Srihari Thyagarajan commited on
Commit
a100272
·
unverified ·
2 Parent(s): d004e37 b85ef90

Merge pull request #117 from thliang01/Working-with-Apache-Arrow

Browse files

feat: Expand Apache Arrow tutorial with advanced examples and performance benchmarks

duckdb/011_working_with_apache_arrow.py ADDED
@@ -0,0 +1,613 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "marimo",
5
+ # "duckdb==1.3.2",
6
+ # "pyarrow==19.0.1",
7
+ # "polars[pyarrow]==1.25.2",
8
+ # "pandas==2.2.3",
9
+ # "sqlglot==27.0.0",
10
+ # "psutil==7.0.0",
11
+ # "altair",
12
+ # ]
13
+ # ///
14
+
15
+ import marimo
16
+
17
+ __generated_with = "0.14.11"
18
+ app = marimo.App(width="medium")
19
+
20
+
21
+ @app.cell(hide_code=True)
22
+ def _(mo):
23
+ mo.md(
24
+ r"""
25
+ # Working with Apache Arrow
26
+ *By [Thomas Liang](https://github.com/thliang01)*
27
+ #
28
+ """
29
+ )
30
+ return
31
+
32
+
33
+ @app.cell(hide_code=True)
34
+ def _(mo):
35
+ mo.md(
36
+ r"""
37
+ [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
38
+
39
+ A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
40
+
41
+ DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
42
+
43
+ In this notebook, we'll explore how to:
44
+
45
+ - Create an Arrow table from a DuckDB query.
46
+ - Load an Arrow table into DuckDB.
47
+ - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
48
+ - Combining data from multiple sources
49
+ - Performance benefits
50
+ """
51
+ )
52
+ return
53
+
54
+
55
+ @app.cell
56
+ def _(mo):
57
+ mo.sql(
58
+ """
59
+ CREATE TABLE IF NOT EXISTS users (
60
+ id INTEGER,
61
+ name VARCHAR,
62
+ age INTEGER,
63
+ city VARCHAR
64
+ );
65
+
66
+ INSERT INTO users VALUES
67
+ (1, 'Alice', 30, 'New York'),
68
+ (2, 'Bob', 24, 'London'),
69
+ (3, 'Charlie', 35, 'Paris'),
70
+ (4, 'David', 29, 'New York'),
71
+ (5, 'Eve', 40, 'London');
72
+ """
73
+ )
74
+ return (users,)
75
+
76
+
77
+ @app.cell(hide_code=True)
78
+ def _(mo):
79
+ mo.md(
80
+ r"""
81
+ ## 1. Creating an Arrow Table from a DuckDB Query
82
+
83
+ You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
84
+ """
85
+ )
86
+ return
87
+
88
+
89
+ @app.cell
90
+ def _(mo, users):
91
+ users_arrow_table = mo.sql( # type: ignore
92
+ """
93
+ SELECT * FROM users WHERE age > 30;
94
+ """
95
+ ).to_arrow()
96
+ return (users_arrow_table,)
97
+
98
+
99
+ @app.cell(hide_code=True)
100
+ def _(mo):
101
+ mo.md(r"""The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:""")
102
+ return
103
+
104
+
105
+ @app.cell
106
+ def _(users_arrow_table):
107
+ users_arrow_table.schema
108
+ return
109
+
110
+
111
+ @app.cell(hide_code=True)
112
+ def _(mo):
113
+ mo.md(
114
+ r"""
115
+ ## 2. Loading an Arrow Table into DuckDB
116
+
117
+ You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
118
+ """
119
+ )
120
+ return
121
+
122
+
123
+ @app.cell
124
+ def _(pa):
125
+ # Create an Arrow table in Python
126
+ new_data = pa.table({
127
+ 'id': [6, 7],
128
+ 'name': ['Fiona', 'George'],
129
+ 'age': [22, 45],
130
+ 'city': ['Berlin', 'Tokyo']
131
+ })
132
+ return (new_data,)
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(r"""Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.""")
138
+ return
139
+
140
+
141
+ @app.cell
142
+ def _(mo, new_data):
143
+ mo.sql(
144
+ f"""
145
+ SELECT name, age, city
146
+ FROM new_data
147
+ WHERE age > 30;
148
+ """
149
+ )
150
+ return
151
+
152
+
153
+ @app.cell(hide_code=True)
154
+ def _(mo):
155
+ mo.md(
156
+ r"""
157
+ ## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
158
+
159
+ The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
160
+ """
161
+ )
162
+ return
163
+
164
+
165
+ @app.cell(hide_code=True)
166
+ def _(mo):
167
+ mo.md(r"""### From DuckDB to Polars/Pandas""")
168
+ return
169
+
170
+
171
+ @app.cell
172
+ def _(pl, users_arrow_table):
173
+ # Convert the Arrow table to a Polars DataFrame
174
+ users_polars_df = pl.from_arrow(users_arrow_table)
175
+ users_polars_df
176
+ return
177
+
178
+
179
+ @app.cell
180
+ def _(users_arrow_table):
181
+ # Convert the Arrow table to a Pandas DataFrame
182
+ users_pandas_df = users_arrow_table.to_pandas()
183
+ users_pandas_df
184
+ return
185
+
186
+
187
+ @app.cell(hide_code=True)
188
+ def _(mo):
189
+ mo.md(r"""### From Polars/Pandas to DuckDB""")
190
+ return
191
+
192
+
193
+ @app.cell
194
+ def _(pl):
195
+ # Create a Polars DataFrame
196
+ polars_df = pl.DataFrame({
197
+ "product_id": [101, 102, 103],
198
+ "product_name": ["Laptop", "Mouse", "Keyboard"],
199
+ "price": [1200.00, 25.50, 75.00]
200
+ })
201
+ polars_df
202
+ return (polars_df,)
203
+
204
+
205
+ @app.cell(hide_code=True)
206
+ def _(mo):
207
+ mo.md(r"""Now we can query this Polars DataFrame directly in DuckDB:""")
208
+ return
209
+
210
+
211
+ @app.cell
212
+ def _(mo, polars_df):
213
+ # Query the Polars DataFrame directly in DuckDB
214
+ mo.sql(
215
+ f"""
216
+ SELECT product_name, price
217
+ FROM polars_df
218
+ WHERE price > 50
219
+ ORDER BY price DESC;
220
+ """
221
+ )
222
+ return
223
+
224
+
225
+ @app.cell(hide_code=True)
226
+ def _(mo):
227
+ mo.md(r"""Similarly, we can query a Pandas DataFrame:""")
228
+ return
229
+
230
+
231
+ @app.cell
232
+ def _(pd):
233
+ # Create a Pandas DataFrame
234
+ pandas_df = pd.DataFrame({
235
+ "order_id": [1001, 1002, 1003, 1004],
236
+ "product_id": [101, 102, 103, 101],
237
+ "quantity": [1, 2, 1, 3],
238
+ "order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'])
239
+ })
240
+ pandas_df
241
+ return (pandas_df,)
242
+
243
+
244
+ @app.cell
245
+ def _(mo, pandas_df):
246
+ # Query the Pandas DataFrame in DuckDB
247
+ mo.sql(
248
+ f"""
249
+ SELECT order_date, SUM(quantity) as total_quantity
250
+ FROM pandas_df
251
+ GROUP BY order_date
252
+ ORDER BY order_date;
253
+ """
254
+ )
255
+ return
256
+
257
+
258
+ @app.cell(hide_code=True)
259
+ def _(mo):
260
+ mo.md(
261
+ r"""
262
+ ## 4. Advanced Example: Combining Multiple Data Sources
263
+
264
+ One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query:
265
+ """
266
+ )
267
+ return
268
+
269
+
270
+ @app.cell
271
+ def _(mo, pandas_df, polars_df, users):
272
+ # Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame
273
+ result = mo.sql(
274
+ f"""
275
+ SELECT
276
+ u.name as customer_name,
277
+ p.product_name,
278
+ o.quantity,
279
+ p.price,
280
+ (o.quantity * p.price) as total_amount
281
+ FROM users u
282
+ CROSS JOIN pandas_df o
283
+ JOIN polars_df p ON o.product_id = p.product_id
284
+ WHERE u.id = 1 -- Just for Alice
285
+ ORDER BY o.order_date;
286
+ """
287
+ )
288
+ result
289
+ return
290
+
291
+
292
+ @app.cell(hide_code=True)
293
+ def _(mo):
294
+ mo.md(
295
+ r"""
296
+ ## 5. Performance Benefits of Arrow Integration
297
+
298
+ The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables:
299
+
300
+ ### Key Benefits:
301
+
302
+ - **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios
303
+ - **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
304
+ - **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying
305
+ - **Larger-than-Memory Analysis**: Since both libraries support streaming query results, you can execute queries on data bigger than available memory by processing one batch at a time
306
+ - **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions
307
+ Let's demonstrate these benefits with concrete examples:
308
+ """
309
+ )
310
+ return
311
+
312
+
313
+
314
+ @app.cell(hide_code=True)
315
+ def _(mo):
316
+ mo.md(r"""### Memory Efficiency Demonstration""")
317
+ return
318
+
319
+
320
+ @app.cell
321
+ def _(pd, pl):
322
+ import sys
323
+ import time
324
+
325
+ # Create identical datasets in different formats
326
+ n_rows = 1_000_000
327
+
328
+ # Pandas DataFrame (traditional approach)
329
+ pandas_data = pd.DataFrame({
330
+ "id": range(n_rows),
331
+ "value": [i * 2.5 for i in range(n_rows)],
332
+ "category": [f"cat_{i % 100}" for i in range(n_rows)],
333
+ "description": [f"This is a longer text description for row {i}" for i in range(n_rows)]
334
+ })
335
+
336
+ # Polars DataFrame (Arrow-based)
337
+ polars_data = pl.DataFrame({
338
+ "id": range(n_rows),
339
+ "value": pl.Series([i * 2.5 for i in range(n_rows)]),
340
+ "category": pl.Series([f"cat_{i % 100}" for i in range(n_rows)]),
341
+ "description": pl.Series([f"This is a longer text description for row {i}" for i in range(n_rows)])
342
+ })
343
+
344
+ # Get memory usage
345
+ pandas_memory = pandas_data.memory_usage(deep=True).sum() / 1024 / 1024 # MB
346
+ polars_memory = polars_data.estimated_size() / 1024 / 1024 # MB
347
+
348
+ print(f"Dataset size: {n_rows:,} rows")
349
+ print(f"Pandas memory usage: {pandas_memory:.2f} MB")
350
+ print(f"Polars (Arrow) memory usage: {polars_memory:.2f} MB")
351
+ print(f"Memory savings: {((pandas_memory - polars_memory) / pandas_memory * 100):.1f}%")
352
+ return pandas_data, polars_data, time
353
+
354
+
355
+ @app.cell(hide_code=True)
356
+ def _(mo):
357
+ mo.md(r"""### Performance Comparison: Arrow vs Non-Arrow Approaches""")
358
+ return
359
+
360
+
361
+ @app.cell(hide_code=True)
362
+ def _(mo):
363
+ mo.md(r"""Let's compare three approaches for the same analytical query:""")
364
+ return
365
+
366
+
367
+ @app.cell
368
+ def _(duckdb, mo, pandas_data, polars_data, time):
369
+ # Test query: group by category and calculate aggregations
370
+ query = """
371
+ SELECT
372
+ category,
373
+ COUNT(*) as count,
374
+ AVG(value) as avg_value,
375
+ MIN(value) as min_value,
376
+ MAX(value) as max_value,
377
+ SUM(value) as sum_value
378
+ FROM data_source
379
+ GROUP BY category
380
+ ORDER BY count DESC
381
+ """
382
+
383
+ # Approach 1: Traditional - Copy data to DuckDB table
384
+ start_time = time.time()
385
+ conn = duckdb.connect(':memory:')
386
+ conn.execute("CREATE TABLE pandas_table AS SELECT * FROM pandas_data")
387
+ result1 = conn.execute(query.replace("data_source", "pandas_table")).fetchall()
388
+ # conn.close()
389
+ approach1_time = time.time() - start_time
390
+
391
+ # Approach 2: Direct Pandas query (no DuckDB)
392
+ start_time = time.time()
393
+ result2 = pandas_data.groupby('category').agg({
394
+ 'id': 'count',
395
+ 'value': ['mean', 'min', 'max', 'sum']
396
+ }).sort_values(('id', 'count'), ascending=False)
397
+ approach2_time = time.time() - start_time
398
+
399
+ # Approach 3: Arrow-based - Zero-copy with Polars
400
+ start_time = time.time()
401
+ result3 = mo.sql(
402
+ f"""
403
+ SELECT
404
+ category,
405
+ COUNT(*) as count,
406
+ AVG(value) as avg_value,
407
+ MIN(value) as min_value,
408
+ MAX(value) as max_value,
409
+ SUM(value) as sum_value
410
+ FROM polars_data
411
+ GROUP BY category
412
+ ORDER BY count DESC
413
+ """
414
+ )
415
+ approach3_time = time.time() - start_time
416
+
417
+ print("Performance Comparison:")
418
+ print(f"1. Traditional (copy to DuckDB): {approach1_time:.3f} seconds")
419
+ print(f"2. Pandas groupby: {approach2_time:.3f} seconds")
420
+ print(f"3. Arrow-based (zero-copy): {approach3_time:.3f} seconds")
421
+ print(f"\nSpeedup vs traditional: {approach1_time/approach3_time:.1f}x")
422
+ print(f"Speedup vs pandas: {approach2_time/approach3_time:.1f}x")
423
+
424
+ # Return timing variables but not the closed connection
425
+ return approach1_time, approach2_time, approach3_time
426
+
427
+
428
+ @app.cell(hide_code=True)
429
+ def _(mo):
430
+ mo.md(r"""### Visualizing the Performance Difference""")
431
+ return
432
+
433
+
434
+ @app.cell
435
+ def _(approach1_time, approach2_time, approach3_time, mo, pl):
436
+ import altair as alt
437
+
438
+ # Create a bar chart showing the performance comparison
439
+ performance_data = pl.DataFrame({
440
+ "Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"],
441
+ "Time (seconds)": [approach1_time, approach2_time, approach3_time]
442
+ })
443
+
444
+ # Create the Altair chart
445
+ chart = alt.Chart(performance_data.to_pandas()).mark_bar().encode(
446
+ x=alt.X("Approach", type="nominal", sort="-y"),
447
+ y=alt.Y("Time (seconds)", type="quantitative"),
448
+ color=alt.Color("Approach", type="nominal",
449
+ scale=alt.Scale(range=["#ff6b6b", "#ffd93d", "#6bcf7f"]))
450
+ ).properties(
451
+ title="Query Performance Comparison",
452
+ width=400,
453
+ height=300
454
+ )
455
+
456
+ # Display using marimo's altair_chart UI element
457
+ mo.ui.altair_chart(chart)
458
+ return alt, chart, performance_data
459
+
460
+
461
+
462
+ @app.cell(hide_code=True)
463
+ def _(mo):
464
+ mo.md(r"""### Complex Query Performance""")
465
+ return
466
+
467
+
468
+ @app.cell(hide_code=True)
469
+ def _(mo):
470
+ mo.md(r"""Let's test a more complex query with joins and window functions:""")
471
+ return
472
+
473
+
474
+ @app.cell
475
+ def _(mo, pl, polars_data, time):
476
+ # Create additional datasets for join operations
477
+ categories_df = pl.DataFrame({
478
+ "category": [f"cat_{i}" for i in range(100)],
479
+ "category_group": [f"group_{i // 10}" for i in range(100)],
480
+ "priority": [i % 5 + 1 for i in range(100)]
481
+ })
482
+
483
+ # Complex query with join and window functions
484
+ new_start_time = time.time()
485
+
486
+ complex_result = mo.sql(
487
+ f"""
488
+ WITH ranked_data AS (
489
+ SELECT
490
+ d.*,
491
+ c.category_group,
492
+ c.priority,
493
+ ROW_NUMBER() OVER (PARTITION BY c.category_group ORDER BY d.value DESC) as rank_in_group,
494
+ AVG(d.value) OVER (PARTITION BY c.category_group) as group_avg_value
495
+ FROM polars_data d
496
+ JOIN categories_df c ON d.category = c.category
497
+ )
498
+ SELECT
499
+ category_group,
500
+ COUNT(DISTINCT category) as unique_categories,
501
+ AVG(value) as avg_value,
502
+ MAX(value) as max_value,
503
+ AVG(group_avg_value) as avg_group_value,
504
+ COUNT(CASE WHEN rank_in_group <= 10 THEN 1 END) as top_10_count
505
+ FROM ranked_data
506
+ GROUP BY category_group
507
+ ORDER BY avg_value DESC
508
+ """
509
+ )
510
+
511
+ complex_query_time = time.time() - new_start_time
512
+ print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds")
513
+
514
+ complex_result
515
+ return (categories_df,)
516
+
517
+
518
+ @app.cell(hide_code=True)
519
+ def _(mo):
520
+ mo.md(
521
+ r"""
522
+ ### Memory Efficiency During Operations
523
+
524
+ Let's demonstrate how Arrow's zero-copy operations save memory during data transformations:
525
+ """
526
+ )
527
+ return
528
+
529
+
530
+ @app.cell
531
+ def _(polars_data, time):
532
+ import psutil
533
+ import os
534
+ import pyarrow.compute as pc # Add this import
535
+
536
+ # Get current process
537
+ process = psutil.Process(os.getpid())
538
+
539
+ # Measure memory before operations
540
+ memory_before = process.memory_info().rss / 1024 / 1024 # MB
541
+
542
+ # Perform multiple Arrow-based operations (zero-copy)
543
+ latest_start_time = time.time()
544
+
545
+ # These operations use Arrow's zero-copy capabilities
546
+ arrow_table = polars_data.to_arrow()
547
+ arrow_sliced = arrow_table.slice(0, 100000)
548
+ # Use PyArrow compute functions for filtering
549
+ arrow_filtered = arrow_table.filter(pc.greater(arrow_table['value'], 500000))
550
+
551
+ arrow_ops_time = time.time() - latest_start_time
552
+ memory_after_arrow = process.memory_info().rss / 1024 / 1024 # MB
553
+
554
+ # Compare with traditional copy-based operations
555
+ latest_start_time = time.time()
556
+
557
+ # These operations create copies
558
+ pandas_copy = polars_data.to_pandas()
559
+ pandas_sliced = pandas_copy.iloc[:100000].copy()
560
+ pandas_filtered = pandas_copy[pandas_copy['value'] > 500000].copy()
561
+
562
+ copy_ops_time = time.time() - latest_start_time
563
+ memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
564
+
565
+ print("Memory Usage Comparison:")
566
+ print(f"Initial memory: {memory_before:.2f} MB")
567
+ print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
568
+ print(f"After copy operations: {memory_after_copy:.2f} MB (diff: +{memory_after_copy - memory_before:.2f} MB)")
569
+ print(f"\nTime comparison:")
570
+ print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
571
+ print(f"Copy operations: {copy_ops_time:.3f} seconds")
572
+ print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
573
+ return pc
574
+
575
+
576
+
577
+ @app.cell(hide_code=True)
578
+ def _(mo):
579
+ mo.md(
580
+ r"""
581
+ ## Summary
582
+
583
+ In this notebook, we've explored:
584
+
585
+ 1. **Creating Arrow tables from DuckDB queries** using `.to_arrow()`
586
+ 2. **Loading Arrow tables into DuckDB** and querying them directly
587
+ 3. **Converting between DuckDB, Arrow, Polars, and Pandas** with zero-copy operations
588
+ 4. **Combining data from multiple sources** in a single SQL query
589
+ 5. **Performance and memory benefits** including:
590
+ - **Memory efficiency**: Arrow format uses 20-40% less memory than traditional DataFrames
591
+ - **Query performance**: 2-10x faster queries through zero-copy operations
592
+ - **Reduced memory overhead**: Operations on Arrow data avoid creating copies
593
+ - **Better scalability**: Can handle larger datasets within the same memory constraints
594
+
595
+ The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency.
596
+ """
597
+ )
598
+ return
599
+
600
+
601
+ @app.cell
602
+ def _():
603
+ import marimo as mo
604
+ import pyarrow as pa
605
+ import polars as pl
606
+ import pandas as pd
607
+ import duckdb
608
+ import sqlglot
609
+ return duckdb, mo, pa, pd, pl
610
+
611
+
612
+ if __name__ == "__main__":
613
+ app.run()