thliang01 commited on
Commit
ef43da6
·
unverified ·
1 Parent(s): 34c490f

feat: Add Apache Arrow integration tutorial notebook

Browse files

- Create comprehensive tutorial demonstrating Apache Arrow usage with DuckDB
- Cover Arrow table creation from DuckDB queries
- Demonstrate loading Arrow tables into DuckDB for zero-copy operations
- Include examples of interoperability with Polars and Pandas DataFrames
- Add marimo notebook with interactive SQL queries and data transformations
- Configure dependencies: duckdb==1.2.1, pyarrow==19.0.1, polars==1.25.2, pandas==2.2.3

This tutorial helps users understand how to leverage Apache Arrow's columnar
format for efficient data transfer between DuckDB and other data processing
libraries, enabling high-performance analytical workflows.

duckdb/011_working_with_apache_arrow.py ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "marimo",
5
+ # "duckdb==1.2.1",
6
+ # "pyarrow==19.0.1",
7
+ # "polars[pyarrow]==1.25.2",
8
+ # "pandas==2.2.3",
9
+ # ]
10
+ # ///
11
+
12
+ import marimo
13
+
14
+ __generated_with = "0.14.10"
15
+ app = marimo.App(width="medium")
16
+
17
+ @app.cell(hide_code=True)
18
+ def _(mo):
19
+ mo.md(
20
+ r"""
21
+ # Working with Apache Arrow
22
+ *By [Thomas Liang](https://github.com/thliang01)*
23
+ #
24
+ """
25
+ )
26
+ return
27
+
28
+
29
+ @app.cell(hide_code=True)
30
+ def _(mo):
31
+ mo.md(
32
+ r"""
33
+ [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
34
+
35
+ A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
36
+
37
+ DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
38
+
39
+ In this notebook, we'll explore how to:
40
+
41
+ - Create an Arrow table from a DuckDB query.
42
+ - Load an Arrow table into DuckDB.
43
+ - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
44
+ """
45
+ )
46
+ return
47
+
48
+
49
+ @app.cell
50
+ def _(mo):
51
+ mo.sql(
52
+ """
53
+ CREATE TABLE IF NOT EXISTS users (
54
+ id INTEGER,
55
+ name VARCHAR,
56
+ age INTEGER,
57
+ city VARCHAR
58
+ );
59
+
60
+ INSERT INTO users VALUES
61
+ (1, 'Alice', 30, 'New York'),
62
+ (2, 'Bob', 24, 'London'),
63
+ (3, 'Charlie', 35, 'Paris'),
64
+ (4, 'David', 29, 'New York'),
65
+ (5, 'Eve', 40, 'London');
66
+ """
67
+ )
68
+ return
69
+
70
+
71
+ @app.cell(hide_code=True)
72
+ def _(mo):
73
+ mo.md(
74
+ r"""
75
+ ## 1. Creating an Arrow Table from a DuckDB Query
76
+
77
+ You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
78
+ """
79
+ )
80
+ return
81
+
82
+
83
+ @app.cell
84
+ def _(mo):
85
+ users_arrow_table = mo.sql( # type: ignore
86
+ """
87
+ SELECT * FROM users WHERE age > 30;
88
+ """
89
+ ).to_arrow()
90
+ return (users_arrow_table,)
91
+
92
+
93
+ @app.cell
94
+ def _(users_arrow_table):
95
+ users_arrow_table
96
+ return
97
+
98
+
99
+ @app.cell(hide_code=True)
100
+ def _(mo):
101
+ mo.md(r"The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:")
102
+ return
103
+
104
+
105
+ @app.cell
106
+ def _(users_arrow_table):
107
+ users_arrow_table.schema
108
+ return
109
+
110
+
111
+ @app.cell(hide_code=True)
112
+ def _(mo):
113
+ mo.md(
114
+ r"""
115
+ ## 2. Loading an Arrow Table into DuckDB
116
+
117
+ You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
118
+ """
119
+ )
120
+ return
121
+
122
+
123
+ @app.cell
124
+ def _(pa):
125
+ # Create an Arrow table in Python
126
+ new_data = pa.table({
127
+ 'id': [6, 7],
128
+ 'name': ['Fiona', 'George'],
129
+ 'age': [22, 45],
130
+ 'city': ['Berlin', 'Tokyo']
131
+ })
132
+ return (new_data,)
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(
138
+ r"""
139
+ Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.
140
+ """
141
+ )
142
+ return
143
+
144
+
145
+ @app.cell
146
+ def _(mo, new_data):
147
+ mo.sql(
148
+ f"""
149
+ SELECT name, age, city
150
+ FROM new_data
151
+ WHERE age > 30;
152
+ """
153
+ )
154
+ return
155
+
156
+ # Working in Interoperability with Polars and Pandas
157
+
158
+ # @app.cell(hide_code=True)
159
+ # def _(mo):
160
+ # mo.md(
161
+ # r"""
162
+ # ## 3. Interoperability with Polars and Pandas
163
+
164
+ # The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
165
+ # """
166
+ # )
167
+ # return
168
+
169
+
170
+ # @app.cell(hide_code=True)
171
+ # def _(mo):
172
+ # mo.md(r"### From DuckDB to Polars/Pandas")
173
+ # return
174
+
175
+
176
+ @app.cell
177
+ def _():
178
+ import marimo as mo
179
+ import plotly.express as px
180
+ return mo, px
181
+
182
+
183
+ @app.cell
184
+ def _():
185
+ import pyarrow as pa
186
+ import polars as pl
187
+ import pandas as pd
188
+ return
189
+
190
+
191
+ if __name__ == "__main__":
192
+ app.run()