Spaces:
Running
Running
feat: Add Apache Arrow integration tutorial notebook
Browse files- Create comprehensive tutorial demonstrating Apache Arrow usage with DuckDB
- Cover Arrow table creation from DuckDB queries
- Demonstrate loading Arrow tables into DuckDB for zero-copy operations
- Include examples of interoperability with Polars and Pandas DataFrames
- Add marimo notebook with interactive SQL queries and data transformations
- Configure dependencies: duckdb==1.2.1, pyarrow==19.0.1, polars==1.25.2, pandas==2.2.3
This tutorial helps users understand how to leverage Apache Arrow's columnar
format for efficient data transfer between DuckDB and other data processing
libraries, enabling high-performance analytical workflows.
duckdb/011_working_with_apache_arrow.py
ADDED
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# /// script
|
2 |
+
# requires-python = ">=3.11"
|
3 |
+
# dependencies = [
|
4 |
+
# "marimo",
|
5 |
+
# "duckdb==1.2.1",
|
6 |
+
# "pyarrow==19.0.1",
|
7 |
+
# "polars[pyarrow]==1.25.2",
|
8 |
+
# "pandas==2.2.3",
|
9 |
+
# ]
|
10 |
+
# ///
|
11 |
+
|
12 |
+
import marimo
|
13 |
+
|
14 |
+
__generated_with = "0.14.10"
|
15 |
+
app = marimo.App(width="medium")
|
16 |
+
|
17 |
+
@app.cell(hide_code=True)
|
18 |
+
def _(mo):
|
19 |
+
mo.md(
|
20 |
+
r"""
|
21 |
+
# Working with Apache Arrow
|
22 |
+
*By [Thomas Liang](https://github.com/thliang01)*
|
23 |
+
#
|
24 |
+
"""
|
25 |
+
)
|
26 |
+
return
|
27 |
+
|
28 |
+
|
29 |
+
@app.cell(hide_code=True)
|
30 |
+
def _(mo):
|
31 |
+
mo.md(
|
32 |
+
r"""
|
33 |
+
[Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
|
34 |
+
|
35 |
+
A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
|
36 |
+
|
37 |
+
DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
|
38 |
+
|
39 |
+
In this notebook, we'll explore how to:
|
40 |
+
|
41 |
+
- Create an Arrow table from a DuckDB query.
|
42 |
+
- Load an Arrow table into DuckDB.
|
43 |
+
- Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
|
44 |
+
"""
|
45 |
+
)
|
46 |
+
return
|
47 |
+
|
48 |
+
|
49 |
+
@app.cell
|
50 |
+
def _(mo):
|
51 |
+
mo.sql(
|
52 |
+
"""
|
53 |
+
CREATE TABLE IF NOT EXISTS users (
|
54 |
+
id INTEGER,
|
55 |
+
name VARCHAR,
|
56 |
+
age INTEGER,
|
57 |
+
city VARCHAR
|
58 |
+
);
|
59 |
+
|
60 |
+
INSERT INTO users VALUES
|
61 |
+
(1, 'Alice', 30, 'New York'),
|
62 |
+
(2, 'Bob', 24, 'London'),
|
63 |
+
(3, 'Charlie', 35, 'Paris'),
|
64 |
+
(4, 'David', 29, 'New York'),
|
65 |
+
(5, 'Eve', 40, 'London');
|
66 |
+
"""
|
67 |
+
)
|
68 |
+
return
|
69 |
+
|
70 |
+
|
71 |
+
@app.cell(hide_code=True)
|
72 |
+
def _(mo):
|
73 |
+
mo.md(
|
74 |
+
r"""
|
75 |
+
## 1. Creating an Arrow Table from a DuckDB Query
|
76 |
+
|
77 |
+
You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
|
78 |
+
"""
|
79 |
+
)
|
80 |
+
return
|
81 |
+
|
82 |
+
|
83 |
+
@app.cell
|
84 |
+
def _(mo):
|
85 |
+
users_arrow_table = mo.sql( # type: ignore
|
86 |
+
"""
|
87 |
+
SELECT * FROM users WHERE age > 30;
|
88 |
+
"""
|
89 |
+
).to_arrow()
|
90 |
+
return (users_arrow_table,)
|
91 |
+
|
92 |
+
|
93 |
+
@app.cell
|
94 |
+
def _(users_arrow_table):
|
95 |
+
users_arrow_table
|
96 |
+
return
|
97 |
+
|
98 |
+
|
99 |
+
@app.cell(hide_code=True)
|
100 |
+
def _(mo):
|
101 |
+
mo.md(r"The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:")
|
102 |
+
return
|
103 |
+
|
104 |
+
|
105 |
+
@app.cell
|
106 |
+
def _(users_arrow_table):
|
107 |
+
users_arrow_table.schema
|
108 |
+
return
|
109 |
+
|
110 |
+
|
111 |
+
@app.cell(hide_code=True)
|
112 |
+
def _(mo):
|
113 |
+
mo.md(
|
114 |
+
r"""
|
115 |
+
## 2. Loading an Arrow Table into DuckDB
|
116 |
+
|
117 |
+
You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
|
118 |
+
"""
|
119 |
+
)
|
120 |
+
return
|
121 |
+
|
122 |
+
|
123 |
+
@app.cell
|
124 |
+
def _(pa):
|
125 |
+
# Create an Arrow table in Python
|
126 |
+
new_data = pa.table({
|
127 |
+
'id': [6, 7],
|
128 |
+
'name': ['Fiona', 'George'],
|
129 |
+
'age': [22, 45],
|
130 |
+
'city': ['Berlin', 'Tokyo']
|
131 |
+
})
|
132 |
+
return (new_data,)
|
133 |
+
|
134 |
+
|
135 |
+
@app.cell(hide_code=True)
|
136 |
+
def _(mo):
|
137 |
+
mo.md(
|
138 |
+
r"""
|
139 |
+
Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.
|
140 |
+
"""
|
141 |
+
)
|
142 |
+
return
|
143 |
+
|
144 |
+
|
145 |
+
@app.cell
|
146 |
+
def _(mo, new_data):
|
147 |
+
mo.sql(
|
148 |
+
f"""
|
149 |
+
SELECT name, age, city
|
150 |
+
FROM new_data
|
151 |
+
WHERE age > 30;
|
152 |
+
"""
|
153 |
+
)
|
154 |
+
return
|
155 |
+
|
156 |
+
# Working in Interoperability with Polars and Pandas
|
157 |
+
|
158 |
+
# @app.cell(hide_code=True)
|
159 |
+
# def _(mo):
|
160 |
+
# mo.md(
|
161 |
+
# r"""
|
162 |
+
# ## 3. Interoperability with Polars and Pandas
|
163 |
+
|
164 |
+
# The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
|
165 |
+
# """
|
166 |
+
# )
|
167 |
+
# return
|
168 |
+
|
169 |
+
|
170 |
+
# @app.cell(hide_code=True)
|
171 |
+
# def _(mo):
|
172 |
+
# mo.md(r"### From DuckDB to Polars/Pandas")
|
173 |
+
# return
|
174 |
+
|
175 |
+
|
176 |
+
@app.cell
|
177 |
+
def _():
|
178 |
+
import marimo as mo
|
179 |
+
import plotly.express as px
|
180 |
+
return mo, px
|
181 |
+
|
182 |
+
|
183 |
+
@app.cell
|
184 |
+
def _():
|
185 |
+
import pyarrow as pa
|
186 |
+
import polars as pl
|
187 |
+
import pandas as pd
|
188 |
+
return
|
189 |
+
|
190 |
+
|
191 |
+
if __name__ == "__main__":
|
192 |
+
app.run()
|