Spaces:
Running
Running
File size: 4,696 Bytes
ef43da6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "marimo",
# "duckdb==1.2.1",
# "pyarrow==19.0.1",
# "polars[pyarrow]==1.25.2",
# "pandas==2.2.3",
# ]
# ///
import marimo
__generated_with = "0.14.10"
app = marimo.App(width="medium")
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# Working with Apache Arrow
*By [Thomas Liang](https://github.com/thliang01)*
#
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
[Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
In this notebook, we'll explore how to:
- Create an Arrow table from a DuckDB query.
- Load an Arrow table into DuckDB.
- Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
"""
)
return
@app.cell
def _(mo):
mo.sql(
"""
CREATE TABLE IF NOT EXISTS users (
id INTEGER,
name VARCHAR,
age INTEGER,
city VARCHAR
);
INSERT INTO users VALUES
(1, 'Alice', 30, 'New York'),
(2, 'Bob', 24, 'London'),
(3, 'Charlie', 35, 'Paris'),
(4, 'David', 29, 'New York'),
(5, 'Eve', 40, 'London');
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
## 1. Creating an Arrow Table from a DuckDB Query
You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
"""
)
return
@app.cell
def _(mo):
users_arrow_table = mo.sql( # type: ignore
"""
SELECT * FROM users WHERE age > 30;
"""
).to_arrow()
return (users_arrow_table,)
@app.cell
def _(users_arrow_table):
users_arrow_table
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:")
return
@app.cell
def _(users_arrow_table):
users_arrow_table.schema
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
## 2. Loading an Arrow Table into DuckDB
You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
"""
)
return
@app.cell
def _(pa):
# Create an Arrow table in Python
new_data = pa.table({
'id': [6, 7],
'name': ['Fiona', 'George'],
'age': [22, 45],
'city': ['Berlin', 'Tokyo']
})
return (new_data,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.
"""
)
return
@app.cell
def _(mo, new_data):
mo.sql(
f"""
SELECT name, age, city
FROM new_data
WHERE age > 30;
"""
)
return
# Working in Interoperability with Polars and Pandas
# @app.cell(hide_code=True)
# def _(mo):
# mo.md(
# r"""
# ## 3. Interoperability with Polars and Pandas
# The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
# """
# )
# return
# @app.cell(hide_code=True)
# def _(mo):
# mo.md(r"### From DuckDB to Polars/Pandas")
# return
@app.cell
def _():
import marimo as mo
import plotly.express as px
return mo, px
@app.cell
def _():
import pyarrow as pa
import polars as pl
import pandas as pd
return
if __name__ == "__main__":
app.run() |