Spaces:
Running
Running
id: "duckdb-summarize" | |
title: "Histogram" | |
slug: "duckdb-histogram-query" | |
description: "Create a histogram for a specific column to visualize the distribution of values." | |
code: | | |
from histogram( | |
table_name, | |
column_name, | |
) | |
# DuckDB Histogram | |
This snippet demonstrates how to use the `Histogram` function in DuckDB to calculate aggregate statistics for a dataset. The `histogram` function in DuckDB is used to compute histograms over columns of a dataset. It works for columns of any type and allows for various binning strategies and a custom number of bins. | |
```sql | |
from histogram( | |
table_name, | |
column_name, | |
bin_count := 10 | |
) | |
``` | |
## Parameters | |
- `table_name`: The name of the table or a subquery result. | |
- `column_name`: The name of the column for which to create the histogram, you can use different expressions to summarize the data such as length of a string. | |
- `bin_count`: The number of bins to use in the histogram. (_**Optional**_) | |
- `technique`: The binning technique to use. (_**Optional**_) | |
## Binning Techniques | |
| Technique | Description | | |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| `auto` | Automatically selects the best binning technique based on the data type. If the data type is not numeric or timestamp, it defaults to `sample`. For numeric or timestamp data, it defaults to `equi-width-nice`. | | |
| `sample` | Uses distinct values in the column as bins. This technique is useful when the column has a small number of distinct values. | | |
| `equi-height` | Creates bins such that each bin has approximately the same number of data points. This technique is useful for ensuring that each bin has a similar number of entries. This can be helpful for skewed distributions. | | |
| `equi-width` | Creates bins of equal width. This technique is useful for numeric data. You want each bin to cover the same range of values. | | |
| `equi-width-nice` | Creates bins of equal width with "nice" boundaries. This technique is similar to `equi-width`. It adjusts the bin boundaries to be more human-readable (e.g., rounding to the nearest whole number). | | |
You can find more information in the [PR](https://github.com/duckdb/duckdb/pull/12590) that added this feature. | |
## Histogram of the length of the input persona from the `PersonaHub` dataset | |
```sql | |
from histogram( | |
instruction, | |
len("input persona"), | |
bin_count := 5 | |
) | |
``` | |
<iframe | |
src="https://huggingface.co/datasets/proj-persona/PersonaHub/embed/viewer/instruction/train?sql_console=true&sql=from+histogram%28%0A++instruction%2C%0A++len%28%22input+persona%22%29%2C%0A++bin_count+%3A%3D+5%0A%29" | |
frameborder="0" | |
width="100%" | |
height="560px" | |
></iframe> |