File size: 3,257 Bytes
6032e5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318fad3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6032e5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
id: "duckdb-summarize"
title: "Histogram"
slug: "duckdb-histogram-query"
description: "Create a histogram for a specific column to visualize the distribution of values."
code: |
    from histogram(
        table_name,
        column_name,
    )
---

# DuckDB Histogram

This snippet demonstrates how to use the `Histogram` function in DuckDB to calculate aggregate statistics for a dataset. The `histogram` function in DuckDB is used to compute histograms over columns of a dataset. It works for columns of any type and allows for various binning strategies and a custom number of bins.

```sql
from histogram(
    table_name,
    column_name,
    bin_count := 10
)
```

## Parameters

- `table_name`: The name of the table or a subquery result.
- `column_name`: The name of the column for which to create the histogram, you can use different expressions to summarize the data such as length of a string.
- `bin_count`: The number of bins to use in the histogram. (_**Optional**_)
- `technique`: The binning technique to use. (_**Optional**_)


## Binning Techniques

| Technique         | Description                                                                                                                                                                                                 |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `auto`            | Automatically selects the best binning technique based on the data type. If the data type is not numeric or timestamp, it defaults to `sample`. For numeric or timestamp data, it defaults to `equi-width-nice`. |
| `sample`          | Uses distinct values in the column as bins. This technique is useful when the column has a small number of distinct values.                                                                                   |
| `equi-height`     | Creates bins such that each bin has approximately the same number of data points. This technique is useful for ensuring that each bin has a similar number of entries. This can be helpful for skewed distributions. |
| `equi-width`      | Creates bins of equal width. This technique is useful for numeric data. You want each bin to cover the same range of values.                                                                                   |
| `equi-width-nice` | Creates bins of equal width with "nice" boundaries. This technique is similar to `equi-width`. It adjusts the bin boundaries to be more human-readable (e.g., rounding to the nearest whole number).            |

You can find more information in the [PR](https://github.com/duckdb/duckdb/pull/12590) that added this feature.


## Histogram of the length of the input persona from the `PersonaHub` dataset

```sql
from histogram(
  instruction,
  len("input persona"),
  bin_count := 5
)
```

<iframe
		src="https://huggingface.co/datasets/proj-persona/PersonaHub/embed/viewer/instruction/train?sql_console=true&sql=from+histogram%28%0A++instruction%2C%0A++len%28%22input+persona%22%29%2C%0A++bin_count+%3A%3D+5%0A%29"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>