cfahlgren1's picture
cfahlgren1 HF staff
update snippet
d25a864
---
id: "duckdb-leakage-detection"
title: "Leakage Detection"
slug: "duckdb-leakage-detection-query"
description: "Detect data leakage between train and test datasets by calculating the overlap percentage."
code: |
WITH
overlapping_rows AS (
SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
(overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
---
# Dataset Leakage Detection
This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.
```sql
WITH
overlapping_rows AS (
SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
(overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
```
There is a very good in depth explanation of leakage in public datasets in this [article](https://huggingface.co/blog/lbourdois/lle).
# Example Query
You can check out this [link](https://huggingface.co/datasets/stanfordnlp/imdb?sql_console=true&sql=WITH%0A++++overlapping_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+train%0A++++++++INTERSECT%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+test%0A++++%29%2C%0A++++total_unique_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+total_count%0A++++++++FROM+%28%0A++++++++++++SELECT+*+FROM+train%0A++++++++++++UNION%0A++++++++++++SELECT+*+FROM+test%0A++++++++%29+combined%0A++++%29%0ASELECT%0A++++overlap_count%2C%0A++++total_count%2C%0A++++%28overlap_count+*+100.0+%2F+total_count%29+AS+overlap_percentage%0AFROM+overlapping_rows%2C+total_unique_rows%3B%0A) for the leakage query for the IMDB dataset.
![leakage-query-imdb](./leakage-query-image.png)