Spaces:
Running
Running
id: "duckdb-leakage-detection" | |
title: "Leakage Detection" | |
slug: "duckdb-leakage-detection-query" | |
description: "Detect data leakage between train and test datasets by calculating the overlap percentage." | |
code: | | |
WITH | |
overlapping_rows AS ( | |
SELECT COUNT(*) AS overlap_count | |
FROM train | |
INTERSECT | |
SELECT COUNT(*) AS overlap_count | |
FROM test | |
), | |
total_unique_rows AS ( | |
SELECT COUNT(*) AS total_count | |
FROM ( | |
SELECT * FROM train | |
UNION | |
SELECT * FROM test | |
) combined | |
) | |
SELECT | |
overlap_count, | |
total_count, | |
(overlap_count * 100.0 / total_count) AS overlap_percentage | |
FROM overlapping_rows, total_unique_rows; | |
# Dataset Leakage Detection | |
This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage. | |
```sql | |
WITH | |
overlapping_rows AS ( | |
SELECT COUNT(*) AS overlap_count | |
FROM train | |
INTERSECT | |
SELECT COUNT(*) AS overlap_count | |
FROM test | |
), | |
total_unique_rows AS ( | |
SELECT COUNT(*) AS total_count | |
FROM ( | |
SELECT * FROM train | |
UNION | |
SELECT * FROM test | |
) combined | |
) | |
SELECT | |
overlap_count, | |
total_count, | |
(overlap_count * 100.0 / total_count) AS overlap_percentage | |
FROM overlapping_rows, total_unique_rows; | |
``` | |
There is a very good in depth explanation of leakage in public datasets in this [article](https://huggingface.co/blog/lbourdois/lle). | |
# Example Query | |
You can check out this [link](https://huggingface.co/datasets/stanfordnlp/imdb?sql_console=true&sql=WITH%0A++++overlapping_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+train%0A++++++++INTERSECT%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+test%0A++++%29%2C%0A++++total_unique_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+total_count%0A++++++++FROM+%28%0A++++++++++++SELECT+*+FROM+train%0A++++++++++++UNION%0A++++++++++++SELECT+*+FROM+test%0A++++++++%29+combined%0A++++%29%0ASELECT%0A++++overlap_count%2C%0A++++total_count%2C%0A++++%28overlap_count+*+100.0+%2F+total_count%29+AS+overlap_percentage%0AFROM+overlapping_rows%2C+total_unique_rows%3B%0A) for the leakage query for the IMDB dataset. | |
 |