File size: 2,457 Bytes
db39944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d25a864
db39944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
id: "duckdb-leakage-detection"
title: "Leakage Detection"
slug: "duckdb-leakage-detection-query"
description: "Detect data leakage between train and test datasets by calculating the overlap percentage."
code: |
    WITH
        overlapping_rows AS (
            SELECT COUNT(*) AS overlap_count
            FROM train
            INTERSECT
            SELECT COUNT(*) AS overlap_count
            FROM test
        ),
        total_unique_rows AS (
            SELECT COUNT(*) AS total_count
            FROM (
                SELECT * FROM train
                UNION
                SELECT * FROM test
            ) combined
        )
    SELECT
        overlap_count,
        total_count,
        (overlap_count * 100.0 / total_count) AS overlap_percentage
    FROM overlapping_rows, total_unique_rows;
---

# Dataset Leakage Detection 

This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.

```sql
WITH
    overlapping_rows AS (
        SELECT COUNT(*) AS overlap_count
        FROM train
        INTERSECT
        SELECT COUNT(*) AS overlap_count
        FROM test
    ),
    total_unique_rows AS (
        SELECT COUNT(*) AS total_count
        FROM (
            SELECT * FROM train
            UNION
            SELECT * FROM test
        ) combined
    )
SELECT
    overlap_count,
    total_count,
    (overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
```

There is a very good in depth explanation of leakage in public datasets in this [article](https://huggingface.co/blog/lbourdois/lle).

# Example Query

You can check out this [link](https://huggingface.co/datasets/stanfordnlp/imdb?sql_console=true&sql=WITH%0A++++overlapping_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+train%0A++++++++INTERSECT%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+test%0A++++%29%2C%0A++++total_unique_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+total_count%0A++++++++FROM+%28%0A++++++++++++SELECT+*+FROM+train%0A++++++++++++UNION%0A++++++++++++SELECT+*+FROM+test%0A++++++++%29+combined%0A++++%29%0ASELECT%0A++++overlap_count%2C%0A++++total_count%2C%0A++++%28overlap_count+*+100.0+%2F+total_count%29+AS+overlap_percentage%0AFROM+overlapping_rows%2C+total_unique_rows%3B%0A) for the leakage query for the IMDB dataset.

![leakage-query-imdb](./leakage-query-image.png)