open_pt_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

eduagarcia commited on Feb 21, 2024

Commit

1a411ea

1 Parent(s): de3b367

change base datasets links to the dataset original paper

Browse files

Files changed (1) hide show

tasks_config/pt_config.yaml +18 -18

tasks_config/pt_config.yaml CHANGED Viewed

@@ -62,8 +62,8 @@ tasks:
       level exam widely applied every year by the Brazilian government to students that
       wish to undertake a University degree. This dataset contains 1,430 questions that don't require
       image understanding of the exams from 2010 to 2018, 2022 and 2023."
-    link: https://huggingface.co/datasets/eduagarcia/enem_challenge
-    sources: ["https://www.ime.usp.br/~ddm/project/enem/", "https://github.com/piresramon/gpt-4-enem", "https://huggingface.co/datasets/maritaca-ai/enem"]
     baseline_sources: ["https://www.sejalguem.com/enem", "https://vestibular.brasilescola.uol.com.br/enem/confira-as-medias-e-notas-maximas-e-minimas-do-enem-2020/349732.html"]
   bluex:
     benchmark: bluex
@@ -81,8 +81,8 @@ tasks:
     description: "BLUEX is a multimodal dataset consisting of the two leading
     university entrance exams conducted in Brazil: Convest (Unicamp) and Fuvest (USP),
     spanning from 2018 to 2024. The benchmark comprises of 724 questions that do not have accompanying images"
-    link: https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images
-    sources: ["https://github.com/portuguese-benchmark-datasets/bluex", "https://huggingface.co/datasets/portuguese-benchmark-datasets/BLUEX"]
     baseline_sources: ["https://www.comvest.unicamp.br/wp-content/uploads/2023/08/Relatorio_F1_2023.pdf", "https://acervo.fuvest.br/fuvest/2018/FUVEST_2018_indice_discriminacao_1_fase_ins.pdf"]
   oab_exams:
     benchmark: oab_exams
@@ -104,8 +104,8 @@ tasks:
     expert_human_baseline: 75.0
     description: OAB Exams is a dataset of more than 2,000 questions from the Brazilian Bar
       Association's exams, from 2010 to 2018.
-    link: https://huggingface.co/datasets/eduagarcia/oab_exams
-    sources: ["https://github.com/legal-nlp/oab-exams"]
     baseline_sources: ["http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros", "http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros-vol2", "http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros-vol3"]
   assin2_rte:
     benchmark: assin2_rte
@@ -124,8 +124,8 @@ tasks:
     of Portuguese. Recognising Textual Entailment (RTE), also called Natural Language
     Inference (NLI), is the task of predicting if a given text (premise) entails (implies) in
     other text (hypothesis)."
-    link: https://huggingface.co/datasets/eduagarcia/portuguese_benchmark
-    sources: ["https://sites.google.com/view/assin2/", "https://huggingface.co/datasets/assin2"]
   assin2_sts:
     benchmark: assin2_sts
     col_name: ASSIN2 STS
@@ -139,8 +139,8 @@ tasks:
     expert_human_baseline: null
     description: "Same as dataset as above. Semantic Textual Similarity (STS)
     ‘measures the degree of semantic equivalence between two sentences’."
-    link: https://huggingface.co/datasets/eduagarcia/portuguese_benchmark
-    sources: ["https://sites.google.com/view/assin2/", "https://huggingface.co/datasets/assin2"]
   faquad_nli:
     benchmark: faquad_nli
     col_name: FAQUAD NLI
@@ -161,8 +161,8 @@ tasks:
     Brazilian higher education system. FaQuAD-NLI is a modified version of the
     FaQuAD dataset that repurposes the question answering task as a textual
     entailment task between a question and its possible answers."
-    link: https://huggingface.co/datasets/ruanchaves/faquad-nli
-    sources: ["https://github.com/liafacom/faquad/"]
   hatebr_offensive:
     benchmark: hatebr_offensive
     col_name: HateBR Offensive
@@ -178,8 +178,8 @@ tasks:
     on the web and social media. The HateBR was collected from Brazilian Instagram comments of politicians and manually annotated
     by specialists. It is composed of 7,000 documents annotated with a binary classification (offensive
     versus non-offensive comments)."
-    link: https://huggingface.co/datasets/ruanchaves/hatebr
-    sources: ["https://github.com/franciellevargas/HateBR", "https://huggingface.co/datasets/eduagarcia/portuguese_benchmark"]
   portuguese_hate_speech:
     benchmark: portuguese_hate_speech
     col_name: PT Hate Speech
@@ -192,8 +192,8 @@ tasks:
     human_baseline: null
     expert_human_baseline: null
     description: "Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate')"
-    link: https://huggingface.co/datasets/eduagarcia/portuguese_benchmark
-    sources: ["https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset", "https://huggingface.co/datasets/hate_speech_portuguese"]
   tweetsentbr:
     benchmark: tweetsentbr
     col_name: tweetSentBR
@@ -209,6 +209,6 @@ tasks:
     It was labeled by several annotators following steps stablished on the literature for
     improving reliability on the task of Sentiment Analysis. Each Tweet was annotated
     in one of the three following classes: Positive, Negative, Neutral."
-    link: https://bitbucket.org/HBrum/tweetsentbr
-    sources: ["https://bitbucket.org/HBrum/tweetsentbr", "https://arxiv.org/abs/1712.08917"]

       level exam widely applied every year by the Brazilian government to students that
       wish to undertake a University degree. This dataset contains 1,430 questions that don't require
       image understanding of the exams from 2010 to 2018, 2022 and 2023."
+    link: https://www.ime.usp.br/~ddm/project/enem/ENEM-GuidingTest.pdf
+    sources: ["https://huggingface.co/datasets/eduagarcia/enem_challenge", "https://www.ime.usp.br/~ddm/project/enem/", "https://github.com/piresramon/gpt-4-enem", "https://huggingface.co/datasets/maritaca-ai/enem"]
     baseline_sources: ["https://www.sejalguem.com/enem", "https://vestibular.brasilescola.uol.com.br/enem/confira-as-medias-e-notas-maximas-e-minimas-do-enem-2020/349732.html"]
   bluex:
     benchmark: bluex
     description: "BLUEX is a multimodal dataset consisting of the two leading
     university entrance exams conducted in Brazil: Convest (Unicamp) and Fuvest (USP),
     spanning from 2018 to 2024. The benchmark comprises of 724 questions that do not have accompanying images"
+    link: https://arxiv.org/abs/2307.05410
+    sources: ["https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images", "https://github.com/portuguese-benchmark-datasets/bluex", "https://huggingface.co/datasets/portuguese-benchmark-datasets/BLUEX"]
     baseline_sources: ["https://www.comvest.unicamp.br/wp-content/uploads/2023/08/Relatorio_F1_2023.pdf", "https://acervo.fuvest.br/fuvest/2018/FUVEST_2018_indice_discriminacao_1_fase_ins.pdf"]
   oab_exams:
     benchmark: oab_exams
     expert_human_baseline: 75.0
     description: OAB Exams is a dataset of more than 2,000 questions from the Brazilian Bar
       Association's exams, from 2010 to 2018.
+    link: https://arxiv.org/abs/1712.05128
+    sources: ["https://huggingface.co/datasets/eduagarcia/oab_exams", "https://github.com/legal-nlp/oab-exams"]
     baseline_sources: ["http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros", "http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros-vol2", "http://fgvprojetos.fgv.br/publicacao/exame-de-ordem-em-numeros-vol3"]
   assin2_rte:
     benchmark: assin2_rte
     of Portuguese. Recognising Textual Entailment (RTE), also called Natural Language
     Inference (NLI), is the task of predicting if a given text (premise) entails (implies) in
     other text (hypothesis)."
+    link: https://dl.acm.org/doi/abs/10.1007/978-3-030-41505-1_39
+    sources: ["https://huggingface.co/datasets/eduagarcia/portuguese_benchmark", "https://sites.google.com/view/assin2/", "https://huggingface.co/datasets/assin2"]
   assin2_sts:
     benchmark: assin2_sts
     col_name: ASSIN2 STS
     expert_human_baseline: null
     description: "Same as dataset as above. Semantic Textual Similarity (STS)
     ‘measures the degree of semantic equivalence between two sentences’."
+    link: https://dl.acm.org/doi/abs/10.1007/978-3-030-41505-1_39
+    sources: ["https://huggingface.co/datasets/eduagarcia/portuguese_benchmark", "https://sites.google.com/view/assin2/", "https://huggingface.co/datasets/assin2"]
   faquad_nli:
     benchmark: faquad_nli
     col_name: FAQUAD NLI
     Brazilian higher education system. FaQuAD-NLI is a modified version of the
     FaQuAD dataset that repurposes the question answering task as a textual
     entailment task between a question and its possible answers."
+    link: https://ieeexplore.ieee.org/abstract/document/8923668
+    sources: ["https://github.com/liafacom/faquad/", "https://huggingface.co/datasets/ruanchaves/faquad-nli"]
   hatebr_offensive:
     benchmark: hatebr_offensive
     col_name: HateBR Offensive
     on the web and social media. The HateBR was collected from Brazilian Instagram comments of politicians and manually annotated
     by specialists. It is composed of 7,000 documents annotated with a binary classification (offensive
     versus non-offensive comments)."
+    link: https://arxiv.org/abs/2103.14972
+    sources: ["https://huggingface.co/datasets/eduagarcia/portuguese_benchmark", "https://github.com/franciellevargas/HateBR", "https://huggingface.co/datasets/ruanchaves/hatebr"]
   portuguese_hate_speech:
     benchmark: portuguese_hate_speech
     col_name: PT Hate Speech
     human_baseline: null
     expert_human_baseline: null
     description: "Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate')"
+    link: https://aclanthology.org/W19-3510/
+    sources: ["https://huggingface.co/datasets/eduagarcia/portuguese_benchmark", "https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset", "https://huggingface.co/datasets/hate_speech_portuguese"]
   tweetsentbr:
     benchmark: tweetsentbr
     col_name: tweetSentBR
     It was labeled by several annotators following steps stablished on the literature for
     improving reliability on the task of Sentiment Analysis. Each Tweet was annotated
     in one of the three following classes: Positive, Negative, Neutral."
+    link: https://arxiv.org/abs/1712.08917
+    sources: ["https://bitbucket.org/HBrum/tweetsentbr"]