Wyona
/

message-classification-question-other-smalltalk-modified

@@ -1,101 +0,0 @@
----
-annotations_creators:
-- expert-generated
-language:
-- en
-language_creators:
-- found
-license: []
-multilinguality:
-- monolingual
-pretty_name: message-classification
-size_categories:
-- n=10K
-source_datasets:
-- https://github.com/zeloru/small-english-smalltalk-corpus
-tags: []
-task_categories:
-- text-classification
-task_ids:
-- semantic-similarity-scoring
----
-# Dataset Card for [Dataset Name]
-## Table of Contents
-- [Table of Contents](#table-of-contents)
-- [Description](#description)
-  - [Summary](#summary)
-  - [Languages](#languages)
-- [Dataset Structure](#dataset-structure)
-  - [Data Fields](#data-fields)
-  - [Data Splits](#data-splits)
-- [Dataset Creation](#dataset-creation)
-  - [Curation Rationale](#curation-rationale)
-  - [Source Data](#source-data)
-  - [Annotations](#annotations)
-- [Considerations for Using the Data](#considerations-for-using-the-data)
-  - [Known Limitations](#known-limitations)
-## Description
-### Summary
-https://ukatie.com
-This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
-It is also used to determine the questions and context within an input sentence.
-### Languages
-So far, English is the only supported language.
-## Dataset Structure
-### Data Fields
-Text: Short input sentence
-Label: Question or Other
-### Data Splits
-Question: 10K samples
-Other: 10K samples
-Training:  18K samples shuffled
-Validation: 2K samples shuffled
-## Dataset Creation
-### Curation Rationale
-Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
-Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
-### Source Data
-#### Initial Data Collection
-https://github.com/zeloru/small-english-smalltalk-corpus
-It is scraped data from ESL language learning material.
-Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
-Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
-### Annotations
-#### Annotation process
-The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
-## Considerations for Using the Data
-### Known Limitations
-There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
-Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
-Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.