scasps commited on
Commit
abe2e14
·
1 Parent(s): 3ac9e26

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -101
README.md DELETED
@@ -1,101 +0,0 @@
1
- ---
2
- annotations_creators:
3
- - expert-generated
4
- language:
5
- - en
6
- language_creators:
7
- - found
8
- license: []
9
- multilinguality:
10
- - monolingual
11
- pretty_name: message-classification
12
- size_categories:
13
- - n=10K
14
- source_datasets:
15
- - https://github.com/zeloru/small-english-smalltalk-corpus
16
- tags: []
17
- task_categories:
18
- - text-classification
19
- task_ids:
20
- - semantic-similarity-scoring
21
- ---
22
-
23
- # Dataset Card for [Dataset Name]
24
-
25
- ## Table of Contents
26
- - [Table of Contents](#table-of-contents)
27
- - [Description](#description)
28
- - [Summary](#summary)
29
- - [Languages](#languages)
30
- - [Dataset Structure](#dataset-structure)
31
- - [Data Fields](#data-fields)
32
- - [Data Splits](#data-splits)
33
- - [Dataset Creation](#dataset-creation)
34
- - [Curation Rationale](#curation-rationale)
35
- - [Source Data](#source-data)
36
- - [Annotations](#annotations)
37
- - [Considerations for Using the Data](#considerations-for-using-the-data)
38
- - [Known Limitations](#known-limitations)
39
-
40
- ## Description
41
- ### Summary
42
- https://ukatie.com
43
-
44
- This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
45
-
46
- It is also used to determine the questions and context within an input sentence.
47
-
48
- ### Languages
49
-
50
- So far, English is the only supported language.
51
-
52
- ## Dataset Structure
53
-
54
- ### Data Fields
55
-
56
- Text: Short input sentence
57
- Label: Question or Other
58
-
59
- ### Data Splits
60
-
61
- Question: 10K samples
62
- Other: 10K samples
63
-
64
- Training: 18K samples shuffled
65
- Validation: 2K samples shuffled
66
-
67
- ## Dataset Creation
68
-
69
- ### Curation Rationale
70
-
71
- Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
72
-
73
- Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
74
-
75
- ### Source Data
76
-
77
- #### Initial Data Collection
78
-
79
- https://github.com/zeloru/small-english-smalltalk-corpus
80
-
81
- It is scraped data from ESL language learning material.
82
-
83
- Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
84
-
85
- Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
86
-
87
- ### Annotations
88
-
89
- #### Annotation process
90
-
91
- The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
92
-
93
- ## Considerations for Using the Data
94
-
95
- ### Known Limitations
96
-
97
- There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
98
-
99
- Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
100
-
101
- Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.