drewThomasson commited on
Commit
c010f6f
·
verified ·
1 Parent(s): 91455a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +225 -0
README.md CHANGED
@@ -9,6 +9,231 @@ app_file: app.py
9
  pinned: false
10
  python_version: 3.10.4
11
  license: mit
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  python_version: 3.10.4
11
  license: mit
12
+ short_description: You can test out BookNLP here! :)
13
  ---
14
 
15
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
16
+
17
+
18
+ # Try Out BookNLP here for free! :)
19
+ # ~ With the help of Calibre It can Accept These ebook file formats!
20
+ Input Formats: AZW, AZW3, AZW4, CBZ, CBR, CB7, CBC, CHM, DJVU, DOCX, EPUB, FB2, FBZ, HTML, HTMLZ, LIT, LRF, MOBI, ODT, PDF, PRC, PDB, PML, RB, RTF, SNB, TCR, TXT, TXTZ
21
+
22
+
23
+
24
+ # BookNLP
25
+
26
+ BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:
27
+
28
+ * Part-of-speech tagging
29
+ * Dependency parsing
30
+ * Entity recognition
31
+ * Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
32
+ * Quotation speaker identification
33
+ * Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
34
+ * Event tagging
35
+ * Referential gender inference (TOM_SAWYER -> he/him/his)
36
+
37
+ BookNLP ships with two models, both with identical architectures but different underlying BERT sizes. The larger and more accurate `big` model is fit for GPUs and multi-core computers; the faster `small` model is more appropriate for personal computers. See the table below for a comparison of the difference, both in terms of overall speed and in accuracy for the tasks that BookNLP performs.
38
+
39
+
40
+ | |Small|Big|
41
+ |---|---|---|
42
+ Entity tagging (F1)|88.2|90.0|
43
+ Supersense tagging (F1)|73.2|76.2|
44
+ Event tagging (F1)|70.6|74.1|
45
+ Coreference resolution (Avg. F1)|76.4|79.0|
46
+ Speaker attribution (B3)|86.4|89.9|
47
+ CPU time, 2019 MacBook Pro (mins.)*|3.6|15.4|
48
+ CPU time, 10-core server (mins.)*|2.4|5.2|
49
+ GPU time, Titan RTX (mins.)*|2.1|2.2|
50
+
51
+ *timings measure speed to run BookNLP on a sample book of *The Secret Garden* (99K tokens). To explore running BookNLP in Google Colab on a GPU, see [this notebook](https://colab.research.google.com/drive/1c9nlqGRbJ-FUP2QJe49h21hB4kUXdU_k?usp=sharing).
52
+
53
+ ## Installation
54
+
55
+ * Create anaconda environment, if desired. First [download and install anaconda](https://www.anaconda.com/download/); then create and activate fresh environment.
56
+
57
+ ```sh
58
+ conda create --name booknlp python=3.7
59
+ conda activate booknlp
60
+ ```
61
+
62
+ * If using a GPU, install pytorch for your system and CUDA version by following installation instructions on [https://pytorch.org](https://pytorch.org).
63
+
64
+
65
+ * Install booknlp and download Spacy model.
66
+
67
+ ```sh
68
+ pip install booknlp
69
+ python -m spacy download en_core_web_sm
70
+ ```
71
+
72
+ ## Usage
73
+
74
+ ```python
75
+ from booknlp.booknlp import BookNLP
76
+
77
+ model_params={
78
+ "pipeline":"entity,quote,supersense,event,coref",
79
+ "model":"big"
80
+ }
81
+
82
+ booknlp=BookNLP("en", model_params)
83
+
84
+ # Input file to process
85
+ input_file="input_dir/bartleby_the_scrivener.txt"
86
+
87
+ # Output directory to store resulting files in
88
+ output_directory="output_dir/bartleby/"
89
+
90
+ # File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
91
+ book_id="bartleby"
92
+
93
+ booknlp.process(input_file, output_directory, book_id)
94
+ ```
95
+
96
+ This runs the full BookNLP pipeline; you are able to run only some elements of the pipeline (to cut down on computational time) by specifying them in that parameter (e.g., to only run entity tagging and event tagging, change `model_params` above to include `"pipeline":"entity,event"`).
97
+
98
+ This process creates the directory `output_dir/bartleby` and generates the following files:
99
+
100
+ * `bartleby/bartleby.tokens` -- This encodes core word-level information. Each row corresponds to one token and includes the following information:
101
+ * paragraph ID
102
+ * sentence ID
103
+ * token ID within sentence
104
+ * token ID within document
105
+ * word
106
+ * lemma
107
+ * byte onset within original document
108
+ * byte offset within original document
109
+ * POS tag
110
+ * dependency relation
111
+ * token ID within document of syntactic head
112
+ * event
113
+
114
+ * `bartleby/bartleby.entities` -- This represents the typed entities within the document (e.g., people and places), along with their coreference.
115
+ * coreference ID (unique entity ID)
116
+ * start token ID within document
117
+ * end token ID within document
118
+ * NOM (nominal), PROP (proper), or PRON (pronoun)
119
+ * PER (person), LOC (location), FAC (facility), GPE (geo-political entity), VEH (vehicle), ORG (organization)
120
+ * text of entity
121
+ * `bartleby/bartleby.supersense` -- This stores information from supersense tagging.
122
+ * start token ID within document
123
+ * end token ID within document
124
+ * supersense category (verb.cognition, verb.communication, noun.artifact, etc.)
125
+ * `bartleby/bartleby.quotes` -- This stores information about the quotations in the document, along with the speaker. In a sentence like "'Yes', she said", where she -> ELIZABETH\_BENNETT, "she" is the attributed mention of the quotation 'Yes', and is coreferent with the unique entity ELIZABETH\_BENNETT.
126
+ * start token ID within document of quotation
127
+ * end token ID within document of quotation
128
+ * start token ID within document of attributed mention
129
+ * end token ID within document of attributed mention
130
+ * attributed mention text
131
+ * coreference ID (unique entity ID) of attributed mention
132
+ * quotation text
133
+ * `bartleby/bartleby.book`
134
+
135
+ JSON file providing information about all characters mentioned more than 1 time in the book, including their proper/common/pronominal references, referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.
136
+
137
+ * `bartleby/bartleby.book.html`
138
+
139
+ HTML file containing a.) the full text of the book along with annotations for entities, coreference, and speaker attribution and b.) a list of the named characters and major entity catgories (FAC, GPE, LOC, etc.).
140
+
141
+
142
+ # Annotations
143
+
144
+ ## Entity annotations
145
+
146
+ The entity annotation layer covers six of the ACE 2005 categories in text:
147
+
148
+ * People (PER): *Tom Sawyer*, *her daughter*
149
+ * Facilities (FAC): *the house*, *the kitchen*
150
+ * Geo-political entities (GPE): *London*, *the village*
151
+ * Locations (LOC): *the forest*, *the river*
152
+ * Vehicles (VEH): *the ship*, *the car*
153
+ * Organizations (ORG): *the army*, *the Church*
154
+
155
+ The targets of annotation here include both named entities (e.g., Tom Sawyer), common entities (the boy) and pronouns (he). These entities can be nested, as in the following:
156
+
157
+ <img src="img/nested_structure.png" alt="drawing" width="300"/>
158
+
159
+
160
+ For more, see: David Bamman, Sejal Popat and Sheng Shen, "[An Annotated Dataset of Literary Entities](http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/naacl2019_literary_entities.pdf)," NAACL 2019.
161
+
162
+ The entity tagging model within BookNLP is trained on an annotated dataset of 968K tokens, including the public domain materials in [LitBank](https://github.com/dbamman/litbank) and a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction (article forthcoming).
163
+
164
+ ## Event annotations
165
+
166
+ The event layer identifies events with asserted *realis* (depicted as actually taking place, with specific participants at a specific time) -- as opposed to events with other epistemic modalities (hypotheticals, future events, extradiegetic summaries by the narrator).
167
+
168
+ |Text|Events|Source|
169
+ |---|---|---|
170
+ |My father’s eyes had **closed** upon the light of this world six months, when mine **opened** on it.|{closed, opened}|Dickens, David Copperfield|
171
+ |Call me Ishmael.|{}|Melville, Moby Dick|
172
+ |His sister was a tall, strong girl, and she **walked** rapidly and resolutely, as if she knew exactly where she was going and what she was going to do next.|{walked}|Cather, O Pioneers|
173
+
174
+ For more, see: Matt Sims, Jong Ho Park and David Bamman, "[Literary Event Detection](http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/acl2019_literary_events.pdf)," ACL 2019.
175
+
176
+ The event tagging model is trained on event annotations within [LitBank](https://github.com/dbamman/litbank). The `small` model above makes use of a distillation process, by training on the predictions made by the `big` model for a collection of contemporary texts.
177
+
178
+ ## Supersense tagging
179
+
180
+ [Supersense tagging](https://aclanthology.org/W06-1670.pdf) provides coarse semantic information for a sentence by tagging spans with 41 lexical semantic categories drawn from WordNet, spanning both nouns (including *plant*, *animal*, *food*, *feeling*, and *artifact*) and verbs (including *cognition*, *communication*, *motion*, etc.)
181
+
182
+ |Example|Source|
183
+ |---|---|
184
+ |The [station wagons]<sub>artifact</sub> [arrived]<sub>motion</sub> at [noon]<sub>time</sub>, a long shining [line]<sub>group</sub> that [coursed]<sub>motion</sub> through the [west campus]<sub>location</sub>.|Delillo, *White Noise*|
185
+
186
+
187
+ The BookNLP tagger is trained on [SemCor](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor).
188
+
189
+ .
190
+
191
+
192
+ ## Character name clustering and coreference
193
+
194
+ The coreference layer covers the six ACE entity categories outlined above (people, facilities, locations, geo-political entities, organizations and vehicles) and is trained on [LitBank](https://github.com/dbamman/litbank) and [PreCo](https://preschool-lab.github.io/PreCo/).
195
+
196
+ Example|Source|
197
+ ---|---|
198
+ One may as well begin with [Helen]<sub>x</sub>'s letters to [[her]<sub>x</sub> sister]<sub>y</sub>|Forster, *Howard's End*
199
+
200
+ Accurate coreference at the scale of a book-length document is still an open research problem, and attempting full coreference -- where any named entity (Elizabeth), common entity (her sister, his daughter) and pronoun (she) can corefer -- tends to erroneously conflate multiple distinct entities into one. By default, BookNLP addresses this by first carrying out character name clustering (grouping "Tom", "Tom Sawyer" and "Mr. Sawyer" into a single entity), and then allowing pronouns to corefer with either named entities (Tom) or common entities (the boy), but disallowing common entities from co-referring to named entities. To turn off this mode and carry out full corefernce, add `pronominalCorefOnly=False` to the `model_params` parameters dictionary above (but be sure to inspect the output!).
201
+
202
+ For more on the coreference criteria used in this work, see David Bamman, Olivia Lewke and Anya Mansoor (2020), "[An Annotated Dataset of Coreference in English Literature](https://arxiv.org/abs/1912.01140)", LREC.
203
+
204
+ ## Referential gender inference
205
+
206
+ BookNLP infers the *referential gender* of characters by associating them with the pronouns (he/him/his, she/her, they/them, xe/xem/xyr/xir, etc.) used to refer to them in the context of the story. This method encodes several assumptions:
207
+
208
+ * BookNLP describes the referential gender of characters, and not their gender identity. Characters are described by the pronouns used to refer to them (e.g., he/him, she/her) rather than labels like "M/F".
209
+
210
+ * Prior information on the alignment of names with referential gender (e.g., from government records or larger background datasets) can be used to provide some information to inform this process if desired (e.g., "Tom" is often associated with he/him in pre-1923 English texts). Name information, however, should not be uniquely determinative, but rather should be sensitive to the context in which it is used (e.g., "Tom" in the book "Tom and Some Other Girls", where Tom is aligned with she/her). By default, BookNLP uses prior information on the alignment of proper names and honorifics with pronouns drawn from ~15K works from Project Gutenberg; this prior information can be ignored by setting `referential_gender_hyperparameterFile:None` in the model_params file. Alternative priors can be used by passing the pathname to a prior file (in the same format as `english/data/gutenberg_prop_gender_terms.txt`) to this parameter.
211
+
212
+ * Users should be free to define the referential gender categories used here. The default set of categories is {he, him, his},
213
+ {she, her}, {they, them, their}, {xe, xem, xyr, xir}, and {ze, zem, zir, hir}. To specify a different set of categories, update the `model_params` setting to define them:
214
+ `referential_gender_cats: [ ["he", "him", "his"], ["she", "her"], ["they", "them", "their"], ["xe", "xem", "xyr", "xir"], ["ze", "zem", "zir", "hir"] ]`
215
+
216
+ ## Speaker attribution
217
+
218
+ The speaker attribution model identifies all instances of direct speech in the text and attributes it to its speaker.
219
+
220
+
221
+ |Quote|Speaker|Source|
222
+ |---|---|---|
223
+ — Come up , Kinch ! Come up , you fearful jesuit !|Buck\_Mulligan-0|Joyce, *Ulysses*|
224
+ ‘ Oh dear ! Oh dear ! I shall be late ! ’|The\_White\_Rabbit-4|Carroll, *Alice in Wonderland*|
225
+ “ Do n't put your feet up there , Huckleberry ; ”|Miss\_Watson-26|Twain, *Huckleberry Finn*|
226
+
227
+ This model is trained on speaker attribution data in [LitBank](https://github.com/dbamman/litbank).
228
+ For more on the quotation annotations, see [this paper](https://arxiv.org/pdf/2004.13980.pdf).
229
+
230
+ ## Part-of-speech tagging and dependency parsing
231
+
232
+ BookNLP uses [Spacy](https://spacy.io) for part-of-speech tagging and dependency parsing.
233
+
234
+ # Acknowledgments
235
+
236
+ <table><tr><td><img width="250" src="https://www.neh.gov/sites/default/files/inline-files/NEH-Preferred-Seal820.jpg" /></td><td><img width="150" src="https://www.nsf.gov/images/logos/NSF_4-Color_bitmap_Logo.png" /></td><td>
237
+ BookNLP is supported by the National Endowment for the Humanities (HAA-271654-20) and the National Science Foundation (IIS-1942591).
238
+ </td></tr></table>
239
+