File size: 14,990 Bytes
b6d9308
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
import streamlit as st

# Page configuration
st.set_page_config(
    layout="wide", 
    initial_sidebar_state="auto"
)

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Introduction to Longformer for Token & Sequence Classification</div>', unsafe_allow_html=True)

# Subtitle
st.markdown("""

<div class="section">

    <p>Longformer is a transformer-based model designed to handle long documents by leveraging an attention mechanism that scales linearly with the length of the document. This makes it highly effective for tasks such as token classification and sequence classification, especially when dealing with lengthy text inputs.</p>

</div>

""", unsafe_allow_html=True)

# Tabs for Longformer Annotators
tab1, tab2, tab3= st.tabs(["Longformer For Token Classification", "Longformer For Sequence Classification", "Longformer For Question Answering"])

# Tab 1: LongformerForTokenClassification
with tab1:
    st.markdown("""

    <div class="section">

        <h2>Longformer for Token Classification</h2>

        <p><strong>Token Classification</strong> involves assigning labels to individual tokens (words or subwords) within a sentence. This is essential for tasks like Named Entity Recognition (NER), where each token is classified as a specific entity such as a person, organization, or location.</p>

        <p>Longformer is particularly effective for token classification tasks due to its ability to handle long contexts and capture dependencies over long spans of text.</p>

        <p>Using Longformer for token classification enables:</p>

        <ul>

            <li><strong>Precise NER:</strong> Extract entities from lengthy documents with high accuracy.</li>

            <li><strong>Efficient Contextual Understanding:</strong> Leverage Longformer's attention mechanism to model long-range dependencies.</li>

            <li><strong>Scalability:</strong> Process large documents efficiently using Spark NLP.</li>

        </ul>

    </div>

    """, unsafe_allow_html=True)

    # Implementation Section
    st.markdown('<div class="sub-title">How to Use Longformer for Token Classification in Spark NLP</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Below is an example of how to set up a pipeline in Spark NLP using the Longformer model for token classification, specifically for Named Entity Recognition (NER).</p>

    </div>

    """, unsafe_allow_html=True)

    st.code('''

    from sparknlp.base import *

    from sparknlp.annotator import *

    from pyspark.ml import Pipeline

    from pyspark.sql.functions import col, expr



    document_assembler = DocumentAssembler() \\

        .setInputCol('text') \\

        .setOutputCol('document')



    tokenizer = Tokenizer() \\

        .setInputCols(['document']) \\

        .setOutputCol('token')



    tokenClassifier = LongformerForTokenClassification \\

        .pretrained('longformer_base_token_classifier_conll03', 'en') \\

        .setInputCols(['token', 'document']) \\

        .setOutputCol('ner') \\

        .setCaseSensitive(True) \\

        .setMaxSentenceLength(512)



    ner_converter = NerConverter() \\

        .setInputCols(['document', 'token', 'ner']) \\

        .setOutputCol('entities')



    pipeline = Pipeline(stages=[

        document_assembler, 

        tokenizer,

        tokenClassifier,

        ner_converter

    ])



    text = "Facebook is a social networking service launched as TheFacebook on February 4, 2004. It was founded by Mark Zuckerberg with his college roommates and fellow Harvard University students Eduardo Saverin, Andrew McCollum, Dustin Moskovitz and Chris Hughes. The website's membership was initially limited by the founders to Harvard students, but was expanded to other colleges in the Boston area, the Ivy League, and gradually most universities in the United States and Canada."

    example = spark.createDataFrame([[text]]).toDF("text")

    result = pipeline.fit(example).transform(example)



    result.select(

        expr("explode(entities) as ner_chunk")

    ).select(

        col("ner_chunk.result").alias("chunk"),

        col("ner_chunk.metadata.entity").alias("ner_label")

    ).show(truncate=False)

    ''', language='python')

    # Example Output
    st.text("""

    +------------------+---------+

    |chunk             |ner_label|

    +------------------+---------+

    |Mark Zuckerberg   |PER      |

    |Harvard University|ORG      |

    |Eduardo Saverin   |PER      |

    |Andrew McCollum   |PER      |

    |Dustin Moskovitz  |PER      |

    |Chris Hughes      |PER      |

    |Harvard           |ORG      |

    |Boston            |LOC      |

    |Ivy               |ORG      |

    |League            |ORG      |

    |United            |LOC      |

    |States            |LOC      |

    |Canada            |LOC      |

    +------------------+---------+

    """)

    # Model Info Section
    st.markdown('<div class="sub-title">Choosing the Right Longformer Model</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Spark NLP offers various Longformer models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance.</p>

        <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=LongformerForTokenClassification" target="_blank">Spark NLP Models Hub</a> to find the one that fits your needs.</p>

    </div>

    """, unsafe_allow_html=True)

# Tab 2: LongformerForSequenceClassification
with tab2:
    st.markdown("""

    <div class="section">

        <h2>Longformer for Sequence Classification</h2>

        <p><strong>Sequence Classification</strong> involves assigning a label to an entire sequence of text, such as determining the sentiment of a review or categorizing a document into topics. Longformer’s ability to model long-range dependencies is particularly beneficial for sequence classification tasks.</p>

        <p>Using Longformer for sequence classification enables:</p>

        <ul>

            <li><strong>Accurate Sentiment Analysis:</strong> Determine the sentiment of long text sequences.</li>

            <li><strong>Effective Document Classification:</strong> Categorize lengthy documents based on their content.</li>

            <li><strong>Robust Performance:</strong> Benefit from Longformer’s attention mechanism for improved classification accuracy.</li>

        </ul>

    </div>

    """, unsafe_allow_html=True)

    # Implementation Section
    st.markdown('<div class="sub-title">How to Use Longformer for Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>The following example demonstrates how to set up a pipeline in Spark NLP using the Longformer model for sequence classification, particularly for sentiment analysis of movie reviews.</p>

    </div>

    """, unsafe_allow_html=True)

    st.code('''

    from sparknlp.base import *

    from sparknlp.annotator import *

    from pyspark.ml import Pipeline



    document_assembler = DocumentAssembler() \\

        .setInputCol('text') \\

        .setOutputCol('document')



    tokenizer = Tokenizer() \\

        .setInputCols(['document']) \\

        .setOutputCol('token')



    sequenceClassifier = LongformerForSequenceClassification \\

        .pretrained('longformer_base_sequence_classifier_imdb', 'en') \\

        .setInputCols(['token', 'document']) \\

        .setOutputCol('class') \\

        .setCaseSensitive(False) \\

        .setMaxSentenceLength(1024)



    pipeline = Pipeline(stages=[

        document_assembler,

        tokenizer,

        sequenceClassifier

    ])



    example = spark.createDataFrame([['I really liked that movie!']]).toDF("text")

    result = pipeline.fit(example).transform(example)

            

    result.select('document.result','class.result').show()

    ''', language='python')

    # Example Output
    st.text("""

    +--------------------+------+

    |              result|result|

    +--------------------+------+

    |[I really liked t...| [pos]|

    +--------------------+------+

    """)

    # Model Info Section
    st.markdown('<div class="sub-title">Choosing the Right Longformer Model</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Various Longformer models are available for sequence classification in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.</p>

        <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=LongformerForSequenceClassification" target="_blank">Spark NLP Models Hub</a> to find the best fit for your use case.</p>

    </div>

    """, unsafe_allow_html=True)

# Tab 3: LongformerForQuestionAnswering
with tab3:
    st.markdown("""

    <div class="section">

        <h2>Longformer for Question Answering</h2>

        <p><strong>Question Answering</strong> is the task of identifying the correct answer to a question from a given context or passage. Longformer's ability to process long documents makes it highly suitable for question answering tasks, especially in cases where the context is lengthy.</p>

        <p>Using Longformer for question answering enables:</p>

        <ul>

            <li><strong>Accurate Answer Extraction:</strong> Identify precise answers within long passages.</li>

            <li><strong>Contextual Understanding:</strong> Benefit from Longformer's global and local attention mechanisms to capture relevant information from context.</li>

            <li><strong>Scalability:</strong> Efficiently process and handle extensive documents using Spark NLP.</li>

        </ul>

    </div>

    """, unsafe_allow_html=True)

    # Implementation Section
    st.markdown('<div class="sub-title">How to Use Longformer for Question Answering in Spark NLP</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>The following example demonstrates how to set up a pipeline in Spark NLP using the Longformer model for question answering, specifically tailored for SQuAD v2 dataset.</p>

    </div>

    """, unsafe_allow_html=True)

    st.code('''

    from sparknlp.base import *

    from sparknlp.annotator import *

    from pyspark.ml import Pipeline



    documentAssembler = MultiDocumentAssembler() \\

        .setInputCols(["question", "context"]) \\

        .setOutputCols(["document_question", "document_context"])



    spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_base_base_qa_squad2", "en") \\

        .setInputCols(["document_question", "document_context"]) \\

        .setOutputCol("answer")\\

        .setCaseSensitive(True)



    pipeline = Pipeline(stages=[documentAssembler, spanClassifier])



    data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")



    result = pipeline.fit(data).transform(data)

    ''', language='python')

    # Example Output
    st.text("""

    +-------+

    | result|

    +-------+

    |[Clara]|

    +-------+

    """)

    # Model Info Section
    st.markdown('<div class="sub-title">Choosing the Right Longformer Model</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Various Longformer models are available for question answering in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.</p>

        <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=LongformerForQuestionAnswering" target="_blank">Spark NLP Models Hub</a> to find the best fit for your use case.</p>

    </div>

    """, unsafe_allow_html=True)

# Footer
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>

        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)