File size: 10,511 Bytes
876127f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
import streamlit as st

# Page configuration
st.set_page_config(
    layout="wide", 
    initial_sidebar_state="auto"
)

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Introduction to XLNet for Token & Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)

# Subtitle
st.markdown("""

<div class="section">

    <p>XLNet is a powerful transformer-based language model that excels in handling various Natural Language Processing (NLP) tasks. It uses a permutation-based training approach, which allows it to capture bidirectional context, making it highly effective for tasks like token classification and sequence classification.</p>

</div>

""", unsafe_allow_html=True)

# Tabs for XLNet Annotators
tab1, tab2 = st.tabs(["XlnetForTokenClassification", "XlnetForSequenceClassification"])

# Tab 1: XlnetForTokenClassification
with tab1:
    st.markdown("""

    <div class="section">

        <h2>XLNet for Token Classification</h2>

        <p><strong>Token Classification</strong> involves assigning labels to individual tokens (words or subwords) within a sentence. This is crucial for tasks such as Named Entity Recognition (NER), where each token is classified as a specific entity like a person, organization, or location.</p>

        <p>XLNet, with its robust contextual understanding, is particularly suited for token classification tasks. Its permutation-based training enables the model to capture dependencies across different parts of a sentence, improving accuracy in token-level predictions.</p>

        <p>Using XLNet for token classification enables:</p>

        <ul>

            <li><strong>Accurate NER:</strong> Extract entities from text with high precision.</li>

            <li><strong>Contextual Understanding:</strong> Benefit from XLNet's ability to model bidirectional context for each token.</li>

            <li><strong>Scalability:</strong> Efficiently process large-scale datasets using Spark NLP.</li>

        </ul>

    </div>

    """, unsafe_allow_html=True)

    # Implementation Section
    st.markdown('<div class="sub-title">How to Use XLNet for Token Classification in Spark NLP</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Below is an example of how to set up a pipeline in Spark NLP using the XLNet model for token classification, specifically for Named Entity Recognition (NER).</p>

    </div>

    """, unsafe_allow_html=True)

    st.code('''

    from sparknlp.base import *

    from sparknlp.annotator import *

    from pyspark.ml import Pipeline



    document_assembler = DocumentAssembler() \\

        .setInputCol('text') \\

        .setOutputCol('document')



    tokenizer = Tokenizer() \\

        .setInputCols(['document']) \\

        .setOutputCol('token')



    tokenClassifier = XlnetForTokenClassification \\

        .pretrained('xlnet_base_token_classifier_conll03', 'en') \\

        .setInputCols(['token', 'document']) \\

        .setOutputCol('ner') \\

        .setCaseSensitive(True) \\

        .setMaxSentenceLength(512)



    ner_converter = NerConverter() \\

        .setInputCols(['document', 'token', 'ner']) \\

        .setOutputCol('entities')



    pipeline = Pipeline(stages=[

        document_assembler, 

        tokenizer,

        tokenClassifier,

        ner_converter

    ])



    example = spark.createDataFrame([['My name is John!']]).toDF("text")

    result = pipeline.fit(example).transform(example)

    ''', language='python')

    # Example Output
    st.text("""

    +---------+---------+

    |entities |label    |

    +---------+---------+

    |John     |PER      |

    +---------+---------+

    """)

    # Model Info Section
    st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Spark NLP offers various XLNet models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance.</p>

        <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForTokenClassification" target="_blank">Spark NLP Models Hub</a> to find the one that fits your needs.</p>

    </div>

    """, unsafe_allow_html=True)

# Tab 2: XlnetForSequenceClassification
with tab2:
    st.markdown("""

    <div class="section">

        <h2>XLNet for Sequence Classification</h2>

        <p><strong>Sequence Classification</strong> is the task of assigning a label to an entire sequence of text, such as determining the sentiment of a review or categorizing a document into topics. XLNet's ability to model long-range dependencies makes it particularly effective for sequence classification.</p>

        <p>Using XLNet for sequence classification enables:</p>

        <ul>

            <li><strong>Sentiment Analysis:</strong> Accurately determine the sentiment of text.</li>

            <li><strong>Document Classification:</strong> Categorize documents based on their content.</li>

            <li><strong>Robust Performance:</strong> Benefit from XLNet's permutation-based training for improved classification accuracy.</li>

        </ul>

    </div>

    """, unsafe_allow_html=True)

    # Implementation Section
    st.markdown('<div class="sub-title">How to Use XLNet for Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>The following example demonstrates how to set up a pipeline in Spark NLP using the XLNet model for sequence classification, particularly for sentiment analysis of movie reviews.</p>

    </div>

    """, unsafe_allow_html=True)

    st.code('''

    from sparknlp.base import *

    from sparknlp.annotator import *

    from pyspark.ml import Pipeline



    document_assembler = DocumentAssembler() \\

        .setInputCol('text') \\

        .setOutputCol('document')



    tokenizer = Tokenizer() \\

        .setInputCols(['document']) \\

        .setOutputCol('token')



    sequenceClassifier = XlnetForSequenceClassification \\

        .pretrained('xlnet_base_sequence_classifier_imdb', 'en') \\

        .setInputCols(['token', 'document']) \\

        .setOutputCol('class') \\

        .setCaseSensitive(False) \\

        .setMaxSentenceLength(512)



    pipeline = Pipeline(stages=[

        document_assembler,

        tokenizer,

        sequenceClassifier

    ])



    example = spark.createDataFrame([['I really liked that movie!']]).toDF("text")

    result = pipeline.fit(example).transform(example)

    ''', language='python')

    # Example Output
    st.text("""

    +------------------------+

    |class                   |

    +------------------------+

    |[positive]              |

    +------------------------+

    """)

    # Model Info Section
    st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
    st.markdown("""

    <div class="section">

        <p>Various XLNet models are available for sequence classification in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.</p>

        <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForSequenceClassification" target="_blank">Spark NLP Models Hub</a> to find the best fit for your use case.</p>

    </div>

    """, unsafe_allow_html=True)


# Footer
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>

        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)