|
import streamlit as st
|
|
|
|
|
|
st.set_page_config(
|
|
layout="wide",
|
|
initial_sidebar_state="auto"
|
|
)
|
|
|
|
|
|
st.markdown("""
|
|
<style>
|
|
.main-title {
|
|
font-size: 36px;
|
|
color: #4A90E2;
|
|
font-weight: bold;
|
|
text-align: center;
|
|
}
|
|
.sub-title {
|
|
font-size: 24px;
|
|
color: #4A90E2;
|
|
margin-top: 20px;
|
|
}
|
|
.section {
|
|
background-color: #f9f9f9;
|
|
padding: 15px;
|
|
border-radius: 10px;
|
|
margin-top: 20px;
|
|
}
|
|
.section h2 {
|
|
font-size: 22px;
|
|
color: #4A90E2;
|
|
}
|
|
.section p, .section ul {
|
|
color: #666666;
|
|
}
|
|
.link {
|
|
color: #4A90E2;
|
|
text-decoration: none;
|
|
}
|
|
.benchmark-table {
|
|
width: 100%;
|
|
border-collapse: collapse;
|
|
margin-top: 20px;
|
|
}
|
|
.benchmark-table th, .benchmark-table td {
|
|
border: 1px solid #ddd;
|
|
padding: 8px;
|
|
text-align: left;
|
|
}
|
|
.benchmark-table th {
|
|
background-color: #4A90E2;
|
|
color: white;
|
|
}
|
|
.benchmark-table td {
|
|
background-color: #f2f2f2;
|
|
}
|
|
</style>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="main-title">Introduction to XLNet for Token & Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>XLNet is a powerful transformer-based language model that excels in handling various Natural Language Processing (NLP) tasks. It uses a permutation-based training approach, which allows it to capture bidirectional context, making it highly effective for tasks like token classification and sequence classification.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
tab1, tab2 = st.tabs(["XlnetForTokenClassification", "XlnetForSequenceClassification"])
|
|
|
|
|
|
with tab1:
|
|
st.markdown("""
|
|
<div class="section">
|
|
<h2>XLNet for Token Classification</h2>
|
|
<p><strong>Token Classification</strong> involves assigning labels to individual tokens (words or subwords) within a sentence. This is crucial for tasks such as Named Entity Recognition (NER), where each token is classified as a specific entity like a person, organization, or location.</p>
|
|
<p>XLNet, with its robust contextual understanding, is particularly suited for token classification tasks. Its permutation-based training enables the model to capture dependencies across different parts of a sentence, improving accuracy in token-level predictions.</p>
|
|
<p>Using XLNet for token classification enables:</p>
|
|
<ul>
|
|
<li><strong>Accurate NER:</strong> Extract entities from text with high precision.</li>
|
|
<li><strong>Contextual Understanding:</strong> Benefit from XLNet's ability to model bidirectional context for each token.</li>
|
|
<li><strong>Scalability:</strong> Efficiently process large-scale datasets using Spark NLP.</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">How to Use XLNet for Token Classification in Spark NLP</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Below is an example of how to set up a pipeline in Spark NLP using the XLNet model for token classification, specifically for Named Entity Recognition (NER).</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.code('''
|
|
from sparknlp.base import *
|
|
from sparknlp.annotator import *
|
|
from pyspark.ml import Pipeline
|
|
|
|
document_assembler = DocumentAssembler() \\
|
|
.setInputCol('text') \\
|
|
.setOutputCol('document')
|
|
|
|
tokenizer = Tokenizer() \\
|
|
.setInputCols(['document']) \\
|
|
.setOutputCol('token')
|
|
|
|
tokenClassifier = XlnetForTokenClassification \\
|
|
.pretrained('xlnet_base_token_classifier_conll03', 'en') \\
|
|
.setInputCols(['token', 'document']) \\
|
|
.setOutputCol('ner') \\
|
|
.setCaseSensitive(True) \\
|
|
.setMaxSentenceLength(512)
|
|
|
|
ner_converter = NerConverter() \\
|
|
.setInputCols(['document', 'token', 'ner']) \\
|
|
.setOutputCol('entities')
|
|
|
|
pipeline = Pipeline(stages=[
|
|
document_assembler,
|
|
tokenizer,
|
|
tokenClassifier,
|
|
ner_converter
|
|
])
|
|
|
|
example = spark.createDataFrame([['My name is John!']]).toDF("text")
|
|
result = pipeline.fit(example).transform(example)
|
|
''', language='python')
|
|
|
|
|
|
st.text("""
|
|
+---------+---------+
|
|
|entities |label |
|
|
+---------+---------+
|
|
|John |PER |
|
|
+---------+---------+
|
|
""")
|
|
|
|
|
|
st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Spark NLP offers various XLNet models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance.</p>
|
|
<p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForTokenClassification" target="_blank">Spark NLP Models Hub</a> to find the one that fits your needs.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
with tab2:
|
|
st.markdown("""
|
|
<div class="section">
|
|
<h2>XLNet for Sequence Classification</h2>
|
|
<p><strong>Sequence Classification</strong> is the task of assigning a label to an entire sequence of text, such as determining the sentiment of a review or categorizing a document into topics. XLNet's ability to model long-range dependencies makes it particularly effective for sequence classification.</p>
|
|
<p>Using XLNet for sequence classification enables:</p>
|
|
<ul>
|
|
<li><strong>Sentiment Analysis:</strong> Accurately determine the sentiment of text.</li>
|
|
<li><strong>Document Classification:</strong> Categorize documents based on their content.</li>
|
|
<li><strong>Robust Performance:</strong> Benefit from XLNet's permutation-based training for improved classification accuracy.</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">How to Use XLNet for Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>The following example demonstrates how to set up a pipeline in Spark NLP using the XLNet model for sequence classification, particularly for sentiment analysis of movie reviews.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.code('''
|
|
from sparknlp.base import *
|
|
from sparknlp.annotator import *
|
|
from pyspark.ml import Pipeline
|
|
|
|
document_assembler = DocumentAssembler() \\
|
|
.setInputCol('text') \\
|
|
.setOutputCol('document')
|
|
|
|
tokenizer = Tokenizer() \\
|
|
.setInputCols(['document']) \\
|
|
.setOutputCol('token')
|
|
|
|
sequenceClassifier = XlnetForSequenceClassification \\
|
|
.pretrained('xlnet_base_sequence_classifier_imdb', 'en') \\
|
|
.setInputCols(['token', 'document']) \\
|
|
.setOutputCol('class') \\
|
|
.setCaseSensitive(False) \\
|
|
.setMaxSentenceLength(512)
|
|
|
|
pipeline = Pipeline(stages=[
|
|
document_assembler,
|
|
tokenizer,
|
|
sequenceClassifier
|
|
])
|
|
|
|
example = spark.createDataFrame([['I really liked that movie!']]).toDF("text")
|
|
result = pipeline.fit(example).transform(example)
|
|
''', language='python')
|
|
|
|
|
|
st.text("""
|
|
+------------------------+
|
|
|class |
|
|
+------------------------+
|
|
|[positive] |
|
|
+------------------------+
|
|
""")
|
|
|
|
|
|
st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Various XLNet models are available for sequence classification in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.</p>
|
|
<p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForSequenceClassification" target="_blank">Spark NLP Models Hub</a> to find the best fit for your use case.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
|
|
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
|
|
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
|
|
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
|
|
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|