XLNet / pages /Workflow & Model Overview.py
abdullahmubeen10's picture
Upload 5 files
876127f verified
import streamlit as st
# Page configuration
st.set_page_config(
layout="wide",
initial_sidebar_state="auto"
)
# Custom CSS for better styling
st.markdown("""
<style>
.main-title {
font-size: 36px;
color: #4A90E2;
font-weight: bold;
text-align: center;
}
.sub-title {
font-size: 24px;
color: #4A90E2;
margin-top: 20px;
}
.section {
background-color: #f9f9f9;
padding: 15px;
border-radius: 10px;
margin-top: 20px;
}
.section h2 {
font-size: 22px;
color: #4A90E2;
}
.section p, .section ul {
color: #666666;
}
.link {
color: #4A90E2;
text-decoration: none;
}
.benchmark-table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
}
.benchmark-table th, .benchmark-table td {
border: 1px solid #ddd;
padding: 8px;
text-align: left;
}
.benchmark-table th {
background-color: #4A90E2;
color: white;
}
.benchmark-table td {
background-color: #f2f2f2;
}
</style>
""", unsafe_allow_html=True)
# Title
st.markdown('<div class="main-title">Introduction to XLNet for Token & Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
# Subtitle
st.markdown("""
<div class="section">
<p>XLNet is a powerful transformer-based language model that excels in handling various Natural Language Processing (NLP) tasks. It uses a permutation-based training approach, which allows it to capture bidirectional context, making it highly effective for tasks like token classification and sequence classification.</p>
</div>
""", unsafe_allow_html=True)
# Tabs for XLNet Annotators
tab1, tab2 = st.tabs(["XlnetForTokenClassification", "XlnetForSequenceClassification"])
# Tab 1: XlnetForTokenClassification
with tab1:
st.markdown("""
<div class="section">
<h2>XLNet for Token Classification</h2>
<p><strong>Token Classification</strong> involves assigning labels to individual tokens (words or subwords) within a sentence. This is crucial for tasks such as Named Entity Recognition (NER), where each token is classified as a specific entity like a person, organization, or location.</p>
<p>XLNet, with its robust contextual understanding, is particularly suited for token classification tasks. Its permutation-based training enables the model to capture dependencies across different parts of a sentence, improving accuracy in token-level predictions.</p>
<p>Using XLNet for token classification enables:</p>
<ul>
<li><strong>Accurate NER:</strong> Extract entities from text with high precision.</li>
<li><strong>Contextual Understanding:</strong> Benefit from XLNet's ability to model bidirectional context for each token.</li>
<li><strong>Scalability:</strong> Efficiently process large-scale datasets using Spark NLP.</li>
</ul>
</div>
""", unsafe_allow_html=True)
# Implementation Section
st.markdown('<div class="sub-title">How to Use XLNet for Token Classification in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>Below is an example of how to set up a pipeline in Spark NLP using the XLNet model for token classification, specifically for Named Entity Recognition (NER).</p>
</div>
""", unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
tokenClassifier = XlnetForTokenClassification \\
.pretrained('xlnet_base_token_classifier_conll03', 'en') \\
.setInputCols(['token', 'document']) \\
.setOutputCol('ner') \\
.setCaseSensitive(True) \\
.setMaxSentenceLength(512)
ner_converter = NerConverter() \\
.setInputCols(['document', 'token', 'ner']) \\
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
tokenClassifier,
ner_converter
])
example = spark.createDataFrame([['My name is John!']]).toDF("text")
result = pipeline.fit(example).transform(example)
''', language='python')
# Example Output
st.text("""
+---------+---------+
|entities |label |
+---------+---------+
|John |PER |
+---------+---------+
""")
# Model Info Section
st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>Spark NLP offers various XLNet models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance.</p>
<p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForTokenClassification" target="_blank">Spark NLP Models Hub</a> to find the one that fits your needs.</p>
</div>
""", unsafe_allow_html=True)
# Tab 2: XlnetForSequenceClassification
with tab2:
st.markdown("""
<div class="section">
<h2>XLNet for Sequence Classification</h2>
<p><strong>Sequence Classification</strong> is the task of assigning a label to an entire sequence of text, such as determining the sentiment of a review or categorizing a document into topics. XLNet's ability to model long-range dependencies makes it particularly effective for sequence classification.</p>
<p>Using XLNet for sequence classification enables:</p>
<ul>
<li><strong>Sentiment Analysis:</strong> Accurately determine the sentiment of text.</li>
<li><strong>Document Classification:</strong> Categorize documents based on their content.</li>
<li><strong>Robust Performance:</strong> Benefit from XLNet's permutation-based training for improved classification accuracy.</li>
</ul>
</div>
""", unsafe_allow_html=True)
# Implementation Section
st.markdown('<div class="sub-title">How to Use XLNet for Sequence Classification in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>The following example demonstrates how to set up a pipeline in Spark NLP using the XLNet model for sequence classification, particularly for sentiment analysis of movie reviews.</p>
</div>
""", unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
sequenceClassifier = XlnetForSequenceClassification \\
.pretrained('xlnet_base_sequence_classifier_imdb', 'en') \\
.setInputCols(['token', 'document']) \\
.setOutputCol('class') \\
.setCaseSensitive(False) \\
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
example = spark.createDataFrame([['I really liked that movie!']]).toDF("text")
result = pipeline.fit(example).transform(example)
''', language='python')
# Example Output
st.text("""
+------------------------+
|class |
+------------------------+
|[positive] |
+------------------------+
""")
# Model Info Section
st.markdown('<div class="sub-title">Choosing the Right XLNet Model</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>Various XLNet models are available for sequence classification in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.</p>
<p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=XlnetForSequenceClassification" target="_blank">Spark NLP Models Hub</a> to find the best fit for your use case.</p>
</div>
""", unsafe_allow_html=True)
# Footer
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
</ul>
</div>
""", unsafe_allow_html=True)
st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
</ul>
</div>
""", unsafe_allow_html=True)