|
import streamlit as st
|
|
|
|
|
|
st.set_page_config(
|
|
layout="wide",
|
|
initial_sidebar_state="auto"
|
|
)
|
|
|
|
|
|
st.markdown("""
|
|
<style>
|
|
.main-title {
|
|
font-size: 36px;
|
|
color: #4A90E2;
|
|
font-weight: bold;
|
|
text-align: center;
|
|
}
|
|
.sub-title {
|
|
font-size: 24px;
|
|
color: #4A90E2;
|
|
margin-top: 20px;
|
|
}
|
|
.section {
|
|
background-color: #f9f9f9;
|
|
padding: 15px;
|
|
border-radius: 10px;
|
|
margin-top: 20px;
|
|
}
|
|
.section h2 {
|
|
font-size: 22px;
|
|
color: #4A90E2;
|
|
}
|
|
.section p, .section ul {
|
|
color: #666666;
|
|
}
|
|
.link {
|
|
color: #4A90E2;
|
|
text-decoration: none;
|
|
}
|
|
.benchmark-table {
|
|
width: 100%;
|
|
border-collapse: collapse;
|
|
margin-top: 20px;
|
|
}
|
|
.benchmark-table th, .benchmark-table td {
|
|
border: 1px solid #ddd;
|
|
padding: 8px;
|
|
text-align: left;
|
|
}
|
|
.benchmark-table th {
|
|
background-color: #4A90E2;
|
|
color: white;
|
|
}
|
|
.benchmark-table td {
|
|
background-color: #f2f2f2;
|
|
}
|
|
</style>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="main-title">Introduction to XLM-RoBERTa Annotators in Spark NLP</div>', unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>XLM-RoBERTa (Cross-lingual Robustly Optimized BERT Approach) is an advanced multilingual model that extends the capabilities of RoBERTa to over 100 languages. Pre-trained on a massive, diverse corpus, XLM-RoBERTa is designed to handle various NLP tasks in a multilingual context, making it ideal for applications that require cross-lingual understanding. Below, we provide an overview of the XLM-RoBERTa annotators for these tasks:</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown("""<div class="sub-title">Zero-Shot Classification with XLM-RoBERTa</div>""", unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Zero-shot classification is a powerful technique that allows models to classify text into categories that the model has never seen before during training. This is particularly useful in scenarios where labeled training data is scarce or when new categories emerge frequently.</p>
|
|
<p><strong>XLM-RoBERTa</strong> is a multilingual model, making it highly effective for zero-shot classification tasks across various languages. It leverages large-scale cross-lingual pretraining to understand and classify text in multiple languages without requiring language-specific annotated data.</p>
|
|
<p>Using XLM-RoBERTa for Zero-Shot Classification enables:</p>
|
|
<ul>
|
|
<li><strong>Multilingual Understanding:</strong> Classify text across multiple languages without needing language-specific training data.</li>
|
|
<li><strong>Dynamic Classification:</strong> Adapt to new or emerging categories without retraining the model.</li>
|
|
<li><strong>Resource Efficiency:</strong> Bypass the need for extensive labeled datasets for each language or category.</li>
|
|
</ul>
|
|
<p>Advantages of using XLM-RoBERTa for Zero-Shot Classification in Spark NLP include:</p>
|
|
<ul>
|
|
<li><strong>Scalability:</strong> Built on Apache Spark, the solution scales efficiently for processing large datasets.</li>
|
|
<li><strong>Flexibility:</strong> Easily adapt and integrate with existing Spark pipelines.</li>
|
|
<li><strong>Cross-Lingual Transfer:</strong> Benefit from XLM-RoBERTa’s cross-lingual transfer capabilities to classify text in various languages without additional fine-tuning.</li>
|
|
<li><strong>Pretrained Models:</strong> Leverage state-of-the-art pretrained models available in Spark NLP, reducing the need for custom training.</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown("""<div class="sub-title">How to Use XLM-RoBERTa for Token Classification in Spark NLP</div>""", unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>To leverage XLM-RoBERTa for zero-shot classification, Spark NLP offers a seamless pipeline configuration. The following example demonstrates how to utilize XLM-RoBERTa for zero-shot text classification, enabling the classification of text into categories that the model has never encountered during training. Thanks to its multilingual training, XLM-RoBERTa can perform zero-shot classification across various languages, making it a versatile tool for global NLP applications.</p>
|
|
</div>""", unsafe_allow_html=True)
|
|
st.code('''
|
|
from sparknlp.base import *
|
|
from sparknlp.annotator import *
|
|
from pyspark.ml import Pipeline
|
|
|
|
document_assembler = DocumentAssembler() \\
|
|
.setInputCol('text') \\
|
|
.setOutputCol('document')
|
|
|
|
tokenizer = Tokenizer() \\
|
|
.setInputCols(['document']) \\
|
|
.setOutputCol('token')
|
|
|
|
zeroShotClassifier = XlmRoBertaForZeroShotClassification \\
|
|
.pretrained('xlm_roberta_large_zero_shot_classifier_xnli_anli', 'xx') \\
|
|
.setInputCols(['token', 'document']) \\
|
|
.setOutputCol('class') \\
|
|
.setCaseSensitive(False) \\
|
|
.setMaxSentenceLength(512) \\
|
|
.setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])
|
|
|
|
pipeline = Pipeline(stages=[
|
|
document_assembler,
|
|
tokenizer,
|
|
zeroShotClassifier
|
|
])
|
|
|
|
example = spark.createDataFrame([['I have a problem with my iphone that needs to be resolved asap!!']]).toDF("text")
|
|
result = pipeline.fit(example).transform(example)
|
|
result.select("class.result").show(truncate=False)
|
|
''', language='python')
|
|
|
|
st.text("""
|
|
+------------------+
|
|
|result |
|
|
+------------------+
|
|
|["urgent"] |
|
|
+------------------+
|
|
""")
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>This pipeline processes the input text and classifies it into one of the candidate labels provided. In the example given, the text is classified into categories such as "urgent", "mobile", "travel", etc.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">Choosing the Right Model</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>The XLM-RoBERTa model used here is pretrained on large multilingual datasets and fine-tuned for zero-shot classification tasks. It is available in Spark NLP, providing robust performance across different languages without needing task-specific annotated data.</p>
|
|
<p>For more information about the model, visit the <a class="link" href="https://huggingface.co/xlm-roberta-large-zero-shot-classifier-xnli-anli" target="_blank">XLM-RoBERTa Model Hub</a>.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://arxiv.org/abs/1911.02116" target="_blank">XLM-R: Cross-lingual Pre-training</a></li>
|
|
<li><a class="link" href="https://arxiv.org/abs/2008.03415" target="_blank">Zero-Shot Learning with XLM-RoBERTa</a></li>
|
|
<li><a class="link" href="https://huggingface.co/xlm-roberta-large-zero-shot-classifier-xnli-anli" target="_blank">XLM-RoBERTa Zero-Shot Classifier</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
|
|
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
|
|
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
|
|
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
|
|
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|