import streamlit as st # Page configuration st.set_page_config( layout="wide", initial_sidebar_state="auto" ) # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Title st.markdown('

Introduction to XLNet for Token & Sequence Classification in Spark NLP

', unsafe_allow_html=True) # Subtitle st.markdown("""

XLNet is a powerful transformer-based language model that excels in handling various Natural Language Processing (NLP) tasks. It uses a permutation-based training approach, which allows it to capture bidirectional context, making it highly effective for tasks like token classification and sequence classification.

""", unsafe_allow_html=True) # Tabs for XLNet Annotators tab1, tab2 = st.tabs(["XlnetForTokenClassification", "XlnetForSequenceClassification"]) # Tab 1: XlnetForTokenClassification with tab1: st.markdown("""

XLNet for Token Classification

Token Classification involves assigning labels to individual tokens (words or subwords) within a sentence. This is crucial for tasks such as Named Entity Recognition (NER), where each token is classified as a specific entity like a person, organization, or location.

XLNet, with its robust contextual understanding, is particularly suited for token classification tasks. Its permutation-based training enables the model to capture dependencies across different parts of a sentence, improving accuracy in token-level predictions.

Using XLNet for token classification enables:

Accurate NER: Extract entities from text with high precision.
Contextual Understanding: Benefit from XLNet's ability to model bidirectional context for each token.
Scalability: Efficiently process large-scale datasets using Spark NLP.

""", unsafe_allow_html=True) # Implementation Section st.markdown('

How to Use XLNet for Token Classification in Spark NLP

', unsafe_allow_html=True) st.markdown("""

Below is an example of how to set up a pipeline in Spark NLP using the XLNet model for token classification, specifically for Named Entity Recognition (NER).

""", unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') tokenClassifier = XlnetForTokenClassification \\ .pretrained('xlnet_base_token_classifier_conll03', 'en') \\ .setInputCols(['token', 'document']) \\ .setOutputCol('ner') \\ .setCaseSensitive(True) \\ .setMaxSentenceLength(512) ner_converter = NerConverter() \\ .setInputCols(['document', 'token', 'ner']) \\ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) example = spark.createDataFrame([['My name is John!']]).toDF("text") result = pipeline.fit(example).transform(example) ''', language='python') # Example Output st.text(""" +---------+---------+ |entities |label | +---------+---------+ |John |PER | +---------+---------+ """) # Model Info Section st.markdown('

Choosing the Right XLNet Model

', unsafe_allow_html=True) st.markdown("""

Spark NLP offers various XLNet models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance.

Explore the available models on the Spark NLP Models Hub to find the one that fits your needs.

""", unsafe_allow_html=True) # Tab 2: XlnetForSequenceClassification with tab2: st.markdown("""

XLNet for Sequence Classification

Sequence Classification is the task of assigning a label to an entire sequence of text, such as determining the sentiment of a review or categorizing a document into topics. XLNet's ability to model long-range dependencies makes it particularly effective for sequence classification.

Using XLNet for sequence classification enables:

Sentiment Analysis: Accurately determine the sentiment of text.
Document Classification: Categorize documents based on their content.
Robust Performance: Benefit from XLNet's permutation-based training for improved classification accuracy.

""", unsafe_allow_html=True) # Implementation Section st.markdown('

How to Use XLNet for Sequence Classification in Spark NLP

', unsafe_allow_html=True) st.markdown("""

The following example demonstrates how to set up a pipeline in Spark NLP using the XLNet model for sequence classification, particularly for sentiment analysis of movie reviews.

""", unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') sequenceClassifier = XlnetForSequenceClassification \\ .pretrained('xlnet_base_sequence_classifier_imdb', 'en') \\ .setInputCols(['token', 'document']) \\ .setOutputCol('class') \\ .setCaseSensitive(False) \\ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) example = spark.createDataFrame([['I really liked that movie!']]).toDF("text") result = pipeline.fit(example).transform(example) ''', language='python') # Example Output st.text(""" +------------------------+ |class | +------------------------+ |[positive] | +------------------------+ """) # Model Info Section st.markdown('

Choosing the Right XLNet Model

', unsafe_allow_html=True) st.markdown("""

Various XLNet models are available for sequence classification in Spark NLP. Each model is fine-tuned for specific tasks, so selecting the right one is crucial for achieving optimal performance.

Explore the available models on the Spark NLP Models Hub to find the best fit for your use case.

""", unsafe_allow_html=True) # Footer st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True) st.markdown('

Quick Links

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True)