import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Title st.markdown('
Introduction to ALBERT Annotators in Spark NLP
', unsafe_allow_html=True) # Subtitle st.markdown("""

ALBERT (A Lite BERT) offers a more efficient alternative to BERT by implementing two parameter-reduction techniques: splitting the embedding matrix and using repeating layers. It maintains high performance while being more memory-efficient. Below, we provide an overview of the ALBERT annotator for token classification:

""", unsafe_allow_html=True) tab1, tab2, tab3 = st.tabs(["ALBERT for Token Classification", "ALBERT for Sequence Classification", "ALBERT for Question Answering"]) with tab1: st.markdown("""

ALBERT for Token Classification

The AlbertForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using ALBERT. This model efficiently handles token classification, enabling the identification and classification of entities in text. The ALBERT model, with its parameter-reduction techniques, achieves state-of-the-art performance while being more lightweight compared to BERT.

Token classification with ALBERT enables:

Here is an example of how ALBERT token classification works:

Entity Label
Google ORG
Satya Nadella PER
Seattle LOC
""", unsafe_allow_html=True) # ALBERT Token Classification - NER CoNLL st.markdown('
ALBERT Token Classification - NER CoNLL
', unsafe_allow_html=True) st.markdown("""

The albert_base_token_classifier_conll03 is a fine-tuned ALBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the CoNLL-03 dataset. It recognizes four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous (MISC).

""", unsafe_allow_html=True) # How to Use the Model - Token Classification st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') tokenClassifier = AlbertForTokenClassification \\ .pretrained('albert_base_token_classifier_conll03', 'en') \\ .setInputCols(['token', 'document']) \\ .setOutputCol('ner') \\ .setCaseSensitive(True) \\ .setMaxSentenceLength(512) # Convert NER labels to entities ner_converter = NerConverter() \\ .setInputCols(['document', 'token', 'ner']) \\ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) example = spark.createDataFrame([["My name is John!"]]).toDF("text") result = pipeline.fit(example).transform(example) result.select( expr("explode(entities) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +-----+---------+ |chunk|ner_label| +-----+---------+ |John |PER | +-----+---------+ """) # Performance Metrics st.markdown('
Performance Metrics
', unsafe_allow_html=True) st.markdown("""

Here are the detailed performance metrics for the ALBERT token classification model:

Entity Precision Recall F1-Score Support
B-LOC 0.95 0.97 0.96 1837
B-MISC 0.87 0.86 0.87 922
B-ORG 0.90 0.91 0.90 1341
B-PER 0.91 0.97 0.94 1842
I-LOC 0.88 0.86 0.87 257
I-MISC 0.78 0.76 0.77 346
I-ORG 0.84 0.85 0.85 751
I-PER 0.97 0.92 0.94 1307
O 0.99 0.99 0.99 42759
average 0.92 0.92 0.92 52000
""", unsafe_allow_html=True) # Model Info Section st.markdown('
Model Info
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # References Section st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) with tab2: st.markdown("""

ALBERT for Sequence Classification

The AlbertForSequenceClassification annotator is tailored for tasks like sentiment analysis or multi-class text classification using the ALBERT model. This model efficiently handles sequence classification, achieving state-of-the-art performance with reduced parameters compared to BERT.

Sequence classification with ALBERT enables:

Here is an example of how ALBERT sequence classification works:

Text Label
Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993. Business
""", unsafe_allow_html=True) # ALBERT Sequence Classification - AG News st.markdown('
ALBERT Sequence Classification - AG News
', unsafe_allow_html=True) st.markdown("""

The albert_base_sequence_classifier_ag_news is a fine-tuned ALBERT model for sequence classification tasks, specifically adapted for text classification on the AG News dataset. It recognizes four categories: Business, Sci/Tech, Sports, and World.

""", unsafe_allow_html=True) # How to Use the Model - Sequence Classification st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') sequenceClassifier = AlbertForSequenceClassification \\ .pretrained('albert_base_sequence_classifier_ag_news', 'en') \\ .setInputCols(['token', 'document']) \\ .setOutputCol('class') \\ .setCaseSensitive(False) \\ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) example = spark.createDataFrame([["Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993."]]).toDF("text") result = pipeline.fit(example).transform(example) result.select( expr("explode(class) as classification_result") ).select( col("classification_result.result").alias("category") ).show(truncate=False) ''', language='python') # Results st.text(""" +---------+ |category | +---------+ |Business | +---------+ """) # Performance Metrics st.markdown('
Performance Metrics
', unsafe_allow_html=True) st.markdown("""

Here are the detailed performance metrics for the ALBERT sequence classification model on the AG News dataset:

Metric Score
Accuracy 0.9472
F1-Score 0.9472
Precision 0.9472
Recall 0.9472
Evaluation Loss 0.1882
""", unsafe_allow_html=True) # Model Info Section st.markdown('
Model Info
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # References Section st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) with tab3: st.markdown("""

ALBERT for Question Answering

The AlbertForQuestionAnswering annotator is specialized for tasks involving Question Answering (QA) using the ALBERT model. This model efficiently processes question-context pairs to provide accurate answers, making it ideal for QA systems and information retrieval applications.

Question Answering with ALBERT enables:

Here is an example of how ALBERT question answering works:

Question Context Answer
What is my name? My name is Clara and I live in Berkeley. Clara
""", unsafe_allow_html=True) # ALBERT Question Answering - SQuAD2 st.markdown('
ALBERT Question Answering - SQuAD2
', unsafe_allow_html=True) st.markdown("""

The albert_base_qa_squad2 is a fine-tuned ALBERT model for Question Answering tasks, specifically adapted for the SQuAD2 dataset. It is capable of answering questions based on the provided context with high accuracy.

""", unsafe_allow_html=True) # How to Use the Model - Question Answering st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr documentAssembler = MultiDocumentAssembler() \\ .setInputCols(["question", "context"]) \\ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en") \\ .setInputCols(["document_question", "document_context"]) \\ .setOutputCol("answer") \\ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) result.select( col("answer.result").alias("predicted_answer") ).show(truncate=False) ''', language='python') # Results st.text(""" +----------------+ |predicted_answer| +----------------+ |[clara] | +----------------+ """) # Performance Metrics st.markdown('
Performance Metrics
', unsafe_allow_html=True) st.markdown("""

The performance metrics of the ALBERT question answering model on a development subset of the SQuAD2 dataset are:

Metric Score
Exact Match 78.71%
F1 Score 81.89%
Total 6078
HasAns Exact Match 75.40%
HasAns F1 Score 82.04%
HasAns Total 2910
NoAns Exact Match 81.76%
NoAns F1 Score 81.76%
NoAns Total 3168
Best Exact Match 78.73%
Best F1 Score 81.91%
""", unsafe_allow_html=True) # Model Info Section st.markdown('
Model Info
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # References Section st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Community & Support st.markdown('
Community & Support
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)