import streamlit as st
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Title
st.markdown('
Introduction to ALBERT Annotators in Spark NLP
', unsafe_allow_html=True)
# Subtitle
st.markdown("""
ALBERT (A Lite BERT) offers a more efficient alternative to BERT by implementing two parameter-reduction techniques: splitting the embedding matrix and using repeating layers. It maintains high performance while being more memory-efficient. Below, we provide an overview of the ALBERT annotator for token classification:
""", unsafe_allow_html=True)
tab1, tab2, tab3 = st.tabs(["ALBERT for Token Classification", "ALBERT for Sequence Classification", "ALBERT for Question Answering"])
with tab1:
st.markdown("""
ALBERT for Token Classification
The AlbertForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using ALBERT. This model efficiently handles token classification, enabling the identification and classification of entities in text. The ALBERT model, with its parameter-reduction techniques, achieves state-of-the-art performance while being more lightweight compared to BERT.
Token classification with ALBERT enables:
- Named Entity Recognition (NER): Identifying and classifying entities such as names, organizations, locations, and other predefined categories.
- Information Extraction: Extracting key information from unstructured text for further analysis.
- Text Categorization: Enhancing document retrieval and categorization based on entity recognition.
Here is an example of how ALBERT token classification works:
Entity |
Label |
Google |
ORG |
Satya Nadella |
PER |
Seattle |
LOC |
""", unsafe_allow_html=True)
# ALBERT Token Classification - NER CoNLL
st.markdown('ALBERT Token Classification - NER CoNLL
', unsafe_allow_html=True)
st.markdown("""
The albert_base_token_classifier_conll03 is a fine-tuned ALBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the CoNLL-03 dataset. It recognizes four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous (MISC).
""", unsafe_allow_html=True)
# How to Use the Model - Token Classification
st.markdown('How to Use the Model
', unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
tokenClassifier = AlbertForTokenClassification \\
.pretrained('albert_base_token_classifier_conll03', 'en') \\
.setInputCols(['token', 'document']) \\
.setOutputCol('ner') \\
.setCaseSensitive(True) \\
.setMaxSentenceLength(512)
# Convert NER labels to entities
ner_converter = NerConverter() \\
.setInputCols(['document', 'token', 'ner']) \\
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
tokenClassifier,
ner_converter
])
example = spark.createDataFrame([["My name is John!"]]).toDF("text")
result = pipeline.fit(example).transform(example)
result.select(
expr("explode(entities) as ner_chunk")
).select(
col("ner_chunk.result").alias("chunk"),
col("ner_chunk.metadata.entity").alias("ner_label")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+-----+---------+
|chunk|ner_label|
+-----+---------+
|John |PER |
+-----+---------+
""")
# Performance Metrics
st.markdown('Performance Metrics
', unsafe_allow_html=True)
st.markdown("""
Here are the detailed performance metrics for the ALBERT token classification model:
Entity |
Precision |
Recall |
F1-Score |
Support |
B-LOC |
0.95 |
0.97 |
0.96 |
1837 |
B-MISC |
0.87 |
0.86 |
0.87 |
922 |
B-ORG |
0.90 |
0.91 |
0.90 |
1341 |
B-PER |
0.91 |
0.97 |
0.94 |
1842 |
I-LOC |
0.88 |
0.86 |
0.87 |
257 |
I-MISC |
0.78 |
0.76 |
0.77 |
346 |
I-ORG |
0.84 |
0.85 |
0.85 |
751 |
I-PER |
0.97 |
0.92 |
0.94 |
1307 |
O |
0.99 |
0.99 |
0.99 |
42759 |
average |
0.92 |
0.92 |
0.92 |
52000 |
""", unsafe_allow_html=True)
# Model Info Section
st.markdown('Model Info
', unsafe_allow_html=True)
st.markdown("""
- Model Name: ALBERT for Token Classification
- Pretrained Model: albert_base_token_classifier_conll03
- Training Dataset: CoNLL-03
- Languages Supported: English
- Use Cases:
- Named Entity Recognition (NER)
- Information Extraction
- Text Categorization
- Performance: High accuracy with a focus on memory efficiency
- Implementation: Spark NLP
- Resource Requirements: Moderate computational resources; suitable for production environments with optimization
""", unsafe_allow_html=True)
# References Section
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
- Lan, Z., Chen, J., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942.
- Google Research's ALBERT GitHub Repository
- Spark NLP Model - albert_base_qa_squad2
- CoNLL-03 Named Entity Recognition Dataset
""", unsafe_allow_html=True)
with tab2:
st.markdown("""
ALBERT for Sequence Classification
The AlbertForSequenceClassification annotator is tailored for tasks like sentiment analysis or multi-class text classification using the ALBERT model. This model efficiently handles sequence classification, achieving state-of-the-art performance with reduced parameters compared to BERT.
Sequence classification with ALBERT enables:
- Sentiment Analysis: Determining the sentiment expressed in text, such as positive, negative, or neutral.
- Multi-Class Text Classification: Categorizing text into predefined classes, such as news categories or topics.
- Document Classification: Enhancing search and categorization of documents based on content classification.
Here is an example of how ALBERT sequence classification works:
Text |
Label |
Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993. |
Business |
""", unsafe_allow_html=True)
# ALBERT Sequence Classification - AG News
st.markdown('ALBERT Sequence Classification - AG News
', unsafe_allow_html=True)
st.markdown("""
The albert_base_sequence_classifier_ag_news is a fine-tuned ALBERT model for sequence classification tasks, specifically adapted for text classification on the AG News dataset. It recognizes four categories: Business, Sci/Tech, Sports, and World.
""", unsafe_allow_html=True)
# How to Use the Model - Sequence Classification
st.markdown('How to Use the Model
', unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
sequenceClassifier = AlbertForSequenceClassification \\
.pretrained('albert_base_sequence_classifier_ag_news', 'en') \\
.setInputCols(['token', 'document']) \\
.setOutputCol('class') \\
.setCaseSensitive(False) \\
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
example = spark.createDataFrame([["Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993."]]).toDF("text")
result = pipeline.fit(example).transform(example)
result.select(
expr("explode(class) as classification_result")
).select(
col("classification_result.result").alias("category")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+---------+
|category |
+---------+
|Business |
+---------+
""")
# Performance Metrics
st.markdown('Performance Metrics
', unsafe_allow_html=True)
st.markdown("""
Here are the detailed performance metrics for the ALBERT sequence classification model on the AG News dataset:
Metric |
Score |
Accuracy |
0.9472 |
F1-Score |
0.9472 |
Precision |
0.9472 |
Recall |
0.9472 |
Evaluation Loss |
0.1882 |
""", unsafe_allow_html=True)
# Model Info Section
st.markdown('Model Info
', unsafe_allow_html=True)
st.markdown("""
- Model Name: ALBERT for Sequence Classification
- Pretrained Model: albert_base_sequence_classifier_ag_news
- Training Dataset: AG News
- Languages Supported: English
- Use Cases:
- Sentiment Analysis
- Multi-Class Text Classification
- Document Classification
- Performance: High accuracy with a focus on memory efficiency
- Implementation: Spark NLP
- Resource Requirements: Moderate computational resources; suitable for production environments with optimization
""", unsafe_allow_html=True)
# References Section
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
- Lan, Z., Chen, J., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942.
- Google Research's ALBERT GitHub Repository
- Spark NLP Model - albert_base_sequence_classifier_ag_news
- AG News Dataset
""", unsafe_allow_html=True)
with tab3:
st.markdown("""
ALBERT for Question Answering
The AlbertForQuestionAnswering annotator is specialized for tasks involving Question Answering (QA) using the ALBERT model. This model efficiently processes question-context pairs to provide accurate answers, making it ideal for QA systems and information retrieval applications.
Question Answering with ALBERT enables:
- Information Retrieval: Extracting precise answers from large text corpora based on user queries.
- Knowledge Management: Enhancing customer support and information systems by providing accurate answers.
- Contextual Understanding: Leveraging ALBERT’s capabilities to understand the context of questions and provide relevant answers.
Here is an example of how ALBERT question answering works:
Question |
Context |
Answer |
What is my name? |
My name is Clara and I live in Berkeley. |
Clara |
""", unsafe_allow_html=True)
# ALBERT Question Answering - SQuAD2
st.markdown('ALBERT Question Answering - SQuAD2
', unsafe_allow_html=True)
st.markdown("""
The albert_base_qa_squad2 is a fine-tuned ALBERT model for Question Answering tasks, specifically adapted for the SQuAD2 dataset. It is capable of answering questions based on the provided context with high accuracy.
""", unsafe_allow_html=True)
# How to Use the Model - Question Answering
st.markdown('How to Use the Model
', unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
documentAssembler = MultiDocumentAssembler() \\
.setInputCols(["question", "context"]) \\
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en") \\
.setInputCols(["document_question", "document_context"]) \\
.setOutputCol("answer") \\
.setCaseSensitive(False)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
result.select(
col("answer.result").alias("predicted_answer")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+----------------+
|predicted_answer|
+----------------+
|[clara] |
+----------------+
""")
# Performance Metrics
st.markdown('Performance Metrics
', unsafe_allow_html=True)
st.markdown("""
The performance metrics of the ALBERT question answering model on a development subset of the SQuAD2 dataset are:
Metric |
Score |
Exact Match |
78.71% |
F1 Score |
81.89% |
Total |
6078 |
HasAns Exact Match |
75.40% |
HasAns F1 Score |
82.04% |
HasAns Total |
2910 |
NoAns Exact Match |
81.76% |
NoAns F1 Score |
81.76% |
NoAns Total |
3168 |
Best Exact Match |
78.73% |
Best F1 Score |
81.91% |
""", unsafe_allow_html=True)
# Model Info Section
st.markdown('Model Info
', unsafe_allow_html=True)
st.markdown("""
- Model Name: ALBERT for Question Answering
- Pretrained Model: albert_base_qa_squad2
- Training Dataset: SQuAD2
- Languages Supported: English
- Use Cases:
- Information Retrieval
- Knowledge Management
- Contextual Understanding
- Performance: High accuracy with optimized resource usage
- Implementation: Spark NLP
- Resource Requirements: Moderate computational resources; suitable for production environments
""", unsafe_allow_html=True)
# References Section
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
- Lan, Z., Chen, J., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942.
- Spark NLP Model - albert_base_qa_squad2
""", unsafe_allow_html=True)
# Community & Support
st.markdown('Community & Support
', unsafe_allow_html=True)
st.markdown("""
- Official Website: Documentation and examples
- Slack: Live discussion with the community and team
- GitHub: Bug reports, feature requests, and contributions
""", unsafe_allow_html=True)