Spaces:

spark-nlp
/

xlm-roberta-for-question-answering

Sleeping

App Files Files Community

abdullahmubeen10 commited on Aug 25, 2024

Commit

cd3baae

verified ·

1 Parent(s): 788ec6c

Upload 5 files

Browse files

Files changed (5) hide show

.streamlit/config.toml +3 -0
Demo.py +124 -0
Dockerfile +72 -0
pages/Workflow & Model Overview.py +174 -0
requirements.txt +7 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import streamlit as st
+import sparknlp
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# CSS for styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 10px;
+            border-radius: 10px;
+            margin-top: 10px;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+    </style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_spark():
+    return sparknlp.start()
+@st.cache_resource
+def create_pipeline():
+    document_assembler = MultiDocumentAssembler()\
+        .setInputCols(["question", "context"]) \
+        .setOutputCols(["document_question", "document_context"])
+    quesAnswer = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_base_qa_squad2","en") \
+        .setInputCols(["document_question","document_context"]) \
+        .setOutputCol("answer")
+    pipeline = Pipeline(stages=[document_assembler, quesAnswer])
+    return pipeline
+def fit_data(pipeline, ques='', cont=''):
+    df = spark.createDataFrame([[ques, cont]]).toDF("question", "context")
+    result = pipeline.fit(df).transform(df)
+    return result.select('answer.result').collect()
+tasks_models_descriptions = {
+    "Question Answering": {
+        "models": ["xlm_roberta_base_qa_squad2"],
+        "description": "The 'xlm_roberta_base_qa_squad2' model, based on RoBERTa, is designed for precise question answering. They excel in extracting answers from a given context, making them suitable for developing advanced QA systems, enhancing customer support, and retrieving specific information from text."
+    }
+}
+# Sidebar content
+task = 'Question Answering'
+model = st.sidebar.selectbox("Choose the pretrained model", tasks_models_descriptions[task]["models"], help="For more info about the models visit: https://sparknlp.org/models")
+# Reference notebook link in sidebar
+link = """
+<a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/357691d18373d6e8f13b5b1015137a398fd0a45f/Spark_NLP_Udemy_MOOC/Open_Source/17.01.Transformers-based_Embeddings.ipynb#L103">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>
+"""
+st.sidebar.markdown('Reference notebook:')
+st.sidebar.markdown(link, unsafe_allow_html=True)
+# Page content
+title, sub_title = (f'DeBERTa for {task}', tasks_models_descriptions[task]["description"])
+st.markdown(f'<div class="main-title">{title}</div>', unsafe_allow_html=True)
+container = st.container(border=True)
+container.write(sub_title)
+# Load examples
+examples_mapping = {
+    "Question Answering": {
+        """What does increased oxygen concentrations in the patient’s lungs displace?""": """Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.""",
+        """What category of game is Legend of Zelda: Twilight Princess?""": """The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.""",
+        """Who is founder of Alibaba Group?""": """Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire. His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses. The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media. Alibaba shares surged 5% on Hong Kong's stock exchange on the news.""",
+        """For what instrument did Frédéric write primarily for?""": """Frédéric François Chopin (/ˈʃoʊpæn/; French pronunciation: [fʁe.de.ʁik fʁɑ̃.swa ʃɔ.pɛ̃]; 22 February or 1 March 1810 – 17 October 1849), born Fryderyk Franciszek Chopin,[n 1] was a Polish and French (by citizenship and birth of father) composer and a virtuoso pianist of the Romantic era, who wrote primarily for the solo piano. He gained and has maintained renown worldwide as one of the leading musicians of his era, whose "poetic genius was based on a professional technique that was without equal in his generation." Chopin was born in what was then the Duchy of Warsaw, and grew up in Warsaw, which after 1815 became part of Congress Poland. A child prodigy, he completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising.""",
+        """The most populated city in the United States is which city?""": """New York—often called New York City or the City of New York to distinguish it from the State of New York, of which it is a part—is the most populous city in the United States and the center of the New York metropolitan area, the premier gateway for legal immigration to the United States and one of the most populous urban agglomerations in the world. A global power city, New York exerts a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. Home to the headquarters of the United Nations, New York is an important center for international diplomacy and has been described as the cultural and financial capital of the world."""
+    }
+}
+examples = list(examples_mapping[task].keys())
+selected_text = st.selectbox('Select an Example:', examples)
+st.subheader('Try it yourself!')
+custom_input_question = st.text_input('Create a question')
+custom_input_context = st.text_input("Create it's context")
+custom_examples = {}
+st.subheader('Selected Text')
+if custom_input_question and custom_input_context:
+    QUESTION = custom_input_question
+    CONTEXT = custom_input_context
+elif selected_text:
+    QUESTION = selected_text
+    CONTEXT = examples_mapping[task][selected_text]
+st.markdown(f"**Question:** {QUESTION}")
+st.markdown(f"**Context:** {CONTEXT}")
+# Initialize Spark and create pipeline
+spark = init_spark()
+pipeline = create_pipeline()
+output = fit_data(pipeline, QUESTION, CONTEXT)
+# Display matched sentence
+st.subheader("Prediction:")
+output_text = "".join(output[0][0])
+st.markdown(f"Answer: **{output_text}**")

Dockerfile ADDED Viewed

	@@ -0,0 +1,72 @@

+# Download base image ubuntu 18.04
+FROM ubuntu:18.04
+# Set environment variables
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+# Install required packages
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz \
+    openjdk-8-jdk \
+    ant \
+    ca-certificates-java \
+    && apt-get clean \
+    && update-ca-certificates -f
+# Install Python 3.8 and pip
+RUN add-apt-repository ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip \
+    && apt-get clean
+# Set up JAVA_HOME
+RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> /etc/profile \
+    && echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile
+# Create a new user named "jovyan" with user ID 1000
+RUN useradd -m -u ${NB_UID} ${NB_USER}
+# Switch to the "jovyan" user
+USER ${NB_USER}
+# Set home and path variables for the user
+ENV HOME=/home/${NB_USER} \
+    PATH=/home/${NB_USER}/.local/bin:$PATH
+# Set up PySpark to use Python 3.8 for both driver and workers
+ENV PYSPARK_PYTHON=/usr/bin/python3.8
+ENV PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
+# Set the working directory to the user's home directory
+WORKDIR ${HOME}
+# Upgrade pip and install Python dependencies
+RUN python3.8 -m pip install --upgrade pip
+COPY requirements.txt /tmp/requirements.txt
+RUN python3.8 -m pip install -r /tmp/requirements.txt
+# Copy the application code into the container at /home/jovyan
+COPY --chown=${NB_USER}:${NB_USER} . ${HOME}
+# Expose port for Streamlit
+EXPOSE 7860
+# Define the entry point for the container
+ENTRYPOINT ["streamlit", "run", "Demo.py", "--server.port=7860", "--server.address=0.0.0.0"]

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import streamlit as st
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+        .benchmark-table {
+            width: 100%;
+            border-collapse: collapse;
+            margin-top: 20px;
+        }
+        .benchmark-table th, .benchmark-table td {
+            border: 1px solid #ddd;
+            padding: 8px;
+            text-align: left;
+        }
+        .benchmark-table th {
+            background-color: #4A90E2;
+            color: white;
+        }
+        .benchmark-table td {
+            background-color: #f2f2f2;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Title
+st.markdown('<div class="main-title">Introduction to XLM-RoBERTa Annotators in Spark NLP</div>', unsafe_allow_html=True)
+# Subtitle
+st.markdown("""
+<div class="section">
+    <p>XLM-RoBERTa (Cross-lingual Robustly Optimized BERT Approach) is an advanced multilingual model that extends the capabilities of RoBERTa to over 100 languages. Pre-trained on a massive, diverse corpus, XLM-RoBERTa is designed to handle various NLP tasks in a multilingual context, making it ideal for applications that require cross-lingual understanding. Below, we provide an overview of the XLM-RoBERTa annotators for these tasks:</p>
+</div>
+""", unsafe_allow_html=True)
+# XLM-RoBERTa for Question Answering
+st.markdown("""<div class="sub-title">Question Answering with XLM-RoBERTa</div>""", unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Question answering (QA) is a crucial task in Natural Language Processing (NLP) where the goal is to extract an answer from a given context in response to a specific question.</p>
+    <p><strong>XLM-RoBERTa</strong> excels in question answering tasks across multiple languages, making it an invaluable tool for global applications. Below is an example of how to implement question answering using XLM-RoBERTa in Spark NLP.</p>
+    <p>Using XLM-RoBERTa for Question Answering enables:</p>
+    <ul>
+        <li><strong>Multilingual QA:</strong> Extract answers from text in various languages with a single model.</li>
+        <li><strong>Accurate Contextual Understanding:</strong> Leverage XLM-RoBERTa's deep understanding of context to provide precise answers.</li>
+        <li><strong>Cross-Domain Flexibility:</strong> Apply to different domains, from customer support to education, across languages.</li>
+    </ul>
+    <p>Advantages of using XLM-RoBERTa for Question Answering in Spark NLP include:</p>
+    <ul>
+        <li><strong>Scalability:</strong> Spark NLP is built on Apache Spark, ensuring efficient scaling for large datasets.</li>
+        <li><strong>Pretrained Excellence:</strong> Utilize state-of-the-art pretrained models to achieve high accuracy in question answering tasks.</li>
+        <li><strong>Multilingual Flexibility:</strong> XLM-RoBERTa’s multilingual capabilities make it suitable for global applications, reducing the need for language-specific models.</li>
+        <li><strong>Seamless Integration:</strong> Easily incorporate XLM-RoBERTa into your existing Spark pipelines for streamlined NLP workflows.</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown("""<div class="sub-title">How to Use XLM-RoBERTa for Question Answering in Spark NLP</div>""", unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+<p>To leverage XLM-RoBERTa for question answering, Spark NLP provides a user-friendly pipeline setup. The following example shows how to use XLM-RoBERTa for extracting answers from a given context based on a specific question. XLM-RoBERTa’s multilingual training enables it to perform question answering across various languages, making it an essential tool for global NLP tasks.</p>
+</div>
+""", unsafe_allow_html=True)
+# Code Example
+st.code('''
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+document_assembler = MultiDocumentAssembler() \\
+    .setInputCols(["question", "context"]) \\
+    .setOutputCols(["document_question", "document_context"])
+spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_Part_1_XLM_Model_E1","en") \\
+    .setInputCols(["document_question", "document_context"]) \\
+    .setOutputCol("answer") \\
+    .setCaseSensitive(True)
+pipeline = Pipeline().setStages([document_assembler, spanClassifier])
+example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
+result = pipeline.fit(example).transform(example)
+result.select("answer.result").show(truncate=False)
+''', language='python')
+st.text("""
++-----------+
+|   result  |
++-----------+
+|[Clara]    |
++-----------+
+""")
+# Model Info Section
+st.markdown('<div class="sub-title">Choosing the Right Model</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The XLM-RoBERTa model used here is pretrained and fine-tuned for question answering tasks, providing high accuracy and multilingual support.</p>
+    <p>For more information about the model, visit the <a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Hub</a>.</p>
+</div>
+""", unsafe_allow_html=True)
+# References Section
+st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://arxiv.org/abs/1911.02116" target="_blank">XLM-R: Cross-lingual Pre-training</a></li>
+        <li><a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Overview</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+# Footer
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
+        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
+        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
+        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
+        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+streamlit
+st-annotated-text
+streamlit-tags
+pandas
+numpy
+spark-nlp
+pyspark