import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('
Extract Aspects and Entities from Airline Questions (ATIS dataset)
', unsafe_allow_html=True) # Description st.markdown("""

Named Entity Recognition (NER) is a crucial NLP task that involves identifying and classifying key entities in text. In the context of airline questions, NER helps in extracting essential information such as flight details, dates, and locations, which can be used to automate responses and enhance user interaction.

This app focuses on extracting entities from questions related to airline operations, utilizing the ATIS (Airline Travel Information System) dataset. This dataset includes diverse queries about flight schedules, fares, and other travel-related information.

""", unsafe_allow_html=True) # What is NER st.markdown('
What is Named Entity Recognition (NER)?
', unsafe_allow_html=True) st.markdown("""

Named Entity Recognition (NER) is a process in Natural Language Processing (NLP) that locates and classifies named entities into predefined categories such as person names, organizations, locations, dates, etc. For instance, in the sentence "Flight DL 108 departs from New York on August 1st", NER helps identify 'DL 108' as a flight number, 'New York' as a location, and 'August 1st' as a date.

NER models are trained to understand the context and semantics of entities within text, enabling automated systems to recognize and categorize these entities accurately. This capability is essential for developing intelligent systems capable of processing and responding to user queries efficiently.

""", unsafe_allow_html=True) # Why We Use NER st.markdown('
Why Use NER for Airline Data?
', unsafe_allow_html=True) st.markdown("""

In the airline industry, customer queries often involve extracting specific information from unstructured text. NER helps in:

""", unsafe_allow_html=True) # Model Details st.markdown('
About the Model
', unsafe_allow_html=True) st.markdown("""

The nerdl_atis_840b_300d used in this app is a pre-trained model specifically designed for recognizing airline-related entities. This model is part of the Spark NLP library and has been trained on the ATIS dataset to identify and classify entities relevant to airline operations.

The model includes entities like flight numbers, airport codes, dates, and more, providing a comprehensive tool for processing airline-related queries.

""", unsafe_allow_html=True) st.write("") # Predicted Entities with st.expander("Predicted Entities 80+"): st.markdown(""" """, unsafe_allow_html=True) # How to use st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.markdown("""

To use this model, follow these steps in Python:

""", unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr, round, concat, lit # Define the components of the pipeline document_assembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("document") tokenizer = Tokenizer() \\ .setInputCols(["document"]) \\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\\ .setInputCols("document", "token") \\ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("nerdl_atis_840b_300d", "en") \\ .setInputCols(["document", "token", "embeddings"]) \\ .setOutputCol("ner") ner_converter = NerConverter() \\ .setInputCols(["document", "token", "ner"]) \\ .setOutputCol("ner_chunk") # Create the pipeline pipeline = Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter ]) # Create some example data sample_text = """ On August 20, 2024, Delta Airlines flight DL 456, operated with a B737 aircraft, will depart from Hartsfield-Jackson Atlanta International Airport (ATL) located in Atlanta, Georgia, United States. The flight will begin at 10:00 AM local time and is scheduled to conclude its departure process by 10:30 AM, reflecting the morning period of the day. This non-stop flight, classified under business class and costing $850, is set for travel on Monday, Wednesday, and Friday, with the fare basis code J. The flight is categorized as a direct route without any stops, and passengers will enjoy a vegetarian meal (meal code VGML) on board. Upon arrival, the flight will land at Los Angeles International Airport (LAX) in Los Angeles, California, United States at 02:00 PM local time. The arrival process is expected to end by 02:30 PM, placing this in the afternoon period of the day. The flight, part of a round-trip itinerary, will return on August 27, 2024. The return flight will depart from LAX at 03:00 PM and is scheduled to conclude by 03:30 PM, also reflecting the afternoon period. Both departure and arrival times are stated in the local time zones of their respective locations. The round-trip journey involves non-refundable tickets and features a direct flight with no connecting flights. The entire travel itinerary spans a total of 7 days from departure to return, with all dates relative to today’s date. Passengers should be aware of the restriction code attached to the fare, indicating its non-refundable nature. Additionally, the flight details include flight number, aircraft code, airline code, airline name, class type, fare amount, fare basis code, flight days, and flight stop information. """ data = spark.createDataFrame([[sample_text]]).toDF("text") # Apply the pipeline to the data model = pipeline.fit(data) result = model.transform(data) # Select the result, entity result.select( expr("explode(ner_chunk) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata").getItem("entity").alias("ner_label") ).show(truncate=False) ''') # Results st.text(""" +--------------+-------------------------+ |chunk |ner_label | +--------------+-------------------------+ |20 |depart_time.time | |2024 |flight_number | |Delta Airlines|airline_name | |456 |flight_number | |B737 |aircraft_code | |Georgia |airline_name | |10:00 AM |depart_time.time | |by |depart_time.time_relative| |10:30 AM |depart_time.time | |morning |depart_time.period_of_day| |non-stop |flight_stop | |under |cost_relative | |business class|class_type | |Monday |arrive_date.day_name | |Wednesday |arrive_date.day_name | |Friday |arrive_date.day_name | |vegetarian |meal | |meal |meal | |meal |meal | |Los Angeles |toloc.city_name | +--------------+-------------------------+ """) # Benchmarks st.markdown('
Model Benchmarks
', unsafe_allow_html=True) st.markdown("""

The following table shows the performance benchmarks of the nerdl_atis_840b_300d model on the ATIS dataset:

Metric Score
Precision 93.5%
Recall 92.7%
F1 Score 93.1%

These metrics indicate the model's high accuracy in identifying and classifying airline-related entities, making it a robust tool for processing travel-related queries.

""", unsafe_allow_html=True) # Conclusion st.markdown('
Conclusion
', unsafe_allow_html=True) st.markdown("""

Named Entity Recognition is a powerful tool for extracting structured information from unstructured text. By leveraging the NerDLModel, you can efficiently process airline-related queries and automate responses with high accuracy.

With its impressive performance metrics and the ability to identify a wide range of entities, this model is well-suited for applications in customer service, data analysis, and beyond in the travel and airline industry.

For further exploration, consider integrating the model into your systems and utilizing the extracted information to enhance user experience and operational efficiency.

""", unsafe_allow_html=True) # References st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Community & Support st.markdown('
Community & Support
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)