TeamBlackEdifai's picture
Remove the plotting of the graph as it is blowing up
17917af verified
# -*- coding: utf-8 -*-
"""Copy of Week_7_Workshop.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/13AsDIcNVewwKVKo0AiSBt6sPLjkdXo_q
# Week 7 Workshop: Prompting
In this workshop, we'll use the CREATE framework to iteratively develop and test prompts for our review analysis system.
## CREATE Framework Components:
- **C**haracter: Give the LLM a persona with expertise
- **R**equest: Clearly state what you want
- **E**xample: Include examples to guide behavior
- **A**udience: Specify the target audience
- **T**ype: Define the format for the response
- **E**xtras: Add any additional instructions
We'll create multiple prompts and evaluate them against our ground truth data to find the most effective approach.
## Setup
First, let's install our dependencies and make sure we've got our API Key loaded from Secrets (or locally, if you're running via Jupyter)
"""
# The pip install commands are not needed in Hugging Face, but may be needed in other environments.
#!pip install datasets
#!pip install gradio
from openai import OpenAI
import os
#from google.colab import userdata
import pandas as pd
import numpy as np
import json
from typing import List, Dict
import time
from datasets import load_dataset
# Setup OpenAI client
#OPENAI_KEY = userdata.get('OPENAI_KEY')
OPENAI_KEY = os.getenv('OPENAI_KEY')
client = OpenAI(api_key=OPENAI_KEY)
"""## Helper Functions
Let's create some helper functions for querying OpenAI and evaluating our prompts. Initially this is set up for a sentiment classification. As the workshop progresses you may want to duplicate this Notebook, and potentially create a helper function that evaluates how close to a star rating the model gets, or one that can even rate the quality of a review response.
NB note that this helper function will only pull out words between <classification></classification> tags - so you'll need to make sure your prompt outputs the classification between such tags.
This function currently measures accuracy - but you might consider a different measure depending on your use case.
"""
def query_openai(messages: List[Dict[str, str]],
#model: str = "gpt-4o-mini",
#model: str = "gpt-4",
model: str = "gpt-3.5-turbo",
temperature: float = 0.7) -> str:
try:
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature
)
return response.choices[0].message.content
except Exception as e:
print(f"Error querying OpenAI: {e}")
return None
def evaluate_prompt(prompt_template: dict, test_data: pd.DataFrame) -> pd.DataFrame:
"""Evaluate a prompt against test data and return metrics"""
results = []
for _, row in test_data.iterrows():
messages = [
{"role": "system", "content": prompt_template['system']},
{"role": "user", "content": prompt_template['user'].replace('{{REVIEW}}', row['text'])}
]
response = query_openai(messages)
# Extract classification from XML tags
import re
reasoning_match = re.search(r'<reasoning>(.*?)</reasoning>', response, re.IGNORECASE)
reasoning = reasoning_match.group(1) if reasoning_match else "unknown"
classification_match = re.search(r'<classification>(.*?)</classification>', response, re.IGNORECASE)
predicted = int(classification_match.group(1)) if classification_match else "unknown"
reply = re.search(r'<reply>(.*?)</reply>', response, re.IGNORECASE)
results.append({
'review': row['text'],
'true_label': row['label'],
'reasoning' : reasoning,
'predicted': predicted,
'correct': predicted == row['label'],
'reply': reply.group(1) if reply else ""
})
time.sleep(1) # Rate limiting
results_df = pd.DataFrame(results)
accuracy = results_df['correct'].mean()
print(f"Accuracy: {accuracy:.2%}")
return results_df
"""## Load Ground Truth Data
Let's load our test dataset with labeled reviews.
You'll want to replace this dataset with a larger number of reviews from your dataset. In the real world this would be something like 100 reviews, and would include lots of edge cases, but to keep things simple today, aim for 10-20 reviews.
You can either create synthetic reviews, or you can use an LLM to get your existing classified reviews into the correct format as below.
"""
# Sample test data - replace with your actual dataset
# test_data = pd.DataFrame({
# 'review': [
# "The product arrived quickly but didn't meet my expectations. The quality wasn't great.",
# "Absolutely love this! Best purchase I've made all year.",
# "It's okay, does the job but nothing special.",
# "Terrible experience. Would not recommend."
# ],
# 'label': ['negative', 'positive', 'neutral', 'negative']
# })
ds = load_dataset("vincha77/filtered_yelp_restaurant_reviews")
df = ds["train"].to_pandas()
# splits = {'train': 'train.parquet', 'validation': 'validation.parquet', 'test': 'test.parquet'}
# df = pd.read_parquet("hf://datasets/cornell-movie-review-data/rotten_tomatoes/" + splits["train"])
test_data = df.tail(20)
print(f"Loaded {len(test_data)} test examples")
"""NB when working at scale it can sometimes be better to create eval data sets that are easier to check and use, like this:
```py
test_data = [
{
"review": "I’ve been using this app for a while, but recently, every time I try to upload a photo, it crashes without fail. It’s really frustrating because I need this feature for my work.",
"issue_tag": ["Software Bug"],
"review_sentiment": "Negative",
"review_star_rating": 2,
"suggested_response": "We’re sorry to hear the app is crashing. Please try updating to the latest version or reinstalling the app. If the issue persists, contact our support team for further assistance!"
},
{
"review": "I just bought a new printer, but my computer refuses to recognise it. I’ve checked the cables and tried reinstalling the drivers, but nothing seems to work. I’m very disappointed as this shouldn’t be so difficult.",
"issue_tag": ["Hardware Malfunction"],
"review_sentiment": "Negative",
"review_star_rating": 1,
"suggested_response": "We apologise for the trouble. Have you checked the printer's connection or installed the latest drivers? If you need more help, our support team would be happy to assist."
},
{
"review": "I needed to change my password and spent ages clicking through the menus. The process is not intuitive at all, and I had to look it up online to figure it out. Not the best user experience.",
"issue_tag": ["User Error"],
"review_sentiment": "Neutral",
"review_star_rating": 3,
"suggested_response": "Changing your password should be simple! Here’s a quick guide: Go to the settings menu, select 'Account', and click 'Change Password'. Let us know if you need more assistance!"
},
{
"review": "This app is absolutely fantastic! It’s easy to use, and I love the new photo editing features—they’ve made my projects so much better. I had a minor issue at first, but support was super helpful and sorted it out quickly.",
"issue_tag": ["Positive Feedback"],
"review_sentiment": "Positive",
"review_star_rating": 5,
"suggested_response": "Thank you so much for your kind words! We’re thrilled that you’re enjoying the app and the photo editing features. If you ever need help again, our support team is always here for you!"
}
]
```
## Prompt Version 1: Basic Prompt
Let's start with a basic prompt - make sure to update the system prompt to your use case, but don't add any examples. We deliberately want to start simple, and use the minimum number of input tokens.
"""
prompt_v1 = {
'system': """You are an chatbot specializing in restaurant review classification. Your goal is to help businesses understand customer sentiment accurately.""",
'user': """Please classify the following review as 2, 1, 0 with 2 for a positive review, 1 for a neutral review, and 0 for a negative review.
Output Format: classification in XML tags - <classification>sentiment</classification>
Review to classify: {{REVIEW}}"""
}
# Evaluate prompt_v1
results_v1 = evaluate_prompt(prompt_v1, test_data)
print("\nDetailed Results:")
print(results_v1[['review', 'true_label', 'predicted', 'correct']])
"""## Prompt Version 2: Enhanced Examples
Let's improve our prompt by adding chain-of-thought reasoning.
Make some small edits and fill ou the reasoning steps to adjust this to your use case.
"""
prompt_v2 = {
'system': """You are a senior sentiment analysis expert with extensive experience in restaurant reviews.""",
'user': """Please classify the following review as 2, 1, 0 with 2 for a positive review, 1 for a neutral review, and 0 for a negative review.
Output Format:
1. List key sentiment indicators
2. Provide reasoning
3. End with classification in XML tags:
<reasoning>reasoning</reasoning>
<classification>sentiment</classification>
Review to classify: {{REVIEW}}"""
}
# Evaluate prompt_v2
results_v2 = evaluate_prompt(prompt_v2, test_data)
print("\nDetailed Results:")
print(results_v2[['review', 'true_label','reasoning', 'predicted', 'correct']])
"""## 🔨 Exercise: Create Your Enhanced Prompt
Now it's your turn! Create a new version of the prompt using the CREATE framework.
|CREATE|Element|Description|
|---|---|---|
| C | Character | Give the LLM a persona. Who has the right expertise to give you the information you need? |
| R | Request | State what you want and provide relevant details, including context and other useful information such as specific keywords and facts. |
| E | Example | Use examples in the prompt to provide context and direction for the output. |
| A | Audience | Who is the target audience for this request? What tone would resonate with them? |
| T | Type | What format and structure do you want e.g. a table, bullet points, number of words? |
| E | Extras | e.g. Ignore everything before this prompt<br>e.g. Ask me questions before you answer<br>e.g. Explain your thinking |
Try adding ONE element at a time before comparing it against the other two prompts in the following code block.
Consider:
- What persona would be most effective?
- What examples would help guide the model?
- How can you make the output format more precise?
- What additional instructions might help?
"""
prompt_v3 = {
'system': """You're the manager of the restaurants that have received reviews.""",
'user': """Please classify the following review as 2, 1, 0 with 2 for a positive review, 1 for a neutral review, and 0 for a negative review.
Use the following examples to determine review sentiment:
Input: Lived out of state for awhile and really missed Trader's great food and specialty items (ie try the pop-up sponges; special dark chocolate bars; thick hand done tortillas; different naan breads; good coffee selection) Wide variety of frozen entrees, interesting frozen veggies, rice, sauces (curries great!). A very happy staff happy to serve will answer your questions and guide you to your item. Due to the competition among grocers, the workers in the large chain stores in town have now stopped being so snarky with customers and some come close to Trader's service, but Trader's cornered the market as a pleasant place to shop with good prices years ago. Free small samples of coffee and food round out the friendly atmosphere.
Output: sentiment is 2
Input: I really wanted to like this place. Especially since two businesses failed before it. Sadly, if things don't improve I foresee Pyramid grill following suit. Maricella was our server. Poor girl seemed a little overwhelmed. She was the hostess, server, busser, and bartender for the whole place. (Not that it was busy) She was very sweet and and prompt as she could possibly be. I ordered the Alfredo. I wasn't really impressed with it. It was lacking the creamy savoryness that one would expect with an Alfredo sauce. To me, it tastes like over cooked, over buttered noodles with feta cheese on top. My boys had the children's Mac and cheese. I think their meal tasted way better than mine. The fries were really good. The inside was a little weird. It's like they gutted the whole thing, but left the booths. It seemed divey to me. The front entrance was nicely decorated for the holidays. It looked like it would be a pretty nice restaurant at first glance until you peek around the corner. I'm not sure I will be back unless I'm with a group that decided to go or unless I hear that changes have been made.
Output: sentiment is 1
Input: Things started off somewhat negative. The hostess was somewhat snarky and not accommodating. In front of me she had to discuss with the other hostess why she wouldn't seat us in a certain seat. The kind hostess was exasperated with her and handed her the menus and said I will let you handle this. I just stood there in amazement. She refused to seat us in a booth because she was reserving it for possible larger parties. They do not take reservations yet we had to be seated at a tiny round bar table for two. Yet there was a party of two seated right before us in a booth. I explained my husband had a recent injury and the high chairs were uncomfortable. She still would not let us sit in a booth. Now the up side of this was the waiter!! He was wonderful and very accommodating. Did I mention it was our anniversary and I thought we would try a new restaurant. We ordered an appetizer and a drink, honestly by this time I was feeling quite upset about the whole thing. The appetizer was extremely mediocre and I suggested we not order an entree and we left. The waiter said he would pay for our meal ( how very kind he was) we paid and left and I will never return.
Output: sentiment is 0
Output Format:
1. List key sentiment indicators
2. Put reasoning in the reasoning XML tags:
<reasoning>reasoning</reasoning>
3. End with classification in XML tags:
<classification>sentiment</classification>
4. Write a reply of less than 50 words back to the reviewer. Be sympathetic. If the review is negative, include that you are sorry they felt that way and hope to do better. If the review is positive, include how much you appreciate the review.
If a review is not about a restaurant, respond by saying that we cannot answer non-restaurant reviews.
Put reply in the reply XML tags:
<reply>reply</reply>
Review to classify: {{REVIEW}}"""
}
# Evaluate prompt v3
results_v3 = evaluate_prompt(prompt_v3, test_data)
print("\nDetailed Results:")
print(results_v3[['review', 'true_label', 'reasoning', 'predicted', 'correct', 'reply']])
"""## Compare All Versions
Let's analyze how each version of our prompt performed.
If you're getting 100%, consider adding more examples to the eval dataset, including some much harder use cases.
If you're not getting 100%, consider some further iterations.
"""
# Calculate accuracy for each version
results = {
'Version 1 (Basic)': results_v1['correct'].mean(),
'Version 2 (Enhanced)': results_v2['correct'].mean(),
'Version 3 (Your Version)': results_v3['correct'].mean()
}
results_df = pd.DataFrame(results.items(), columns=['Version', 'Accuracy'])
print("Accuracy Comparison:")
print(results_df)
# Create a visualization of the results
# results_df.plot(x='Version', y='Accuracy', kind='bar')
"""## Define Gradio functions and create the chat interaction space"""
def test_prompt(prompt_template: dict, user_prompt): #test_data: pd.DataFrame) -> pd.DataFrame:
"""Evaluate a prompt against test data and return metrics"""
messages = [
{"role": "system", "content": prompt_template['system']},
{"role": "user", "content": prompt_template['user'].replace('{{REVIEW}}', user_prompt)}
]
response = query_openai(messages)
import re
classification_match = re.search(r'<classification>(.*?)</classification>', response, re.IGNORECASE)
reply = re.search(r'<reply>(.*?)</reply>', response, re.IGNORECASE)
return classification_match.group(1), reply.group(1)
# Here is where we set up the interface for restaurant reviews and provide feedback
import gradio as gr
# Define the response function
def get_response(question):
# For now, just echo the question with a response
sentiment, reply = test_prompt(prompt_v3, question)
if sentiment == '2':
sentiment = "Positive"
elif sentiment == '1':
sentiment = "Neutral"
else:
sentiment = "Negative"
return sentiment, reply
# Create the interface
interface = gr.Interface(
fn=get_response, # The function to process input
inputs=gr.Textbox(label="Write a restaurant review"), # Input box
outputs=[
gr.Textbox(label="Predicted sentiment"),
#gr.Textbox(label="True sentiment"),
gr.Textbox(label="Reply to reviewer"),
],
title="Restaurant Review Interface", # Title for the app
description="Write a review see it's sentiment recognition and auto-reply. Do not write non-restaurant reviews."
)
# Launch the interface
interface.launch()
"""## Your task for this week will be to:
### Topic
Experiment with prompts to optimize your LLMs' performance.
### Deliverable
Link to your updated Hugging Face Space from week 5 that demos improved classification, rating, and response generation. The user should be able to select prompts that use examples and chain of thought. Possibly even tool/function use.
### OpenAI Guides for further exploration
- [Tool Use for Customer Service](https://cookbook.openai.com/examples/using_tool_required_for_customer_service)
- [OpenAI Evals](https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals)
## Reflection
Consider the following questions:
1. Which aspects of the CREATE framework had the biggest impact on performance?
2. How did different types of examples affect the results?
3. What other improvements could you make to your prompt?
4. How would you adapt this prompt for different use cases (e.g., generating responses)?
"""