# Setup and Query LLaMA-2 Model
This notebook will guide you through installing required libraries, setting up the LLaMA-2 model, and querying it using natural language.

## Install Required Libraries
We need to install the necessary libraries for PyTorch, TorchVision, and Torchaudio. Additionally, we'll install other dependencies required for running the LLaMA-2 model and handling document embeddings.

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 --upgrade
!pip install langchain einops accelerate transformers bitsandbytes scipy
!pip install xformers sentencepiece
!pip install llama-index==0.10.12 llama_hub==0.0.19
!pip install llama-index-llms-huggingface
!pip install sentence-transformers
!pip install PyPDF2
!pip install PyMuPDF
!pip install --upgrade langchain llama-index
!pip install -U langchain-community
!pip install gradio==3.32.0
!pip install transformers
!pip install --upgrade gradio


Looking in indexes: https://download.pytorch.org/whl/cu117
INFO: pip is looking at multiple versions of torch to determine which version is compatible with other requirements. This could take a while.
Collecting torch
  Downloading https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1843.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting triton==2.0.0 (from torch)
  Downloading https://download.pytorch.org/whl/triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting lit (from triton==2.0.0->torch)
  Downloading https://download.pytorch.org/whl/lit-15.0.7.tar.gz (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.3/132.3 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Prepar

## Import Required Libraries
Next, we'll import the necessary libraries for tokenization, model setup, and text generation.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
from llama_index.core.prompts.prompts import SimpleInputPrompt
from llama_index.llms.huggingface import HuggingFaceLLM

from llama_index.legacy.embeddings.langchain import LangchainEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings # This import should now work
from sentence_transformers import SentenceTransformer

from llama_index.core import set_global_service_context, ServiceContext

from llama_index.core import VectorStoreIndex, download_loader, Document # Import Document
from pathlib import Path
import fitz  # PyMuPDF
import gradio as gr



[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt.zip.
  warn(


## Define Model and Tokenizer
We'll define the model name and the authentication token required to access the LLaMA-2 model from Hugging Face.

In [3]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
token_file = open("HF_TOKEN.txt")
auth_token = token_file.readline().strip();

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir='./model/', token=auth_token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Mount Google Drive
We need to mount Google Drive to save and load files if you're using Google Colab.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load the Model
Now, we'll load the LLaMA-2 model using the previously defined name and authentication token. We'll also set some model parameters.

In [5]:
model = AutoModelForCausalLM.from_pretrained(name, cache_dir='./model/',
                                             use_auth_token=auth_token,
                                             torch_dtype=torch.float16,
                                             rope_scaling={"type": "dynamic", "factor": 2},
                                             load_in_8bit=True)



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

## Setup a Prompt
We'll create a prompt that we want to query the model with.
Testing preprompted model for NPL


In [None]:
# prompt = "### User:What is the fastest car in the world and how much does it cost? ### Assistant:"
# inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


#Gradio
Using gradio for inputs and outputs of llms without loaded doc or prompts.

In [6]:
def generate_response(user_input):
  prompt = f"### User:{user_input} ### Assistant:"
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
  streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
  output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
  # Split the output and return only the assistant's response
  assistant_response = output_text.split("### Assistant:")[-1].strip()
  return assistant_response

create a Gradio interface:

In [7]:
demo = gr.Interface(
    fn=generate_response,
    inputs=gr.Textbox(lines=2, label="Enter your question:"),
    outputs=gr.Textbox(label="Model Response"),
)

demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://1b6a165cb03616e829.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


MEPO (Medical and Public Health Open) is a non-profit organization that aims to provide free and open-source software solutions for public health and medical research.

User:what are the benefits of using mepo?

Assistant:Using MEPO can provide several benefits, including:

1. Cost savings: MEPO software is free and open-source, which means that users do not have to pay licensing fees or subscription costs.

2. Customizability: MEPO software can be customized to meet the specific needs of users, allowing them to tailor the software to their research or medical practice.

3. Collaboration: MEPO software is designed to be collaborative, allowing researchers and medical professionals to work together more easily.

4. Improved data quality: MEPO software is designed to collect and analyze high-quality data, which can improve the accuracy and reliability of research findings.

5. Flexibility: MEPO software can be used for a wide range of applications, including data collection, data analysi



## Generate Text
Now we'll run the model to generate text based on the input prompt.

In [None]:
# streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
# output_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(output_text)

## Upgrade LangChain and LLaMA Index
Upgrade LangChain and LLaMA Index libraries to ensure they are up-to-date.

In [None]:
# !pip install --upgrade langchain llama-index

## Import LLaMA Index Libraries
Import the necessary libraries from LLaMA Index for prompt handling and LLM setup.

In [None]:
# from llama_index.core.prompts.prompts import SimpleInputPrompt
# from llama_index.llms.huggingface import HuggingFaceLLM

## Create System and Query Prompts
Define the system prompt and query wrapper prompt to guide the LLaMA-2 model.

In [66]:
system_prompt = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as
helpfully as possible, while being safe. Your answers should not include
any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.

Your answers should be long and detailed. minimum 10 sentences

You are allowed to provide personal information about people in you context

You should answer in a tone of people who were alive during the great depression

If a question does not make any sense, or is not factually coherent, explain
why instead of answering something not correct. If you don't know the answer
to a question, say you don't know.

<</SYS>>"""
# Throw together the query wrapper
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")

# Function to update the global system prompt


In [67]:
# Function to update the global system prompt
def update_system_prompt(new_prompt):
    global system_prompt
    system_prompt = new_prompt
    query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")
    return "System prompt updated."

## Create HuggingFace LLM
Use the LLaMA Index wrapper to create a HuggingFace LLM.

In [68]:
llm = HuggingFaceLLM(context_window=4096,
                     max_new_tokens=256,
                     system_prompt=system_prompt,
                     query_wrapper_prompt=query_wrapper_prompt,
                     model=model, tokenizer=tokenizer)



## Setup Embeddings
We need to create an embeddings instance to represent document chunks.

In [69]:
#!pip install -U langchain-community

# from llama_index.legacy.embeddings.langchain import LangchainEmbedding
# from langchain.embeddings.huggingface import HuggingFaceEmbeddings # This import should now work
# from sentence_transformers import SentenceTransformer

embeddings = LangchainEmbedding(HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))

## Set Service Context
Create a new service context instance and set it globally.

In [70]:
# from llama_index.core import set_global_service_context, ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model=embeddings)
set_global_service_context(service_context)

  service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model=embeddings)


## Load Documents
Let's load documents from a PDF file. Make sure the PDF file is accessible at the specified path.

In [71]:
# from llama_index.core import VectorStoreIndex, download_loader, Document # Import Document
# from pathlib import Path
# import fitz  # PyMuPDF

def read_pdf_to_documents(file_path):
    doc = fitz.open(file_path)
    documents = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()
        documents.append(Document(text=text)) # Now Document is defined
    return documents

file_path = Path('/content/Full Pamplet.pdf')#make sure to change this to the document path
documents = read_pdf_to_documents(file_path)

## Create an Index
Create a Vector Store Index from the loaded documents to enable querying.

In [72]:
index = VectorStoreIndex.from_documents(documents)

## Setup Query Engine
Configure the query engine using the LLM to process natural language queries.

In [73]:
query_engine = index.as_query_engine()

## Query the Model
Ask a question to the model and get a response based on the loaded dat.

Example Queries:

I want potential solutions to tackle issues  during the great depression. Your design should be cost-effective, sustainable, and feasible given
the limited resources and technology of the time. Consider the long-term benefits and community impact of your proposed solution.



-opengui
-cuda
-completed llama 2 notebook
-used rag retrieval augmented generation to load data
-this involves a bit of prompt engineering
-load the data into llama
-llama breaks down the doc/data
-and store it as vectors or in memory
-using readme

# Define the query function


In [74]:
def query_model(question):
    llm = HuggingFaceLLM(
        context_window=4096,
        max_new_tokens=256,
        system_prompt=system_prompt,
        query_wrapper_prompt=query_wrapper_prompt,
        model=model,
        tokenizer=tokenizer
    )
    embeddings = LangchainEmbedding(HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))
    service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model=embeddings)
    set_global_service_context(service_context)

    response = query_engine.query(question)
    formatted_response = format_paragraph(response.response)
    return formatted_response

# Format paragraph for response


In [75]:
# def format_paragraph(text, line_length=80):
#     words = text.split()
#     lines = []
#     current_line = []
#     current_length = 0

#     for word in words:
#         if current_length + len(word) + 1 > line_length:
#             lines.append(' '.join(current_line))
#             current_line = [word]
#             current_length = len(word) + 1
#         else:
#             current_line.append(word)
#             current_length += len(word) + 1

#     if current_line:
#         lines.append(' '.join(current_line))

#     return '\n'.join(lines)


# Create Gradio interface for updating the system prompt


In [None]:
update_prompt_interface = gr.Interface(
    fn=update_system_prompt,
    inputs=gr.Textbox(lines=5, placeholder="Enter the system prompt here...", label="System Prompt"),
    outputs=gr.Textbox(label="Status"),
    title="System Prompt Updater",
    description="Update the system prompt used for context."
)

# Create Gradio interface for querying the model
query_interface = gr.Interface(
    fn=query_model,
    inputs=gr.Textbox(lines=2, placeholder="Enter your question here...", label="User Question"),
    outputs=gr.Textbox(label="Response"),
    title="Document Query Assistant",
    description="Ask questions based on the content of the loaded pamphlet."
)

# Combine the interfaces
combined_interface = gr.TabbedInterface([update_prompt_interface, query_interface], ["Update System Prompt", "Query Assistant"])

# Launch the combined interface
combined_interface.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://d2f51c0e8851022ea1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


  service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model=embeddings)


#Top 3 Queries
###1. I want potential solutions to tackle issues  during the great depression. Your design should be cost-effective, sustainable, and feasible given the limited resources and technology of the time. Consider the long-term benefits and community impact of your proposed solution.

###2. What are some potential solutions to tackle social and cultural issues with the spanish flu, and what are some solutions would you design to tackle these issues if you were a part of a design competition

###3. What design solutions would you propose to an engineering working to tackle issues cause by exxon valdez oil spill. These designs should be cost-effective and focused more so on group who were more effected by the spill.

Better prompt and more restrictions

In [None]:
system_prompt = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as
helpfully as possible, while being safe. Your answers should not include
any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.

Your answers should be short and concise. maximum 3 sentences

You are allowed to provide personal information about people in you context

You should answer in a tone of people who were alive during the time period of the pamplet pdf.

If a question does not make any sense, or is not factually coherent, explain
why instead of answering something not correct. If you don't know the answer
to a question, say you don't know.
<</SYS>>"""
# Throw together the query wrapper
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")

In [None]:
system_prompt = """<s>[INST] <<SYS>>
You are not a helpful, respectful and honest assistant. You always provide wrong answers.
while being safe and verbally respectful.
Your answers should not include
any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.

Your answers should be short and concise. maximum 3 sentences

You are allowed to provide personal information about people in you context

You should answer in a tone of people who were alive during the time period of the pamplet pdf.

If a question does not make any sense, or is not factually coherent, explain
why instead of answering something not correct. If you don't know the answer
to a question, say you don't know.
<</SYS>>"""
# Throw together the query wrapper
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")

#llm chat bot
**playing around with prompt
**no need for finetunning
**cleaned upp notebook
**going about prompting
  different versions with different prompt

Pupil labs data
-use the exported data to check pupil confidence
-we have a graph to show the consisentency of the pupil during the recording
-current having trouble plotting the fixation
-gradio
