Spaces:
Sleeping
Sleeping
Merge branch 'main' into pdf-render
Browse files- CHANGELOG.md +31 -10
- README.md +7 -3
- document_qa/document_qa_engine.py +66 -25
- document_qa/grobid_processors.py +1 -1
- pyproject.toml +1 -1
- streamlit_app.py +15 -11
CHANGELOG.md
CHANGED
|
@@ -4,27 +4,49 @@ All notable changes to this project will be documented in this file.
|
|
| 4 |
|
| 5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
## [0.2.0] – 2023-10-31
|
| 9 |
|
| 10 |
### Added
|
|
|
|
| 11 |
+ Selection of chunk size on which embeddings are created upon
|
| 12 |
-
+ Mistral model to be used freely via the Huggingface free API
|
| 13 |
|
| 14 |
### Changed
|
| 15 |
-
|
|
|
|
| 16 |
+ Moved settings on the sidebar
|
| 17 |
+ Disable NER extraction by default, and allow user to activate it
|
| 18 |
+ Read API KEY from the environment variables and if present, avoid asking the user
|
| 19 |
+ Avoid changing model after update
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
## [0.1.3] – 2023-10-30
|
| 24 |
|
| 25 |
### Fixed
|
| 26 |
|
| 27 |
-
+ ChromaDb accumulating information even when new papers were uploaded
|
| 28 |
|
| 29 |
## [0.1.2] – 2023-10-26
|
| 30 |
|
|
@@ -36,9 +58,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
| 36 |
|
| 37 |
### Fixed
|
| 38 |
|
| 39 |
-
+ Github action build
|
| 40 |
-
+ dependencies of langchain and chromadb
|
| 41 |
-
|
| 42 |
|
| 43 |
## [0.1.0] – 2023-10-26
|
| 44 |
|
|
@@ -54,8 +75,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
| 54 |
+ Kick off application
|
| 55 |
+ Support for GPT-3.5
|
| 56 |
+ Support for Mistral + SentenceTransformer
|
| 57 |
-
+ Streamlit application
|
| 58 |
-
+ Docker image
|
| 59 |
+ pypi package
|
| 60 |
|
| 61 |
<!-- markdownlint-disable-file MD024 MD033 -->
|
|
|
|
| 4 |
|
| 5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
| 6 |
|
| 7 |
+
## [0.3.1] - 2023-11-22
|
| 8 |
+
|
| 9 |
+
### Added
|
| 10 |
+
|
| 11 |
+
+ Include biblio in embeddings by @lfoppiano in #21
|
| 12 |
+
|
| 13 |
+
### Fixed
|
| 14 |
+
|
| 15 |
+
+ Fix conversational memory by @lfoppiano in #20
|
| 16 |
+
|
| 17 |
+
## [0.3.0] - 2023-11-18
|
| 18 |
+
|
| 19 |
+
### Added
|
| 20 |
+
|
| 21 |
+
+ add zephyr-7b by @lfoppiano in #15
|
| 22 |
+
+ add conversational memory in #18
|
| 23 |
+
|
| 24 |
+
## [0.2.1] - 2023-11-01
|
| 25 |
+
|
| 26 |
+
### Fixed
|
| 27 |
+
|
| 28 |
+
+ fix env variables by @lfoppiano in #9
|
| 29 |
|
| 30 |
## [0.2.0] – 2023-10-31
|
| 31 |
|
| 32 |
### Added
|
| 33 |
+
|
| 34 |
+ Selection of chunk size on which embeddings are created upon
|
| 35 |
+
+ Mistral model to be used freely via the Huggingface free API
|
| 36 |
|
| 37 |
### Changed
|
| 38 |
+
|
| 39 |
+
+ Improved documentation, adding privacy statement
|
| 40 |
+ Moved settings on the sidebar
|
| 41 |
+ Disable NER extraction by default, and allow user to activate it
|
| 42 |
+ Read API KEY from the environment variables and if present, avoid asking the user
|
| 43 |
+ Avoid changing model after update
|
| 44 |
|
|
|
|
|
|
|
| 45 |
## [0.1.3] – 2023-10-30
|
| 46 |
|
| 47 |
### Fixed
|
| 48 |
|
| 49 |
+
+ ChromaDb accumulating information even when new papers were uploaded
|
| 50 |
|
| 51 |
## [0.1.2] – 2023-10-26
|
| 52 |
|
|
|
|
| 58 |
|
| 59 |
### Fixed
|
| 60 |
|
| 61 |
+
+ Github action build
|
| 62 |
+
+ dependencies of langchain and chromadb
|
|
|
|
| 63 |
|
| 64 |
## [0.1.0] – 2023-10-26
|
| 65 |
|
|
|
|
| 75 |
+ Kick off application
|
| 76 |
+ Support for GPT-3.5
|
| 77 |
+ Support for Mistral + SentenceTransformer
|
| 78 |
+
+ Streamlit application
|
| 79 |
+
+ Docker image
|
| 80 |
+ pypi package
|
| 81 |
|
| 82 |
<!-- markdownlint-disable-file MD024 MD033 -->
|
README.md
CHANGED
|
@@ -14,6 +14,8 @@ license: apache-2.0
|
|
| 14 |
|
| 15 |
**Work in progress** :construction_worker:
|
| 16 |
|
|
|
|
|
|
|
| 17 |
## Introduction
|
| 18 |
|
| 19 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
|
@@ -23,11 +25,13 @@ We target only the full-text using [Grobid](https://github.com/kermitt2/grobid)
|
|
| 23 |
|
| 24 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
| 25 |
|
| 26 |
-
The conversation is
|
|
|
|
|
|
|
| 27 |
|
| 28 |
**Demos**:
|
| 29 |
-
- (
|
| 30 |
-
- (
|
| 31 |
|
| 32 |
## Getting started
|
| 33 |
|
|
|
|
| 14 |
|
| 15 |
**Work in progress** :construction_worker:
|
| 16 |
|
| 17 |
+
<img src="https://github.com/lfoppiano/document-qa/assets/15426/f0a04a86-96b3-406e-8303-904b93f00015" width=300 align="right" />
|
| 18 |
+
|
| 19 |
## Introduction
|
| 20 |
|
| 21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
|
|
|
| 25 |
|
| 26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
| 27 |
|
| 28 |
+
The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
|
| 29 |
+
|
| 30 |
+
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
| 31 |
|
| 32 |
**Demos**:
|
| 33 |
+
- (stable version): https://lfoppiano-document-qa.hf.space/
|
| 34 |
+
- (unstable version): https://document-insights.streamlit.app/
|
| 35 |
|
| 36 |
## Getting started
|
| 37 |
|
document_qa/document_qa_engine.py
CHANGED
|
@@ -3,17 +3,18 @@ import os
|
|
| 3 |
from pathlib import Path
|
| 4 |
from typing import Union, Any
|
| 5 |
|
|
|
|
| 6 |
from grobid_client.grobid_client import GrobidClient
|
| 7 |
-
from langchain.chains import create_extraction_chain
|
| 8 |
-
from langchain.chains.question_answering import load_qa_chain
|
|
|
|
| 9 |
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
| 10 |
from langchain.retrievers import MultiQueryRetriever
|
|
|
|
| 11 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 12 |
from langchain.vectorstores import Chroma
|
| 13 |
from tqdm import tqdm
|
| 14 |
|
| 15 |
-
from document_qa.grobid_processors import GrobidProcessor
|
| 16 |
-
|
| 17 |
|
| 18 |
class DocumentQAEngine:
|
| 19 |
llm = None
|
|
@@ -23,15 +24,24 @@ class DocumentQAEngine:
|
|
| 23 |
embeddings_map_from_md5 = {}
|
| 24 |
embeddings_map_to_md5 = {}
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
def __init__(self,
|
| 27 |
llm,
|
| 28 |
embedding_function,
|
| 29 |
qa_chain_type="stuff",
|
| 30 |
embeddings_root_path=None,
|
| 31 |
grobid_url=None,
|
|
|
|
| 32 |
):
|
| 33 |
self.embedding_function = embedding_function
|
| 34 |
self.llm = llm
|
|
|
|
| 35 |
self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
|
| 36 |
|
| 37 |
if embeddings_root_path is not None:
|
|
@@ -87,14 +97,14 @@ class DocumentQAEngine:
|
|
| 87 |
return self.embeddings_map_from_md5[md5]
|
| 88 |
|
| 89 |
def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
|
| 90 |
-
verbose=False
|
| 91 |
Any, str):
|
| 92 |
# self.load_embeddings(self.embeddings_root_path)
|
| 93 |
|
| 94 |
if verbose:
|
| 95 |
print(query)
|
| 96 |
|
| 97 |
-
response = self._run_query(doc_id, query, context_size=context_size
|
| 98 |
response = response['output_text'] if 'output_text' in response else response
|
| 99 |
|
| 100 |
if verbose:
|
|
@@ -144,21 +154,25 @@ class DocumentQAEngine:
|
|
| 144 |
|
| 145 |
return parsed_output
|
| 146 |
|
| 147 |
-
def _run_query(self, doc_id, query,
|
| 148 |
relevant_documents = self._get_context(doc_id, query, context_size)
|
| 149 |
-
|
| 150 |
-
return self.chain.run(input_documents=relevant_documents,
|
| 151 |
question=query)
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
# return self.chain({"input_documents": relevant_documents, "question": prompt_chat_template}, return_only_outputs=True)
|
| 157 |
|
| 158 |
def _get_context(self, doc_id, query, context_size=4):
|
| 159 |
db = self.embeddings_dict[doc_id]
|
| 160 |
retriever = db.as_retriever(search_kwargs={"k": context_size})
|
| 161 |
relevant_documents = retriever.get_relevant_documents(query)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
return relevant_documents
|
| 163 |
|
| 164 |
def get_all_context_by_document(self, doc_id):
|
|
@@ -173,8 +187,10 @@ class DocumentQAEngine:
|
|
| 173 |
relevant_documents = multi_query_retriever.get_relevant_documents(query)
|
| 174 |
return relevant_documents
|
| 175 |
|
| 176 |
-
def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, verbose=False):
|
| 177 |
-
"""
|
|
|
|
|
|
|
| 178 |
if verbose:
|
| 179 |
print("File", pdf_file_path)
|
| 180 |
filename = Path(pdf_file_path).stem
|
|
@@ -189,6 +205,7 @@ class DocumentQAEngine:
|
|
| 189 |
texts = []
|
| 190 |
metadatas = []
|
| 191 |
ids = []
|
|
|
|
| 192 |
if chunk_size < 0:
|
| 193 |
for passage in structure['passages']:
|
| 194 |
biblio_copy = copy.copy(biblio)
|
|
@@ -212,28 +229,49 @@ class DocumentQAEngine:
|
|
| 212 |
metadatas = [biblio for _ in range(len(texts))]
|
| 213 |
ids = [id for id, t in enumerate(texts)]
|
| 214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
return texts, metadatas, ids
|
| 216 |
|
| 217 |
-
def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1):
|
| 218 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
if doc_id:
|
| 220 |
hash = doc_id
|
| 221 |
else:
|
| 222 |
hash = metadata[0]['hash']
|
| 223 |
|
| 224 |
if hash not in self.embeddings_dict.keys():
|
| 225 |
-
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
|
|
|
|
|
|
| 226 |
collection_name=hash)
|
| 227 |
else:
|
| 228 |
-
self.embeddings_dict[hash].
|
| 229 |
-
self.embeddings_dict[hash]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
collection_name=hash)
|
| 231 |
|
| 232 |
self.embeddings_root_path = None
|
| 233 |
|
| 234 |
return hash
|
| 235 |
|
| 236 |
-
def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1):
|
| 237 |
input_files = []
|
| 238 |
for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
|
| 239 |
for file_ in files:
|
|
@@ -250,9 +288,12 @@ class DocumentQAEngine:
|
|
| 250 |
if os.path.exists(data_path):
|
| 251 |
print(data_path, "exists. Skipping it ")
|
| 252 |
continue
|
| 253 |
-
|
| 254 |
-
texts, metadata, ids = self.get_text_from_document(
|
| 255 |
-
|
|
|
|
|
|
|
|
|
|
| 256 |
filename = metadata[0]['filename']
|
| 257 |
|
| 258 |
vector_db_document = Chroma.from_texts(texts,
|
|
|
|
| 3 |
from pathlib import Path
|
| 4 |
from typing import Union, Any
|
| 5 |
|
| 6 |
+
from document_qa.grobid_processors import GrobidProcessor
|
| 7 |
from grobid_client.grobid_client import GrobidClient
|
| 8 |
+
from langchain.chains import create_extraction_chain, ConversationChain, ConversationalRetrievalChain
|
| 9 |
+
from langchain.chains.question_answering import load_qa_chain, stuff_prompt, refine_prompts, map_reduce_prompt, \
|
| 10 |
+
map_rerank_prompt
|
| 11 |
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
| 12 |
from langchain.retrievers import MultiQueryRetriever
|
| 13 |
+
from langchain.schema import Document
|
| 14 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 15 |
from langchain.vectorstores import Chroma
|
| 16 |
from tqdm import tqdm
|
| 17 |
|
|
|
|
|
|
|
| 18 |
|
| 19 |
class DocumentQAEngine:
|
| 20 |
llm = None
|
|
|
|
| 24 |
embeddings_map_from_md5 = {}
|
| 25 |
embeddings_map_to_md5 = {}
|
| 26 |
|
| 27 |
+
default_prompts = {
|
| 28 |
+
'stuff': stuff_prompt,
|
| 29 |
+
'refine': refine_prompts,
|
| 30 |
+
"map_reduce": map_reduce_prompt,
|
| 31 |
+
"map_rerank": map_rerank_prompt
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
def __init__(self,
|
| 35 |
llm,
|
| 36 |
embedding_function,
|
| 37 |
qa_chain_type="stuff",
|
| 38 |
embeddings_root_path=None,
|
| 39 |
grobid_url=None,
|
| 40 |
+
memory=None
|
| 41 |
):
|
| 42 |
self.embedding_function = embedding_function
|
| 43 |
self.llm = llm
|
| 44 |
+
self.memory = memory
|
| 45 |
self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
|
| 46 |
|
| 47 |
if embeddings_root_path is not None:
|
|
|
|
| 97 |
return self.embeddings_map_from_md5[md5]
|
| 98 |
|
| 99 |
def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
|
| 100 |
+
verbose=False) -> (
|
| 101 |
Any, str):
|
| 102 |
# self.load_embeddings(self.embeddings_root_path)
|
| 103 |
|
| 104 |
if verbose:
|
| 105 |
print(query)
|
| 106 |
|
| 107 |
+
response = self._run_query(doc_id, query, context_size=context_size)
|
| 108 |
response = response['output_text'] if 'output_text' in response else response
|
| 109 |
|
| 110 |
if verbose:
|
|
|
|
| 154 |
|
| 155 |
return parsed_output
|
| 156 |
|
| 157 |
+
def _run_query(self, doc_id, query, context_size=4):
|
| 158 |
relevant_documents = self._get_context(doc_id, query, context_size)
|
| 159 |
+
response = self.chain.run(input_documents=relevant_documents,
|
|
|
|
| 160 |
question=query)
|
| 161 |
+
|
| 162 |
+
if self.memory:
|
| 163 |
+
self.memory.save_context({"input": query}, {"output": response})
|
| 164 |
+
return response
|
|
|
|
| 165 |
|
| 166 |
def _get_context(self, doc_id, query, context_size=4):
|
| 167 |
db = self.embeddings_dict[doc_id]
|
| 168 |
retriever = db.as_retriever(search_kwargs={"k": context_size})
|
| 169 |
relevant_documents = retriever.get_relevant_documents(query)
|
| 170 |
+
if self.memory and len(self.memory.buffer_as_messages) > 0:
|
| 171 |
+
relevant_documents.append(
|
| 172 |
+
Document(
|
| 173 |
+
page_content="""Following, the previous question and answers. Use these information only when in the question there are unspecified references:\n{}\n\n""".format(
|
| 174 |
+
self.memory.buffer_as_str))
|
| 175 |
+
)
|
| 176 |
return relevant_documents
|
| 177 |
|
| 178 |
def get_all_context_by_document(self, doc_id):
|
|
|
|
| 187 |
relevant_documents = multi_query_retriever.get_relevant_documents(query)
|
| 188 |
return relevant_documents
|
| 189 |
|
| 190 |
+
def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, include=(), verbose=False):
|
| 191 |
+
"""
|
| 192 |
+
Extract text from documents using Grobid, if chunk_size is < 0 it keeps each paragraph separately
|
| 193 |
+
"""
|
| 194 |
if verbose:
|
| 195 |
print("File", pdf_file_path)
|
| 196 |
filename = Path(pdf_file_path).stem
|
|
|
|
| 205 |
texts = []
|
| 206 |
metadatas = []
|
| 207 |
ids = []
|
| 208 |
+
|
| 209 |
if chunk_size < 0:
|
| 210 |
for passage in structure['passages']:
|
| 211 |
biblio_copy = copy.copy(biblio)
|
|
|
|
| 229 |
metadatas = [biblio for _ in range(len(texts))]
|
| 230 |
ids = [id for id, t in enumerate(texts)]
|
| 231 |
|
| 232 |
+
if "biblio" in include:
|
| 233 |
+
biblio_metadata = copy.copy(biblio)
|
| 234 |
+
biblio_metadata['type'] = "biblio"
|
| 235 |
+
biblio_metadata['section'] = "header"
|
| 236 |
+
for key in ['title', 'authors', 'publication_year']:
|
| 237 |
+
if key in biblio_metadata:
|
| 238 |
+
texts.append("{}: {}".format(key, biblio_metadata[key]))
|
| 239 |
+
metadatas.append(biblio_metadata)
|
| 240 |
+
ids.append(key)
|
| 241 |
+
|
| 242 |
return texts, metadatas, ids
|
| 243 |
|
| 244 |
+
def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1, include_biblio=False):
|
| 245 |
+
include = ["biblio"] if include_biblio else []
|
| 246 |
+
texts, metadata, ids = self.get_text_from_document(
|
| 247 |
+
pdf_path,
|
| 248 |
+
chunk_size=chunk_size,
|
| 249 |
+
perc_overlap=perc_overlap,
|
| 250 |
+
include=include)
|
| 251 |
if doc_id:
|
| 252 |
hash = doc_id
|
| 253 |
else:
|
| 254 |
hash = metadata[0]['hash']
|
| 255 |
|
| 256 |
if hash not in self.embeddings_dict.keys():
|
| 257 |
+
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
| 258 |
+
embedding=self.embedding_function,
|
| 259 |
+
metadatas=metadata,
|
| 260 |
collection_name=hash)
|
| 261 |
else:
|
| 262 |
+
# if 'documents' in self.embeddings_dict[hash].get() and len(self.embeddings_dict[hash].get()['documents']) == 0:
|
| 263 |
+
# self.embeddings_dict[hash].delete(ids=self.embeddings_dict[hash].get()['ids'])
|
| 264 |
+
self.embeddings_dict[hash].delete_collection()
|
| 265 |
+
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
| 266 |
+
embedding=self.embedding_function,
|
| 267 |
+
metadatas=metadata,
|
| 268 |
collection_name=hash)
|
| 269 |
|
| 270 |
self.embeddings_root_path = None
|
| 271 |
|
| 272 |
return hash
|
| 273 |
|
| 274 |
+
def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1, include_biblio=False):
|
| 275 |
input_files = []
|
| 276 |
for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
|
| 277 |
for file_ in files:
|
|
|
|
| 288 |
if os.path.exists(data_path):
|
| 289 |
print(data_path, "exists. Skipping it ")
|
| 290 |
continue
|
| 291 |
+
include = ["biblio"] if include_biblio else []
|
| 292 |
+
texts, metadata, ids = self.get_text_from_document(
|
| 293 |
+
input_file,
|
| 294 |
+
chunk_size=chunk_size,
|
| 295 |
+
perc_overlap=perc_overlap,
|
| 296 |
+
include=include)
|
| 297 |
filename = metadata[0]['filename']
|
| 298 |
|
| 299 |
vector_db_document = Chroma.from_texts(texts,
|
document_qa/grobid_processors.py
CHANGED
|
@@ -171,7 +171,7 @@ class GrobidProcessor(BaseProcessor):
|
|
| 171 |
}
|
| 172 |
try:
|
| 173 |
year = dateparser.parse(doc_biblio.header.date).year
|
| 174 |
-
biblio["
|
| 175 |
except:
|
| 176 |
pass
|
| 177 |
|
|
|
|
| 171 |
}
|
| 172 |
try:
|
| 173 |
year = dateparser.parse(doc_biblio.header.date).year
|
| 174 |
+
biblio["publication_year"] = year
|
| 175 |
except:
|
| 176 |
pass
|
| 177 |
|
pyproject.toml
CHANGED
|
@@ -3,7 +3,7 @@ requires = ["setuptools", "setuptools-scm"]
|
|
| 3 |
build-backend = "setuptools.build_meta"
|
| 4 |
|
| 5 |
[tool.bumpversion]
|
| 6 |
-
current_version = "0.3.
|
| 7 |
commit = "true"
|
| 8 |
tag = "true"
|
| 9 |
tag_name = "v{new_version}"
|
|
|
|
| 3 |
build-backend = "setuptools.build_meta"
|
| 4 |
|
| 5 |
[tool.bumpversion]
|
| 6 |
+
current_version = "0.3.2"
|
| 7 |
commit = "true"
|
| 8 |
tag = "true"
|
| 9 |
tag_name = "v{new_version}"
|
streamlit_app.py
CHANGED
|
@@ -115,6 +115,7 @@ def clear_memory():
|
|
| 115 |
|
| 116 |
# @st.cache_resource
|
| 117 |
def init_qa(model, api_key=None):
|
|
|
|
| 118 |
if model == 'chatgpt-3.5-turbo':
|
| 119 |
if api_key:
|
| 120 |
chat = ChatOpenAI(model_name="gpt-3.5-turbo",
|
|
@@ -143,7 +144,7 @@ def init_qa(model, api_key=None):
|
|
| 143 |
st.stop()
|
| 144 |
return
|
| 145 |
|
| 146 |
-
return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'])
|
| 147 |
|
| 148 |
|
| 149 |
@st.cache_resource
|
|
@@ -252,7 +253,8 @@ with st.sidebar:
|
|
| 252 |
|
| 253 |
st.button(
|
| 254 |
'Reset chat memory.',
|
| 255 |
-
|
|
|
|
| 256 |
help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
|
| 257 |
|
| 258 |
left_column, right_column = st.columns([1, 1])
|
|
@@ -264,7 +266,9 @@ with right_column:
|
|
| 264 |
st.markdown(
|
| 265 |
":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
|
| 266 |
|
| 267 |
-
|
|
|
|
|
|
|
| 268 |
disabled=st.session_state['model'] is not None and st.session_state['model'] not in
|
| 269 |
st.session_state['api_keys'],
|
| 270 |
help="The full-text is extracted using Grobid. ")
|
|
@@ -331,7 +335,8 @@ if uploaded_file and not st.session_state.loaded_embeddings:
|
|
| 331 |
|
| 332 |
st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
|
| 333 |
chunk_size=chunk_size,
|
| 334 |
-
|
|
|
|
| 335 |
st.session_state['loaded_embeddings'] = True
|
| 336 |
st.session_state.messages = []
|
| 337 |
|
|
@@ -384,8 +389,7 @@ with right_column:
|
|
| 384 |
elif mode == "LLM":
|
| 385 |
with st.spinner("Generating response..."):
|
| 386 |
_, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
|
| 387 |
-
|
| 388 |
-
memory=st.session_state.memory)
|
| 389 |
|
| 390 |
if not text_response:
|
| 391 |
st.error("Something went wrong. Contact Luca Foppiano ([email protected]) to report the issue.")
|
|
@@ -404,11 +408,11 @@ with right_column:
|
|
| 404 |
st.write(text_response)
|
| 405 |
st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
|
| 406 |
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
|
| 413 |
elif st.session_state.loaded_embeddings and st.session_state.doc_id:
|
| 414 |
play_old_messages()
|
|
|
|
| 115 |
|
| 116 |
# @st.cache_resource
|
| 117 |
def init_qa(model, api_key=None):
|
| 118 |
+
## For debug add: callbacks=[PromptLayerCallbackHandler(pl_tags=["langchain", "chatgpt", "document-qa"])])
|
| 119 |
if model == 'chatgpt-3.5-turbo':
|
| 120 |
if api_key:
|
| 121 |
chat = ChatOpenAI(model_name="gpt-3.5-turbo",
|
|
|
|
| 144 |
st.stop()
|
| 145 |
return
|
| 146 |
|
| 147 |
+
return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'], memory=st.session_state['memory'])
|
| 148 |
|
| 149 |
|
| 150 |
@st.cache_resource
|
|
|
|
| 253 |
|
| 254 |
st.button(
|
| 255 |
'Reset chat memory.',
|
| 256 |
+
key="reset-memory-button",
|
| 257 |
+
on_click=clear_memory,
|
| 258 |
help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
|
| 259 |
|
| 260 |
left_column, right_column = st.columns([1, 1])
|
|
|
|
| 266 |
st.markdown(
|
| 267 |
":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
|
| 268 |
|
| 269 |
+
uploaded_file = st.file_uploader("Upload an article",
|
| 270 |
+
type=("pdf", "txt"),
|
| 271 |
+
on_change=new_file,
|
| 272 |
disabled=st.session_state['model'] is not None and st.session_state['model'] not in
|
| 273 |
st.session_state['api_keys'],
|
| 274 |
help="The full-text is extracted using Grobid. ")
|
|
|
|
| 335 |
|
| 336 |
st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
|
| 337 |
chunk_size=chunk_size,
|
| 338 |
+
perc_overlap=0.1,
|
| 339 |
+
include_biblio=True)
|
| 340 |
st.session_state['loaded_embeddings'] = True
|
| 341 |
st.session_state.messages = []
|
| 342 |
|
|
|
|
| 389 |
elif mode == "LLM":
|
| 390 |
with st.spinner("Generating response..."):
|
| 391 |
_, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
|
| 392 |
+
context_size=context_size)
|
|
|
|
| 393 |
|
| 394 |
if not text_response:
|
| 395 |
st.error("Something went wrong. Contact Luca Foppiano ([email protected]) to report the issue.")
|
|
|
|
| 408 |
st.write(text_response)
|
| 409 |
st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
|
| 410 |
|
| 411 |
+
# if len(st.session_state.messages) > 1:
|
| 412 |
+
# last_answer = st.session_state.messages[len(st.session_state.messages)-1]
|
| 413 |
+
# if last_answer['role'] == "assistant":
|
| 414 |
+
# last_question = st.session_state.messages[len(st.session_state.messages)-2]
|
| 415 |
+
# st.session_state.memory.save_context({"input": last_question['content']}, {"output": last_answer['content']})
|
| 416 |
|
| 417 |
elif st.session_state.loaded_embeddings and st.session_state.doc_id:
|
| 418 |
play_old_messages()
|