File size: 19,477 Bytes
cfd3735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fc0db1bc",
   "metadata": {},
   "source": [
    "# Contextual Compression Retriever\n",
    "\n",
    "This notebook introduces the concept of DocumentCompressors and the ContextualCompressionRetriever. The core idea is simple: given a specific query, we should be able to return only the documents relevant to that query, and only the parts of those documents that are relevant. The ContextualCompressionsRetriever is a wrapper for another retriever that iterates over the initial output of the base retriever and filters and compresses those initial documents, so that only the most relevant information is returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "28e8dc12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function for printing docs\n",
    "\n",
    "def pretty_print_docs(docs):\n",
    "    print(f\"\\n{'-' * 100}\\n\".join([f\"Document {i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fa3d916",
   "metadata": {},
   "source": [
    "## Using a vanilla vector store retriever\n",
    "Let's start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can see that given an example question our retriever returns one or two relevant docs and a few irrelevant docs. And even the relevant docs have a lot of irrelevant information in them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9fbcc58f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
      "\n",
      "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
      "\n",
      "And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
      "\n",
      "We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n",
      "\n",
      "We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  \n",
      "\n",
      "We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
      "\n",
      "We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 3:\n",
      "\n",
      "And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n",
      "\n",
      "As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
      "\n",
      "While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n",
      "\n",
      "And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n",
      "\n",
      "So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  \n",
      "\n",
      "First, beat the opioid epidemic.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 4:\n",
      "\n",
      "Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n",
      "\n",
      "And as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up.  \n",
      "\n",
      "That ends on my watch. \n",
      "\n",
      "Medicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n",
      "\n",
      "We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n",
      "\n",
      "Let’s pass the Paycheck Fairness Act and paid leave.  \n",
      "\n",
      "Raise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n",
      "\n",
      "Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.vectorstores import FAISS\n",
    "\n",
    "documents = TextLoader('../../../state_of_the_union.txt').load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "texts = text_splitter.split_documents(documents)\n",
    "retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()\n",
    "\n",
    "docs = retriever.get_relevant_documents(\"What did the president say about Ketanji Brown Jackson\")\n",
    "pretty_print_docs(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7648612",
   "metadata": {},
   "source": [
    "## Adding contextual compression with an `LLMChainExtractor`\n",
    "Now let's wrap our base retriever with a `ContextualCompressionRetriever`. We'll add an `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9a658023",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "\"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\"\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "\"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\"\n"
     ]
    }
   ],
   "source": [
    "from langchain.llms import OpenAI\n",
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "from langchain.retrievers.document_compressors import LLMChainExtractor\n",
    "\n",
    "llm = OpenAI(temperature=0)\n",
    "compressor = LLMChainExtractor.from_llm(llm)\n",
    "compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)\n",
    "\n",
    "compressed_docs = compression_retriever.get_relevant_documents(\"What did the president say about Ketanji Jackson Brown\")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cd38f3a",
   "metadata": {},
   "source": [
    "## More built-in compressors: filters\n",
    "### `LLMChainFilter`\n",
    "The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b216a767",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
      "\n",
      "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
     ]
    }
   ],
   "source": [
    "from langchain.retrievers.document_compressors import LLMChainFilter\n",
    "\n",
    "_filter = LLMChainFilter.from_llm(llm)\n",
    "compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)\n",
    "\n",
    "compressed_docs = compression_retriever.get_relevant_documents(\"What did the president say about Ketanji Jackson Brown\")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c709598",
   "metadata": {},
   "source": [
    "### `EmbeddingsFilter`\n",
    "\n",
    "Making an extra LLM call over each retrieved document is expensive and slow. The `EmbeddingsFilter` provides a cheaper and faster option by embedding the documents and query and only returning those documents which have sufficiently similar embeddings to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6fbc801f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
      "\n",
      "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
      "\n",
      "And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
      "\n",
      "We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n",
      "\n",
      "We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  \n",
      "\n",
      "We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
      "\n",
      "We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 3:\n",
      "\n",
      "And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n",
      "\n",
      "As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
      "\n",
      "While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n",
      "\n",
      "And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n",
      "\n",
      "So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  \n",
      "\n",
      "First, beat the opioid epidemic.\n"
     ]
    }
   ],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain.retrievers.document_compressors import EmbeddingsFilter\n",
    "\n",
    "embeddings = OpenAIEmbeddings()\n",
    "embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)\n",
    "compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)\n",
    "\n",
    "compressed_docs = compression_retriever.get_relevant_documents(\"What did the president say about Ketanji Jackson Brown\")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07365d36",
   "metadata": {},
   "source": [
    "# Stringing compressors and document transformers together\n",
    "Using the `DocumentCompressorPipeline` we can also easily combine multiple compressors in sequence. Along with compressors we can add `BaseDocumentTransformer`s to our pipeline, which don't perform any contextual compression but simply perform some transformation on a set of documents. For example `TextSplitter`s can be used as document transformers to split documents into smaller pieces, and the `EmbeddingsRedundantFilter` can be used to filter out redundant documents based on embedding similarity between documents.\n",
    "\n",
    "Below we create a compressor pipeline by first splitting our docs into smaller chunks, then removing redundant documents, and then filtering based on relevance to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2a150a63",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_transformers import EmbeddingsRedundantFilter\n",
    "from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "\n",
    "splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=\". \")\n",
    "redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)\n",
    "relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)\n",
    "pipeline_compressor = DocumentCompressorPipeline(\n",
    "    transformers=[splitter, redundant_filter, relevant_filter]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3ceab64a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
      "\n",
      "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
      "\n",
      "While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 3:\n",
      "\n",
      "A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder\n"
     ]
    }
   ],
   "source": [
    "compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)\n",
    "\n",
    "compressed_docs = compression_retriever.get_relevant_documents(\"What did the president say about Ketanji Jackson Brown\")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cfd9fc5",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}