File size: 12,097 Bytes
cfd3735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9fc6205b",
   "metadata": {},
   "source": [
    "# Arxiv\n",
    "\n",
    ">[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
    "\n",
    "This notebook shows how to retrieve scientific articles from `Arxiv.org` into the Document format that is used downstream."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51489529-5dcd-4b86-bda6-de0a39d8ffd1",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1435c804-069d-4ade-9a7b-006b97b767c1",
   "metadata": {},
   "source": [
    "First, you need to install `arxiv` python package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a737220",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#!pip install arxiv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c15470b-a16b-4e0d-bc6a-6998bafbb5a4",
   "metadata": {},
   "source": [
    "`ArxivRetriever` has these arguments:\n",
    "- optional `load_max_docs`: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.\n",
    "- optional `load_all_available_meta`: default=False. By default only the most important fields downloaded: `Published` (date when document was published/last updated), `Title`, `Authors`, `Summary`. If True, other fields also downloaded.\n",
    "\n",
    "`get_relevant_documents()` has one argument, `query`: free text which used to find documents in `Arxiv.org`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae3c3d16",
   "metadata": {},
   "source": [
    "## Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fafb73b-d6ec-4822-b161-edf0aaf5224a",
   "metadata": {},
   "source": [
    "### Running retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0e6f506",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.retrievers import ArxivRetriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f381f642",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = ArxivRetriever(load_max_docs=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "20ae1a74",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = retriever.get_relevant_documents(query='1605.08386')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1d5a5088",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Published': '2016-05-26',\n",
       " 'Title': 'Heat-bath random walks with Markov bases',\n",
       " 'Authors': 'Caprice Stanley, Tobias Windisch',\n",
       " 'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on\\nfibers of a fixed integer matrix can be bounded from above by a constant. We\\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\\nalso state explicit conditions on the set of moves so that the heat-bath random\\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\\ndimension.'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0].metadata  # meta-information of the Document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c0ccd0c7-f6a6-43e7-b842-5f57afb94224",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'arXiv:1605.08386v1  [math.CO]  26 May 2016\\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES\\nCAPRICE STANLEY AND TOBIAS WINDISCH\\nAbstract. Graphs on lattice points are studied whose edges come from a 铿乶ite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on 铿乥ers of a\\n铿亁ed integer matrix can be bounded from above by a constant. We then study the mixing\\nbehaviour of heat-b'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0].page_content[:400]  # a content of the Document "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2670363b-3806-4c7e-b14d-90a4d5d2a200",
   "metadata": {},
   "source": [
    "### Question Answering on facts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "bb3601df-53ea-4826-bdbe-554387bc3ad4",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " 路路路路路路路路\n"
     ]
    }
   ],
   "source": [
    "# get a token: https://platform.openai.com/account/api-keys\n",
    "\n",
    "from getpass import getpass\n",
    "\n",
    "OPENAI_API_KEY = getpass()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e9c1a114-0410-4804-be30-05f34a9760f9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "51a33cc9-ec42-4afc-8a2d-3bfff476aa59",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "\n",
    "model = ChatOpenAI(model_name='gpt-3.5-turbo') # switch to 'gpt-4'\n",
    "qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ea537767-a8bf-4adf-ae03-b353c9145d58",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-> **Question**: What are Heat-bath random walks with Markov base? \n",
      "\n",
      "**Answer**: I'm not sure, as I don't have enough context to provide a definitive answer. The term \"Heat-bath random walks with Markov base\" is not mentioned in the given text. Could you provide more information or context about where you encountered this term? \n",
      "\n",
      "-> **Question**: What is the ImageBind model? \n",
      "\n",
      "**Answer**: ImageBind is an approach developed by Facebook AI Research to learn a joint embedding across six different modalities, including images, text, audio, depth, thermal, and IMU data. The approach uses the binding property of images to align each modality's embedding to image embeddings and achieve an emergent alignment across all modalities. This enables novel multimodal capabilities, including cross-modal retrieval, embedding-space arithmetic, and audio-to-image generation, among others. The approach sets a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Additionally, it shows strong few-shot recognition results and serves as a new way to evaluate vision models for visual and non-visual tasks. \n",
      "\n",
      "-> **Question**: How does Compositional Reasoning with Large Language Models works? \n",
      "\n",
      "**Answer**: Compositional reasoning with large language models refers to the ability of these models to correctly identify and represent complex concepts by breaking them down into smaller, more basic parts and combining them in a structured way. This involves understanding the syntax and semantics of language and using that understanding to build up more complex meanings from simpler ones. \n",
      "\n",
      "In the context of the paper \"Does CLIP Bind Concepts? Probing Compositionality in Large Image Models\", the authors focus specifically on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way. They examine CLIP's ability to compose concepts in a single-object setting, as well as in situations where concept binding is needed. \n",
      "\n",
      "The authors situate their work within the tradition of research on compositional distributional semantics models (CDSMs), which seek to bridge the gap between distributional models and formal semantics by building architectures which operate over vectors yet still obey traditional theories of linguistic composition. They compare the performance of CLIP with several architectures from research on CDSMs to evaluate its ability to encode and reason about compositional concepts. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "questions = [\n",
    "    \"What are Heat-bath random walks with Markov base?\",\n",
    "    \"What is the ImageBind model?\",\n",
    "    \"How does Compositional Reasoning with Large Language Models works?\",   \n",
    "] \n",
    "chat_history = []\n",
    "\n",
    "for question in questions:  \n",
    "    result = qa({\"question\": question, \"chat_history\": chat_history})\n",
    "    chat_history.append((question, result['answer']))\n",
    "    print(f\"-> **Question**: {question} \\n\")\n",
    "    print(f\"**Answer**: {result['answer']} \\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "8e0c3fc6-ae62-4036-a885-dc60176a7745",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-> **Question**: What are Heat-bath random walks with Markov base? Include references to answer. \n",
      "\n",
      "**Answer**: Heat-bath random walks with Markov base (HB-MB) is a class of stochastic processes that have been studied in the field of statistical mechanics and condensed matter physics. In these processes, a particle moves in a lattice by making a transition to a neighboring site, which is chosen according to a probability distribution that depends on the energy of the particle and the energy of its surroundings.\n",
      "\n",
      "The HB-MB process was introduced by Bortz, Kalos, and Lebowitz in 1975 as a way to simulate the dynamics of interacting particles in a lattice at thermal equilibrium. The method has been used to study a variety of physical phenomena, including phase transitions, critical behavior, and transport properties.\n",
      "\n",
      "References:\n",
      "\n",
      "Bortz, A. B., Kalos, M. H., & Lebowitz, J. L. (1975). A new algorithm for Monte Carlo simulation of Ising spin systems. Journal of Computational Physics, 17(1), 10-18.\n",
      "\n",
      "Binder, K., & Heermann, D. W. (2010). Monte Carlo simulation in statistical physics: an introduction. Springer Science & Business Media. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "questions = [\n",
    "    \"What are Heat-bath random walks with Markov base? Include references to answer.\",\n",
    "] \n",
    "chat_history = []\n",
    "\n",
    "for question in questions:  \n",
    "    result = qa({\"question\": question, \"chat_history\": chat_history})\n",
    "    chat_history.append((question, result['answer']))\n",
    "    print(f\"-> **Question**: {question} \\n\")\n",
    "    print(f\"**Answer**: {result['answer']} \\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09794ab5-759c-4b56-95d4-2454d4d86da1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}