File size: 5,113 Bytes
cfd3735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Apify Dataset\n",
    "\n",
    ">[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n",
    "\n",
    "This notebook shows how to load Apify datasets to LangChain.\n",
    "\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "You need to have an existing dataset on the Apify platform. If you don't have one, please first check out [this notebook](../../../agents/tools/examples/apify.ipynb) on how to use Apify to extract content from documentation, knowledge bases, help centers, or blogs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#!pip install apify-client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import `ApifyDatasetLoader` into your source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import ApifyDatasetLoader\n",
    "from langchain.document_loaders.base import Document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then provide a function that maps Apify dataset record fields to LangChain `Document` format.\n",
    "\n",
    "For example, if your dataset items are structured like this:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"url\": \"https://apify.com\",\n",
    "    \"text\": \"Apify is the best web scraping and automation platform.\"\n",
    "}\n",
    "```\n",
    "\n",
    "The mapping function in the code below will convert them to LangChain `Document` format, so that you can use them further with any LLM model (e.g. for question answering)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = ApifyDatasetLoader(\n",
    "    dataset_id=\"your-dataset-id\",\n",
    "    dataset_mapping_function=lambda dataset_item: Document(\n",
    "        page_content=dataset_item[\"text\"], metadata={\"source\": dataset_item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An example with question answering\n",
    "\n",
    "In this example, we use data from a dataset to answer a question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.docstore.document import Document\n",
    "from langchain.document_loaders import ApifyDatasetLoader\n",
    "from langchain.indexes import VectorstoreIndexCreator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = ApifyDatasetLoader(\n",
    "    dataset_id=\"your-dataset-id\",\n",
    "    dataset_mapping_function=lambda item: Document(\n",
    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorstoreIndexCreator().from_loaders([loader])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is Apify?\"\n",
    "result = index.query_with_sources(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.\n",
      "\n",
      "https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples\n"
     ]
    }
   ],
   "source": [
    "print(result[\"answer\"])\n",
    "print(result[\"sources\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}