File size: 8,789 Bytes
5bbd8cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Awesum Care dataset upload flow.\n",
    "\n",
    "### This section describe the flow for turning information from text to vector for RAG. The vector db used below is locally hosted. To upload to the production, change the qdrant config.\n",
    "\n",
    "1. Put the data into a text file (.pdf/.docx/.txt/.md), then put then into a subdirectory. (/awesumcare_data in this example).\n",
    "2. Change them to embedding.\n",
    "3. Verify locally if works.\n",
    "\n",
    "(If without old data file)\n",
    "\n",
    "4. Create a duplicate of existing collection.\n",
    "5. Deploy new collection with snapshots and upload new data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n",
      "[nltk_data] Downloading package punkt to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n",
      "[nltk_data] Downloading package punkt to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     C:\\Users\\josh\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "from custom_io import MarkdownReader, UnstructuredReader, default_file_metadata_func\n",
    "\n",
    "dir_reader = SimpleDirectoryReader(\n",
    "    \"./awesumcare_data\",\n",
    "    file_extractor={\n",
    "        \".pdf\": UnstructuredReader(),\n",
    "        \".docx\": UnstructuredReader(),\n",
    "        \".pptx\": UnstructuredReader(),\n",
    "        \".md\": MarkdownReader(),\n",
    "    },\n",
    "    recursive=True,\n",
    "    exclude=[\"*.png\", \"*.pptx\", \"*.docx\", \"*.pdf\"],\n",
    "    file_metadata=default_file_metadata_func,\n",
    ")\n",
    "\n",
    "documents = dir_reader.load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the embedding client add feed to the IngestionPipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding\n",
    "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
    "\n",
    "\n",
    "import qdrant_client\n",
    "import nest_asyncio\n",
    "\n",
    "client = qdrant_client.QdrantClient(location=\":memory:\")\n",
    "vector_store = QdrantVectorStore(client=client, collection_name=\"test_store\")\n",
    "\n",
    "embedding_client = AzureOpenAIEmbedding(\n",
    "    deployment_name=\"text-embedding-ada-002\",\n",
    "    api_key=\"\",\n",
    "    azure_endpoint=\"\",\n",
    "    api_version=\"2024-02-01\",\n",
    ")\n",
    "\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        embedding_client,\n",
    "    ],\n",
    "    vector_store=vector_store,\n",
    ")\n",
    "\n",
    "# Need this for the code to run in my jupyter notebook. Not sure if needed in a different env.\n",
    "nest_asyncio.apply()\n",
    "\n",
    "# Ingest directly into a vector db\n",
    "pipeline.run(documents=documents)\n",
    "\n",
    "# Create your index\n",
    "index = VectorStoreIndex.from_vector_store(\n",
    "    vector_store=vector_store, embed_model=embedding_client\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify embeeding result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7. 見證人可以是親人嗎?\n",
      "\n",
      "    見證人不能是遺囑的受益人或受益人的配偶。如果親人是遺囑的見證人,那他們便不能在遺囑中有利益關係,意思是親人不能同時作為遺囑的見證人及受益人。見證人必須是年滿18歲且具有完全行為能力的成年人。\n",
      "18. 在遺囑裡分配共同擁有一物業(長命契),”如何分配所持有的物業”一欄應如何填寫?\n",
      "\n",
      "    在填寫遺囑時,如果涉及共同擁有的物業(例如長命契),需要特別注意如何分配這部分資產。共同擁有的物業通常有兩種形式:聯權共有(長命契)和分權共有(分權契)。\n",
      "    ### 聯權共有(長命契):\n",
      "    > - 在聯權共有的情況下,當其中一位擁有人去世,其持有的份額會自動轉移給其他聯權共有人,這稱為「生者繼承權」。這種情況下,該物業的份額通常不會在遺囑中分配,因為它自動轉移給其他共有人。\n",
      "    ### 分權共有(分權契):\n",
      "    > - 在分權共有的情況下,每位共有人擁有物業的特定份額,並且這些份額可以在遺囑中分配給指定的受益人。\n",
      "    > - 在填寫遺囑的「如何分配所持有的物業」一欄時,可以就分權共有的物業分配,寫上:我所持有的 [物業地址] 的 [百分比或具體份額],應分配給 [受益人姓名]。\n",
      "8. 遺囑一定需要見證人嗎?沒有見證人的遺囑有效嗎?\n",
      "\n",
      "    是的,遺囑必須有見證人。根據《遺囑條例》(第30章)第5(1)(c)條,一份具有法律效力的遺囑必須在兩名年滿18歲且非受益人的獨立見證人面前簽署和加上日期。如果遺囑沒有見證人,則該遺囑可能會被視為無效。請注意,如果立遺囑人年紀較大,或已開始有腦退化症狀,最好找醫生見證。此外,如果遺產承辦處對非律師製作的遺囑的真實性有任何疑問,可能會要求見證人簽署並作出有關見證遺囑的聲明。\n"
     ]
    }
   ],
   "source": [
    "from qdrant_client.http import models\n",
    "import random\n",
    "\n",
    "# You can comment out this line to reuse the same query vector after a run if you find the results unsatisfactory.\n",
    "# This allows you to compare the results after modify the document.\n",
    "query_vector=[random.random() for _ in range(1536)]\n",
    "\n",
    "res = client.search(\n",
    "    collection_name=\"test_store\",\n",
    "    search_params=models.SearchParams(hnsw_ef=128, exact=False),\n",
    "    # create a list of random float with 1536 elements\n",
    "    query_vector=query_vector,\n",
    "    limit=3,\n",
    ")\n",
    "\n",
    "# Need this line, or will have error: \"NameError: name 'null' is not defined\"\n",
    "null = None\n",
    "print(eval(res[0].payload[\"_node_content\"])[\"text\"])\n",
    "print(eval(res[1].payload[\"_node_content\"])[\"text\"])\n",
    "print(eval(res[2].payload[\"_node_content\"])[\"text\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}