hugging2021 commited on
Commit
5f3b20a
ยท
verified ยท
1 Parent(s): 1edba9f

Upload folder using huggingface_hub

Browse files
Files changed (22) hide show
  1. .env +1 -0
  2. .gitattributes +5 -33
  3. .gitignore +8 -0
  4. Dockerfile +14 -0
  5. README.md +245 -10
  6. concat_vector_store.py +46 -0
  7. concat_vector_store_์ •๋ฆฌ๋œ.py +55 -0
  8. dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/(๋™ํ–ฅ๋ณด๊ณ ) ๊ต์œก๋ถ€ K-์—๋“€ํŒŒ์ธ ์‹œ์Šคํ…œ ์ „๋ถ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ํƒ‘์žฌ ๊ฒ€ํ† .hwp +3 -0
  9. dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/23.05.10 ์ „๋ผ๋ถ๋„ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๊ฑด๋ฆฝ ๊ฐ€๋Šฅ ๋ถ€์ง€.hwp +3 -0
  10. dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/25.02.28 ํ–ฅํ›„ ๊ณต๊ณต ๋ฏผ๊ฐ„๋ฌผ๋Ÿ‰ ๋…ธ๋ ฅ ํฌ์ธํŠธ.hwp +3 -0
  11. dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/25.03.07 ์ƒ์„ฑํ˜• AI ์‹œ์Šคํ…œ ๊ตฌ์ถ•์„ ์œ„ํ•œ ์—…๋ฌดํ˜‘์•ฝ์‹ ๊ณ„ํš.hwp +3 -0
  12. dataset/์ถœ์žฅ๊ฒฐ๊ณผ๋ณด๊ณ /(1) 24.08.21 ์นด์นด์˜ค ์•„ํ†  ๋…น์Œ ํ’€๋ณธ1.txt +0 -0
  13. dataset/์ถœ์žฅ๊ฒฐ๊ณผ๋ณด๊ณ /(4) 24.08.21 ์นด์นด์˜ค,์•„ํ†  ๋ฉด๋‹ด ๊ฒฐ๊ณผ๋ณด๊ณ F.hwp +3 -0
  14. docker-compose.yml +12 -0
  15. document_processor_image_test.py +440 -0
  16. e5_embeddings.py +9 -0
  17. llm_loader.py +24 -0
  18. rag_server.py +197 -0
  19. rag_system.py +227 -0
  20. requirements.txt +18 -0
  21. vector_store.py +104 -0
  22. vector_store_test.py +121 -0
.env ADDED
@@ -0,0 +1 @@
 
 
1
+ HUGGINGFACE_TOKEN=<Huggingface_Token>
.gitattributes CHANGED
@@ -1,35 +1,7 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
  *.pkl filter=lfs diff=lfs merge=lfs -text
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.faiss filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  *.pkl filter=lfs diff=lfs merge=lfs -text
3
  *.pt filter=lfs diff=lfs merge=lfs -text
4
+ *.pdf filter=lfs diff=lfs merge=lfs -text
5
+ vector_db/*.faiss filter=lfs diff=lfs merge=lfs -text
6
+ vector_db/*.pkl filter=lfs diff=lfs merge=lfs -text
7
+ dataset/* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ vector_db/
2
+ *.index
3
+ *.faiss
4
+ *.pkl
5
+ *.pdf
6
+ *.faiss
7
+ *.pkl
8
+ *.hwpx
Dockerfile ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10
2
+
3
+ WORKDIR /app
4
+ COPY . /app
5
+
6
+ # /tmp ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ ๋ฐ ๊ถŒํ•œ ๋ถ€์—ฌ
7
+ RUN mkdir -p /tmp && chmod 1777 /tmp
8
+
9
+ # TMPDIR ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ์ง€์ •ํ•˜์—ฌ pip install ์‹คํ–‰
10
+ RUN TMPDIR=/tmp pip install --upgrade pip && TMPDIR=/tmp pip install --no-cache-dir -r requirements.txt
11
+
12
+ EXPOSE 8500
13
+
14
+ CMD ["uvicorn", "rag_server:app", "--host", "0.0.0.0", "--port", "8500"]
README.md CHANGED
@@ -1,10 +1,245 @@
1
- ---
2
- title: Open Webui Rag System
3
- emoji: ๐Ÿ“Š
4
- colorFrom: green
5
- colorTo: yellow
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open WebUI RAG System
2
+
3
+ Open WebUI์™€ ์—ฐ๋™ ๊ฐ€๋Šฅํ•œ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ธฐ๋ฐ˜ RAG(Retrieval-Augmented Generation) ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. PDF์™€ HWPX ํŒŒ์ผ์„ ์ง€์›ํ•˜๋ฉฐ, ํŽ˜์ด์ง€๋ณ„ ์ •ํ™•ํ•œ ์ •๋ณด ์ถ”์ถœ๊ณผ ์ถœ์ฒ˜ ์ถ”์ ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
4
+
5
+ ## ์ฃผ์š” ๊ธฐ๋Šฅ
6
+
7
+ ### 1. ๋ฌธ์„œ ์ฒ˜๋ฆฌ
8
+ - **PDF ๋ฌธ์„œ**: PyMuPDF ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€ OCR ์ถ”์ถœ
9
+ - **HWPX ๋ฌธ์„œ**: XML ํŒŒ์‹ฑ์„ ํ†ตํ•œ ์„น์…˜๋ณ„ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€ ์ถ”์ถœ
10
+ - **ํŽ˜์ด์ง€๋ณ„ ์ฒ˜๋ฆฌ**: ๊ฐ ๋ฌธ์„œ๋ฅผ ํŽ˜์ด์ง€/์„น์…˜ ๋‹จ์œ„๋กœ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฆฌ
11
+ - **๋‹ค์ค‘ ์ฝ˜ํ…์ธ  ํƒ€์ž…**: ๋ณธ๋ฌธ, ํ‘œ, OCR ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ์‹๋ณ„ํ•˜์—ฌ ์ฒ˜๋ฆฌ
12
+
13
+ ### 2. ๋ฒกํ„ฐ ๊ฒ€์ƒ‰
14
+ - **E5-Large ์ž„๋ฒ ๋”ฉ**: ๋‹ค๊ตญ์–ด ์ง€์› ๊ณ ์„ฑ๋Šฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ
15
+ - **FAISS ๋ฒกํ„ฐ์Šคํ† ์–ด**: ๋น ๋ฅธ ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰
16
+ - **๋ฐฐ์น˜ ์ฒ˜๋ฆฌ**: ๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”
17
+ - **์ฒญํฌ ๋ถ„ํ• **: ๋ฌธ๋งฅ ์œ ์ง€๋ฅผ ์œ„ํ•œ ๊ฒน์นจ ์ฒ˜๋ฆฌ
18
+
19
+ ### 3. RAG ์‹œ์Šคํ…œ
20
+ - **Refine ์ฒด์ธ**: ๋‹ค์ค‘ ๋ฌธ์„œ ์ฐธ์กฐ๋ฅผ ํ†ตํ•œ ์ •ํ™•ํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ
21
+ - **์ถœ์ฒ˜ ์ถ”์ **: ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ์™€ ๋ฌธ์„œ๋ช…์„ ํฌํ•จํ•œ ์ •ํ™•ํ•œ ์ธ์šฉ
22
+ - **Hallucination ๋ฐฉ์ง€**: ๋ฌธ์„œ์— ๋ช…์‹œ๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜๋Š” ์—„๊ฒฉํ•œ ํ”„๋กฌํ”„ํŠธ
23
+
24
+ ### 4. API ์„œ๋ฒ„
25
+ - **FastAPI ๊ธฐ๋ฐ˜**: ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ ์ง€์›
26
+ - **OpenAI ํ˜ธํ™˜**: `/v1/chat/completions` ์—”๋“œํฌ์ธํŠธ ์ œ๊ณต
27
+ - **์ŠคํŠธ๋ฆฌ๋ฐ ์ง€์›**: ์‹ค์‹œ๊ฐ„ ๋‹ต๋ณ€ ์ƒ์„ฑ
28
+ - **Open WebUI ์—ฐ๋™**: ํ”Œ๋Ÿฌ๊ทธ์ธ ์—†์ด ๋ฐ”๋กœ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
29
+
30
+ ## ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ
31
+
32
+ ### ํ•˜๋“œ์›จ์–ด
33
+ - **GPU**: CUDA ์ง€์› (์ž„๋ฒ ๋”ฉ ๋ฐ LLM ์ถ”๋ก ์šฉ)
34
+ - **RAM**: ์ตœ์†Œ 16GB (๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์‹œ ๋” ํ•„์š”)
35
+ - **์ €์žฅ๊ณต๊ฐ„**: ๋ชจ๋ธ ๋ฐ ๋ฒกํ„ฐ์Šคํ† ์–ด์šฉ 10GB+
36
+
37
+ ### ์†Œํ”„ํŠธ์›จ์–ด
38
+ - Python 3.8+
39
+ - CUDA 11.7+ (GPU ์‚ฌ์šฉ ์‹œ)
40
+ - Tesseract OCR
41
+
42
+ ## ์„ค์น˜ ๋ฐฉ๋ฒ•
43
+
44
+ ### 1. ์ €์žฅ์†Œ ํด๋ก 
45
+ ```bash
46
+ git clone <repository-url>
47
+ cd open-webui-rag-system
48
+ ```
49
+
50
+ ### 2. ์˜์กด์„ฑ ์„ค์น˜
51
+ ```bash
52
+ pip install -r requirements.txt
53
+ ```
54
+
55
+ ### 3. Tesseract OCR ์„ค์น˜
56
+ **Ubuntu/Debian:**
57
+ ```bash
58
+ sudo apt-get install tesseract-ocr tesseract-ocr-kor
59
+ ```
60
+
61
+ **Windows:**
62
+ - [Tesseract ๊ณต์‹ ํŽ˜์ด์ง€](https://github.com/UB-Mannheim/tesseract/wiki)์—์„œ ์„ค์น˜
63
+
64
+ ### 4. LLM ์„œ๋ฒ„ ์„ค์ •
65
+ `llm_loader.py`์—์„œ ์‚ฌ์šฉํ•  LLM ์„œ๋ฒ„ ์„ค์ •:
66
+ ```python
67
+ # EXAONE ๋ชจ๋ธ ์‚ฌ์šฉ ์˜ˆ์‹œ
68
+ base_url="http://vllm:8000/v1"
69
+ model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
70
+ openai_api_key="token-abc123"
71
+ ```
72
+
73
+ ## ์‹คํ–‰ ๋ฐฉ๋ฒ•
74
+
75
+ ### 1. ๋ฌธ์„œ ์ค€๋น„
76
+ ์ฒ˜๋ฆฌํ•  ๋ฌธ์„œ๋“ค์„ `dataset_test` ํด๋”์— ์ €์žฅ:
77
+ ```
78
+ dataset_test/
79
+ โ”œโ”€โ”€ document1.pdf
80
+ โ”œโ”€โ”€ document2.hwpx
81
+ โ””โ”€โ”€ document3.pdf
82
+ ```
83
+
84
+ ### 2. ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ฐ ๋ฒกํ„ฐ์Šคํ† ์–ด ์ƒ์„ฑ
85
+ ```bash
86
+ python document_processor_image_test.py
87
+ ```
88
+ ๋˜๋Š” ๋ฒกํ„ฐ์Šคํ† ์–ด ๋นŒ๋“œ ์Šคํฌ๋ฆฝํŠธ ์‚ฌ์šฉ:
89
+ ```bash
90
+ python vector_store_test.py --folder dataset_test --save_path faiss_index_pymupdf
91
+ ```
92
+
93
+ ### 3. RAG ์„œ๋ฒ„ ์‹คํ–‰
94
+ ```bash
95
+ python rag_server.py
96
+ ```
97
+ ์„œ๋ฒ„๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 8000๋ฒˆ ํฌํŠธ์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
98
+
99
+ ### 4. Open WebUI ์—ฐ๋™
100
+ Open WebUI์˜ ๋ชจ๋ธ ์„ค์ •์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •:
101
+ - **API Base URL**: `http://localhost:8000/v1`
102
+ - **API Key**: `token-abc123`
103
+ - **Model Name**: `rag`
104
+
105
+ ### 5. ๊ฐœ๋ณ„ ํ…Œ์ŠคํŠธ
106
+ ๋ช…๋ น์ค„์—์„œ ์ง์ ‘ ์งˆ๋ฌธ:
107
+ ```bash
108
+ python rag_system.py --query "๋ฌธ์„œ์—์„œ ์ฐพ๊ณ  ์‹ถ์€ ๋‚ด์šฉ"
109
+ ```
110
+
111
+ ๋Œ€ํ™”ํ˜• ๋ชจ๋“œ:
112
+ ```bash
113
+ python rag_system.py
114
+ ```
115
+
116
+ ## ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
117
+
118
+ ```
119
+ open-webui-rag-system/
120
+ โ”œโ”€โ”€ document_processor_image_test.py # ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ฉ”์ธ ๋ชจ๋“ˆ
121
+ โ”œโ”€โ”€ vector_store_test.py # ๋ฒกํ„ฐ์Šคํ† ์–ด ์ƒ์„ฑ ๋ชจ๋“ˆ
122
+ โ”œโ”€โ”€ rag_system.py # RAG ์ฒด์ธ ๊ตฌ์„ฑ ๋ฐ ์งˆ์˜์‘๋‹ต
123
+ โ”œโ”€โ”€ rag_server.py # FastAPI ์„œ๋ฒ„
124
+ โ”œโ”€โ”€ llm_loader.py # LLM ๋ชจ๋ธ ๋กœ๋”
125
+ โ”œโ”€โ”€ e5_embeddings.py # E5 ์ž„๋ฒ ๋”ฉ ๋ชจ๋“ˆ
126
+ โ”œโ”€โ”€ requirements.txt # ์˜์กด์„ฑ ๋ชฉ๋ก
127
+ โ”œโ”€โ”€ dataset_test/ # ๋ฌธ์„œ ์ €์žฅ ํด๋”
128
+ โ””โ”€โ”€ faiss_index_pymupdf/ # ์ƒ์„ฑ๋œ ๋ฒกํ„ฐ์Šคํ† ์–ด
129
+ ```
130
+
131
+ ## ํ•ต์‹ฌ ๋ชจ๋“ˆ ์„ค๋ช…
132
+
133
+ ### document_processor_image_test.py
134
+ - PDF์™€ HWPX ํŒŒ์ผ์˜ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€๋ฅผ ํŽ˜์ด์ง€๋ณ„๋กœ ์ถ”์ถœ
135
+ - PyMuPDF, pdfplumber, pytesseract๋ฅผ ํ™œ์šฉํ•œ ๋‹ค์ธต ์ฒ˜๋ฆฌ
136
+ - ์„น์…˜๋ณ„ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ํŽ˜์ด์ง€ ์ •๋ณด ์œ ์ง€
137
+
138
+ ### vector_store_test.py
139
+ - E5-Large ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ๋ฒกํ„ฐํ™”
140
+ - FAISS๋ฅผ ์ด์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฒกํ„ฐ์Šคํ† ์–ด ๊ตฌ์ถ•
141
+ - ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
142
+
143
+ ### rag_system.py
144
+ - Refine ์ฒด์ธ์„ ํ™œ์šฉํ•œ ๋‹ค๋‹จ๊ณ„ ๋‹ต๋ณ€ ์ƒ์„ฑ
145
+ - ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ hallucination ๋ฐฉ์ง€ ํ”„๋กฌํ”„ํŠธ
146
+ - ์ถœ์ฒ˜ ์ถ”์ ๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ
147
+
148
+ ### rag_server.py
149
+ - OpenAI ํ˜ธํ™˜ API ์—”๋“œํฌ์ธํŠธ ์ œ๊ณต
150
+ - ์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต ์ง€์›
151
+ - Open WebUI์™€์˜ ์›ํ™œํ•œ ์—ฐ๋™
152
+
153
+ ## ์„ค์ • ์˜ต์…˜
154
+
155
+ ### ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์˜ต์…˜
156
+ - **์ฒญํฌ ํฌ๏ฟฝ๏ฟฝ**: `chunk_size=500` (๊ธฐ๋ณธ๊ฐ’)
157
+ - **์ฒญํฌ ๊ฒน์นจ**: `chunk_overlap=100` (๊ธฐ๋ณธ๊ฐ’)
158
+ - **OCR ์–ธ์–ด**: `lang='kor+eng'` (ํ•œ๊ตญ์–ด+์˜์–ด)
159
+
160
+ ### ๊ฒ€์ƒ‰ ์˜ต์…˜
161
+ - **๊ฒ€์ƒ‰ ๋ฌธ์„œ ์ˆ˜**: `k=7` (๊ธฐ๋ณธ๊ฐ’)
162
+ - **์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ**: `intfloat/multilingual-e5-large-instruct`
163
+ - **๋””๋ฐ”์ด์Šค**: `cuda` ๋˜๋Š” `cpu`
164
+
165
+ ### LLM ์„ค์ •
166
+ ์ง€์›ํ•˜๋Š” ๋ชจ๋ธ๋“ค:
167
+ - LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
168
+ - meta-llama/Meta-Llama-3-8B-Instruct
169
+ - ๊ธฐํƒ€ OpenAI ํ˜ธํ™˜ ๋ชจ๋ธ
170
+
171
+ ## ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…
172
+
173
+ ### 1. CUDA ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
174
+ ```bash
175
+ # CPU ๋ชจ๋“œ๋กœ ์‹คํ–‰
176
+ python vector_store_test.py --device cpu
177
+ ```
178
+
179
+ ### 2. ํ•œ๊ธ€ ํฐํŠธ ๋ฌธ์ œ
180
+ ```bash
181
+ # ํ•œ๊ธ€ ํฐํŠธ ์„ค์น˜ (Ubuntu)
182
+ sudo apt-get install fonts-nanum
183
+ ```
184
+
185
+ ### 3. Tesseract ๊ฒฝ๋กœ ๋ฌธ์ œ
186
+ ```python
187
+ # pytesseract ๊ฒฝ๋กœ ์ˆ˜๋™ ์„ค์ •
188
+ pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
189
+ ```
190
+
191
+ ### 4. ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์‹คํŒจ
192
+ ```bash
193
+ # Hugging Face ์บ์‹œ ๊ฒฝ๋กœ ํ™•์ธ
194
+ export HF_HOME=/path/to/huggingface/cache
195
+ ```
196
+
197
+ ## API ์‚ฌ์šฉ ์˜ˆ์‹œ
198
+
199
+ ### ์ง์ ‘ ์งˆ์˜
200
+ ```bash
201
+ curl -X POST "http://localhost:8000/ask" \
202
+ -H "Content-Type: application/json" \
203
+ -d '{"question": "๋ฌธ์„œ์—์„œ ์˜ˆ์‚ฐ ๊ด€๋ จ ๋‚ด์šฉ์„ ์ฐพ์•„์ฃผ์„ธ์š”"}'
204
+ ```
205
+
206
+ ### OpenAI ํ˜ธํ™˜ API
207
+ ```bash
208
+ curl -X POST "http://localhost:8000/v1/chat/completions" \
209
+ -H "Content-Type: application/json" \
210
+ -d '{
211
+ "model": "rag",
212
+ "messages": [{"role": "user", "content": "์˜ˆ์‚ฐ ํ˜„ํ™ฉ์ด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?"}],
213
+ "stream": false
214
+ }'
215
+ ```
216
+
217
+ ## ์„ฑ๋Šฅ ์ตœ์ ํ™”
218
+
219
+ ### 1. ๋ฐฐ์น˜ ํฌ๊ธฐ ์กฐ์ •
220
+ ```bash
221
+ python vector_store_test.py --batch_size 32 # GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ ์กฐ์ •
222
+ ```
223
+
224
+ ### 2. ์ฒญํฌ ํฌ๊ธฐ ์ตœ์ ํ™”
225
+ ```python
226
+ # ๊ธด ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ์ฒญํฌ ํฌ๊ธฐ ์ฆ๊ฐ€
227
+ chunks = split_documents(docs, chunk_size=800, chunk_overlap=150)
228
+ ```
229
+
230
+ ### 3. ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ˆ˜ ์กฐ์ •
231
+ ```bash
232
+ python rag_system.py --k 10 # ๋” ๋งŽ์€ ๋ฌธ์„œ ์ฐธ์กฐ
233
+ ```
234
+
235
+ ## ๋ผ์ด์„ ์Šค
236
+
237
+ MIT License
238
+
239
+ ## ๊ธฐ์—ฌ ๋ฐฉ๋ฒ•
240
+
241
+ 1. Fork the repository
242
+ 2. Create your feature branch
243
+ 3. Commit your changes
244
+ 4. Push to the branch
245
+ 5. Create a Pull Request
concat_vector_store.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain.schema.document import Document
3
+ from e5_embeddings import E5Embeddings
4
+ from langchain_community.vectorstores import FAISS
5
+
6
+ from document_processor_image import load_documents, split_documents # ๋ฐ˜๋“œ์‹œ ์ด ํ•จ์ˆ˜๊ฐ€ ํ•„์š”
7
+
8
+ # ๊ฒฝ๋กœ ์„ค์ •
9
+ NEW_FOLDER = "25.05.28 RAG์šฉ 2์ฐจ ์—…๋ฌดํŽธ๋žŒ ์ทจํ•ฉ๋ณธ"
10
+ #NEW_FOLDER = "์ž„์‹œ"
11
+ VECTOR_STORE_PATH = "vector_db"
12
+
13
+ # 1. ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋กœ๋”ฉ
14
+ def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
15
+ return E5Embeddings(
16
+ model_name=model_name,
17
+ model_kwargs={'device': device},
18
+ encode_kwargs={'normalize_embeddings': True}
19
+ )
20
+
21
+ # 2. ๊ธฐ์กด ๋ฒกํ„ฐ ์Šคํ† ์–ด ๋กœ๋“œ
22
+ def load_vector_store(embeddings, load_path="vector_db"):
23
+ if not os.path.exists(load_path):
24
+ raise FileNotFoundError(f"๋ฒกํ„ฐ ์Šคํ† ์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค: {load_path}")
25
+ return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
26
+
27
+ # 3. ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ ๋ฐ ์ถ”๊ฐ€
28
+ def add_new_documents_to_vector_store(new_folder, vectorstore, embeddings):
29
+ print(f"๐Ÿ“‚ ์ƒˆ๋กœ์šด ๋ฌธ์„œ ๋กœ๋“œ ์ค‘: {new_folder}")
30
+ new_docs = load_documents(new_folder)
31
+ new_chunks = split_documents(new_docs, chunk_size=800, chunk_overlap=100)
32
+
33
+ print(f"๐Ÿ“„ ์ƒˆ๋กœ์šด ์ฒญํฌ ์ˆ˜: {len(new_chunks)}")
34
+ print(f"์ถ”๊ฐ€ ์ „ ๋ฒกํ„ฐ ์ˆ˜: {vectorstore.index.ntotal}")
35
+ vectorstore.add_documents(new_chunks)
36
+ print(f"์ถ”๊ฐ€ ํ›„ ๋ฒกํ„ฐ ์ˆ˜: {vectorstore.index.ntotal}")
37
+
38
+ print("โœ… ์ƒˆ๋กœ์šด ๋ฌธ์„œ๊ฐ€ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")
39
+
40
+ # 4. ์ „์ฒด ์‹คํ–‰
41
+ if __name__ == "__main__":
42
+ embeddings = get_embeddings()
43
+ vectorstore = load_vector_store(embeddings, VECTOR_STORE_PATH)
44
+ add_new_documents_to_vector_store(NEW_FOLDER, vectorstore, embeddings)
45
+ vectorstore.save_local(VECTOR_STORE_PATH)
46
+ print(f"๐Ÿ’พ ๋ฒกํ„ฐ ์Šคํ† ์–ด ์ €์žฅ ์™„๋ฃŒ: {VECTOR_STORE_PATH}")
concat_vector_store_์ •๋ฆฌ๋œ.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ from langchain.schema.document import Document
4
+ from e5_embeddings import E5Embeddings
5
+ from langchain_community.vectorstores import FAISS
6
+ from document_processor import load_pdf_with_pymupdf, split_documents
7
+
8
+ # ๊ฒฝ๋กœ ์„ค์ •
9
+ FOLDER = "25.05.28 RAG์šฉ 2์ฐจ ์—…๋ฌดํŽธ๋žŒ ์ทจํ•ฉ๋ณธ"
10
+ VECTOR_STORE_PATH = "vector_db"
11
+
12
+ # 1. ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋กœ๋“œ
13
+ def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
14
+ return E5Embeddings(
15
+ model_name=model_name,
16
+ model_kwargs={'device': device},
17
+ encode_kwargs={'normalize_embeddings': True}
18
+ )
19
+
20
+ # 2. ๊ธฐ์กด ๋ฒกํ„ฐ ์Šคํ† ์–ด ๋กœ๋“œ
21
+ def load_vector_store(embeddings, load_path=VECTOR_STORE_PATH):
22
+ if not os.path.exists(load_path):
23
+ raise FileNotFoundError(f"๋ฒกํ„ฐ ์Šคํ† ์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค: {load_path}")
24
+ return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
25
+
26
+ # 3. ์ •๋ฆฌ๋œ PDF๋งŒ ์ž„๋ฒ ๋”ฉ
27
+ def embed_cleaned_pdfs(folder, vectorstore, embeddings):
28
+ pattern = os.path.join(folder, "์ •๋ฆฌ๋œ*.pdf")
29
+ pdf_files = glob.glob(pattern)
30
+ print(f"๐Ÿงพ ๋Œ€์ƒ PDF ์ˆ˜: {len(pdf_files)}")
31
+
32
+ new_documents = []
33
+ for pdf_path in pdf_files:
34
+ print(f"๐Ÿ“„ ์ฒ˜๋ฆฌ ์ค‘: {pdf_path}")
35
+ text = load_pdf_with_pymupdf(pdf_path)
36
+ if text.strip():
37
+ new_documents.append(Document(page_content=text, metadata={"source": pdf_path}))
38
+
39
+ print(f"๐Ÿ“š ๋ฌธ์„œ ์ˆ˜: {len(new_documents)}")
40
+
41
+ chunks = split_documents(new_documents, chunk_size=300, chunk_overlap=50)
42
+ print(f"๏ฟฝ๏ฟฝ ์ฒญํฌ ์ˆ˜: {len(chunks)}")
43
+
44
+ print(f"์ถ”๊ฐ€ ์ „ ๋ฒกํ„ฐ ์ˆ˜: {vectorstore.index.ntotal}")
45
+ vectorstore.add_documents(chunks)
46
+ print(f"์ถ”๊ฐ€ ํ›„ ๋ฒกํ„ฐ ์ˆ˜: {vectorstore.index.ntotal}")
47
+
48
+ vectorstore.save_local(VECTOR_STORE_PATH)
49
+ print(f"โœ… ์ €์žฅ ์™„๋ฃŒ: {VECTOR_STORE_PATH}")
50
+
51
+ # ์‹คํ–‰
52
+ if __name__ == "__main__":
53
+ embeddings = get_embeddings()
54
+ vectorstore = load_vector_store(embeddings)
55
+ embed_cleaned_pdfs(FOLDER, vectorstore, embeddings)
dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/(๋™ํ–ฅ๋ณด๊ณ ) ๊ต์œก๋ถ€ K-์—๋“€ํŒŒ์ธ ์‹œ์Šคํ…œ ์ „๋ถ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ํƒ‘์žฌ ๊ฒ€ํ† .hwp ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b9ffae391271afee3a1d65cf2a46c58eabeca3ba9305ac0a987fb034e63b1708
3
+ size 110080
dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/23.05.10 ์ „๋ผ๋ถ๋„ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๊ฑด๋ฆฝ ๊ฐ€๋Šฅ ๋ถ€์ง€.hwp ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c8bbb64a8ec39a2bfcc373ba0b7bacb4dc5fb25200c2eda08a3abcea733368f
3
+ size 651264
dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/25.02.28 ํ–ฅํ›„ ๊ณต๊ณต ๋ฏผ๊ฐ„๋ฌผ๋Ÿ‰ ๋…ธ๋ ฅ ํฌ์ธํŠธ.hwp ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a89a3d34664fd68852dc1c4129efe73752dac17272d8238653b80d49309f88b
3
+ size 101376
dataset/์ถœ๋ ฅ HWPํŒŒ์ผ ์–‘์‹/25.03.07 ์ƒ์„ฑํ˜• AI ์‹œ์Šคํ…œ ๊ตฌ์ถ•์„ ์œ„ํ•œ ์—…๋ฌดํ˜‘์•ฝ์‹ ๊ณ„ํš.hwp ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39286f70b90a2e4207947785576467f99cf33081aed24edd0f588c8d10f07cbc
3
+ size 817152
dataset/์ถœ์žฅ๊ฒฐ๊ณผ๋ณด๊ณ /(1) 24.08.21 ์นด์นด์˜ค ์•„ํ†  ๋…น์Œ ํ’€๋ณธ1.txt ADDED
The diff for this file is too large to render. See raw diff
 
dataset/์ถœ์žฅ๊ฒฐ๊ณผ๋ณด๊ณ /(4) 24.08.21 ์นด์นด์˜ค,์•„ํ†  ๋ฉด๋‹ด ๊ฒฐ๊ณผ๋ณด๊ณ F.hwp ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:930e7fde652397ee17ea6cfbbe1b993b7166fc3e350f9aa6c999f982baad3944
3
+ size 120832
docker-compose.yml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ rag-api:
5
+ build: .
6
+ ports:
7
+ - "8500:8500"
8
+ volumes:
9
+ - ./dataset:/app/dataset
10
+ environment:
11
+ - PYTHONPATH=/app
12
+ command: uvicorn rag_server:app --host 0.0.0.0 --port 8500 --reload
document_processor_image_test.py ADDED
@@ -0,0 +1,440 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ import glob
4
+ import time
5
+ from collections import defaultdict
6
+
7
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
8
+ from langchain_core.documents import Document
9
+ from langchain_community.embeddings import HuggingFaceEmbeddings
10
+ from langchain_community.vectorstores import FAISS
11
+
12
+ # PyMuPDF ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
13
+ try:
14
+ import fitz # PyMuPDF
15
+ PYMUPDF_AVAILABLE = True
16
+ print("โœ… PyMuPDF ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅ")
17
+ except ImportError:
18
+ PYMUPDF_AVAILABLE = False
19
+ print("โš ๏ธ PyMuPDF ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์ง€ ์•Š์Œ. pip install PyMuPDF๋กœ ์„ค์น˜ํ•˜์„ธ์š”.")
20
+
21
+ # PDF ์ฒ˜๋ฆฌ์šฉ
22
+ import pytesseract
23
+ from PIL import Image
24
+ from pdf2image import convert_from_path
25
+ import pdfplumber
26
+ from pymupdf4llm import LlamaMarkdownReader
27
+
28
+ # --------------------------------
29
+ # ๋กœ๊ทธ ์ถœ๋ ฅ
30
+ # --------------------------------
31
+
32
+ def log(msg):
33
+ print(f"[{time.strftime('%H:%M:%S')}] {msg}")
34
+
35
+ # --------------------------------
36
+ # ํ…์ŠคํŠธ ์ •์ œ ํ•จ์ˆ˜
37
+ # --------------------------------
38
+
39
+ def clean_text(text):
40
+ return re.sub(r"[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318F\w\s.,!?\"'()$:\-]", "", text)
41
+
42
+ def apply_corrections(text):
43
+ corrections = {
44
+ 'ยบยฉ': '์ •๋ณด', 'รŒ': '์˜', 'ยฝ': '์šด์˜', 'รƒ': '', 'ยฉ': '',
45
+ 'รขโ‚ฌโ„ข': "'", 'รขโ‚ฌล“': '"', 'รขโ‚ฌ': '"'
46
+ }
47
+ for k, v in corrections.items():
48
+ text = text.replace(k, v)
49
+ return text
50
+
51
+ # --------------------------------
52
+ # HWPX ์ฒ˜๋ฆฌ (์„น์…˜๋ณ„ ์ฒ˜๋ฆฌ๋งŒ ์‚ฌ์šฉ)
53
+ # --------------------------------
54
+
55
+ def load_hwpx(file_path):
56
+ """HWPX ํŒŒ์ผ ๋กœ๋”ฉ (XML ํŒŒ์‹ฑ ๋ฐฉ์‹๋งŒ ์‚ฌ์šฉ)"""
57
+ import zipfile
58
+ import xml.etree.ElementTree as ET
59
+ import chardet
60
+
61
+ log(f"๐Ÿ“ฅ HWPX ์„น์…˜๋ณ„ ์ฒ˜๋ฆฌ ์‹œ์ž‘: {file_path}")
62
+ start = time.time()
63
+ documents = []
64
+
65
+ try:
66
+ with zipfile.ZipFile(file_path, 'r') as zip_ref:
67
+ file_list = zip_ref.namelist()
68
+ section_files = [f for f in file_list
69
+ if f.startswith('Contents/section') and f.endswith('.xml')]
70
+ section_files.sort() # section0.xml, section1.xml ์ˆœ์„œ๋กœ ์ •๋ ฌ
71
+
72
+ log(f"๐Ÿ“„ ๋ฐœ๊ฒฌ๋œ ์„น์…˜ ํŒŒ์ผ: {len(section_files)}๊ฐœ")
73
+
74
+ for section_idx, section_file in enumerate(section_files):
75
+ with zip_ref.open(section_file) as xml_file:
76
+ raw = xml_file.read()
77
+ encoding = chardet.detect(raw)['encoding'] or 'utf-8'
78
+ try:
79
+ text = raw.decode(encoding)
80
+ except UnicodeDecodeError:
81
+ text = raw.decode("cp949", errors="replace")
82
+
83
+ tree = ET.ElementTree(ET.fromstring(text))
84
+ root = tree.getroot()
85
+
86
+ # ๋„ค์ž„์ŠคํŽ˜์ด์Šค ์—†์ด ํ…์ŠคํŠธ ์ฐพ๊ธฐ
87
+ t_elements = [elem for elem in root.iter() if elem.tag.endswith('}t') or elem.tag == 't']
88
+ body_text = ""
89
+ for elem in t_elements:
90
+ if elem.text:
91
+ body_text += clean_text(elem.text) + " "
92
+
93
+ # page ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋Š” ๋นˆ ๊ฐ’์œผ๋กœ ์„ค์ •
94
+ page_value = ""
95
+
96
+ if body_text.strip():
97
+ documents.append(Document(
98
+ page_content=apply_corrections(body_text),
99
+ metadata={
100
+ "source": file_path,
101
+ "filename": os.path.basename(file_path),
102
+ "type": "hwpx_body",
103
+ "page": page_value,
104
+ "total_sections": len(section_files)
105
+ }
106
+ ))
107
+ log(f"โœ… ์„น์…˜ ํ…์ŠคํŠธ ์ถ”์ถœ ์™„๋ฃŒ (chars: {len(body_text)})")
108
+
109
+ # ํ‘œ ์ฐพ๊ธฐ
110
+ table_elements = [elem for elem in root.iter() if elem.tag.endswith('}table') or elem.tag == 'table']
111
+ if table_elements:
112
+ table_text = ""
113
+ for table_idx, table in enumerate(table_elements):
114
+ table_text += f"[Table {table_idx + 1}]\n"
115
+ rows = [elem for elem in table.iter() if elem.tag.endswith('}tr') or elem.tag == 'tr']
116
+ for row in rows:
117
+ row_text = []
118
+ cells = [elem for elem in row.iter() if elem.tag.endswith('}tc') or elem.tag == 'tc']
119
+ for cell in cells:
120
+ cell_texts = []
121
+ for t_elem in cell.iter():
122
+ if (t_elem.tag.endswith('}t') or t_elem.tag == 't') and t_elem.text:
123
+ cell_texts.append(clean_text(t_elem.text))
124
+ row_text.append(" ".join(cell_texts))
125
+ if row_text:
126
+ table_text += "\t".join(row_text) + "\n"
127
+
128
+ if table_text.strip():
129
+ documents.append(Document(
130
+ page_content=apply_corrections(table_text),
131
+ metadata={
132
+ "source": file_path,
133
+ "filename": os.path.basename(file_path),
134
+ "type": "hwpx_table",
135
+ "page": page_value,
136
+ "total_sections": len(section_files)
137
+ }
138
+ ))
139
+ log(f"๐Ÿ“Š ํ‘œ ์ถ”์ถœ ์™„๋ฃŒ")
140
+
141
+ # ์ด๋ฏธ์ง€ ์ฐพ๊ธฐ
142
+ if [elem for elem in root.iter() if elem.tag.endswith('}picture') or elem.tag == 'picture']:
143
+ documents.append(Document(
144
+ page_content="[์ด๋ฏธ์ง€ ํฌํ•จ]",
145
+ metadata={
146
+ "source": file_path,
147
+ "filename": os.path.basename(file_path),
148
+ "type": "hwpx_image",
149
+ "page": page_value,
150
+ "total_sections": len(section_files)
151
+ }
152
+ ))
153
+ log(f"๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€ ๋ฐœ๊ฒฌ")
154
+
155
+ except Exception as e:
156
+ log(f"โŒ HWPX ์ฒ˜๋ฆฌ ์˜ค๋ฅ˜: {e}")
157
+
158
+ duration = time.time() - start
159
+
160
+ # ๋ฌธ์„œ ์ •๋ณด ์š”์•ฝ ์ถœ๋ ฅ
161
+ if documents:
162
+ log(f"๐Ÿ“‹ ์ถ”์ถœ๋œ ๋ฌธ์„œ ์ˆ˜: {len(documents)}")
163
+
164
+ log(f"โœ… HWPX ์ฒ˜๋ฆฌ ์™„๋ฃŒ: {file_path} โฑ๏ธ {duration:.2f}์ดˆ, ์ด {len(documents)}๊ฐœ ๋ฌธ์„œ")
165
+ return documents
166
+
167
+ # --------------------------------
168
+ # PDF ์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋“ค (๊ธฐ์กด๊ณผ ๋™์ผ)
169
+ # --------------------------------
170
+
171
+ def run_ocr_on_image(image: Image.Image, lang='kor+eng'):
172
+ return pytesseract.image_to_string(image, lang=lang)
173
+
174
+ def extract_images_with_ocr(pdf_path, lang='kor+eng'):
175
+ try:
176
+ images = convert_from_path(pdf_path)
177
+ page_ocr_data = {}
178
+ for idx, img in enumerate(images):
179
+ page_num = idx + 1
180
+ text = run_ocr_on_image(img, lang=lang)
181
+ if text.strip():
182
+ page_ocr_data[page_num] = text.strip()
183
+ return page_ocr_data
184
+ except Exception as e:
185
+ print(f"โŒ ์ด๋ฏธ์ง€ OCR ์‹คํŒจ: {e}")
186
+ return {}
187
+
188
+ def extract_tables_with_pdfplumber(pdf_path):
189
+ page_table_data = {}
190
+ try:
191
+ with pdfplumber.open(pdf_path) as pdf:
192
+ for i, page in enumerate(pdf.pages):
193
+ page_num = i + 1
194
+ tables = page.extract_tables()
195
+ table_text = ""
196
+ for t_index, table in enumerate(tables):
197
+ if table:
198
+ table_text += f"[Table {t_index+1}]\n"
199
+ for row in table:
200
+ row_text = "\t".join(cell if cell else "" for cell in row)
201
+ table_text += row_text + "\n"
202
+ if table_text.strip():
203
+ page_table_data[page_num] = table_text.strip()
204
+ return page_table_data
205
+ except Exception as e:
206
+ print(f"โŒ ํ‘œ ์ถ”์ถœ ์‹คํŒจ: {e}")
207
+ return {}
208
+
209
+ def extract_body_text_with_pages(pdf_path):
210
+ page_body_data = {}
211
+ try:
212
+ pdf_processor = LlamaMarkdownReader()
213
+ docs = pdf_processor.load_data(file_path=pdf_path)
214
+
215
+ combined_text = ""
216
+ for d in docs:
217
+ if isinstance(d, dict) and "text" in d:
218
+ combined_text += d["text"]
219
+ elif hasattr(d, "text"):
220
+ combined_text += d.text
221
+
222
+ if combined_text.strip():
223
+ chars_per_page = 2000
224
+ start = 0
225
+ page_num = 1
226
+
227
+ while start < len(combined_text):
228
+ end = start + chars_per_page
229
+ if end > len(combined_text):
230
+ end = len(combined_text)
231
+
232
+ page_text = combined_text[start:end]
233
+ if page_text.strip():
234
+ page_body_data[page_num] = page_text.strip()
235
+ page_num += 1
236
+
237
+ if end == len(combined_text):
238
+ break
239
+ start = end - 100
240
+
241
+ except Exception as e:
242
+ print(f"โŒ ๋ณธ๋ฌธ ์ถ”์ถœ ์‹คํŒจ: {e}")
243
+
244
+ return page_body_data
245
+
246
+ def load_pdf_with_metadata(pdf_path):
247
+ """PDF ํŒŒ์ผ์—์„œ ํŽ˜์ด์ง€๋ณ„ ์ •๋ณด๋ฅผ ์ถ”์ถœ"""
248
+ log(f"๐Ÿ“‘ PDF ํŽ˜์ด์ง€๋ณ„ ์ฒ˜๋ฆฌ ์‹œ์ž‘: {pdf_path}")
249
+ start = time.time()
250
+
251
+ # ๋จผ์ € PyPDFLoader๋กœ ์‹ค์ œ ํŽ˜์ด์ง€ ์ˆ˜ ํ™•์ธ
252
+ try:
253
+ from langchain_community.document_loaders import PyPDFLoader
254
+ loader = PyPDFLoader(pdf_path)
255
+ pdf_pages = loader.load()
256
+ actual_total_pages = len(pdf_pages)
257
+ log(f"๐Ÿ“„ PyPDFLoader๋กœ ํ™•์ธํ•œ ์‹ค์ œ ํŽ˜์ด์ง€ ์ˆ˜: {actual_total_pages}")
258
+ except Exception as e:
259
+ log(f"โŒ PyPDFLoader ํŽ˜์ด์ง€ ์ˆ˜ ํ™•์ธ ์‹คํŒจ: {e}")
260
+ actual_total_pages = 1
261
+
262
+ try:
263
+ page_tables = extract_tables_with_pdfplumber(pdf_path)
264
+ except Exception as e:
265
+ page_tables = {}
266
+ print(f"โŒ ํ‘œ ์ถ”์ถœ ์‹คํŒจ: {e}")
267
+
268
+ try:
269
+ page_ocr = extract_images_with_ocr(pdf_path)
270
+ except Exception as e:
271
+ page_ocr = {}
272
+ print(f"โŒ ์ด๋ฏธ์ง€ OCR ์‹คํŒจ: {e}")
273
+
274
+ try:
275
+ page_body = extract_body_text_with_pages(pdf_path)
276
+ except Exception as e:
277
+ page_body = {}
278
+ print(f"โŒ ๋ณธ๋ฌธ ์ถ”์ถœ ์‹คํŒจ: {e}")
279
+
280
+ duration = time.time() - start
281
+ log(f"โœ… PDF ํŽ˜์ด์ง€๋ณ„ ์ฒ˜๋ฆฌ ์™„๋ฃŒ: {pdf_path} โฑ๏ธ {duration:.2f}์ดˆ")
282
+
283
+ # ์‹ค์ œ ํŽ˜์ด์ง€ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ค์ •
284
+ all_pages = set(page_tables.keys()) | set(page_ocr.keys()) | set(page_body.keys())
285
+ if all_pages:
286
+ max_extracted_page = max(all_pages)
287
+ # ์‹ค์ œ ํŽ˜์ด์ง€ ์ˆ˜์™€ ์ถ”์ถœ๋œ ํŽ˜์ด์ง€ ์ˆ˜ ์ค‘ ํฐ ๊ฐ’ ์‚ฌ์šฉ
288
+ total_pages = max(actual_total_pages, max_extracted_page)
289
+ else:
290
+ total_pages = actual_total_pages
291
+
292
+ log(f"๐Ÿ“Š ์ตœ์ข… ์„ค์ •๋œ ์ด ํŽ˜์ด์ง€ ์ˆ˜: {total_pages}")
293
+
294
+ docs = []
295
+
296
+ for page_num in sorted(all_pages):
297
+ if page_num in page_tables and page_tables[page_num].strip():
298
+ docs.append(Document(
299
+ page_content=clean_text(apply_corrections(page_tables[page_num])),
300
+ metadata={
301
+ "source": pdf_path,
302
+ "filename": os.path.basename(pdf_path),
303
+ "type": "table",
304
+ "page": page_num,
305
+ "total_pages": total_pages
306
+ }
307
+ ))
308
+ log(f"๐Ÿ“Š ํŽ˜์ด์ง€ {page_num}: ํ‘œ ์ถ”์ถœ ์™„๋ฃŒ")
309
+
310
+ if page_num in page_body and page_body[page_num].strip():
311
+ docs.append(Document(
312
+ page_content=clean_text(apply_corrections(page_body[page_num])),
313
+ metadata={
314
+ "source": pdf_path,
315
+ "filename": os.path.basename(pdf_path),
316
+ "type": "body",
317
+ "page": page_num,
318
+ "total_pages": total_pages
319
+ }
320
+ ))
321
+ log(f"๐Ÿ“„ ํŽ˜์ด์ง€ {page_num}: ๋ณธ๋ฌธ ์ถ”์ถœ ์™„๋ฃŒ")
322
+
323
+ if page_num in page_ocr and page_ocr[page_num].strip():
324
+ docs.append(Document(
325
+ page_content=clean_text(apply_corrections(page_ocr[page_num])),
326
+ metadata={
327
+ "source": pdf_path,
328
+ "filename": os.path.basename(pdf_path),
329
+ "type": "ocr",
330
+ "page": page_num,
331
+ "total_pages": total_pages
332
+ }
333
+ ))
334
+ log(f"๐Ÿ–ผ๏ธ ํŽ˜์ด์ง€ {page_num}: OCR ์ถ”์ถœ ์™„๋ฃŒ")
335
+
336
+ if not docs:
337
+ docs.append(Document(
338
+ page_content="[๋‚ด์šฉ ์ถ”์ถœ ์‹คํŒจ]",
339
+ metadata={
340
+ "source": pdf_path,
341
+ "filename": os.path.basename(pdf_path),
342
+ "type": "error",
343
+ "page": 1,
344
+ "total_pages": total_pages
345
+ }
346
+ ))
347
+
348
+ # ํŽ˜์ด์ง€ ์ •๋ณด ์š”์•ฝ ์ถœ๋ ฅ
349
+ if docs:
350
+ page_numbers = [doc.metadata.get('page', 0) for doc in docs if doc.metadata.get('page')]
351
+ if page_numbers:
352
+ log(f"๐Ÿ“‹ ์ถ”์ถœ๋œ ํŽ˜์ด์ง€ ๋ฒ”์œ„: {min(page_numbers)} ~ {max(page_numbers)}")
353
+
354
+ log(f"๐Ÿ“Š ์ถ”์ถœ๋œ ํŽ˜์ด์ง€๋ณ„ PDF ๋ฌธ์„œ: {len(docs)}๊ฐœ (์ด {total_pages}ํŽ˜์ด์ง€)")
355
+ return docs
356
+
357
+ # --------------------------------
358
+ # ๋ฌธ์„œ ๋กœ๋”ฉ ๋ฐ ๋ถ„ํ• 
359
+ # --------------------------------
360
+
361
+ def load_documents(folder_path):
362
+ documents = []
363
+
364
+ for file in glob.glob(os.path.join(folder_path, "*.hwpx")):
365
+ log(f"๐Ÿ“„ HWPX ํŒŒ์ผ ํ™•์ธ: {file}")
366
+ docs = load_hwpx(file)
367
+ documents.extend(docs)
368
+
369
+ for file in glob.glob(os.path.join(folder_path, "*.pdf")):
370
+ log(f"๐Ÿ“„ PDF ํŒŒ์ผ ํ™•์ธ: {file}")
371
+ documents.extend(load_pdf_with_metadata(file))
372
+
373
+ log(f"๐Ÿ“š ๋ฌธ์„œ ๋กœ๋”ฉ ์ „์ฒด ์™„๋ฃŒ! ์ด ๋ฌธ์„œ ์ˆ˜: {len(documents)}")
374
+ return documents
375
+
376
+ def split_documents(documents, chunk_size=800, chunk_overlap=100):
377
+ log("๐Ÿ”ช ์ฒญํฌ ๋ถ„ํ•  ์‹œ์ž‘")
378
+ splitter = RecursiveCharacterTextSplitter(
379
+ chunk_size=chunk_size,
380
+ chunk_overlap=chunk_overlap,
381
+ length_function=len
382
+ )
383
+ chunks = []
384
+ for doc in documents:
385
+ split = splitter.split_text(doc.page_content)
386
+ for i, chunk in enumerate(split):
387
+ enriched_chunk = f"passage: {chunk}"
388
+ chunks.append(Document(
389
+ page_content=enriched_chunk,
390
+ metadata={**doc.metadata, "chunk_index": i}
391
+ ))
392
+ log(f"โœ… ์ฒญํฌ ๋ถ„ํ•  ์™„๋ฃŒ: ์ด {len(chunks)}๊ฐœ ์ƒ์„ฑ")
393
+ return chunks
394
+
395
+ # --------------------------------
396
+ # ๋ฉ”์ธ ์‹คํ–‰
397
+ # --------------------------------
398
+
399
+ if __name__ == "__main__":
400
+ folder = "dataset_test"
401
+ log("๐Ÿš€ PyMuPDF ๊ธฐ๋ฐ˜ ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์‹œ์ž‘")
402
+ docs = load_documents(folder)
403
+ log("๐Ÿ“ฆ ๋ฌธ์„œ ๋กœ๋”ฉ ์™„๋ฃŒ")
404
+
405
+ # ํŽ˜์ด์ง€ ์ •๋ณด ํ™•์ธ
406
+ log("๐Ÿ“„ ํŽ˜์ด์ง€ ์ •๋ณด ์š”์•ฝ:")
407
+ page_info = {}
408
+ for doc in docs:
409
+ source = doc.metadata.get('source', 'unknown')
410
+ page = doc.metadata.get('page', 'unknown')
411
+ doc_type = doc.metadata.get('type', 'unknown')
412
+
413
+ if source not in page_info:
414
+ page_info[source] = {'pages': set(), 'types': set()}
415
+ page_info[source]['pages'].add(page)
416
+ page_info[source]['types'].add(doc_type)
417
+
418
+ for source, info in page_info.items():
419
+ max_page = max(info['pages']) if info['pages'] and isinstance(max(info['pages']), int) else 'unknown'
420
+ log(f" ๐Ÿ“„ {os.path.basename(source)}: {max_page}ํŽ˜์ด์ง€, ํƒ€์ž…: {info['types']}")
421
+
422
+ chunks = split_documents(docs)
423
+ log("๐Ÿ’ก E5-Large-Instruct ์ž„๋ฒ ๋”ฉ ์ค€๋น„ ์ค‘")
424
+ embedding_model = HuggingFaceEmbeddings(
425
+ model_name="intfloat/e5-large-v2",
426
+ model_kwargs={"device": "cuda"}
427
+ )
428
+
429
+ vectorstore = FAISS.from_documents(chunks, embedding_model)
430
+ vectorstore.save_local("vector_db")
431
+
432
+ log(f"๐Ÿ“Š ์ „์ฒด ๋ฌธ์„œ ์ˆ˜: {len(docs)}")
433
+ log(f"๐Ÿ”— ์ฒญํฌ ์ด ์ˆ˜: {len(chunks)}")
434
+ log("โœ… FAISS ์ €์žฅ ์™„๋ฃŒ: vector_db")
435
+
436
+ # ํŽ˜์ด์ง€ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ƒ˜ํ”Œ ์ถœ๋ ฅ
437
+ log("\n๐Ÿ“‹ ์‹ค์ œ ํŽ˜์ด์ง€ ์ •๋ณด ํฌํ•จ ์ƒ˜ํ”Œ:")
438
+ for i, chunk in enumerate(chunks[:5]):
439
+ meta = chunk.metadata
440
+ log(f" ์ฒญํฌ {i+1}: {meta.get('type')} | ํŽ˜์ด์ง€ {meta.get('page')} | {os.path.basename(meta.get('source', 'unknown'))}")
e5_embeddings.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_huggingface import HuggingFaceEmbeddings
2
+
3
+ class E5Embeddings(HuggingFaceEmbeddings):
4
+ def embed_documents(self, texts):
5
+ texts = [f"passage: {text}" for text in texts]
6
+ return super().embed_documents(texts)
7
+
8
+ def embed_query(self, text):
9
+ return super().embed_query(f"query: {text}")
llm_loader.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.chat_models import ChatOpenAI
2
+
3
+ def load_llama_model():
4
+ return ChatOpenAI(
5
+
6
+ #Llama 3 8B ๋ชจ๋ธ๋กœ RAG ์‹คํ–‰ํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ
7
+ #base_url="http://torch27:8000/v1",
8
+ #model="meta-llama/Meta-Llama-3-8B-Instruct",
9
+ #openai_api_key="EMPTY"
10
+
11
+ #Exaone์œผ๋กœ RAG ์‹คํ–‰ํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ
12
+ base_url="http://220.124.155.35:8000/v1",
13
+ model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
14
+ openai_api_key="token-abc123"
15
+
16
+ #base_url="https://7xiebe4unotxnp-8000.proxy.runpod.net/v1",
17
+ #model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
18
+ #openai_api_key="EMPTY"
19
+
20
+ # base_url="http://vllm_yjy:8000/v1",
21
+ # model="/models/Llama-3.3-70B-Instruct-AWQ",
22
+ # openai_api_key="token-abc123"
23
+
24
+ )
rag_server.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, Request
2
+ from fastapi.responses import JSONResponse, FileResponse, HTMLResponse
3
+ from fastapi.staticfiles import StaticFiles
4
+ from pydantic import BaseModel
5
+ from rag_system import build_rag_chain, ask_question
6
+ from vector_store import get_embeddings, load_vector_store
7
+ from llm_loader import load_llama_model
8
+ import uuid
9
+ import os
10
+ import shutil
11
+ from urllib.parse import urljoin, quote
12
+
13
+ from fastapi.responses import StreamingResponse
14
+ import json
15
+ import time
16
+
17
+ app = FastAPI()
18
+
19
+ # ์ •์  ํŒŒ์ผ ์„œ๋น™์„ ์œ„ํ•œ ์„ค์ •
20
+ os.makedirs("static/documents", exist_ok=True)
21
+ app.mount("/static", StaticFiles(directory="static"), name="static")
22
+
23
+ # ์ „์—ญ ๊ฐ์ฒด ์ค€๋น„
24
+ embeddings = get_embeddings(device="cpu")
25
+ vectorstore = load_vector_store(embeddings, load_path="vector_db")
26
+ llm = load_llama_model()
27
+ qa_chain = build_rag_chain(llm, vectorstore, language="ko", k=7)
28
+
29
+ # ์„œ๋ฒ„ URL ์„ค์ • (์‹ค์ œ ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ์ˆ˜์ • ํ•„์š”)
30
+ BASE_URL = "http://220.124.155.35:8500"
31
+
32
+ class Question(BaseModel):
33
+ question: str
34
+
35
+ def get_document_url(source_path):
36
+ if not source_path or source_path == 'N/A':
37
+ return None
38
+ filename = os.path.basename(source_path)
39
+ dataset_root = os.path.join(os.getcwd(), "dataset")
40
+ # dataset ์ „์ฒด ํ•˜์œ„ ํด๋”์—์„œ ํŒŒ์ผ๋ช… ์ผ์น˜ํ•˜๋Š” ํŒŒ์ผ ์ฐพ๊ธฐ
41
+ found_path = None
42
+ for root, dirs, files in os.walk(dataset_root):
43
+ if filename in files:
44
+ found_path = os.path.join(root, filename)
45
+ break
46
+ if not found_path or not os.path.exists(found_path):
47
+ return None
48
+ static_path = f"static/documents/{filename}"
49
+ shutil.copy2(found_path, static_path)
50
+ encoded_filename = quote(filename)
51
+ return urljoin(BASE_URL, f"/static/documents/{encoded_filename}")
52
+
53
+ def create_download_link(url, filename):
54
+ return f'์ถœ์ฒ˜: [{filename}]({url})'
55
+
56
+ @app.post("/ask")
57
+ def ask(question: Question):
58
+ result = ask_question(qa_chain, question.question)
59
+
60
+ # ์†Œ์Šค ๋ฌธ์„œ ์ •๋ณด ์ฒ˜๋ฆฌ
61
+ sources = []
62
+ for doc in result["source_documents"]:
63
+ source_path = doc.metadata.get('source', 'N/A')
64
+ document_url = get_document_url(source_path) if source_path != 'N/A' else None
65
+
66
+ source_info = {
67
+ "source": source_path,
68
+ "content": doc.page_content,
69
+ "page": doc.metadata.get('page', 'N/A'),
70
+ "document_url": document_url,
71
+ "filename": os.path.basename(source_path) if source_path != 'N/A' else None
72
+ }
73
+ sources.append(source_info)
74
+
75
+ return {
76
+ "answer": result['result'].split("A:")[-1].strip() if "A:" in result['result'] else result['result'].strip(),
77
+ "sources": sources
78
+ }
79
+
80
+ @app.get("/v1/models")
81
+ def list_models():
82
+ return JSONResponse({
83
+ "object": "list",
84
+ "data": [
85
+ {
86
+ "id": "rag",
87
+ "object": "model",
88
+ "owned_by": "local",
89
+ }
90
+ ]
91
+ })
92
+
93
+ @app.post("/v1/chat/completions")
94
+ async def openai_compatible_chat(request: Request):
95
+ payload = await request.json()
96
+ messages = payload.get("messages", [])
97
+ user_input = messages[-1]["content"] if messages else ""
98
+ stream = payload.get("stream", False)
99
+
100
+ result = ask_question(qa_chain, user_input)
101
+ answer = result['result']
102
+
103
+ # ์†Œ์Šค ๋ฌธ์„œ ์ •๋ณด ์ฒ˜๋ฆฌ
104
+ sources = []
105
+ for doc in result["source_documents"]:
106
+ source_path = doc.metadata.get('source', 'N/A')
107
+ document_url = get_document_url(source_path) if source_path != 'N/A' else None
108
+ filename = os.path.basename(source_path) if source_path != 'N/A' else None
109
+
110
+ source_info = {
111
+ "source": source_path,
112
+ "content": doc.page_content,
113
+ "page": doc.metadata.get('page', 'N/A'),
114
+ "document_url": document_url,
115
+ "filename": filename
116
+ }
117
+ sources.append(source_info)
118
+
119
+ # ์†Œ์Šค ์ •๋ณด๋ฅผ ํ•œ ์ค„์”ฉ๋งŒ ์ถœ๋ ฅ
120
+ sources_md = "\n์ฐธ๊ณ  ๋ฌธ์„œ:\n"
121
+ seen = set()
122
+ for source in sources:
123
+ key = (source['filename'], source['document_url'])
124
+ if source['document_url'] and source['filename'] and key not in seen:
125
+ sources_md += f"์ถœ์ฒ˜: [{source['filename']}]({source['document_url']})\n"
126
+ seen.add(key)
127
+
128
+ final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
129
+ final_answer += sources_md
130
+
131
+ if not stream:
132
+ return JSONResponse({
133
+ "id": f"chatcmpl-{uuid.uuid4()}",
134
+ "object": "chat.completion",
135
+ "choices": [{
136
+ "index": 0,
137
+ "message": {
138
+ "role": "assistant",
139
+ "content": final_answer
140
+ },
141
+ "finish_reason": "stop"
142
+ }],
143
+ "model": "rag",
144
+ })
145
+
146
+ # ์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต์„ ์œ„ํ•œ generator
147
+ def event_stream():
148
+ # ๋‹ต๋ณ€ ๋ณธ๋ฌธ๋งŒ ๋จผ์ € ์ŠคํŠธ๋ฆฌ๋ฐ
149
+ answer_main = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
150
+ for char in answer_main:
151
+ chunk = {
152
+ "id": f"chatcmpl-{uuid.uuid4()}",
153
+ "object": "chat.completion.chunk",
154
+ "choices": [{
155
+ "index": 0,
156
+ "delta": {
157
+ "content": char
158
+ },
159
+ "finish_reason": None
160
+ }]
161
+ }
162
+ yield f"data: {json.dumps(chunk)}\n\n"
163
+ time.sleep(0.005)
164
+ # ์ฐธ๊ณ  ๋ฌธ์„œ(๋‹ค์šด๋กœ๋“œ ๋งํฌ)๋Š” ๋งˆ์ง€๋ง‰์— ํ•œ ๋ฒˆ์— ๋ถ™์—ฌ์„œ ์ „์†ก
165
+ sources_md = "\n์ฐธ๊ณ  ๋ฌธ์„œ:\n"
166
+ seen = set()
167
+ for source in sources:
168
+ key = (source['filename'], source['document_url'])
169
+ if source['document_url'] and source['filename'] and key not in seen:
170
+ sources_md += f"์ถœ์ฒ˜: [{source['filename']}]({source['document_url']})\n"
171
+ seen.add(key)
172
+ if sources_md.strip() != "์ฐธ๊ณ  ๋ฌธ์„œ:":
173
+ chunk = {
174
+ "id": f"chatcmpl-{uuid.uuid4()}",
175
+ "object": "chat.completion.chunk",
176
+ "choices": [{
177
+ "index": 0,
178
+ "delta": {
179
+ "content": sources_md
180
+ },
181
+ "finish_reason": None
182
+ }]
183
+ }
184
+ yield f"data: {json.dumps(chunk)}\n\n"
185
+ done = {
186
+ "id": f"chatcmpl-{uuid.uuid4()}",
187
+ "object": "chat.completion.chunk",
188
+ "choices": [{
189
+ "index": 0,
190
+ "delta": {},
191
+ "finish_reason": "stop"
192
+ }]
193
+ }
194
+ yield f"data: {json.dumps(done)}\n\n"
195
+ return
196
+
197
+ return StreamingResponse(event_stream(), media_type="text/event-stream")
rag_system.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ import sys
4
+ from langchain.chains import RetrievalQA
5
+ from langchain.prompts import PromptTemplate
6
+ from vector_store import get_embeddings, load_vector_store
7
+ from llm_loader import load_llama_model
8
+
9
+ def create_refine_prompts_with_pages(language="ko"):
10
+ if language == "ko":
11
+ question_prompt = PromptTemplate(
12
+ input_variables=["context_str", "question"],
13
+ template="""
14
+ ๋‹ค์Œ์€ ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ ์กฐ๊ฐ๋“ค์ž…๋‹ˆ๋‹ค:
15
+
16
+ {context_str}
17
+
18
+ ์œ„ ๋ฌธ์„œ๋“ค์„ ์ฐธ๊ณ ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•ด์ฃผ์„ธ์š”.
19
+
20
+ **์ค‘์š”ํ•œ ๊ทœ์น™:**
21
+ - ๋‹ต๋ณ€ ์‹œ ์ฐธ๊ณ ํ•œ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๋ฉด ํ•ด๋‹น ์ •๋ณด๋ฅผ ์ธ์šฉํ•˜์„ธ์š”
22
+ - ๋ฌธ์„œ์— ๋ช…์‹œ๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜๊ณ , ์ถ”์ธกํ•˜์ง€ ๋งˆ์„ธ์š”
23
+ - ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๋‚˜ ์ถœ์ฒ˜๋Š” ์œ„ ๋ฌธ์„œ์—์„œ ํ™•์ธ๋œ ๊ฒƒ๋งŒ ์–ธ๊ธ‰ํ•˜์„ธ์š”
24
+ - ํ™•์‹คํ•˜์ง€ ์•Š์€ ์ •๋ณด๋Š” "๋ฌธ์„œ์—์„œ ํ™•์ธ๋˜์ง€ ์•Š์Œ"์ด๋ผ๊ณ  ๋ช…์‹œํ•˜์„ธ์š”
25
+
26
+ ์งˆ๋ฌธ: {question}
27
+ ๋‹ต๋ณ€:"""
28
+ )
29
+
30
+ refine_prompt = PromptTemplate(
31
+ input_variables=["question", "existing_answer", "context_str"],
32
+ template="""
33
+ ๊ธฐ์กด ๋‹ต๋ณ€:
34
+ {existing_answer}
35
+
36
+ ์ถ”๊ฐ€ ๋ฌธ์„œ:
37
+ {context_str}
38
+
39
+ ๊ธฐ์กด ๋‹ต๋ณ€์„ ์œ„ ์ถ”๊ฐ€ ๋ฌธ์„œ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ณด์™„ํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•ด์ฃผ์„ธ์š”.
40
+
41
+ **๊ทœ์น™:**
42
+ - ์ƒˆ๋กœ์šด ์ •๋ณด๊ฐ€ ๊ธฐ์กด ๋‹ต๋ณ€๊ณผ ๋‹ค๋ฅด๋‹ค๋ฉด ์ˆ˜์ •ํ•˜์„ธ์š”
43
+ - ์ถ”๊ฐ€ ๋ฌธ์„œ์— ๋ช…์‹œ๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜์„ธ์š”
44
+ - ํ•˜๋‚˜์˜ ์™„๊ฒฐ๋œ ๋‹ต๋ณ€์œผ๋กœ ์ž‘์„ฑํ•˜์„ธ์š”
45
+ - ํ™•์‹คํ•˜์ง€ ์•Š์€ ์ถœ์ฒ˜๋‚˜ ํŽ˜์ด์ง€๋Š” ์–ธ๊ธ‰ํ•˜์ง€ ๋งˆ์„ธ์š”
46
+
47
+ ์งˆ๋ฌธ: {question}
48
+ ๋‹ต๋ณ€:"""
49
+ )
50
+ else:
51
+ question_prompt = PromptTemplate(
52
+ input_variables=["context_str", "question"],
53
+ template="""
54
+ Here are the retrieved document fragments:
55
+
56
+ {context_str}
57
+
58
+ Please answer the question based on the above documents.
59
+
60
+ **Important rules:**
61
+ - Only use information explicitly stated in the documents
62
+ - If citing sources, only mention what is clearly indicated in the documents above
63
+ - Do not guess or infer page numbers not shown in the context
64
+ - If unsure, state "not confirmed in the provided documents"
65
+
66
+ Question: {question}
67
+ Answer:"""
68
+ )
69
+
70
+ refine_prompt = PromptTemplate(
71
+ input_variables=["question", "existing_answer", "context_str"],
72
+ template="""
73
+ Existing answer:
74
+ {existing_answer}
75
+
76
+ Additional documents:
77
+ {context_str}
78
+
79
+ Refine the existing answer using the additional documents.
80
+
81
+ **Rules:**
82
+ - Only use information explicitly stated in the additional documents
83
+ - Create one coherent final answer
84
+ - Do not mention uncertain sources or page numbers
85
+
86
+ Question: {question}
87
+ Answer:"""
88
+ )
89
+
90
+ return question_prompt, refine_prompt
91
+
92
+ def build_rag_chain(llm, vectorstore, language="ko", k=7):
93
+ """RAG ์ฒด์ธ ๊ตฌ์ถ•"""
94
+ question_prompt, refine_prompt = create_refine_prompts_with_pages(language)
95
+
96
+ qa_chain = RetrievalQA.from_chain_type(
97
+ llm=llm,
98
+ chain_type="refine",
99
+ retriever=vectorstore.as_retriever(search_kwargs={"k": k}),
100
+ chain_type_kwargs={
101
+ "question_prompt": question_prompt,
102
+ "refine_prompt": refine_prompt
103
+ },
104
+ return_source_documents=True
105
+ )
106
+
107
+ return qa_chain
108
+
109
+ def ask_question_with_pages(qa_chain, question):
110
+ """์งˆ๋ฌธ ์ฒ˜๋ฆฌ"""
111
+ result = qa_chain.invoke({"query": question})
112
+
113
+ # ๊ฒฐ๊ณผ์—์„œ A: ์ดํ›„ ๋ฌธ์žฅ๋งŒ ์ถ”์ถœ
114
+ answer = result['result']
115
+ final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
116
+
117
+ print(f"\n๐Ÿงพ ์งˆ๋ฌธ: {question}")
118
+ print(f"\n๐ŸŸข ์ตœ์ข… ๋‹ต๋ณ€: {final_answer}")
119
+
120
+ # ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋””๋ฒ„๊น… ์ •๋ณด ์ถœ๋ ฅ (๋น„ํ™œ์„ฑํ™”)
121
+ # debug_metadata_info(result["source_documents"])
122
+
123
+ # ์ฐธ๊ณ  ๋ฌธ์„œ๋ฅผ ํŽ˜์ด์ง€๋ณ„๋กœ ์ •๋ฆฌ
124
+ print("\n๐Ÿ“š ์ฐธ๊ณ  ๋ฌธ์„œ ์š”์•ฝ:")
125
+ source_info = {}
126
+
127
+ for doc in result["source_documents"]:
128
+ source = doc.metadata.get('source', 'N/A')
129
+ page = doc.metadata.get('page', 'N/A')
130
+ doc_type = doc.metadata.get('type', 'N/A')
131
+ section = doc.metadata.get('section', None)
132
+ total_pages = doc.metadata.get('total_pages', None)
133
+
134
+ filename = doc.metadata.get('filename', 'N/A')
135
+ if filename == 'N/A':
136
+ filename = os.path.basename(source) if source != 'N/A' else 'N/A'
137
+
138
+ if filename not in source_info:
139
+ source_info[filename] = {
140
+ 'pages': set(),
141
+ 'sections': set(),
142
+ 'types': set(),
143
+ 'total_pages': total_pages
144
+ }
145
+
146
+ if page != 'N/A':
147
+ if isinstance(page, str) and page.startswith('์„น์…˜'):
148
+ source_info[filename]['sections'].add(page)
149
+ else:
150
+ source_info[filename]['pages'].add(page)
151
+
152
+ if section is not None:
153
+ source_info[filename]['sections'].add(f"์„น์…˜ {section}")
154
+
155
+ source_info[filename]['types'].add(doc_type)
156
+
157
+ # ๊ฒฐ๊ณผ ์ถœ๋ ฅ
158
+ total_chunks = len(result["source_documents"])
159
+ print(f"์ด ์‚ฌ์šฉ๋œ ์ฒญํฌ ์ˆ˜: {total_chunks}")
160
+
161
+ for filename, info in source_info.items():
162
+ print(f"\n- {filename}")
163
+
164
+ # ์ „์ฒด ํŽ˜์ด์ง€ ์ˆ˜ ์ •๋ณด
165
+ if info['total_pages']:
166
+ print(f" ์ „์ฒด ํŽ˜์ด์ง€ ์ˆ˜: {info['total_pages']}")
167
+
168
+ # ํŽ˜์ด์ง€ ์ •๋ณด ์ถœ๋ ฅ
169
+ if info['pages']:
170
+ pages_list = list(info['pages'])
171
+ print(f" ํŽ˜์ด์ง€: {', '.join(map(str, pages_list))}")
172
+
173
+ # ์„น์…˜ ์ •๋ณด ์ถœ๋ ฅ
174
+ if info['sections']:
175
+ sections_list = sorted(list(info['sections']))
176
+ print(f" ์„น์…˜: {', '.join(sections_list)}")
177
+
178
+ # ํŽ˜์ด์ง€์™€ ์„น์…˜์ด ๋ชจ๋‘ ์—†๋Š” ๊ฒฝ์šฐ
179
+ if not info['pages'] and not info['sections']:
180
+ print(f" ํŽ˜์ด์ง€: ์ •๋ณด ์—†์Œ")
181
+
182
+ # ๋ฌธ์„œ ์œ ํ˜• ์ถœ๋ ฅ
183
+ types_str = ', '.join(sorted(info['types']))
184
+ print(f" ์œ ํ˜•: {types_str}")
185
+
186
+ return result
187
+
188
+ # ๊ธฐ์กด ask_question ํ•จ์ˆ˜๋Š” ask_question_with_pages๋กœ ๊ต์ฒด
189
+ def ask_question(qa_chain, question):
190
+ """ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•œ ๋ž˜ํผ ํ•จ์ˆ˜"""
191
+ return ask_question_with_pages(qa_chain, question)
192
+
193
+ if __name__ == "__main__":
194
+ parser = argparse.ArgumentParser(description="RAG refine system (ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ์ง€์›)")
195
+ parser.add_argument("--vector_store", type=str, default="vector_db", help="๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ฒฝ๋กœ")
196
+ parser.add_argument("--model", type=str, default="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct", help="LLM ๋ชจ๋ธ ID")
197
+ parser.add_argument("--device", type=str, default="cuda", choices=["cuda", "cpu"], help="์‚ฌ์šฉํ•  ๋””๋ฐ”์ด์Šค")
198
+ parser.add_argument("--k", type=int, default=7, help="๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ ์ˆ˜")
199
+ parser.add_argument("--language", type=str, default="ko", choices=["ko", "en"], help="์‚ฌ์šฉํ•  ์–ธ์–ด")
200
+ parser.add_argument("--query", type=str, help="์งˆ๋ฌธ (์—†์œผ๋ฉด ๋Œ€ํ™”ํ˜• ๋ชจ๋“œ ์‹คํ–‰)")
201
+
202
+ args = parser.parse_args()
203
+
204
+ embeddings = get_embeddings(device=args.device)
205
+ vectorstore = load_vector_store(embeddings, load_path=args.vector_store)
206
+ llm = load_llama_model()
207
+
208
+ qa_chain = build_rag_chain(llm, vectorstore, language=args.language, k=args.k)
209
+
210
+ print("๐ŸŸข RAG ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ์ง€์› ์‹œ์Šคํ…œ ์ค€๋น„ ์™„๋ฃŒ!")
211
+
212
+ if args.query:
213
+ ask_question_with_pages(qa_chain, args.query)
214
+ else:
215
+ print("๐Ÿ’ฌ ๋Œ€ํ™”ํ˜• ๋ชจ๋“œ ์‹œ์ž‘ (์ข…๋ฃŒํ•˜๋ ค๋ฉด 'exit', 'quit', '์ข…๋ฃŒ' ์ž…๋ ฅ)")
216
+ while True:
217
+ try:
218
+ query = input("\n์งˆ๋ฌธ: ").strip()
219
+ if query.lower() in ["exit", "quit", "์ข…๋ฃŒ"]:
220
+ break
221
+ if query: # ๋นˆ ์ž…๋ ฅ ๋ฐฉ์ง€
222
+ ask_question_with_pages(qa_chain, query)
223
+ except KeyboardInterrupt:
224
+ print("\n\nํ”„๋กœ๊ทธ๋žจ์„ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.")
225
+ break
226
+ except Exception as e:
227
+ print(f"โ— ์˜ค๋ฅ˜ ๋ฐœ์ƒ: {e}\n๋‹ค์‹œ ์‹œ๋„ํ•ด์ฃผ์„ธ์š”.")
requirements.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain>=0.1.0
2
+ langchain-community>=0.0.13
3
+ langchain-core>=0.1.0
4
+ langchain-huggingface>=0.0.2
5
+ sentence-transformers>=2.2.2
6
+ pypdf>=3.15.1
7
+ faiss-cpu>=1.7.4
8
+ transformers>=4.36.0
9
+ accelerate>=0.21.0
10
+ torch>=2.0.0
11
+ peft>=0.7.0
12
+ bitsandbytes>=0.41.0
13
+ tqdm>=4.65.0
14
+ python-docx>=0.8.11
15
+ olefile>=0.46
16
+ uvicorn
17
+ fastapi
18
+ openai
vector_store.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # -*- coding: utf-8 -*-
3
+
4
+ """
5
+ ๋ฒกํ„ฐ ์Šคํ† ์–ด ๋ชจ๋“ˆ: ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๋ฐ ๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ตฌ์ถ•
6
+ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์ ์šฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ตœ์ ํ™” + ๊ธด ์ฒญํฌ ์˜ค๋ฅ˜ ๋ฐฉ์ง€
7
+ """
8
+
9
+ import os
10
+ import argparse
11
+ import logging
12
+ from tqdm import tqdm
13
+ from langchain_community.vectorstores import FAISS
14
+ from langchain.schema.document import Document
15
+ from langchain_huggingface import HuggingFaceEmbeddings
16
+
17
+ # ๋กœ๊น… ์„ค์ • - ๋ถˆํ•„์š”ํ•œ ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€ ์ œ๊ฑฐ
18
+ logging.getLogger().setLevel(logging.ERROR)
19
+
20
+ def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
21
+ return HuggingFaceEmbeddings(
22
+ model_name=model_name,
23
+ model_kwargs={'device': device},
24
+ encode_kwargs={'normalize_embeddings': True}
25
+ )
26
+
27
+ def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=16):
28
+ if not documents:
29
+ raise ValueError("๋ฌธ์„œ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋กœ๋“œ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.")
30
+
31
+ texts = [doc.page_content for doc in documents]
32
+ metadatas = [doc.metadata for doc in documents]
33
+
34
+ # ๋ฐฐ์น˜๋กœ ๋ถ„ํ• 
35
+ batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
36
+ metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
37
+
38
+ print(f"Processing {len(batches)} batches with size {batch_size}")
39
+ print(f"Initializing vector store with batch 1/{len(batches)}")
40
+
41
+ # โœ… from_texts ๋Œ€์‹  from_documents ์‚ฌ์šฉ (๊ธธ์ด ๋ฌธ์ œ ๋ฐฉ์ง€)
42
+ first_docs = [
43
+ Document(page_content=text, metadata=meta)
44
+ for text, meta in zip(batches[0], metadata_batches[0])
45
+ ]
46
+ vectorstore = FAISS.from_documents(first_docs, embeddings)
47
+
48
+ # ๋‚˜๋จธ์ง€ ๋ฐฐ์น˜ ์ถ”๊ฐ€
49
+ for i in tqdm(range(1, len(batches)), desc="Processing batches"):
50
+ try:
51
+ docs_batch = [
52
+ Document(page_content=text, metadata=meta)
53
+ for text, meta in zip(batches[i], metadata_batches[i])
54
+ ]
55
+ vectorstore.add_documents(docs_batch)
56
+
57
+ if i % 10 == 0:
58
+ temp_save_path = f"{save_path}_temp"
59
+ os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
60
+ vectorstore.save_local(temp_save_path)
61
+ print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
62
+
63
+ except Exception as e:
64
+ print(f"Error processing batch {i}: {e}")
65
+ error_save_path = f"{save_path}_error_at_batch_{i}"
66
+ os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
67
+ vectorstore.save_local(error_save_path)
68
+ print(f"Partial vector store saved to {error_save_path}")
69
+ raise
70
+
71
+ os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
72
+ vectorstore.save_local(save_path)
73
+ print(f"Vector store saved to {save_path}")
74
+
75
+ return vectorstore
76
+
77
+ def load_vector_store(embeddings, load_path="vector_db"):
78
+ if not os.path.exists(load_path):
79
+ raise FileNotFoundError(f"๋ฒกํ„ฐ ์Šคํ† ์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค: {load_path}")
80
+ return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
81
+
82
+
83
+ if __name__ == "__main__":
84
+ parser = argparse.ArgumentParser(description="๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ตฌ์ถ•")
85
+ parser.add_argument("--folder", type=str, default="dataset", help="๋ฌธ์„œ๊ฐ€ ์žˆ๋Š” ํด๋” ๊ฒฝ๋กœ")
86
+ parser.add_argument("--save_path", type=str, default="vector_db", help="๋ฒกํ„ฐ ์Šคํ† ์–ด ์ €์žฅ ๊ฒฝ๋กœ")
87
+ parser.add_argument("--batch_size", type=int, default=16, help="๋ฐฐ์น˜ ํฌ๊ธฐ")
88
+ parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ด๋ฆ„")
89
+ parser.add_argument("--device", type=str, default="cuda", help="์‚ฌ์šฉํ•  ๋””๋ฐ”์ด์Šค ('cuda' ๋˜๋Š” 'cpu')")
90
+
91
+ args = parser.parse_args()
92
+
93
+ # ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ import
94
+ from document_processor import load_documents, split_documents
95
+
96
+ # ๋ฌธ์„œ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• 
97
+ documents = load_documents(args.folder)
98
+ chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
99
+
100
+ # ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋กœ๋“œ
101
+ embeddings = get_embeddings(model_name=args.model_name, device=args.device)
102
+
103
+ # ๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ตฌ์ถ•
104
+ build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)
vector_store_test.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # -*- coding: utf-8 -*-
3
+
4
+ """
5
+ ๋ฒกํ„ฐ ์Šคํ† ์–ด ๋ชจ๋“ˆ: ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๋ฐ ๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ตฌ์ถ•
6
+ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์ ์šฉ + ์ฒญํฌ ๊ธธ์ด ํ™•์ธ ์ถ”๊ฐ€
7
+ """
8
+
9
+ import os
10
+ import argparse
11
+ import logging
12
+ from tqdm import tqdm
13
+ from langchain_community.vectorstores import FAISS
14
+ from langchain.schema.document import Document
15
+ from langchain_huggingface import HuggingFaceEmbeddings
16
+ from e5_embeddings import E5Embeddings
17
+
18
+ # ๋กœ๊น… ์„ค์ •
19
+ logging.getLogger().setLevel(logging.ERROR)
20
+
21
+ def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
22
+ print(f"[INFO] ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋””๋ฐ”์ด์Šค: {device}")
23
+ return E5Embeddings(
24
+ model_name=model_name,
25
+ model_kwargs={'device': device},
26
+ encode_kwargs={'normalize_embeddings': True}
27
+ )
28
+
29
+ def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=4):
30
+ if not documents:
31
+ raise ValueError("๋ฌธ์„œ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋กœ๋“œ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.")
32
+
33
+ texts = [doc.page_content for doc in documents]
34
+ metadatas = [doc.metadata for doc in documents]
35
+
36
+ # ์ฒญํฌ ๊ธธ์ด ์ถœ๋ ฅ
37
+ lengths = [len(t) for t in texts]
38
+ print(f"๐Ÿ’ก ์ฒญํฌ ์ˆ˜: {len(texts)}")
39
+ print(f"๐Ÿ’ก ๊ฐ€์žฅ ๊ธด ์ฒญํฌ ๊ธธ์ด: {max(lengths)} chars")
40
+ print(f"๐Ÿ’ก ํ‰๊ท  ์ฒญํฌ ๊ธธ์ด: {sum(lengths) // len(lengths)} chars")
41
+
42
+ # ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„๊ธฐ
43
+ batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
44
+ metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
45
+
46
+ print(f"Processing {len(batches)} batches with size {batch_size}")
47
+ print(f"Initializing vector store with batch 1/{len(batches)}")
48
+
49
+ # โœ… from_documents ์‚ฌ์šฉ
50
+ first_docs = [
51
+ Document(page_content=text, metadata=meta)
52
+ for text, meta in zip(batches[0], metadata_batches[0])
53
+ ]
54
+ vectorstore = FAISS.from_documents(first_docs, embeddings)
55
+
56
+ for i in tqdm(range(1, len(batches)), desc="Processing batches"):
57
+ try:
58
+ docs_batch = [
59
+ Document(page_content=text, metadata=meta)
60
+ for text, meta in zip(batches[i], metadata_batches[i])
61
+ ]
62
+ vectorstore.add_documents(docs_batch)
63
+
64
+ if i % 10 == 0:
65
+ temp_save_path = f"{save_path}_temp"
66
+ os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
67
+ vectorstore.save_local(temp_save_path)
68
+ print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
69
+
70
+ except Exception as e:
71
+ print(f"Error processing batch {i}: {e}")
72
+ error_save_path = f"{save_path}_error_at_batch_{i}"
73
+ os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
74
+ vectorstore.save_local(error_save_path)
75
+ print(f"Partial vector store saved to {error_save_path}")
76
+ raise
77
+
78
+ os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
79
+ vectorstore.save_local(save_path)
80
+ print(f"Vector store saved to {save_path}")
81
+
82
+ return vectorstore
83
+
84
+ def load_vector_store(embeddings, load_path="vector_db"):
85
+ if not os.path.exists(load_path):
86
+ raise FileNotFoundError(f"๋ฒกํ„ฐ ์Šคํ† ์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค: {load_path}")
87
+ return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
88
+
89
+ if __name__ == "__main__":
90
+ parser = argparse.ArgumentParser(description="๋ฒกํ„ฐ ์Šคํ† ์–ด ๊ตฌ์ถ•")
91
+ parser.add_argument("--folder", type=str, default="final_dataset", help="๋ฌธ์„œ๊ฐ€ ์žˆ๋Š” ํด๋” ๊ฒฝ๋กœ")
92
+ parser.add_argument("--save_path", type=str, default="vector_db", help="๋ฒกํ„ฐ ์Šคํ† ์–ด ์ €์žฅ ๊ฒฝ๋กœ")
93
+ parser.add_argument("--batch_size", type=int, default=4, help="๋ฐฐ์น˜ ํฌ๊ธฐ")
94
+ parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ด๋ฆ„")
95
+ # parser.add_argument("--device", type=str, default="cuda", help="์‚ฌ์šฉํ•  ๋””๋ฐ”์ด์Šค ('cuda' ๋˜๋Š” 'cpu')")
96
+ parser.add_argument("--device", type=str, default="cuda", help="์‚ฌ์šฉํ•  ๋””๋ฐ”์ด์Šค ('cuda' ๋˜๋Š” 'cpu' ๋˜๋Š” 'cuda:1')")
97
+
98
+ args = parser.parse_args()
99
+
100
+ # ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ import
101
+ from document_processor_image_test import load_documents, split_documents
102
+
103
+ documents = load_documents(args.folder)
104
+ chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
105
+
106
+ print(f"[DEBUG] ๋ฌธ์„œ ๋กœ๋”ฉ ๋ฐ ์ฒญํฌ ๋ถ„ํ•  ์™„๋ฃŒ, ์ž„๋ฒ ๋”ฉ ๋‹จ๊ณ„ ์ง„์ž… ์ „")
107
+ print(f"[INFO] ์„ ํƒ๋œ ๋””๋ฐ”์ด์Šค: {args.device}")
108
+
109
+ try:
110
+ embeddings = get_embeddings(
111
+ model_name=args.model_name,
112
+ device=args.device
113
+ )
114
+ print(f"[DEBUG] ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ƒ์„ฑ ์™„๋ฃŒ")
115
+ except Exception as e:
116
+ print(f"[ERROR] ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ƒ์„ฑ ๏ฟฝ๏ฟฝ ์—๋Ÿฌ ๋ฐœ์ƒ: {e}")
117
+ import traceback; traceback.print_exc()
118
+ exit(1)
119
+
120
+ build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)
121
+