Spaces:
Runtime error
Runtime error
Upload folder using huggingface_hub
Browse files- .env +1 -0
- .gitattributes +5 -33
- .gitignore +8 -0
- Dockerfile +14 -0
- README.md +245 -10
- concat_vector_store.py +46 -0
- concat_vector_store_์ ๋ฆฌ๋.py +55 -0
- dataset/์ถ๋ ฅ HWPํ์ผ ์์/(๋ํฅ๋ณด๊ณ ) ๊ต์ก๋ถ K-์๋ํ์ธ ์์คํ ์ ๋ถ ๋ฐ์ดํฐ์ผํฐ ํ์ฌ ๊ฒํ .hwp +3 -0
- dataset/์ถ๋ ฅ HWPํ์ผ ์์/23.05.10 ์ ๋ผ๋ถ๋ ๋ฐ์ดํฐ์ผํฐ ๊ฑด๋ฆฝ ๊ฐ๋ฅ ๋ถ์ง.hwp +3 -0
- dataset/์ถ๋ ฅ HWPํ์ผ ์์/25.02.28 ํฅํ ๊ณต๊ณต ๋ฏผ๊ฐ๋ฌผ๋ ๋ ธ๋ ฅ ํฌ์ธํธ.hwp +3 -0
- dataset/์ถ๋ ฅ HWPํ์ผ ์์/25.03.07 ์์ฑํ AI ์์คํ ๊ตฌ์ถ์ ์ํ ์ ๋ฌดํ์ฝ์ ๊ณํ.hwp +3 -0
- dataset/์ถ์ฅ๊ฒฐ๊ณผ๋ณด๊ณ /(1) 24.08.21 ์นด์นด์ค ์ํ ๋ น์ ํ๋ณธ1.txt +0 -0
- dataset/์ถ์ฅ๊ฒฐ๊ณผ๋ณด๊ณ /(4) 24.08.21 ์นด์นด์ค,์ํ ๋ฉด๋ด ๊ฒฐ๊ณผ๋ณด๊ณ F.hwp +3 -0
- docker-compose.yml +12 -0
- document_processor_image_test.py +440 -0
- e5_embeddings.py +9 -0
- llm_loader.py +24 -0
- rag_server.py +197 -0
- rag_system.py +227 -0
- requirements.txt +18 -0
- vector_store.py +104 -0
- vector_store_test.py +121 -0
.env
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
HUGGINGFACE_TOKEN=<Huggingface_Token>
|
.gitattributes
CHANGED
@@ -1,35 +1,7 @@
|
|
1 |
-
*.
|
2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
-
*.
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
1 |
+
*.faiss filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
3 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.pdf filter=lfs diff=lfs merge=lfs -text
|
5 |
+
vector_db/*.faiss filter=lfs diff=lfs merge=lfs -text
|
6 |
+
vector_db/*.pkl filter=lfs diff=lfs merge=lfs -text
|
7 |
+
dataset/* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.gitignore
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
vector_db/
|
2 |
+
*.index
|
3 |
+
*.faiss
|
4 |
+
*.pkl
|
5 |
+
*.pdf
|
6 |
+
*.faiss
|
7 |
+
*.pkl
|
8 |
+
*.hwpx
|
Dockerfile
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM python:3.10
|
2 |
+
|
3 |
+
WORKDIR /app
|
4 |
+
COPY . /app
|
5 |
+
|
6 |
+
# /tmp ๋๋ ํ ๋ฆฌ ์์ฑ ๋ฐ ๊ถํ ๋ถ์ฌ
|
7 |
+
RUN mkdir -p /tmp && chmod 1777 /tmp
|
8 |
+
|
9 |
+
# TMPDIR ํ๊ฒฝ๋ณ์๋ฅผ ์ง์ ํ์ฌ pip install ์คํ
|
10 |
+
RUN TMPDIR=/tmp pip install --upgrade pip && TMPDIR=/tmp pip install --no-cache-dir -r requirements.txt
|
11 |
+
|
12 |
+
EXPOSE 8500
|
13 |
+
|
14 |
+
CMD ["uvicorn", "rag_server:app", "--host", "0.0.0.0", "--port", "8500"]
|
README.md
CHANGED
@@ -1,10 +1,245 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Open WebUI RAG System
|
2 |
+
|
3 |
+
Open WebUI์ ์ฐ๋ ๊ฐ๋ฅํ ํ๊ตญ์ด ๋ฌธ์ ๊ธฐ๋ฐ RAG(Retrieval-Augmented Generation) ์์คํ
์
๋๋ค. PDF์ HWPX ํ์ผ์ ์ง์ํ๋ฉฐ, ํ์ด์ง๋ณ ์ ํํ ์ ๋ณด ์ถ์ถ๊ณผ ์ถ์ฒ ์ถ์ ์ด ๊ฐ๋ฅํฉ๋๋ค.
|
4 |
+
|
5 |
+
## ์ฃผ์ ๊ธฐ๋ฅ
|
6 |
+
|
7 |
+
### 1. ๋ฌธ์ ์ฒ๋ฆฌ
|
8 |
+
- **PDF ๋ฌธ์**: PyMuPDF ๊ธฐ๋ฐ ํ
์คํธ, ํ, ์ด๋ฏธ์ง OCR ์ถ์ถ
|
9 |
+
- **HWPX ๋ฌธ์**: XML ํ์ฑ์ ํตํ ์น์
๋ณ ํ
์คํธ, ํ, ์ด๋ฏธ์ง ์ถ์ถ
|
10 |
+
- **ํ์ด์ง๋ณ ์ฒ๋ฆฌ**: ๊ฐ ๋ฌธ์๋ฅผ ํ์ด์ง/์น์
๋จ์๋ก ์ ํํ๊ฒ ๋ถ๋ฆฌ
|
11 |
+
- **๋ค์ค ์ฝํ
์ธ ํ์
**: ๋ณธ๋ฌธ, ํ, OCR ํ
์คํธ๋ฅผ ๊ฐ๊ฐ ์๋ณํ์ฌ ์ฒ๋ฆฌ
|
12 |
+
|
13 |
+
### 2. ๋ฒกํฐ ๊ฒ์
|
14 |
+
- **E5-Large ์๋ฒ ๋ฉ**: ๋ค๊ตญ์ด ์ง์ ๊ณ ์ฑ๋ฅ ์๋ฒ ๋ฉ ๋ชจ๋ธ
|
15 |
+
- **FAISS ๋ฒกํฐ์คํ ์ด**: ๋น ๋ฅธ ์ ์ฌ๋ ๊ฒ์
|
16 |
+
- **๋ฐฐ์น ์ฒ๋ฆฌ**: ๋์ฉ๋ ๋ฌธ์ ์ฒ๋ฆฌ ์ต์ ํ
|
17 |
+
- **์ฒญํฌ ๋ถํ **: ๋ฌธ๋งฅ ์ ์ง๋ฅผ ์ํ ๊ฒน์นจ ์ฒ๋ฆฌ
|
18 |
+
|
19 |
+
### 3. RAG ์์คํ
|
20 |
+
- **Refine ์ฒด์ธ**: ๋ค์ค ๋ฌธ์ ์ฐธ์กฐ๋ฅผ ํตํ ์ ํํ ๋ต๋ณ ์์ฑ
|
21 |
+
- **์ถ์ฒ ์ถ์ **: ํ์ด์ง ๋ฒํธ์ ๋ฌธ์๋ช
์ ํฌํจํ ์ ํํ ์ธ์ฉ
|
22 |
+
- **Hallucination ๋ฐฉ์ง**: ๋ฌธ์์ ๋ช
์๋ ์ ๋ณด๋ง ์ฌ์ฉํ๋ ์๊ฒฉํ ํ๋กฌํํธ
|
23 |
+
|
24 |
+
### 4. API ์๋ฒ
|
25 |
+
- **FastAPI ๊ธฐ๋ฐ**: ๋น๋๊ธฐ ์ฒ๋ฆฌ ์ง์
|
26 |
+
- **OpenAI ํธํ**: `/v1/chat/completions` ์๋ํฌ์ธํธ ์ ๊ณต
|
27 |
+
- **์คํธ๋ฆฌ๋ฐ ์ง์**: ์ค์๊ฐ ๋ต๋ณ ์์ฑ
|
28 |
+
- **Open WebUI ์ฐ๋**: ํ๋ฌ๊ทธ์ธ ์์ด ๋ฐ๋ก ์ฐ๊ฒฐ ๊ฐ๋ฅ
|
29 |
+
|
30 |
+
## ์์คํ
์๊ตฌ์ฌํญ
|
31 |
+
|
32 |
+
### ํ๋์จ์ด
|
33 |
+
- **GPU**: CUDA ์ง์ (์๋ฒ ๋ฉ ๋ฐ LLM ์ถ๋ก ์ฉ)
|
34 |
+
- **RAM**: ์ต์ 16GB (๋์ฉ๋ ๋ฌธ์ ์ฒ๋ฆฌ ์ ๋ ํ์)
|
35 |
+
- **์ ์ฅ๊ณต๊ฐ**: ๋ชจ๋ธ ๋ฐ ๋ฒกํฐ์คํ ์ด์ฉ 10GB+
|
36 |
+
|
37 |
+
### ์ํํธ์จ์ด
|
38 |
+
- Python 3.8+
|
39 |
+
- CUDA 11.7+ (GPU ์ฌ์ฉ ์)
|
40 |
+
- Tesseract OCR
|
41 |
+
|
42 |
+
## ์ค์น ๋ฐฉ๋ฒ
|
43 |
+
|
44 |
+
### 1. ์ ์ฅ์ ํด๋ก
|
45 |
+
```bash
|
46 |
+
git clone <repository-url>
|
47 |
+
cd open-webui-rag-system
|
48 |
+
```
|
49 |
+
|
50 |
+
### 2. ์์กด์ฑ ์ค์น
|
51 |
+
```bash
|
52 |
+
pip install -r requirements.txt
|
53 |
+
```
|
54 |
+
|
55 |
+
### 3. Tesseract OCR ์ค์น
|
56 |
+
**Ubuntu/Debian:**
|
57 |
+
```bash
|
58 |
+
sudo apt-get install tesseract-ocr tesseract-ocr-kor
|
59 |
+
```
|
60 |
+
|
61 |
+
**Windows:**
|
62 |
+
- [Tesseract ๊ณต์ ํ์ด์ง](https://github.com/UB-Mannheim/tesseract/wiki)์์ ์ค์น
|
63 |
+
|
64 |
+
### 4. LLM ์๋ฒ ์ค์
|
65 |
+
`llm_loader.py`์์ ์ฌ์ฉํ LLM ์๋ฒ ์ค์ :
|
66 |
+
```python
|
67 |
+
# EXAONE ๋ชจ๋ธ ์ฌ์ฉ ์์
|
68 |
+
base_url="http://vllm:8000/v1"
|
69 |
+
model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
|
70 |
+
openai_api_key="token-abc123"
|
71 |
+
```
|
72 |
+
|
73 |
+
## ์คํ ๋ฐฉ๋ฒ
|
74 |
+
|
75 |
+
### 1. ๋ฌธ์ ์ค๋น
|
76 |
+
์ฒ๋ฆฌํ ๋ฌธ์๋ค์ `dataset_test` ํด๋์ ์ ์ฅ:
|
77 |
+
```
|
78 |
+
dataset_test/
|
79 |
+
โโโ document1.pdf
|
80 |
+
โโโ document2.hwpx
|
81 |
+
โโโ document3.pdf
|
82 |
+
```
|
83 |
+
|
84 |
+
### 2. ๋ฌธ์ ์ฒ๋ฆฌ ๋ฐ ๋ฒกํฐ์คํ ์ด ์์ฑ
|
85 |
+
```bash
|
86 |
+
python document_processor_image_test.py
|
87 |
+
```
|
88 |
+
๋๋ ๋ฒกํฐ์คํ ์ด ๋น๋ ์คํฌ๋ฆฝํธ ์ฌ์ฉ:
|
89 |
+
```bash
|
90 |
+
python vector_store_test.py --folder dataset_test --save_path faiss_index_pymupdf
|
91 |
+
```
|
92 |
+
|
93 |
+
### 3. RAG ์๋ฒ ์คํ
|
94 |
+
```bash
|
95 |
+
python rag_server.py
|
96 |
+
```
|
97 |
+
์๋ฒ๋ ๊ธฐ๋ณธ์ ์ผ๋ก 8000๋ฒ ํฌํธ์์ ์คํ๋ฉ๋๋ค.
|
98 |
+
|
99 |
+
### 4. Open WebUI ์ฐ๋
|
100 |
+
Open WebUI์ ๋ชจ๋ธ ์ค์ ์์ ๋ค์๊ณผ ๊ฐ์ด ์ค์ :
|
101 |
+
- **API Base URL**: `http://localhost:8000/v1`
|
102 |
+
- **API Key**: `token-abc123`
|
103 |
+
- **Model Name**: `rag`
|
104 |
+
|
105 |
+
### 5. ๊ฐ๋ณ ํ
์คํธ
|
106 |
+
๋ช
๋ น์ค์์ ์ง์ ์ง๋ฌธ:
|
107 |
+
```bash
|
108 |
+
python rag_system.py --query "๋ฌธ์์์ ์ฐพ๊ณ ์ถ์ ๋ด์ฉ"
|
109 |
+
```
|
110 |
+
|
111 |
+
๋ํํ ๋ชจ๋:
|
112 |
+
```bash
|
113 |
+
python rag_system.py
|
114 |
+
```
|
115 |
+
|
116 |
+
## ํ๋ก์ ํธ ๊ตฌ์กฐ
|
117 |
+
|
118 |
+
```
|
119 |
+
open-webui-rag-system/
|
120 |
+
โโโ document_processor_image_test.py # ๋ฌธ์ ์ฒ๋ฆฌ ๋ฉ์ธ ๋ชจ๋
|
121 |
+
โโโ vector_store_test.py # ๋ฒกํฐ์คํ ์ด ์์ฑ ๋ชจ๋
|
122 |
+
โโโ rag_system.py # RAG ์ฒด์ธ ๊ตฌ์ฑ ๋ฐ ์ง์์๋ต
|
123 |
+
โโโ rag_server.py # FastAPI ์๋ฒ
|
124 |
+
โโโ llm_loader.py # LLM ๋ชจ๋ธ ๋ก๋
|
125 |
+
โโโ e5_embeddings.py # E5 ์๋ฒ ๋ฉ ๋ชจ๋
|
126 |
+
โโโ requirements.txt # ์์กด์ฑ ๋ชฉ๋ก
|
127 |
+
โโโ dataset_test/ # ๋ฌธ์ ์ ์ฅ ํด๋
|
128 |
+
โโโ faiss_index_pymupdf/ # ์์ฑ๋ ๋ฒกํฐ์คํ ์ด
|
129 |
+
```
|
130 |
+
|
131 |
+
## ํต์ฌ ๋ชจ๋ ์ค๋ช
|
132 |
+
|
133 |
+
### document_processor_image_test.py
|
134 |
+
- PDF์ HWPX ํ์ผ์ ํ
์คํธ, ํ, ์ด๋ฏธ์ง๋ฅผ ํ์ด์ง๋ณ๋ก ์ถ์ถ
|
135 |
+
- PyMuPDF, pdfplumber, pytesseract๋ฅผ ํ์ฉํ ๋ค์ธต ์ฒ๋ฆฌ
|
136 |
+
- ์น์
๋ณ ๋ฉํ๋ฐ์ดํฐ์ ํ์ด์ง ์ ๋ณด ์ ์ง
|
137 |
+
|
138 |
+
### vector_store_test.py
|
139 |
+
- E5-Large ์๋ฒ ๋ฉ ๋ชจ๋ธ์ ์ฌ์ฉํ ๋ฒกํฐํ
|
140 |
+
- FAISS๋ฅผ ์ด์ฉํ ํจ์จ์ ์ธ ๋ฒกํฐ์คํ ์ด ๊ตฌ์ถ
|
141 |
+
- ๋ฐฐ์น ์ฒ๋ฆฌ๋ฅผ ํตํ ๋ฉ๋ชจ๋ฆฌ ์ต์ ํ
|
142 |
+
|
143 |
+
### rag_system.py
|
144 |
+
- Refine ์ฒด์ธ์ ํ์ฉํ ๋ค๋จ๊ณ ๋ต๋ณ ์์ฑ
|
145 |
+
- ํ์ด์ง ๋ฒํธ hallucination ๋ฐฉ์ง ํ๋กฌํํธ
|
146 |
+
- ์ถ์ฒ ์ถ์ ๊ณผ ๋ฉํ๋ฐ์ดํฐ ๊ด๋ฆฌ
|
147 |
+
|
148 |
+
### rag_server.py
|
149 |
+
- OpenAI ํธํ API ์๋ํฌ์ธํธ ์ ๊ณต
|
150 |
+
- ์คํธ๋ฆฌ๋ฐ ์๋ต ์ง์
|
151 |
+
- Open WebUI์์ ์ํํ ์ฐ๋
|
152 |
+
|
153 |
+
## ์ค์ ์ต์
|
154 |
+
|
155 |
+
### ๋ฌธ์ ์ฒ๋ฆฌ ์ต์
|
156 |
+
- **์ฒญํฌ ํฌ๏ฟฝ๏ฟฝ**: `chunk_size=500` (๊ธฐ๋ณธ๊ฐ)
|
157 |
+
- **์ฒญํฌ ๊ฒน์นจ**: `chunk_overlap=100` (๊ธฐ๋ณธ๊ฐ)
|
158 |
+
- **OCR ์ธ์ด**: `lang='kor+eng'` (ํ๊ตญ์ด+์์ด)
|
159 |
+
|
160 |
+
### ๊ฒ์ ์ต์
|
161 |
+
- **๊ฒ์ ๋ฌธ์ ์**: `k=7` (๊ธฐ๋ณธ๊ฐ)
|
162 |
+
- **์๋ฒ ๋ฉ ๋ชจ๋ธ**: `intfloat/multilingual-e5-large-instruct`
|
163 |
+
- **๋๋ฐ์ด์ค**: `cuda` ๋๋ `cpu`
|
164 |
+
|
165 |
+
### LLM ์ค์
|
166 |
+
์ง์ํ๋ ๋ชจ๋ธ๋ค:
|
167 |
+
- LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
|
168 |
+
- meta-llama/Meta-Llama-3-8B-Instruct
|
169 |
+
- ๊ธฐํ OpenAI ํธํ ๋ชจ๋ธ
|
170 |
+
|
171 |
+
## ํธ๋ฌ๋ธ์ํ
|
172 |
+
|
173 |
+
### 1. CUDA ๋ฉ๋ชจ๋ฆฌ ๋ถ์กฑ
|
174 |
+
```bash
|
175 |
+
# CPU ๋ชจ๋๋ก ์คํ
|
176 |
+
python vector_store_test.py --device cpu
|
177 |
+
```
|
178 |
+
|
179 |
+
### 2. ํ๊ธ ํฐํธ ๋ฌธ์
|
180 |
+
```bash
|
181 |
+
# ํ๊ธ ํฐํธ ์ค์น (Ubuntu)
|
182 |
+
sudo apt-get install fonts-nanum
|
183 |
+
```
|
184 |
+
|
185 |
+
### 3. Tesseract ๊ฒฝ๋ก ๋ฌธ์
|
186 |
+
```python
|
187 |
+
# pytesseract ๊ฒฝ๋ก ์๋ ์ค์
|
188 |
+
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
|
189 |
+
```
|
190 |
+
|
191 |
+
### 4. ๋ชจ๋ธ ๋ค์ด๋ก๋ ์คํจ
|
192 |
+
```bash
|
193 |
+
# Hugging Face ์บ์ ๊ฒฝ๋ก ํ์ธ
|
194 |
+
export HF_HOME=/path/to/huggingface/cache
|
195 |
+
```
|
196 |
+
|
197 |
+
## API ์ฌ์ฉ ์์
|
198 |
+
|
199 |
+
### ์ง์ ์ง์
|
200 |
+
```bash
|
201 |
+
curl -X POST "http://localhost:8000/ask" \
|
202 |
+
-H "Content-Type: application/json" \
|
203 |
+
-d '{"question": "๋ฌธ์์์ ์์ฐ ๊ด๋ จ ๋ด์ฉ์ ์ฐพ์์ฃผ์ธ์"}'
|
204 |
+
```
|
205 |
+
|
206 |
+
### OpenAI ํธํ API
|
207 |
+
```bash
|
208 |
+
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
209 |
+
-H "Content-Type: application/json" \
|
210 |
+
-d '{
|
211 |
+
"model": "rag",
|
212 |
+
"messages": [{"role": "user", "content": "์์ฐ ํํฉ์ด ์ด๋ป๊ฒ ๋๋์?"}],
|
213 |
+
"stream": false
|
214 |
+
}'
|
215 |
+
```
|
216 |
+
|
217 |
+
## ์ฑ๋ฅ ์ต์ ํ
|
218 |
+
|
219 |
+
### 1. ๋ฐฐ์น ํฌ๊ธฐ ์กฐ์
|
220 |
+
```bash
|
221 |
+
python vector_store_test.py --batch_size 32 # GPU ๋ฉ๋ชจ๋ฆฌ์ ๋ฐ๋ผ ์กฐ์
|
222 |
+
```
|
223 |
+
|
224 |
+
### 2. ์ฒญํฌ ํฌ๊ธฐ ์ต์ ํ
|
225 |
+
```python
|
226 |
+
# ๊ธด ๋ฌธ์์ ๊ฒฝ์ฐ ์ฒญํฌ ํฌ๊ธฐ ์ฆ๊ฐ
|
227 |
+
chunks = split_documents(docs, chunk_size=800, chunk_overlap=150)
|
228 |
+
```
|
229 |
+
|
230 |
+
### 3. ๊ฒ์ ๊ฒฐ๊ณผ ์ ์กฐ์
|
231 |
+
```bash
|
232 |
+
python rag_system.py --k 10 # ๋ ๋ง์ ๋ฌธ์ ์ฐธ์กฐ
|
233 |
+
```
|
234 |
+
|
235 |
+
## ๋ผ์ด์ ์ค
|
236 |
+
|
237 |
+
MIT License
|
238 |
+
|
239 |
+
## ๊ธฐ์ฌ ๋ฐฉ๋ฒ
|
240 |
+
|
241 |
+
1. Fork the repository
|
242 |
+
2. Create your feature branch
|
243 |
+
3. Commit your changes
|
244 |
+
4. Push to the branch
|
245 |
+
5. Create a Pull Request
|
concat_vector_store.py
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from langchain.schema.document import Document
|
3 |
+
from e5_embeddings import E5Embeddings
|
4 |
+
from langchain_community.vectorstores import FAISS
|
5 |
+
|
6 |
+
from document_processor_image import load_documents, split_documents # ๋ฐ๋์ ์ด ํจ์๊ฐ ํ์
|
7 |
+
|
8 |
+
# ๊ฒฝ๋ก ์ค์
|
9 |
+
NEW_FOLDER = "25.05.28 RAG์ฉ 2์ฐจ ์
๋ฌดํธ๋ ์ทจํฉ๋ณธ"
|
10 |
+
#NEW_FOLDER = "์์"
|
11 |
+
VECTOR_STORE_PATH = "vector_db"
|
12 |
+
|
13 |
+
# 1. ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ก๋ฉ
|
14 |
+
def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
|
15 |
+
return E5Embeddings(
|
16 |
+
model_name=model_name,
|
17 |
+
model_kwargs={'device': device},
|
18 |
+
encode_kwargs={'normalize_embeddings': True}
|
19 |
+
)
|
20 |
+
|
21 |
+
# 2. ๊ธฐ์กด ๋ฒกํฐ ์คํ ์ด ๋ก๋
|
22 |
+
def load_vector_store(embeddings, load_path="vector_db"):
|
23 |
+
if not os.path.exists(load_path):
|
24 |
+
raise FileNotFoundError(f"๋ฒกํฐ ์คํ ์ด๋ฅผ ์ฐพ์ ์ ์์ต๋๋ค: {load_path}")
|
25 |
+
return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
|
26 |
+
|
27 |
+
# 3. ๋ฌธ์ ์๋ฒ ๋ฉ ๋ฐ ์ถ๊ฐ
|
28 |
+
def add_new_documents_to_vector_store(new_folder, vectorstore, embeddings):
|
29 |
+
print(f"๐ ์๋ก์ด ๋ฌธ์ ๋ก๋ ์ค: {new_folder}")
|
30 |
+
new_docs = load_documents(new_folder)
|
31 |
+
new_chunks = split_documents(new_docs, chunk_size=800, chunk_overlap=100)
|
32 |
+
|
33 |
+
print(f"๐ ์๋ก์ด ์ฒญํฌ ์: {len(new_chunks)}")
|
34 |
+
print(f"์ถ๊ฐ ์ ๋ฒกํฐ ์: {vectorstore.index.ntotal}")
|
35 |
+
vectorstore.add_documents(new_chunks)
|
36 |
+
print(f"์ถ๊ฐ ํ ๋ฒกํฐ ์: {vectorstore.index.ntotal}")
|
37 |
+
|
38 |
+
print("โ
์๋ก์ด ๋ฌธ์๊ฐ ๋ฒกํฐ ์คํ ์ด์ ์ถ๊ฐ๋์์ต๋๋ค.")
|
39 |
+
|
40 |
+
# 4. ์ ์ฒด ์คํ
|
41 |
+
if __name__ == "__main__":
|
42 |
+
embeddings = get_embeddings()
|
43 |
+
vectorstore = load_vector_store(embeddings, VECTOR_STORE_PATH)
|
44 |
+
add_new_documents_to_vector_store(NEW_FOLDER, vectorstore, embeddings)
|
45 |
+
vectorstore.save_local(VECTOR_STORE_PATH)
|
46 |
+
print(f"๐พ ๋ฒกํฐ ์คํ ์ด ์ ์ฅ ์๋ฃ: {VECTOR_STORE_PATH}")
|
concat_vector_store_์ ๋ฆฌ๋.py
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import glob
|
3 |
+
from langchain.schema.document import Document
|
4 |
+
from e5_embeddings import E5Embeddings
|
5 |
+
from langchain_community.vectorstores import FAISS
|
6 |
+
from document_processor import load_pdf_with_pymupdf, split_documents
|
7 |
+
|
8 |
+
# ๊ฒฝ๋ก ์ค์
|
9 |
+
FOLDER = "25.05.28 RAG์ฉ 2์ฐจ ์
๋ฌดํธ๋ ์ทจํฉ๋ณธ"
|
10 |
+
VECTOR_STORE_PATH = "vector_db"
|
11 |
+
|
12 |
+
# 1. ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ก๋
|
13 |
+
def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
|
14 |
+
return E5Embeddings(
|
15 |
+
model_name=model_name,
|
16 |
+
model_kwargs={'device': device},
|
17 |
+
encode_kwargs={'normalize_embeddings': True}
|
18 |
+
)
|
19 |
+
|
20 |
+
# 2. ๊ธฐ์กด ๋ฒกํฐ ์คํ ์ด ๋ก๋
|
21 |
+
def load_vector_store(embeddings, load_path=VECTOR_STORE_PATH):
|
22 |
+
if not os.path.exists(load_path):
|
23 |
+
raise FileNotFoundError(f"๋ฒกํฐ ์คํ ์ด๋ฅผ ์ฐพ์ ์ ์์ต๋๋ค: {load_path}")
|
24 |
+
return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
|
25 |
+
|
26 |
+
# 3. ์ ๋ฆฌ๋ PDF๋ง ์๋ฒ ๋ฉ
|
27 |
+
def embed_cleaned_pdfs(folder, vectorstore, embeddings):
|
28 |
+
pattern = os.path.join(folder, "์ ๋ฆฌ๋*.pdf")
|
29 |
+
pdf_files = glob.glob(pattern)
|
30 |
+
print(f"๐งพ ๋์ PDF ์: {len(pdf_files)}")
|
31 |
+
|
32 |
+
new_documents = []
|
33 |
+
for pdf_path in pdf_files:
|
34 |
+
print(f"๐ ์ฒ๋ฆฌ ์ค: {pdf_path}")
|
35 |
+
text = load_pdf_with_pymupdf(pdf_path)
|
36 |
+
if text.strip():
|
37 |
+
new_documents.append(Document(page_content=text, metadata={"source": pdf_path}))
|
38 |
+
|
39 |
+
print(f"๐ ๋ฌธ์ ์: {len(new_documents)}")
|
40 |
+
|
41 |
+
chunks = split_documents(new_documents, chunk_size=300, chunk_overlap=50)
|
42 |
+
print(f"๏ฟฝ๏ฟฝ ์ฒญํฌ ์: {len(chunks)}")
|
43 |
+
|
44 |
+
print(f"์ถ๊ฐ ์ ๋ฒกํฐ ์: {vectorstore.index.ntotal}")
|
45 |
+
vectorstore.add_documents(chunks)
|
46 |
+
print(f"์ถ๊ฐ ํ ๋ฒกํฐ ์: {vectorstore.index.ntotal}")
|
47 |
+
|
48 |
+
vectorstore.save_local(VECTOR_STORE_PATH)
|
49 |
+
print(f"โ
์ ์ฅ ์๋ฃ: {VECTOR_STORE_PATH}")
|
50 |
+
|
51 |
+
# ์คํ
|
52 |
+
if __name__ == "__main__":
|
53 |
+
embeddings = get_embeddings()
|
54 |
+
vectorstore = load_vector_store(embeddings)
|
55 |
+
embed_cleaned_pdfs(FOLDER, vectorstore, embeddings)
|
dataset/์ถ๋ ฅ HWPํ์ผ ์์/(๋ํฅ๋ณด๊ณ ) ๊ต์ก๋ถ K-์๋ํ์ธ ์์คํ
์ ๋ถ ๋ฐ์ดํฐ์ผํฐ ํ์ฌ ๊ฒํ .hwp
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b9ffae391271afee3a1d65cf2a46c58eabeca3ba9305ac0a987fb034e63b1708
|
3 |
+
size 110080
|
dataset/์ถ๋ ฅ HWPํ์ผ ์์/23.05.10 ์ ๋ผ๋ถ๋ ๋ฐ์ดํฐ์ผํฐ ๊ฑด๋ฆฝ ๊ฐ๋ฅ ๋ถ์ง.hwp
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2c8bbb64a8ec39a2bfcc373ba0b7bacb4dc5fb25200c2eda08a3abcea733368f
|
3 |
+
size 651264
|
dataset/์ถ๋ ฅ HWPํ์ผ ์์/25.02.28 ํฅํ ๊ณต๊ณต ๋ฏผ๊ฐ๋ฌผ๋ ๋
ธ๋ ฅ ํฌ์ธํธ.hwp
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2a89a3d34664fd68852dc1c4129efe73752dac17272d8238653b80d49309f88b
|
3 |
+
size 101376
|
dataset/์ถ๋ ฅ HWPํ์ผ ์์/25.03.07 ์์ฑํ AI ์์คํ
๊ตฌ์ถ์ ์ํ ์
๋ฌดํ์ฝ์ ๊ณํ.hwp
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:39286f70b90a2e4207947785576467f99cf33081aed24edd0f588c8d10f07cbc
|
3 |
+
size 817152
|
dataset/์ถ์ฅ๊ฒฐ๊ณผ๋ณด๊ณ /(1) 24.08.21 ์นด์นด์ค ์ํ ๋
น์ ํ๋ณธ1.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dataset/์ถ์ฅ๊ฒฐ๊ณผ๋ณด๊ณ /(4) 24.08.21 ์นด์นด์ค,์ํ ๋ฉด๋ด ๊ฒฐ๊ณผ๋ณด๊ณ F.hwp
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:930e7fde652397ee17ea6cfbbe1b993b7166fc3e350f9aa6c999f982baad3944
|
3 |
+
size 120832
|
docker-compose.yml
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
version: '3.8'
|
2 |
+
|
3 |
+
services:
|
4 |
+
rag-api:
|
5 |
+
build: .
|
6 |
+
ports:
|
7 |
+
- "8500:8500"
|
8 |
+
volumes:
|
9 |
+
- ./dataset:/app/dataset
|
10 |
+
environment:
|
11 |
+
- PYTHONPATH=/app
|
12 |
+
command: uvicorn rag_server:app --host 0.0.0.0 --port 8500 --reload
|
document_processor_image_test.py
ADDED
@@ -0,0 +1,440 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import re
|
3 |
+
import glob
|
4 |
+
import time
|
5 |
+
from collections import defaultdict
|
6 |
+
|
7 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
8 |
+
from langchain_core.documents import Document
|
9 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
10 |
+
from langchain_community.vectorstores import FAISS
|
11 |
+
|
12 |
+
# PyMuPDF ๋ผ์ด๋ธ๋ฌ๋ฆฌ
|
13 |
+
try:
|
14 |
+
import fitz # PyMuPDF
|
15 |
+
PYMUPDF_AVAILABLE = True
|
16 |
+
print("โ
PyMuPDF ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ฌ์ฉ ๊ฐ๋ฅ")
|
17 |
+
except ImportError:
|
18 |
+
PYMUPDF_AVAILABLE = False
|
19 |
+
print("โ ๏ธ PyMuPDF ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ์ค์น๋์ง ์์. pip install PyMuPDF๋ก ์ค์นํ์ธ์.")
|
20 |
+
|
21 |
+
# PDF ์ฒ๋ฆฌ์ฉ
|
22 |
+
import pytesseract
|
23 |
+
from PIL import Image
|
24 |
+
from pdf2image import convert_from_path
|
25 |
+
import pdfplumber
|
26 |
+
from pymupdf4llm import LlamaMarkdownReader
|
27 |
+
|
28 |
+
# --------------------------------
|
29 |
+
# ๋ก๊ทธ ์ถ๋ ฅ
|
30 |
+
# --------------------------------
|
31 |
+
|
32 |
+
def log(msg):
|
33 |
+
print(f"[{time.strftime('%H:%M:%S')}] {msg}")
|
34 |
+
|
35 |
+
# --------------------------------
|
36 |
+
# ํ
์คํธ ์ ์ ํจ์
|
37 |
+
# --------------------------------
|
38 |
+
|
39 |
+
def clean_text(text):
|
40 |
+
return re.sub(r"[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318F\w\s.,!?\"'()$:\-]", "", text)
|
41 |
+
|
42 |
+
def apply_corrections(text):
|
43 |
+
corrections = {
|
44 |
+
'ยบยฉ': '์ ๋ณด', 'ร': '์', 'ยฝ': '์ด์', 'ร': '', 'ยฉ': '',
|
45 |
+
'รขโฌโข': "'", 'รขโฌล': '"', 'รขโฌ': '"'
|
46 |
+
}
|
47 |
+
for k, v in corrections.items():
|
48 |
+
text = text.replace(k, v)
|
49 |
+
return text
|
50 |
+
|
51 |
+
# --------------------------------
|
52 |
+
# HWPX ์ฒ๋ฆฌ (์น์
๋ณ ์ฒ๋ฆฌ๋ง ์ฌ์ฉ)
|
53 |
+
# --------------------------------
|
54 |
+
|
55 |
+
def load_hwpx(file_path):
|
56 |
+
"""HWPX ํ์ผ ๋ก๋ฉ (XML ํ์ฑ ๋ฐฉ์๋ง ์ฌ์ฉ)"""
|
57 |
+
import zipfile
|
58 |
+
import xml.etree.ElementTree as ET
|
59 |
+
import chardet
|
60 |
+
|
61 |
+
log(f"๐ฅ HWPX ์น์
๋ณ ์ฒ๋ฆฌ ์์: {file_path}")
|
62 |
+
start = time.time()
|
63 |
+
documents = []
|
64 |
+
|
65 |
+
try:
|
66 |
+
with zipfile.ZipFile(file_path, 'r') as zip_ref:
|
67 |
+
file_list = zip_ref.namelist()
|
68 |
+
section_files = [f for f in file_list
|
69 |
+
if f.startswith('Contents/section') and f.endswith('.xml')]
|
70 |
+
section_files.sort() # section0.xml, section1.xml ์์๋ก ์ ๋ ฌ
|
71 |
+
|
72 |
+
log(f"๐ ๋ฐ๊ฒฌ๋ ์น์
ํ์ผ: {len(section_files)}๊ฐ")
|
73 |
+
|
74 |
+
for section_idx, section_file in enumerate(section_files):
|
75 |
+
with zip_ref.open(section_file) as xml_file:
|
76 |
+
raw = xml_file.read()
|
77 |
+
encoding = chardet.detect(raw)['encoding'] or 'utf-8'
|
78 |
+
try:
|
79 |
+
text = raw.decode(encoding)
|
80 |
+
except UnicodeDecodeError:
|
81 |
+
text = raw.decode("cp949", errors="replace")
|
82 |
+
|
83 |
+
tree = ET.ElementTree(ET.fromstring(text))
|
84 |
+
root = tree.getroot()
|
85 |
+
|
86 |
+
# ๋ค์์คํ์ด์ค ์์ด ํ
์คํธ ์ฐพ๊ธฐ
|
87 |
+
t_elements = [elem for elem in root.iter() if elem.tag.endswith('}t') or elem.tag == 't']
|
88 |
+
body_text = ""
|
89 |
+
for elem in t_elements:
|
90 |
+
if elem.text:
|
91 |
+
body_text += clean_text(elem.text) + " "
|
92 |
+
|
93 |
+
# page ๋ฉํ๋ฐ์ดํฐ๋ ๋น ๊ฐ์ผ๋ก ์ค์
|
94 |
+
page_value = ""
|
95 |
+
|
96 |
+
if body_text.strip():
|
97 |
+
documents.append(Document(
|
98 |
+
page_content=apply_corrections(body_text),
|
99 |
+
metadata={
|
100 |
+
"source": file_path,
|
101 |
+
"filename": os.path.basename(file_path),
|
102 |
+
"type": "hwpx_body",
|
103 |
+
"page": page_value,
|
104 |
+
"total_sections": len(section_files)
|
105 |
+
}
|
106 |
+
))
|
107 |
+
log(f"โ
์น์
ํ
์คํธ ์ถ์ถ ์๋ฃ (chars: {len(body_text)})")
|
108 |
+
|
109 |
+
# ํ ์ฐพ๊ธฐ
|
110 |
+
table_elements = [elem for elem in root.iter() if elem.tag.endswith('}table') or elem.tag == 'table']
|
111 |
+
if table_elements:
|
112 |
+
table_text = ""
|
113 |
+
for table_idx, table in enumerate(table_elements):
|
114 |
+
table_text += f"[Table {table_idx + 1}]\n"
|
115 |
+
rows = [elem for elem in table.iter() if elem.tag.endswith('}tr') or elem.tag == 'tr']
|
116 |
+
for row in rows:
|
117 |
+
row_text = []
|
118 |
+
cells = [elem for elem in row.iter() if elem.tag.endswith('}tc') or elem.tag == 'tc']
|
119 |
+
for cell in cells:
|
120 |
+
cell_texts = []
|
121 |
+
for t_elem in cell.iter():
|
122 |
+
if (t_elem.tag.endswith('}t') or t_elem.tag == 't') and t_elem.text:
|
123 |
+
cell_texts.append(clean_text(t_elem.text))
|
124 |
+
row_text.append(" ".join(cell_texts))
|
125 |
+
if row_text:
|
126 |
+
table_text += "\t".join(row_text) + "\n"
|
127 |
+
|
128 |
+
if table_text.strip():
|
129 |
+
documents.append(Document(
|
130 |
+
page_content=apply_corrections(table_text),
|
131 |
+
metadata={
|
132 |
+
"source": file_path,
|
133 |
+
"filename": os.path.basename(file_path),
|
134 |
+
"type": "hwpx_table",
|
135 |
+
"page": page_value,
|
136 |
+
"total_sections": len(section_files)
|
137 |
+
}
|
138 |
+
))
|
139 |
+
log(f"๐ ํ ์ถ์ถ ์๋ฃ")
|
140 |
+
|
141 |
+
# ์ด๋ฏธ์ง ์ฐพ๊ธฐ
|
142 |
+
if [elem for elem in root.iter() if elem.tag.endswith('}picture') or elem.tag == 'picture']:
|
143 |
+
documents.append(Document(
|
144 |
+
page_content="[์ด๋ฏธ์ง ํฌํจ]",
|
145 |
+
metadata={
|
146 |
+
"source": file_path,
|
147 |
+
"filename": os.path.basename(file_path),
|
148 |
+
"type": "hwpx_image",
|
149 |
+
"page": page_value,
|
150 |
+
"total_sections": len(section_files)
|
151 |
+
}
|
152 |
+
))
|
153 |
+
log(f"๐ผ๏ธ ์ด๋ฏธ์ง ๋ฐ๊ฒฌ")
|
154 |
+
|
155 |
+
except Exception as e:
|
156 |
+
log(f"โ HWPX ์ฒ๋ฆฌ ์ค๋ฅ: {e}")
|
157 |
+
|
158 |
+
duration = time.time() - start
|
159 |
+
|
160 |
+
# ๋ฌธ์ ์ ๋ณด ์์ฝ ์ถ๋ ฅ
|
161 |
+
if documents:
|
162 |
+
log(f"๐ ์ถ์ถ๋ ๋ฌธ์ ์: {len(documents)}")
|
163 |
+
|
164 |
+
log(f"โ
HWPX ์ฒ๋ฆฌ ์๋ฃ: {file_path} โฑ๏ธ {duration:.2f}์ด, ์ด {len(documents)}๊ฐ ๋ฌธ์")
|
165 |
+
return documents
|
166 |
+
|
167 |
+
# --------------------------------
|
168 |
+
# PDF ์ฒ๋ฆฌ ํจ์๋ค (๊ธฐ์กด๊ณผ ๋์ผ)
|
169 |
+
# --------------------------------
|
170 |
+
|
171 |
+
def run_ocr_on_image(image: Image.Image, lang='kor+eng'):
|
172 |
+
return pytesseract.image_to_string(image, lang=lang)
|
173 |
+
|
174 |
+
def extract_images_with_ocr(pdf_path, lang='kor+eng'):
|
175 |
+
try:
|
176 |
+
images = convert_from_path(pdf_path)
|
177 |
+
page_ocr_data = {}
|
178 |
+
for idx, img in enumerate(images):
|
179 |
+
page_num = idx + 1
|
180 |
+
text = run_ocr_on_image(img, lang=lang)
|
181 |
+
if text.strip():
|
182 |
+
page_ocr_data[page_num] = text.strip()
|
183 |
+
return page_ocr_data
|
184 |
+
except Exception as e:
|
185 |
+
print(f"โ ์ด๋ฏธ์ง OCR ์คํจ: {e}")
|
186 |
+
return {}
|
187 |
+
|
188 |
+
def extract_tables_with_pdfplumber(pdf_path):
|
189 |
+
page_table_data = {}
|
190 |
+
try:
|
191 |
+
with pdfplumber.open(pdf_path) as pdf:
|
192 |
+
for i, page in enumerate(pdf.pages):
|
193 |
+
page_num = i + 1
|
194 |
+
tables = page.extract_tables()
|
195 |
+
table_text = ""
|
196 |
+
for t_index, table in enumerate(tables):
|
197 |
+
if table:
|
198 |
+
table_text += f"[Table {t_index+1}]\n"
|
199 |
+
for row in table:
|
200 |
+
row_text = "\t".join(cell if cell else "" for cell in row)
|
201 |
+
table_text += row_text + "\n"
|
202 |
+
if table_text.strip():
|
203 |
+
page_table_data[page_num] = table_text.strip()
|
204 |
+
return page_table_data
|
205 |
+
except Exception as e:
|
206 |
+
print(f"โ ํ ์ถ์ถ ์คํจ: {e}")
|
207 |
+
return {}
|
208 |
+
|
209 |
+
def extract_body_text_with_pages(pdf_path):
|
210 |
+
page_body_data = {}
|
211 |
+
try:
|
212 |
+
pdf_processor = LlamaMarkdownReader()
|
213 |
+
docs = pdf_processor.load_data(file_path=pdf_path)
|
214 |
+
|
215 |
+
combined_text = ""
|
216 |
+
for d in docs:
|
217 |
+
if isinstance(d, dict) and "text" in d:
|
218 |
+
combined_text += d["text"]
|
219 |
+
elif hasattr(d, "text"):
|
220 |
+
combined_text += d.text
|
221 |
+
|
222 |
+
if combined_text.strip():
|
223 |
+
chars_per_page = 2000
|
224 |
+
start = 0
|
225 |
+
page_num = 1
|
226 |
+
|
227 |
+
while start < len(combined_text):
|
228 |
+
end = start + chars_per_page
|
229 |
+
if end > len(combined_text):
|
230 |
+
end = len(combined_text)
|
231 |
+
|
232 |
+
page_text = combined_text[start:end]
|
233 |
+
if page_text.strip():
|
234 |
+
page_body_data[page_num] = page_text.strip()
|
235 |
+
page_num += 1
|
236 |
+
|
237 |
+
if end == len(combined_text):
|
238 |
+
break
|
239 |
+
start = end - 100
|
240 |
+
|
241 |
+
except Exception as e:
|
242 |
+
print(f"โ ๋ณธ๋ฌธ ์ถ์ถ ์คํจ: {e}")
|
243 |
+
|
244 |
+
return page_body_data
|
245 |
+
|
246 |
+
def load_pdf_with_metadata(pdf_path):
|
247 |
+
"""PDF ํ์ผ์์ ํ์ด์ง๋ณ ์ ๋ณด๋ฅผ ์ถ์ถ"""
|
248 |
+
log(f"๐ PDF ํ์ด์ง๋ณ ์ฒ๋ฆฌ ์์: {pdf_path}")
|
249 |
+
start = time.time()
|
250 |
+
|
251 |
+
# ๋จผ์ PyPDFLoader๋ก ์ค์ ํ์ด์ง ์ ํ์ธ
|
252 |
+
try:
|
253 |
+
from langchain_community.document_loaders import PyPDFLoader
|
254 |
+
loader = PyPDFLoader(pdf_path)
|
255 |
+
pdf_pages = loader.load()
|
256 |
+
actual_total_pages = len(pdf_pages)
|
257 |
+
log(f"๐ PyPDFLoader๋ก ํ์ธํ ์ค์ ํ์ด์ง ์: {actual_total_pages}")
|
258 |
+
except Exception as e:
|
259 |
+
log(f"โ PyPDFLoader ํ์ด์ง ์ ํ์ธ ์คํจ: {e}")
|
260 |
+
actual_total_pages = 1
|
261 |
+
|
262 |
+
try:
|
263 |
+
page_tables = extract_tables_with_pdfplumber(pdf_path)
|
264 |
+
except Exception as e:
|
265 |
+
page_tables = {}
|
266 |
+
print(f"โ ํ ์ถ์ถ ์คํจ: {e}")
|
267 |
+
|
268 |
+
try:
|
269 |
+
page_ocr = extract_images_with_ocr(pdf_path)
|
270 |
+
except Exception as e:
|
271 |
+
page_ocr = {}
|
272 |
+
print(f"โ ์ด๋ฏธ์ง OCR ์คํจ: {e}")
|
273 |
+
|
274 |
+
try:
|
275 |
+
page_body = extract_body_text_with_pages(pdf_path)
|
276 |
+
except Exception as e:
|
277 |
+
page_body = {}
|
278 |
+
print(f"โ ๋ณธ๋ฌธ ์ถ์ถ ์คํจ: {e}")
|
279 |
+
|
280 |
+
duration = time.time() - start
|
281 |
+
log(f"โ
PDF ํ์ด์ง๋ณ ์ฒ๋ฆฌ ์๋ฃ: {pdf_path} โฑ๏ธ {duration:.2f}์ด")
|
282 |
+
|
283 |
+
# ์ค์ ํ์ด์ง ์๋ฅผ ๊ธฐ์ค์ผ๋ก ์ค์
|
284 |
+
all_pages = set(page_tables.keys()) | set(page_ocr.keys()) | set(page_body.keys())
|
285 |
+
if all_pages:
|
286 |
+
max_extracted_page = max(all_pages)
|
287 |
+
# ์ค์ ํ์ด์ง ์์ ์ถ์ถ๋ ํ์ด์ง ์ ์ค ํฐ ๊ฐ ์ฌ์ฉ
|
288 |
+
total_pages = max(actual_total_pages, max_extracted_page)
|
289 |
+
else:
|
290 |
+
total_pages = actual_total_pages
|
291 |
+
|
292 |
+
log(f"๐ ์ต์ข
์ค์ ๋ ์ด ํ์ด์ง ์: {total_pages}")
|
293 |
+
|
294 |
+
docs = []
|
295 |
+
|
296 |
+
for page_num in sorted(all_pages):
|
297 |
+
if page_num in page_tables and page_tables[page_num].strip():
|
298 |
+
docs.append(Document(
|
299 |
+
page_content=clean_text(apply_corrections(page_tables[page_num])),
|
300 |
+
metadata={
|
301 |
+
"source": pdf_path,
|
302 |
+
"filename": os.path.basename(pdf_path),
|
303 |
+
"type": "table",
|
304 |
+
"page": page_num,
|
305 |
+
"total_pages": total_pages
|
306 |
+
}
|
307 |
+
))
|
308 |
+
log(f"๐ ํ์ด์ง {page_num}: ํ ์ถ์ถ ์๋ฃ")
|
309 |
+
|
310 |
+
if page_num in page_body and page_body[page_num].strip():
|
311 |
+
docs.append(Document(
|
312 |
+
page_content=clean_text(apply_corrections(page_body[page_num])),
|
313 |
+
metadata={
|
314 |
+
"source": pdf_path,
|
315 |
+
"filename": os.path.basename(pdf_path),
|
316 |
+
"type": "body",
|
317 |
+
"page": page_num,
|
318 |
+
"total_pages": total_pages
|
319 |
+
}
|
320 |
+
))
|
321 |
+
log(f"๐ ํ์ด์ง {page_num}: ๋ณธ๋ฌธ ์ถ์ถ ์๋ฃ")
|
322 |
+
|
323 |
+
if page_num in page_ocr and page_ocr[page_num].strip():
|
324 |
+
docs.append(Document(
|
325 |
+
page_content=clean_text(apply_corrections(page_ocr[page_num])),
|
326 |
+
metadata={
|
327 |
+
"source": pdf_path,
|
328 |
+
"filename": os.path.basename(pdf_path),
|
329 |
+
"type": "ocr",
|
330 |
+
"page": page_num,
|
331 |
+
"total_pages": total_pages
|
332 |
+
}
|
333 |
+
))
|
334 |
+
log(f"๐ผ๏ธ ํ์ด์ง {page_num}: OCR ์ถ์ถ ์๋ฃ")
|
335 |
+
|
336 |
+
if not docs:
|
337 |
+
docs.append(Document(
|
338 |
+
page_content="[๋ด์ฉ ์ถ์ถ ์คํจ]",
|
339 |
+
metadata={
|
340 |
+
"source": pdf_path,
|
341 |
+
"filename": os.path.basename(pdf_path),
|
342 |
+
"type": "error",
|
343 |
+
"page": 1,
|
344 |
+
"total_pages": total_pages
|
345 |
+
}
|
346 |
+
))
|
347 |
+
|
348 |
+
# ํ์ด์ง ์ ๋ณด ์์ฝ ์ถ๋ ฅ
|
349 |
+
if docs:
|
350 |
+
page_numbers = [doc.metadata.get('page', 0) for doc in docs if doc.metadata.get('page')]
|
351 |
+
if page_numbers:
|
352 |
+
log(f"๐ ์ถ์ถ๋ ํ์ด์ง ๋ฒ์: {min(page_numbers)} ~ {max(page_numbers)}")
|
353 |
+
|
354 |
+
log(f"๐ ์ถ์ถ๋ ํ์ด์ง๋ณ PDF ๋ฌธ์: {len(docs)}๊ฐ (์ด {total_pages}ํ์ด์ง)")
|
355 |
+
return docs
|
356 |
+
|
357 |
+
# --------------------------------
|
358 |
+
# ๋ฌธ์ ๋ก๋ฉ ๋ฐ ๋ถํ
|
359 |
+
# --------------------------------
|
360 |
+
|
361 |
+
def load_documents(folder_path):
|
362 |
+
documents = []
|
363 |
+
|
364 |
+
for file in glob.glob(os.path.join(folder_path, "*.hwpx")):
|
365 |
+
log(f"๐ HWPX ํ์ผ ํ์ธ: {file}")
|
366 |
+
docs = load_hwpx(file)
|
367 |
+
documents.extend(docs)
|
368 |
+
|
369 |
+
for file in glob.glob(os.path.join(folder_path, "*.pdf")):
|
370 |
+
log(f"๐ PDF ํ์ผ ํ์ธ: {file}")
|
371 |
+
documents.extend(load_pdf_with_metadata(file))
|
372 |
+
|
373 |
+
log(f"๐ ๋ฌธ์ ๋ก๋ฉ ์ ์ฒด ์๋ฃ! ์ด ๋ฌธ์ ์: {len(documents)}")
|
374 |
+
return documents
|
375 |
+
|
376 |
+
def split_documents(documents, chunk_size=800, chunk_overlap=100):
|
377 |
+
log("๐ช ์ฒญํฌ ๋ถํ ์์")
|
378 |
+
splitter = RecursiveCharacterTextSplitter(
|
379 |
+
chunk_size=chunk_size,
|
380 |
+
chunk_overlap=chunk_overlap,
|
381 |
+
length_function=len
|
382 |
+
)
|
383 |
+
chunks = []
|
384 |
+
for doc in documents:
|
385 |
+
split = splitter.split_text(doc.page_content)
|
386 |
+
for i, chunk in enumerate(split):
|
387 |
+
enriched_chunk = f"passage: {chunk}"
|
388 |
+
chunks.append(Document(
|
389 |
+
page_content=enriched_chunk,
|
390 |
+
metadata={**doc.metadata, "chunk_index": i}
|
391 |
+
))
|
392 |
+
log(f"โ
์ฒญํฌ ๋ถํ ์๋ฃ: ์ด {len(chunks)}๊ฐ ์์ฑ")
|
393 |
+
return chunks
|
394 |
+
|
395 |
+
# --------------------------------
|
396 |
+
# ๋ฉ์ธ ์คํ
|
397 |
+
# --------------------------------
|
398 |
+
|
399 |
+
if __name__ == "__main__":
|
400 |
+
folder = "dataset_test"
|
401 |
+
log("๐ PyMuPDF ๊ธฐ๋ฐ ๋ฌธ์ ์ฒ๋ฆฌ ์์")
|
402 |
+
docs = load_documents(folder)
|
403 |
+
log("๐ฆ ๋ฌธ์ ๋ก๋ฉ ์๋ฃ")
|
404 |
+
|
405 |
+
# ํ์ด์ง ์ ๋ณด ํ์ธ
|
406 |
+
log("๐ ํ์ด์ง ์ ๋ณด ์์ฝ:")
|
407 |
+
page_info = {}
|
408 |
+
for doc in docs:
|
409 |
+
source = doc.metadata.get('source', 'unknown')
|
410 |
+
page = doc.metadata.get('page', 'unknown')
|
411 |
+
doc_type = doc.metadata.get('type', 'unknown')
|
412 |
+
|
413 |
+
if source not in page_info:
|
414 |
+
page_info[source] = {'pages': set(), 'types': set()}
|
415 |
+
page_info[source]['pages'].add(page)
|
416 |
+
page_info[source]['types'].add(doc_type)
|
417 |
+
|
418 |
+
for source, info in page_info.items():
|
419 |
+
max_page = max(info['pages']) if info['pages'] and isinstance(max(info['pages']), int) else 'unknown'
|
420 |
+
log(f" ๐ {os.path.basename(source)}: {max_page}ํ์ด์ง, ํ์
: {info['types']}")
|
421 |
+
|
422 |
+
chunks = split_documents(docs)
|
423 |
+
log("๐ก E5-Large-Instruct ์๋ฒ ๋ฉ ์ค๋น ์ค")
|
424 |
+
embedding_model = HuggingFaceEmbeddings(
|
425 |
+
model_name="intfloat/e5-large-v2",
|
426 |
+
model_kwargs={"device": "cuda"}
|
427 |
+
)
|
428 |
+
|
429 |
+
vectorstore = FAISS.from_documents(chunks, embedding_model)
|
430 |
+
vectorstore.save_local("vector_db")
|
431 |
+
|
432 |
+
log(f"๐ ์ ์ฒด ๋ฌธ์ ์: {len(docs)}")
|
433 |
+
log(f"๐ ์ฒญํฌ ์ด ์: {len(chunks)}")
|
434 |
+
log("โ
FAISS ์ ์ฅ ์๋ฃ: vector_db")
|
435 |
+
|
436 |
+
# ํ์ด์ง ์ ๋ณด๊ฐ ํฌํจ๋ ์ํ ์ถ๋ ฅ
|
437 |
+
log("\n๐ ์ค์ ํ์ด์ง ์ ๋ณด ํฌํจ ์ํ:")
|
438 |
+
for i, chunk in enumerate(chunks[:5]):
|
439 |
+
meta = chunk.metadata
|
440 |
+
log(f" ์ฒญํฌ {i+1}: {meta.get('type')} | ํ์ด์ง {meta.get('page')} | {os.path.basename(meta.get('source', 'unknown'))}")
|
e5_embeddings.py
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
2 |
+
|
3 |
+
class E5Embeddings(HuggingFaceEmbeddings):
|
4 |
+
def embed_documents(self, texts):
|
5 |
+
texts = [f"passage: {text}" for text in texts]
|
6 |
+
return super().embed_documents(texts)
|
7 |
+
|
8 |
+
def embed_query(self, text):
|
9 |
+
return super().embed_query(f"query: {text}")
|
llm_loader.py
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from langchain.chat_models import ChatOpenAI
|
2 |
+
|
3 |
+
def load_llama_model():
|
4 |
+
return ChatOpenAI(
|
5 |
+
|
6 |
+
#Llama 3 8B ๋ชจ๋ธ๋ก RAG ์คํํ๊ณ ์ถ์ ๊ฒฝ์ฐ
|
7 |
+
#base_url="http://torch27:8000/v1",
|
8 |
+
#model="meta-llama/Meta-Llama-3-8B-Instruct",
|
9 |
+
#openai_api_key="EMPTY"
|
10 |
+
|
11 |
+
#Exaone์ผ๋ก RAG ์คํํ๊ณ ์ถ์ ๊ฒฝ์ฐ
|
12 |
+
base_url="http://220.124.155.35:8000/v1",
|
13 |
+
model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
|
14 |
+
openai_api_key="token-abc123"
|
15 |
+
|
16 |
+
#base_url="https://7xiebe4unotxnp-8000.proxy.runpod.net/v1",
|
17 |
+
#model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
|
18 |
+
#openai_api_key="EMPTY"
|
19 |
+
|
20 |
+
# base_url="http://vllm_yjy:8000/v1",
|
21 |
+
# model="/models/Llama-3.3-70B-Instruct-AWQ",
|
22 |
+
# openai_api_key="token-abc123"
|
23 |
+
|
24 |
+
)
|
rag_server.py
ADDED
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from fastapi import FastAPI, Request
|
2 |
+
from fastapi.responses import JSONResponse, FileResponse, HTMLResponse
|
3 |
+
from fastapi.staticfiles import StaticFiles
|
4 |
+
from pydantic import BaseModel
|
5 |
+
from rag_system import build_rag_chain, ask_question
|
6 |
+
from vector_store import get_embeddings, load_vector_store
|
7 |
+
from llm_loader import load_llama_model
|
8 |
+
import uuid
|
9 |
+
import os
|
10 |
+
import shutil
|
11 |
+
from urllib.parse import urljoin, quote
|
12 |
+
|
13 |
+
from fastapi.responses import StreamingResponse
|
14 |
+
import json
|
15 |
+
import time
|
16 |
+
|
17 |
+
app = FastAPI()
|
18 |
+
|
19 |
+
# ์ ์ ํ์ผ ์๋น์ ์ํ ์ค์
|
20 |
+
os.makedirs("static/documents", exist_ok=True)
|
21 |
+
app.mount("/static", StaticFiles(directory="static"), name="static")
|
22 |
+
|
23 |
+
# ์ ์ญ ๊ฐ์ฒด ์ค๋น
|
24 |
+
embeddings = get_embeddings(device="cpu")
|
25 |
+
vectorstore = load_vector_store(embeddings, load_path="vector_db")
|
26 |
+
llm = load_llama_model()
|
27 |
+
qa_chain = build_rag_chain(llm, vectorstore, language="ko", k=7)
|
28 |
+
|
29 |
+
# ์๋ฒ URL ์ค์ (์ค์ ํ๊ฒฝ์ ๋ง๊ฒ ์์ ํ์)
|
30 |
+
BASE_URL = "http://220.124.155.35:8500"
|
31 |
+
|
32 |
+
class Question(BaseModel):
|
33 |
+
question: str
|
34 |
+
|
35 |
+
def get_document_url(source_path):
|
36 |
+
if not source_path or source_path == 'N/A':
|
37 |
+
return None
|
38 |
+
filename = os.path.basename(source_path)
|
39 |
+
dataset_root = os.path.join(os.getcwd(), "dataset")
|
40 |
+
# dataset ์ ์ฒด ํ์ ํด๋์์ ํ์ผ๋ช
์ผ์นํ๋ ํ์ผ ์ฐพ๊ธฐ
|
41 |
+
found_path = None
|
42 |
+
for root, dirs, files in os.walk(dataset_root):
|
43 |
+
if filename in files:
|
44 |
+
found_path = os.path.join(root, filename)
|
45 |
+
break
|
46 |
+
if not found_path or not os.path.exists(found_path):
|
47 |
+
return None
|
48 |
+
static_path = f"static/documents/{filename}"
|
49 |
+
shutil.copy2(found_path, static_path)
|
50 |
+
encoded_filename = quote(filename)
|
51 |
+
return urljoin(BASE_URL, f"/static/documents/{encoded_filename}")
|
52 |
+
|
53 |
+
def create_download_link(url, filename):
|
54 |
+
return f'์ถ์ฒ: [{filename}]({url})'
|
55 |
+
|
56 |
+
@app.post("/ask")
|
57 |
+
def ask(question: Question):
|
58 |
+
result = ask_question(qa_chain, question.question)
|
59 |
+
|
60 |
+
# ์์ค ๋ฌธ์ ์ ๋ณด ์ฒ๋ฆฌ
|
61 |
+
sources = []
|
62 |
+
for doc in result["source_documents"]:
|
63 |
+
source_path = doc.metadata.get('source', 'N/A')
|
64 |
+
document_url = get_document_url(source_path) if source_path != 'N/A' else None
|
65 |
+
|
66 |
+
source_info = {
|
67 |
+
"source": source_path,
|
68 |
+
"content": doc.page_content,
|
69 |
+
"page": doc.metadata.get('page', 'N/A'),
|
70 |
+
"document_url": document_url,
|
71 |
+
"filename": os.path.basename(source_path) if source_path != 'N/A' else None
|
72 |
+
}
|
73 |
+
sources.append(source_info)
|
74 |
+
|
75 |
+
return {
|
76 |
+
"answer": result['result'].split("A:")[-1].strip() if "A:" in result['result'] else result['result'].strip(),
|
77 |
+
"sources": sources
|
78 |
+
}
|
79 |
+
|
80 |
+
@app.get("/v1/models")
|
81 |
+
def list_models():
|
82 |
+
return JSONResponse({
|
83 |
+
"object": "list",
|
84 |
+
"data": [
|
85 |
+
{
|
86 |
+
"id": "rag",
|
87 |
+
"object": "model",
|
88 |
+
"owned_by": "local",
|
89 |
+
}
|
90 |
+
]
|
91 |
+
})
|
92 |
+
|
93 |
+
@app.post("/v1/chat/completions")
|
94 |
+
async def openai_compatible_chat(request: Request):
|
95 |
+
payload = await request.json()
|
96 |
+
messages = payload.get("messages", [])
|
97 |
+
user_input = messages[-1]["content"] if messages else ""
|
98 |
+
stream = payload.get("stream", False)
|
99 |
+
|
100 |
+
result = ask_question(qa_chain, user_input)
|
101 |
+
answer = result['result']
|
102 |
+
|
103 |
+
# ์์ค ๋ฌธ์ ์ ๋ณด ์ฒ๋ฆฌ
|
104 |
+
sources = []
|
105 |
+
for doc in result["source_documents"]:
|
106 |
+
source_path = doc.metadata.get('source', 'N/A')
|
107 |
+
document_url = get_document_url(source_path) if source_path != 'N/A' else None
|
108 |
+
filename = os.path.basename(source_path) if source_path != 'N/A' else None
|
109 |
+
|
110 |
+
source_info = {
|
111 |
+
"source": source_path,
|
112 |
+
"content": doc.page_content,
|
113 |
+
"page": doc.metadata.get('page', 'N/A'),
|
114 |
+
"document_url": document_url,
|
115 |
+
"filename": filename
|
116 |
+
}
|
117 |
+
sources.append(source_info)
|
118 |
+
|
119 |
+
# ์์ค ์ ๋ณด๋ฅผ ํ ์ค์ฉ๋ง ์ถ๋ ฅ
|
120 |
+
sources_md = "\n์ฐธ๊ณ ๋ฌธ์:\n"
|
121 |
+
seen = set()
|
122 |
+
for source in sources:
|
123 |
+
key = (source['filename'], source['document_url'])
|
124 |
+
if source['document_url'] and source['filename'] and key not in seen:
|
125 |
+
sources_md += f"์ถ์ฒ: [{source['filename']}]({source['document_url']})\n"
|
126 |
+
seen.add(key)
|
127 |
+
|
128 |
+
final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
|
129 |
+
final_answer += sources_md
|
130 |
+
|
131 |
+
if not stream:
|
132 |
+
return JSONResponse({
|
133 |
+
"id": f"chatcmpl-{uuid.uuid4()}",
|
134 |
+
"object": "chat.completion",
|
135 |
+
"choices": [{
|
136 |
+
"index": 0,
|
137 |
+
"message": {
|
138 |
+
"role": "assistant",
|
139 |
+
"content": final_answer
|
140 |
+
},
|
141 |
+
"finish_reason": "stop"
|
142 |
+
}],
|
143 |
+
"model": "rag",
|
144 |
+
})
|
145 |
+
|
146 |
+
# ์คํธ๋ฆฌ๋ฐ ์๋ต์ ์ํ generator
|
147 |
+
def event_stream():
|
148 |
+
# ๋ต๋ณ ๋ณธ๋ฌธ๋ง ๋จผ์ ์คํธ๋ฆฌ๋ฐ
|
149 |
+
answer_main = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
|
150 |
+
for char in answer_main:
|
151 |
+
chunk = {
|
152 |
+
"id": f"chatcmpl-{uuid.uuid4()}",
|
153 |
+
"object": "chat.completion.chunk",
|
154 |
+
"choices": [{
|
155 |
+
"index": 0,
|
156 |
+
"delta": {
|
157 |
+
"content": char
|
158 |
+
},
|
159 |
+
"finish_reason": None
|
160 |
+
}]
|
161 |
+
}
|
162 |
+
yield f"data: {json.dumps(chunk)}\n\n"
|
163 |
+
time.sleep(0.005)
|
164 |
+
# ์ฐธ๊ณ ๋ฌธ์(๋ค์ด๋ก๋ ๋งํฌ)๋ ๋ง์ง๋ง์ ํ ๋ฒ์ ๋ถ์ฌ์ ์ ์ก
|
165 |
+
sources_md = "\n์ฐธ๊ณ ๋ฌธ์:\n"
|
166 |
+
seen = set()
|
167 |
+
for source in sources:
|
168 |
+
key = (source['filename'], source['document_url'])
|
169 |
+
if source['document_url'] and source['filename'] and key not in seen:
|
170 |
+
sources_md += f"์ถ์ฒ: [{source['filename']}]({source['document_url']})\n"
|
171 |
+
seen.add(key)
|
172 |
+
if sources_md.strip() != "์ฐธ๊ณ ๋ฌธ์:":
|
173 |
+
chunk = {
|
174 |
+
"id": f"chatcmpl-{uuid.uuid4()}",
|
175 |
+
"object": "chat.completion.chunk",
|
176 |
+
"choices": [{
|
177 |
+
"index": 0,
|
178 |
+
"delta": {
|
179 |
+
"content": sources_md
|
180 |
+
},
|
181 |
+
"finish_reason": None
|
182 |
+
}]
|
183 |
+
}
|
184 |
+
yield f"data: {json.dumps(chunk)}\n\n"
|
185 |
+
done = {
|
186 |
+
"id": f"chatcmpl-{uuid.uuid4()}",
|
187 |
+
"object": "chat.completion.chunk",
|
188 |
+
"choices": [{
|
189 |
+
"index": 0,
|
190 |
+
"delta": {},
|
191 |
+
"finish_reason": "stop"
|
192 |
+
}]
|
193 |
+
}
|
194 |
+
yield f"data: {json.dumps(done)}\n\n"
|
195 |
+
return
|
196 |
+
|
197 |
+
return StreamingResponse(event_stream(), media_type="text/event-stream")
|
rag_system.py
ADDED
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import argparse
|
3 |
+
import sys
|
4 |
+
from langchain.chains import RetrievalQA
|
5 |
+
from langchain.prompts import PromptTemplate
|
6 |
+
from vector_store import get_embeddings, load_vector_store
|
7 |
+
from llm_loader import load_llama_model
|
8 |
+
|
9 |
+
def create_refine_prompts_with_pages(language="ko"):
|
10 |
+
if language == "ko":
|
11 |
+
question_prompt = PromptTemplate(
|
12 |
+
input_variables=["context_str", "question"],
|
13 |
+
template="""
|
14 |
+
๋ค์์ ๊ฒ์๋ ๋ฌธ์ ์กฐ๊ฐ๋ค์
๋๋ค:
|
15 |
+
|
16 |
+
{context_str}
|
17 |
+
|
18 |
+
์ ๋ฌธ์๋ค์ ์ฐธ๊ณ ํ์ฌ ์ง๋ฌธ์ ๋ต๋ณํด์ฃผ์ธ์.
|
19 |
+
|
20 |
+
**์ค์ํ ๊ท์น:**
|
21 |
+
- ๋ต๋ณ ์ ์ฐธ๊ณ ํ ๋ฌธ์๊ฐ ์๋ค๋ฉด ํด๋น ์ ๋ณด๋ฅผ ์ธ์ฉํ์ธ์
|
22 |
+
- ๋ฌธ์์ ๋ช
์๋ ์ ๋ณด๋ง ์ฌ์ฉํ๊ณ , ์ถ์ธกํ์ง ๋ง์ธ์
|
23 |
+
- ํ์ด์ง ๋ฒํธ๋ ์ถ์ฒ๋ ์ ๋ฌธ์์์ ํ์ธ๋ ๊ฒ๋ง ์ธ๊ธํ์ธ์
|
24 |
+
- ํ์คํ์ง ์์ ์ ๋ณด๋ "๋ฌธ์์์ ํ์ธ๋์ง ์์"์ด๋ผ๊ณ ๋ช
์ํ์ธ์
|
25 |
+
|
26 |
+
์ง๋ฌธ: {question}
|
27 |
+
๋ต๋ณ:"""
|
28 |
+
)
|
29 |
+
|
30 |
+
refine_prompt = PromptTemplate(
|
31 |
+
input_variables=["question", "existing_answer", "context_str"],
|
32 |
+
template="""
|
33 |
+
๊ธฐ์กด ๋ต๋ณ:
|
34 |
+
{existing_answer}
|
35 |
+
|
36 |
+
์ถ๊ฐ ๋ฌธ์:
|
37 |
+
{context_str}
|
38 |
+
|
39 |
+
๊ธฐ์กด ๋ต๋ณ์ ์ ์ถ๊ฐ ๋ฌธ์๋ฅผ ๋ฐํ์ผ๋ก ๋ณด์ํ๊ฑฐ๋ ์์ ํด์ฃผ์ธ์.
|
40 |
+
|
41 |
+
**๊ท์น:**
|
42 |
+
- ์๋ก์ด ์ ๋ณด๊ฐ ๊ธฐ์กด ๋ต๋ณ๊ณผ ๋ค๋ฅด๋ค๋ฉด ์์ ํ์ธ์
|
43 |
+
- ์ถ๊ฐ ๋ฌธ์์ ๋ช
์๋ ์ ๋ณด๋ง ์ฌ์ฉํ์ธ์
|
44 |
+
- ํ๋์ ์๊ฒฐ๋ ๋ต๋ณ์ผ๋ก ์์ฑํ์ธ์
|
45 |
+
- ํ์คํ์ง ์์ ์ถ์ฒ๋ ํ์ด์ง๋ ์ธ๊ธํ์ง ๋ง์ธ์
|
46 |
+
|
47 |
+
์ง๋ฌธ: {question}
|
48 |
+
๋ต๋ณ:"""
|
49 |
+
)
|
50 |
+
else:
|
51 |
+
question_prompt = PromptTemplate(
|
52 |
+
input_variables=["context_str", "question"],
|
53 |
+
template="""
|
54 |
+
Here are the retrieved document fragments:
|
55 |
+
|
56 |
+
{context_str}
|
57 |
+
|
58 |
+
Please answer the question based on the above documents.
|
59 |
+
|
60 |
+
**Important rules:**
|
61 |
+
- Only use information explicitly stated in the documents
|
62 |
+
- If citing sources, only mention what is clearly indicated in the documents above
|
63 |
+
- Do not guess or infer page numbers not shown in the context
|
64 |
+
- If unsure, state "not confirmed in the provided documents"
|
65 |
+
|
66 |
+
Question: {question}
|
67 |
+
Answer:"""
|
68 |
+
)
|
69 |
+
|
70 |
+
refine_prompt = PromptTemplate(
|
71 |
+
input_variables=["question", "existing_answer", "context_str"],
|
72 |
+
template="""
|
73 |
+
Existing answer:
|
74 |
+
{existing_answer}
|
75 |
+
|
76 |
+
Additional documents:
|
77 |
+
{context_str}
|
78 |
+
|
79 |
+
Refine the existing answer using the additional documents.
|
80 |
+
|
81 |
+
**Rules:**
|
82 |
+
- Only use information explicitly stated in the additional documents
|
83 |
+
- Create one coherent final answer
|
84 |
+
- Do not mention uncertain sources or page numbers
|
85 |
+
|
86 |
+
Question: {question}
|
87 |
+
Answer:"""
|
88 |
+
)
|
89 |
+
|
90 |
+
return question_prompt, refine_prompt
|
91 |
+
|
92 |
+
def build_rag_chain(llm, vectorstore, language="ko", k=7):
|
93 |
+
"""RAG ์ฒด์ธ ๊ตฌ์ถ"""
|
94 |
+
question_prompt, refine_prompt = create_refine_prompts_with_pages(language)
|
95 |
+
|
96 |
+
qa_chain = RetrievalQA.from_chain_type(
|
97 |
+
llm=llm,
|
98 |
+
chain_type="refine",
|
99 |
+
retriever=vectorstore.as_retriever(search_kwargs={"k": k}),
|
100 |
+
chain_type_kwargs={
|
101 |
+
"question_prompt": question_prompt,
|
102 |
+
"refine_prompt": refine_prompt
|
103 |
+
},
|
104 |
+
return_source_documents=True
|
105 |
+
)
|
106 |
+
|
107 |
+
return qa_chain
|
108 |
+
|
109 |
+
def ask_question_with_pages(qa_chain, question):
|
110 |
+
"""์ง๋ฌธ ์ฒ๋ฆฌ"""
|
111 |
+
result = qa_chain.invoke({"query": question})
|
112 |
+
|
113 |
+
# ๊ฒฐ๊ณผ์์ A: ์ดํ ๋ฌธ์ฅ๋ง ์ถ์ถ
|
114 |
+
answer = result['result']
|
115 |
+
final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
|
116 |
+
|
117 |
+
print(f"\n๐งพ ์ง๋ฌธ: {question}")
|
118 |
+
print(f"\n๐ข ์ต์ข
๋ต๋ณ: {final_answer}")
|
119 |
+
|
120 |
+
# ๋ฉํ๋ฐ์ดํฐ ๋๋ฒ๊น
์ ๋ณด ์ถ๋ ฅ (๋นํ์ฑํ)
|
121 |
+
# debug_metadata_info(result["source_documents"])
|
122 |
+
|
123 |
+
# ์ฐธ๊ณ ๋ฌธ์๋ฅผ ํ์ด์ง๋ณ๋ก ์ ๋ฆฌ
|
124 |
+
print("\n๐ ์ฐธ๊ณ ๋ฌธ์ ์์ฝ:")
|
125 |
+
source_info = {}
|
126 |
+
|
127 |
+
for doc in result["source_documents"]:
|
128 |
+
source = doc.metadata.get('source', 'N/A')
|
129 |
+
page = doc.metadata.get('page', 'N/A')
|
130 |
+
doc_type = doc.metadata.get('type', 'N/A')
|
131 |
+
section = doc.metadata.get('section', None)
|
132 |
+
total_pages = doc.metadata.get('total_pages', None)
|
133 |
+
|
134 |
+
filename = doc.metadata.get('filename', 'N/A')
|
135 |
+
if filename == 'N/A':
|
136 |
+
filename = os.path.basename(source) if source != 'N/A' else 'N/A'
|
137 |
+
|
138 |
+
if filename not in source_info:
|
139 |
+
source_info[filename] = {
|
140 |
+
'pages': set(),
|
141 |
+
'sections': set(),
|
142 |
+
'types': set(),
|
143 |
+
'total_pages': total_pages
|
144 |
+
}
|
145 |
+
|
146 |
+
if page != 'N/A':
|
147 |
+
if isinstance(page, str) and page.startswith('์น์
'):
|
148 |
+
source_info[filename]['sections'].add(page)
|
149 |
+
else:
|
150 |
+
source_info[filename]['pages'].add(page)
|
151 |
+
|
152 |
+
if section is not None:
|
153 |
+
source_info[filename]['sections'].add(f"์น์
{section}")
|
154 |
+
|
155 |
+
source_info[filename]['types'].add(doc_type)
|
156 |
+
|
157 |
+
# ๊ฒฐ๊ณผ ์ถ๋ ฅ
|
158 |
+
total_chunks = len(result["source_documents"])
|
159 |
+
print(f"์ด ์ฌ์ฉ๋ ์ฒญํฌ ์: {total_chunks}")
|
160 |
+
|
161 |
+
for filename, info in source_info.items():
|
162 |
+
print(f"\n- {filename}")
|
163 |
+
|
164 |
+
# ์ ์ฒด ํ์ด์ง ์ ์ ๋ณด
|
165 |
+
if info['total_pages']:
|
166 |
+
print(f" ์ ์ฒด ํ์ด์ง ์: {info['total_pages']}")
|
167 |
+
|
168 |
+
# ํ์ด์ง ์ ๋ณด ์ถ๋ ฅ
|
169 |
+
if info['pages']:
|
170 |
+
pages_list = list(info['pages'])
|
171 |
+
print(f" ํ์ด์ง: {', '.join(map(str, pages_list))}")
|
172 |
+
|
173 |
+
# ์น์
์ ๋ณด ์ถ๋ ฅ
|
174 |
+
if info['sections']:
|
175 |
+
sections_list = sorted(list(info['sections']))
|
176 |
+
print(f" ์น์
: {', '.join(sections_list)}")
|
177 |
+
|
178 |
+
# ํ์ด์ง์ ์น์
์ด ๋ชจ๋ ์๋ ๊ฒฝ์ฐ
|
179 |
+
if not info['pages'] and not info['sections']:
|
180 |
+
print(f" ํ์ด์ง: ์ ๋ณด ์์")
|
181 |
+
|
182 |
+
# ๋ฌธ์ ์ ํ ์ถ๋ ฅ
|
183 |
+
types_str = ', '.join(sorted(info['types']))
|
184 |
+
print(f" ์ ํ: {types_str}")
|
185 |
+
|
186 |
+
return result
|
187 |
+
|
188 |
+
# ๊ธฐ์กด ask_question ํจ์๋ ask_question_with_pages๋ก ๊ต์ฒด
|
189 |
+
def ask_question(qa_chain, question):
|
190 |
+
"""ํธํ์ฑ์ ์ํ ๋ํผ ํจ์"""
|
191 |
+
return ask_question_with_pages(qa_chain, question)
|
192 |
+
|
193 |
+
if __name__ == "__main__":
|
194 |
+
parser = argparse.ArgumentParser(description="RAG refine system (ํ์ด์ง ๋ฒํธ ์ง์)")
|
195 |
+
parser.add_argument("--vector_store", type=str, default="vector_db", help="๋ฒกํฐ ์คํ ์ด ๊ฒฝ๋ก")
|
196 |
+
parser.add_argument("--model", type=str, default="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct", help="LLM ๋ชจ๋ธ ID")
|
197 |
+
parser.add_argument("--device", type=str, default="cuda", choices=["cuda", "cpu"], help="์ฌ์ฉํ ๋๋ฐ์ด์ค")
|
198 |
+
parser.add_argument("--k", type=int, default=7, help="๊ฒ์ํ ๋ฌธ์ ์")
|
199 |
+
parser.add_argument("--language", type=str, default="ko", choices=["ko", "en"], help="์ฌ์ฉํ ์ธ์ด")
|
200 |
+
parser.add_argument("--query", type=str, help="์ง๋ฌธ (์์ผ๋ฉด ๋ํํ ๋ชจ๋ ์คํ)")
|
201 |
+
|
202 |
+
args = parser.parse_args()
|
203 |
+
|
204 |
+
embeddings = get_embeddings(device=args.device)
|
205 |
+
vectorstore = load_vector_store(embeddings, load_path=args.vector_store)
|
206 |
+
llm = load_llama_model()
|
207 |
+
|
208 |
+
qa_chain = build_rag_chain(llm, vectorstore, language=args.language, k=args.k)
|
209 |
+
|
210 |
+
print("๐ข RAG ํ์ด์ง ๋ฒํธ ์ง์ ์์คํ
์ค๋น ์๋ฃ!")
|
211 |
+
|
212 |
+
if args.query:
|
213 |
+
ask_question_with_pages(qa_chain, args.query)
|
214 |
+
else:
|
215 |
+
print("๐ฌ ๋ํํ ๋ชจ๋ ์์ (์ข
๋ฃํ๋ ค๋ฉด 'exit', 'quit', '์ข
๋ฃ' ์
๋ ฅ)")
|
216 |
+
while True:
|
217 |
+
try:
|
218 |
+
query = input("\n์ง๋ฌธ: ").strip()
|
219 |
+
if query.lower() in ["exit", "quit", "์ข
๋ฃ"]:
|
220 |
+
break
|
221 |
+
if query: # ๋น ์
๋ ฅ ๋ฐฉ์ง
|
222 |
+
ask_question_with_pages(qa_chain, query)
|
223 |
+
except KeyboardInterrupt:
|
224 |
+
print("\n\nํ๋ก๊ทธ๋จ์ ์ข
๋ฃํฉ๋๋ค.")
|
225 |
+
break
|
226 |
+
except Exception as e:
|
227 |
+
print(f"โ ์ค๋ฅ ๋ฐ์: {e}\n๋ค์ ์๋ํด์ฃผ์ธ์.")
|
requirements.txt
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
langchain>=0.1.0
|
2 |
+
langchain-community>=0.0.13
|
3 |
+
langchain-core>=0.1.0
|
4 |
+
langchain-huggingface>=0.0.2
|
5 |
+
sentence-transformers>=2.2.2
|
6 |
+
pypdf>=3.15.1
|
7 |
+
faiss-cpu>=1.7.4
|
8 |
+
transformers>=4.36.0
|
9 |
+
accelerate>=0.21.0
|
10 |
+
torch>=2.0.0
|
11 |
+
peft>=0.7.0
|
12 |
+
bitsandbytes>=0.41.0
|
13 |
+
tqdm>=4.65.0
|
14 |
+
python-docx>=0.8.11
|
15 |
+
olefile>=0.46
|
16 |
+
uvicorn
|
17 |
+
fastapi
|
18 |
+
openai
|
vector_store.py
ADDED
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
# -*- coding: utf-8 -*-
|
3 |
+
|
4 |
+
"""
|
5 |
+
๋ฒกํฐ ์คํ ์ด ๋ชจ๋: ๋ฌธ์ ์๋ฒ ๋ฉ ์์ฑ ๋ฐ ๋ฒกํฐ ์คํ ์ด ๊ตฌ์ถ
|
6 |
+
๋ฐฐ์น ์ฒ๋ฆฌ ์ ์ฉ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ ์ต์ ํ + ๊ธด ์ฒญํฌ ์ค๋ฅ ๋ฐฉ์ง
|
7 |
+
"""
|
8 |
+
|
9 |
+
import os
|
10 |
+
import argparse
|
11 |
+
import logging
|
12 |
+
from tqdm import tqdm
|
13 |
+
from langchain_community.vectorstores import FAISS
|
14 |
+
from langchain.schema.document import Document
|
15 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
16 |
+
|
17 |
+
# ๋ก๊น
์ค์ - ๋ถํ์ํ ๊ฒฝ๊ณ ๋ฉ์์ง ์ ๊ฑฐ
|
18 |
+
logging.getLogger().setLevel(logging.ERROR)
|
19 |
+
|
20 |
+
def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
|
21 |
+
return HuggingFaceEmbeddings(
|
22 |
+
model_name=model_name,
|
23 |
+
model_kwargs={'device': device},
|
24 |
+
encode_kwargs={'normalize_embeddings': True}
|
25 |
+
)
|
26 |
+
|
27 |
+
def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=16):
|
28 |
+
if not documents:
|
29 |
+
raise ValueError("๋ฌธ์๊ฐ ์์ต๋๋ค. ๋ฌธ์๊ฐ ์ฌ๋ฐ๋ฅด๊ฒ ๋ก๋๋์๋์ง ํ์ธํ์ธ์.")
|
30 |
+
|
31 |
+
texts = [doc.page_content for doc in documents]
|
32 |
+
metadatas = [doc.metadata for doc in documents]
|
33 |
+
|
34 |
+
# ๋ฐฐ์น๋ก ๋ถํ
|
35 |
+
batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
|
36 |
+
metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
|
37 |
+
|
38 |
+
print(f"Processing {len(batches)} batches with size {batch_size}")
|
39 |
+
print(f"Initializing vector store with batch 1/{len(batches)}")
|
40 |
+
|
41 |
+
# โ
from_texts ๋์ from_documents ์ฌ์ฉ (๊ธธ์ด ๋ฌธ์ ๋ฐฉ์ง)
|
42 |
+
first_docs = [
|
43 |
+
Document(page_content=text, metadata=meta)
|
44 |
+
for text, meta in zip(batches[0], metadata_batches[0])
|
45 |
+
]
|
46 |
+
vectorstore = FAISS.from_documents(first_docs, embeddings)
|
47 |
+
|
48 |
+
# ๋๋จธ์ง ๋ฐฐ์น ์ถ๊ฐ
|
49 |
+
for i in tqdm(range(1, len(batches)), desc="Processing batches"):
|
50 |
+
try:
|
51 |
+
docs_batch = [
|
52 |
+
Document(page_content=text, metadata=meta)
|
53 |
+
for text, meta in zip(batches[i], metadata_batches[i])
|
54 |
+
]
|
55 |
+
vectorstore.add_documents(docs_batch)
|
56 |
+
|
57 |
+
if i % 10 == 0:
|
58 |
+
temp_save_path = f"{save_path}_temp"
|
59 |
+
os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
|
60 |
+
vectorstore.save_local(temp_save_path)
|
61 |
+
print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
|
62 |
+
|
63 |
+
except Exception as e:
|
64 |
+
print(f"Error processing batch {i}: {e}")
|
65 |
+
error_save_path = f"{save_path}_error_at_batch_{i}"
|
66 |
+
os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
|
67 |
+
vectorstore.save_local(error_save_path)
|
68 |
+
print(f"Partial vector store saved to {error_save_path}")
|
69 |
+
raise
|
70 |
+
|
71 |
+
os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
|
72 |
+
vectorstore.save_local(save_path)
|
73 |
+
print(f"Vector store saved to {save_path}")
|
74 |
+
|
75 |
+
return vectorstore
|
76 |
+
|
77 |
+
def load_vector_store(embeddings, load_path="vector_db"):
|
78 |
+
if not os.path.exists(load_path):
|
79 |
+
raise FileNotFoundError(f"๋ฒกํฐ ์คํ ์ด๋ฅผ ์ฐพ์ ์ ์์ต๋๋ค: {load_path}")
|
80 |
+
return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
|
81 |
+
|
82 |
+
|
83 |
+
if __name__ == "__main__":
|
84 |
+
parser = argparse.ArgumentParser(description="๋ฒกํฐ ์คํ ์ด ๊ตฌ์ถ")
|
85 |
+
parser.add_argument("--folder", type=str, default="dataset", help="๋ฌธ์๊ฐ ์๋ ํด๋ ๊ฒฝ๋ก")
|
86 |
+
parser.add_argument("--save_path", type=str, default="vector_db", help="๋ฒกํฐ ์คํ ์ด ์ ์ฅ ๊ฒฝ๋ก")
|
87 |
+
parser.add_argument("--batch_size", type=int, default=16, help="๋ฐฐ์น ํฌ๊ธฐ")
|
88 |
+
parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="์๋ฒ ๋ฉ ๋ชจ๋ธ ์ด๋ฆ")
|
89 |
+
parser.add_argument("--device", type=str, default="cuda", help="์ฌ์ฉํ ๋๋ฐ์ด์ค ('cuda' ๋๋ 'cpu')")
|
90 |
+
|
91 |
+
args = parser.parse_args()
|
92 |
+
|
93 |
+
# ๋ฌธ์ ์ฒ๋ฆฌ ๋ชจ๋ import
|
94 |
+
from document_processor import load_documents, split_documents
|
95 |
+
|
96 |
+
# ๋ฌธ์ ๋ก๋ ๋ฐ ๋ถํ
|
97 |
+
documents = load_documents(args.folder)
|
98 |
+
chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
|
99 |
+
|
100 |
+
# ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ก๋
|
101 |
+
embeddings = get_embeddings(model_name=args.model_name, device=args.device)
|
102 |
+
|
103 |
+
# ๋ฒกํฐ ์คํ ์ด ๊ตฌ์ถ
|
104 |
+
build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)
|
vector_store_test.py
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
# -*- coding: utf-8 -*-
|
3 |
+
|
4 |
+
"""
|
5 |
+
๋ฒกํฐ ์คํ ์ด ๋ชจ๋: ๋ฌธ์ ์๋ฒ ๋ฉ ์์ฑ ๋ฐ ๋ฒกํฐ ์คํ ์ด ๊ตฌ์ถ
|
6 |
+
๋ฐฐ์น ์ฒ๋ฆฌ ์ ์ฉ + ์ฒญํฌ ๊ธธ์ด ํ์ธ ์ถ๊ฐ
|
7 |
+
"""
|
8 |
+
|
9 |
+
import os
|
10 |
+
import argparse
|
11 |
+
import logging
|
12 |
+
from tqdm import tqdm
|
13 |
+
from langchain_community.vectorstores import FAISS
|
14 |
+
from langchain.schema.document import Document
|
15 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
16 |
+
from e5_embeddings import E5Embeddings
|
17 |
+
|
18 |
+
# ๋ก๊น
์ค์
|
19 |
+
logging.getLogger().setLevel(logging.ERROR)
|
20 |
+
|
21 |
+
def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
|
22 |
+
print(f"[INFO] ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋๋ฐ์ด์ค: {device}")
|
23 |
+
return E5Embeddings(
|
24 |
+
model_name=model_name,
|
25 |
+
model_kwargs={'device': device},
|
26 |
+
encode_kwargs={'normalize_embeddings': True}
|
27 |
+
)
|
28 |
+
|
29 |
+
def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=4):
|
30 |
+
if not documents:
|
31 |
+
raise ValueError("๋ฌธ์๊ฐ ์์ต๋๋ค. ๋ฌธ์๊ฐ ์ฌ๋ฐ๋ฅด๊ฒ ๋ก๋๋์๋์ง ํ์ธํ์ธ์.")
|
32 |
+
|
33 |
+
texts = [doc.page_content for doc in documents]
|
34 |
+
metadatas = [doc.metadata for doc in documents]
|
35 |
+
|
36 |
+
# ์ฒญํฌ ๊ธธ์ด ์ถ๋ ฅ
|
37 |
+
lengths = [len(t) for t in texts]
|
38 |
+
print(f"๐ก ์ฒญํฌ ์: {len(texts)}")
|
39 |
+
print(f"๐ก ๊ฐ์ฅ ๊ธด ์ฒญํฌ ๊ธธ์ด: {max(lengths)} chars")
|
40 |
+
print(f"๐ก ํ๊ท ์ฒญํฌ ๊ธธ์ด: {sum(lengths) // len(lengths)} chars")
|
41 |
+
|
42 |
+
# ๋ฐฐ์น๋ก ๋๋๊ธฐ
|
43 |
+
batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
|
44 |
+
metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
|
45 |
+
|
46 |
+
print(f"Processing {len(batches)} batches with size {batch_size}")
|
47 |
+
print(f"Initializing vector store with batch 1/{len(batches)}")
|
48 |
+
|
49 |
+
# โ
from_documents ์ฌ์ฉ
|
50 |
+
first_docs = [
|
51 |
+
Document(page_content=text, metadata=meta)
|
52 |
+
for text, meta in zip(batches[0], metadata_batches[0])
|
53 |
+
]
|
54 |
+
vectorstore = FAISS.from_documents(first_docs, embeddings)
|
55 |
+
|
56 |
+
for i in tqdm(range(1, len(batches)), desc="Processing batches"):
|
57 |
+
try:
|
58 |
+
docs_batch = [
|
59 |
+
Document(page_content=text, metadata=meta)
|
60 |
+
for text, meta in zip(batches[i], metadata_batches[i])
|
61 |
+
]
|
62 |
+
vectorstore.add_documents(docs_batch)
|
63 |
+
|
64 |
+
if i % 10 == 0:
|
65 |
+
temp_save_path = f"{save_path}_temp"
|
66 |
+
os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
|
67 |
+
vectorstore.save_local(temp_save_path)
|
68 |
+
print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
|
69 |
+
|
70 |
+
except Exception as e:
|
71 |
+
print(f"Error processing batch {i}: {e}")
|
72 |
+
error_save_path = f"{save_path}_error_at_batch_{i}"
|
73 |
+
os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
|
74 |
+
vectorstore.save_local(error_save_path)
|
75 |
+
print(f"Partial vector store saved to {error_save_path}")
|
76 |
+
raise
|
77 |
+
|
78 |
+
os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
|
79 |
+
vectorstore.save_local(save_path)
|
80 |
+
print(f"Vector store saved to {save_path}")
|
81 |
+
|
82 |
+
return vectorstore
|
83 |
+
|
84 |
+
def load_vector_store(embeddings, load_path="vector_db"):
|
85 |
+
if not os.path.exists(load_path):
|
86 |
+
raise FileNotFoundError(f"๋ฒกํฐ ์คํ ์ด๋ฅผ ์ฐพ์ ์ ์์ต๋๋ค: {load_path}")
|
87 |
+
return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
|
88 |
+
|
89 |
+
if __name__ == "__main__":
|
90 |
+
parser = argparse.ArgumentParser(description="๋ฒกํฐ ์คํ ์ด ๊ตฌ์ถ")
|
91 |
+
parser.add_argument("--folder", type=str, default="final_dataset", help="๋ฌธ์๊ฐ ์๋ ํด๋ ๊ฒฝ๋ก")
|
92 |
+
parser.add_argument("--save_path", type=str, default="vector_db", help="๋ฒกํฐ ์คํ ์ด ์ ์ฅ ๊ฒฝ๋ก")
|
93 |
+
parser.add_argument("--batch_size", type=int, default=4, help="๋ฐฐ์น ํฌ๊ธฐ")
|
94 |
+
parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="์๋ฒ ๋ฉ ๋ชจ๋ธ ์ด๋ฆ")
|
95 |
+
# parser.add_argument("--device", type=str, default="cuda", help="์ฌ์ฉํ ๋๋ฐ์ด์ค ('cuda' ๋๋ 'cpu')")
|
96 |
+
parser.add_argument("--device", type=str, default="cuda", help="์ฌ์ฉํ ๋๋ฐ์ด์ค ('cuda' ๋๋ 'cpu' ๋๋ 'cuda:1')")
|
97 |
+
|
98 |
+
args = parser.parse_args()
|
99 |
+
|
100 |
+
# ๋ฌธ์ ์ฒ๋ฆฌ ๋ชจ๋ import
|
101 |
+
from document_processor_image_test import load_documents, split_documents
|
102 |
+
|
103 |
+
documents = load_documents(args.folder)
|
104 |
+
chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
|
105 |
+
|
106 |
+
print(f"[DEBUG] ๋ฌธ์ ๋ก๋ฉ ๋ฐ ์ฒญํฌ ๋ถํ ์๋ฃ, ์๋ฒ ๋ฉ ๋จ๊ณ ์ง์
์ ")
|
107 |
+
print(f"[INFO] ์ ํ๋ ๋๋ฐ์ด์ค: {args.device}")
|
108 |
+
|
109 |
+
try:
|
110 |
+
embeddings = get_embeddings(
|
111 |
+
model_name=args.model_name,
|
112 |
+
device=args.device
|
113 |
+
)
|
114 |
+
print(f"[DEBUG] ์๋ฒ ๋ฉ ๋ชจ๋ธ ์์ฑ ์๋ฃ")
|
115 |
+
except Exception as e:
|
116 |
+
print(f"[ERROR] ์๋ฒ ๋ฉ ๋ชจ๋ธ ์์ฑ ๏ฟฝ๏ฟฝ ์๋ฌ ๋ฐ์: {e}")
|
117 |
+
import traceback; traceback.print_exc()
|
118 |
+
exit(1)
|
119 |
+
|
120 |
+
build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)
|
121 |
+
|