hugging2021's picture
Update README.md
9e8d8c2 verified
---
title: open-webui-rag-system
sdk: docker
---
# Open WebUI RAG System
Open WebUI์™€ ์—ฐ๋™ ๊ฐ€๋Šฅํ•œ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ธฐ๋ฐ˜ RAG(Retrieval-Augmented Generation) ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. PDF์™€ HWPX ํŒŒ์ผ์„ ์ง€์›ํ•˜๋ฉฐ, ํŽ˜์ด์ง€๋ณ„ ์ •ํ™•ํ•œ ์ •๋ณด ์ถ”์ถœ๊ณผ ์ถœ์ฒ˜ ์ถ”์ ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
## ์ฃผ์š” ๊ธฐ๋Šฅ
### 1. ๋ฌธ์„œ ์ฒ˜๋ฆฌ
- **PDF ๋ฌธ์„œ**: PyMuPDF ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€ OCR ์ถ”์ถœ
- **HWPX ๋ฌธ์„œ**: XML ํŒŒ์‹ฑ์„ ํ†ตํ•œ ์„น์…˜๋ณ„ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€ ์ถ”์ถœ
- **ํŽ˜์ด์ง€๋ณ„ ์ฒ˜๋ฆฌ**: ๊ฐ ๋ฌธ์„œ๋ฅผ ํŽ˜์ด์ง€/์„น์…˜ ๋‹จ์œ„๋กœ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฆฌ
- **๋‹ค์ค‘ ์ฝ˜ํ…์ธ  ํƒ€์ž…**: ๋ณธ๋ฌธ, ํ‘œ, OCR ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ์‹๋ณ„ํ•˜์—ฌ ์ฒ˜๋ฆฌ
### 2. ๋ฒกํ„ฐ ๊ฒ€์ƒ‰
- **E5-Large ์ž„๋ฒ ๋”ฉ**: ๋‹ค๊ตญ์–ด ์ง€์› ๊ณ ์„ฑ๋Šฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ
- **FAISS ๋ฒกํ„ฐ์Šคํ† ์–ด**: ๋น ๋ฅธ ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰
- **๋ฐฐ์น˜ ์ฒ˜๋ฆฌ**: ๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”
- **์ฒญํฌ ๋ถ„ํ• **: ๋ฌธ๋งฅ ์œ ์ง€๋ฅผ ์œ„ํ•œ ๊ฒน์นจ ์ฒ˜๋ฆฌ
### 3. RAG ์‹œ์Šคํ…œ
- **Refine ์ฒด์ธ**: ๋‹ค์ค‘ ๋ฌธ์„œ ์ฐธ์กฐ๋ฅผ ํ†ตํ•œ ์ •ํ™•ํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ
- **์ถœ์ฒ˜ ์ถ”์ **: ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ์™€ ๋ฌธ์„œ๋ช…์„ ํฌํ•จํ•œ ์ •ํ™•ํ•œ ์ธ์šฉ
- **Hallucination ๋ฐฉ์ง€**: ๋ฌธ์„œ์— ๋ช…์‹œ๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜๋Š” ์—„๊ฒฉํ•œ ํ”„๋กฌํ”„ํŠธ
### 4. API ์„œ๋ฒ„
- **FastAPI ๊ธฐ๋ฐ˜**: ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ ์ง€์›
- **OpenAI ํ˜ธํ™˜**: `/v1/chat/completions` ์—”๋“œํฌ์ธํŠธ ์ œ๊ณต
- **์ŠคํŠธ๋ฆฌ๋ฐ ์ง€์›**: ์‹ค์‹œ๊ฐ„ ๋‹ต๋ณ€ ์ƒ์„ฑ
- **Open WebUI ์—ฐ๋™**: ํ”Œ๋Ÿฌ๊ทธ์ธ ์—†์ด ๋ฐ”๋กœ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
## ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ
### ํ•˜๋“œ์›จ์–ด
- **GPU**: CUDA ์ง€์› (์ž„๋ฒ ๋”ฉ ๋ฐ LLM ์ถ”๋ก ์šฉ)
- **RAM**: ์ตœ์†Œ 16GB (๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์‹œ ๋” ํ•„์š”)
- **์ €์žฅ๊ณต๊ฐ„**: ๋ชจ๋ธ ๋ฐ ๋ฒกํ„ฐ์Šคํ† ์–ด์šฉ 10GB+
### ์†Œํ”„ํŠธ์›จ์–ด
- Python 3.8+
- CUDA 11.7+ (GPU ์‚ฌ์šฉ ์‹œ)
- Tesseract OCR
## ์„ค์น˜ ๋ฐฉ๋ฒ•
### 1. ์ €์žฅ์†Œ ํด๋ก 
```bash
git clone <repository-url>
cd open-webui-rag-system
```
### 2. ์˜์กด์„ฑ ์„ค์น˜
```bash
pip install -r requirements.txt
```
### 3. Tesseract OCR ์„ค์น˜
**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr tesseract-ocr-kor
```
**Windows:**
- [Tesseract ๊ณต์‹ ํŽ˜์ด์ง€](https://github.com/UB-Mannheim/tesseract/wiki)์—์„œ ์„ค์น˜
### 4. LLM ์„œ๋ฒ„ ์„ค์ •
`llm_loader.py`์—์„œ ์‚ฌ์šฉํ•  LLM ์„œ๋ฒ„ ์„ค์ •:
```python
# EXAONE ๋ชจ๋ธ ์‚ฌ์šฉ ์˜ˆ์‹œ
base_url="http://vllm:8000/v1"
model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
openai_api_key="token-abc123"
```
## ์‹คํ–‰ ๋ฐฉ๋ฒ•
### 1. ๋ฌธ์„œ ์ค€๋น„
์ฒ˜๋ฆฌํ•  ๋ฌธ์„œ๋“ค์„ `dataset_test` ํด๋”์— ์ €์žฅ:
```
dataset_test/
โ”œโ”€โ”€ document1.pdf
โ”œโ”€โ”€ document2.hwpx
โ””โ”€โ”€ document3.pdf
```
### 2. ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ฐ ๋ฒกํ„ฐ์Šคํ† ์–ด ์ƒ์„ฑ
```bash
python document_processor_image_test.py
```
๋˜๋Š” ๋ฒกํ„ฐ์Šคํ† ์–ด ๋นŒ๋“œ ์Šคํฌ๋ฆฝํŠธ ์‚ฌ์šฉ:
```bash
python vector_store_test.py --folder dataset_test --save_path faiss_index_pymupdf
```
### 3. RAG ์„œ๋ฒ„ ์‹คํ–‰
```bash
python rag_server.py
```
์„œ๋ฒ„๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 8000๋ฒˆ ํฌํŠธ์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
### 4. Open WebUI ์—ฐ๋™
Open WebUI์˜ ๋ชจ๋ธ ์„ค์ •์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •:
- **API Base URL**: `http://localhost:8000/v1`
- **API Key**: `token-abc123`
- **Model Name**: `rag`
### 5. ๊ฐœ๋ณ„ ํ…Œ์ŠคํŠธ
๋ช…๋ น์ค„์—์„œ ์ง์ ‘ ์งˆ๋ฌธ:
```bash
python rag_system.py --query "๋ฌธ์„œ์—์„œ ์ฐพ๊ณ  ์‹ถ์€ ๋‚ด์šฉ"
```
๋Œ€ํ™”ํ˜• ๋ชจ๋“œ:
```bash
python rag_system.py
```
## ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
```
open-webui-rag-system/
โ”œโ”€โ”€ document_processor_image_test.py # ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๋ฉ”์ธ ๋ชจ๋“ˆ
โ”œโ”€โ”€ vector_store_test.py # ๋ฒกํ„ฐ์Šคํ† ์–ด ์ƒ์„ฑ ๋ชจ๋“ˆ
โ”œโ”€โ”€ rag_system.py # RAG ์ฒด์ธ ๊ตฌ์„ฑ ๋ฐ ์งˆ์˜์‘๋‹ต
โ”œโ”€โ”€ rag_server.py # FastAPI ์„œ๋ฒ„
โ”œโ”€โ”€ llm_loader.py # LLM ๋ชจ๋ธ ๋กœ๋”
โ”œโ”€โ”€ e5_embeddings.py # E5 ์ž„๋ฒ ๋”ฉ ๋ชจ๋“ˆ
โ”œโ”€โ”€ requirements.txt # ์˜์กด์„ฑ ๋ชฉ๋ก
โ”œโ”€โ”€ dataset_test/ # ๋ฌธ์„œ ์ €์žฅ ํด๋”
โ””โ”€โ”€ faiss_index_pymupdf/ # ์ƒ์„ฑ๋œ ๋ฒกํ„ฐ์Šคํ† ์–ด
```
## ํ•ต์‹ฌ ๋ชจ๋“ˆ ์„ค๋ช…
### document_processor_image_test.py
- PDF์™€ HWPX ํŒŒ์ผ์˜ ํ…์ŠคํŠธ, ํ‘œ, ์ด๋ฏธ์ง€๋ฅผ ํŽ˜์ด์ง€๋ณ„๋กœ ์ถ”์ถœ
- PyMuPDF, pdfplumber, pytesseract๋ฅผ ํ™œ์šฉํ•œ ๋‹ค์ธต ์ฒ˜๋ฆฌ
- ์„น์…˜๋ณ„ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ํŽ˜์ด์ง€ ์ •๋ณด ์œ ์ง€
### vector_store_test.py
- E5-Large ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ๋ฒกํ„ฐํ™”
- FAISS๋ฅผ ์ด์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฒกํ„ฐ์Šคํ† ์–ด ๊ตฌ์ถ•
- ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
### rag_system.py
- Refine ์ฒด์ธ์„ ํ™œ์šฉํ•œ ๋‹ค๋‹จ๊ณ„ ๋‹ต๋ณ€ ์ƒ์„ฑ
- ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ hallucination ๋ฐฉ์ง€ ํ”„๋กฌํ”„ํŠธ
- ์ถœ์ฒ˜ ์ถ”์ ๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ
### rag_server.py
- OpenAI ํ˜ธํ™˜ API ์—”๋“œํฌ์ธํŠธ ์ œ๊ณต
- ์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต ์ง€์›
- Open WebUI์™€์˜ ์›ํ™œํ•œ ์—ฐ๋™
## ์„ค์ • ์˜ต์…˜
### ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์˜ต์…˜
- **์ฒญํฌ ํฌ๊ธฐ**: `chunk_size=500` (๊ธฐ๋ณธ๊ฐ’)
- **์ฒญํฌ ๊ฒน์นจ**: `chunk_overlap=100` (๊ธฐ๋ณธ๊ฐ’)
- **OCR ์–ธ์–ด**: `lang='kor+eng'` (ํ•œ๊ตญ์–ด+์˜์–ด)
### ๊ฒ€์ƒ‰ ์˜ต์…˜
- **๊ฒ€์ƒ‰ ๋ฌธ์„œ ์ˆ˜**: `k=7` (๊ธฐ๋ณธ๊ฐ’)
- **์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ**: `intfloat/multilingual-e5-large-instruct`
- **๋””๋ฐ”์ด์Šค**: `cuda` ๋˜๋Š” `cpu`
### LLM ์„ค์ •
์ง€์›ํ•˜๋Š” ๋ชจ๋ธ๋“ค:
- LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
- meta-llama/Meta-Llama-3-8B-Instruct
- ๊ธฐํƒ€ OpenAI ํ˜ธํ™˜ ๋ชจ๋ธ
## ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…
### 1. CUDA ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
```bash
# CPU ๋ชจ๋“œ๋กœ ์‹คํ–‰
python vector_store_test.py --device cpu
```
### 2. ํ•œ๊ธ€ ํฐํŠธ ๋ฌธ์ œ
```bash
# ํ•œ๊ธ€ ํฐํŠธ ์„ค์น˜ (Ubuntu)
sudo apt-get install fonts-nanum
```
### 3. Tesseract ๊ฒฝ๋กœ ๋ฌธ์ œ
```python
# pytesseract ๊ฒฝ๋กœ ์ˆ˜๋™ ์„ค์ •
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
```
### 4. ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์‹คํŒจ
```bash
# Hugging Face ์บ์‹œ ๊ฒฝ๋กœ ํ™•์ธ
export HF_HOME=/path/to/huggingface/cache
```
## API ์‚ฌ์šฉ ์˜ˆ์‹œ
### ์ง์ ‘ ์งˆ์˜
```bash
curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{"question": "๋ฌธ์„œ์—์„œ ์˜ˆ์‚ฐ ๊ด€๋ จ ๋‚ด์šฉ์„ ์ฐพ์•„์ฃผ์„ธ์š”"}'
```
### OpenAI ํ˜ธํ™˜ API
```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "rag",
"messages": [{"role": "user", "content": "์˜ˆ์‚ฐ ํ˜„ํ™ฉ์ด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?"}],
"stream": false
}'
```
## ์„ฑ๋Šฅ ์ตœ์ ํ™”
### 1. ๋ฐฐ์น˜ ํฌ๊ธฐ ์กฐ์ •
```bash
python vector_store_test.py --batch_size 32 # GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ ์กฐ์ •
```
### 2. ์ฒญํฌ ํฌ๊ธฐ ์ตœ์ ํ™”
```python
# ๊ธด ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ์ฒญํฌ ํฌ๊ธฐ ์ฆ๊ฐ€
chunks = split_documents(docs, chunk_size=800, chunk_overlap=150)
```
### 3. ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ˆ˜ ์กฐ์ •
```bash
python rag_system.py --k 10 # ๋” ๋งŽ์€ ๋ฌธ์„œ ์ฐธ์กฐ
```
## ๋ผ์ด์„ ์Šค
MIT License
## ๊ธฐ์—ฌ ๋ฐฉ๋ฒ•
1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request