Spaces:

hugging2021
/

open-webui-rag-system

Runtime error

App Files Files Community

hugging2021 commited on 2 days ago

Commit

5f3b20a

verified ·

1 Parent(s): 1edba9f

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

.env +1 -0
.gitattributes +5 -33
.gitignore +8 -0
Dockerfile +14 -0
README.md +245 -10
concat_vector_store.py +46 -0
concat_vector_store_정리된.py +55 -0
dataset/출력 HWP파일 양식/(동향보고) 교육부 K-에듀파인 시스템 전북 데이터센터 탑재 검토.hwp +3 -0
dataset/출력 HWP파일 양식/23.05.10 전라북도 데이터센터 건립 가능 부지.hwp +3 -0
dataset/출력 HWP파일 양식/25.02.28 향후 공공 민간물량 노력 포인트.hwp +3 -0
dataset/출력 HWP파일 양식/25.03.07 생성형 AI 시스템 구축을 위한 업무협약식 계획.hwp +3 -0
dataset/출장결과보고/(1) 24.08.21 카카오 아토 녹음 풀본1.txt +0 -0
dataset/출장결과보고/(4) 24.08.21 카카오,아토 면담 결과보고F.hwp +3 -0
docker-compose.yml +12 -0
document_processor_image_test.py +440 -0
e5_embeddings.py +9 -0
llm_loader.py +24 -0
rag_server.py +197 -0
rag_system.py +227 -0
requirements.txt +18 -0
vector_store.py +104 -0
vector_store_test.py +121 -0

.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ HUGGINGFACE_TOKEN=<Huggingface_Token>

.gitattributes CHANGED Viewed

@@ -1,35 +1,7 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+*.faiss filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text
+vector_db/*.faiss filter=lfs diff=lfs merge=lfs -text
+vector_db/*.pkl filter=lfs diff=lfs merge=lfs -text
+dataset/* filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+vector_db/
+*.index
+*.faiss
+*.pkl
+*.pdf
+*.faiss
+*.pkl
+*.hwpx

Dockerfile ADDED Viewed

	@@ -0,0 +1,14 @@

+FROM python:3.10
+WORKDIR /app
+COPY . /app
+# /tmp 디렉토리 생성 및 권한 부여
+RUN mkdir -p /tmp && chmod 1777 /tmp
+# TMPDIR 환경변수를 지정하여 pip install 실행
+RUN TMPDIR=/tmp pip install --upgrade pip && TMPDIR=/tmp pip install --no-cache-dir -r requirements.txt
+EXPOSE 8500
+CMD ["uvicorn", "rag_server:app", "--host", "0.0.0.0", "--port", "8500"]

README.md CHANGED Viewed

@@ -1,10 +1,245 @@
----
-title: Open Webui Rag System
-emoji: 📊
-colorFrom: green
-colorTo: yellow
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Open WebUI RAG System
+Open WebUI와 연동 가능한 한국어 문서 기반 RAG(Retrieval-Augmented Generation) 시스템입니다. PDF와 HWPX 파일을 지원하며, 페이지별 정확한 정보 추출과 출처 추적이 가능합니다.
+## 주요 기능
+### 1. 문서 처리
+- **PDF 문서**: PyMuPDF 기반 텍스트, 표, 이미지 OCR 추출
+- **HWPX 문서**: XML 파싱을 통한 섹션별 텍스트, 표, 이미지 추출
+- **페이지별 처리**: 각 문서를 페이지/섹션 단위로 정확하게 분리
+- **다중 콘텐츠 타입**: 본문, 표, OCR 텍스트를 각각 식별하여 처리
+### 2. 벡터 검색
+- **E5-Large 임베딩**: 다국어 지원 고성능 임베딩 모델
+- **FAISS 벡터스토어**: 빠른 유사도 검색
+- **배치 처리**: 대용량 문서 처리 최적화
+- **청크 분할**: 문맥 유지를 위한 겹침 처리
+### 3. RAG 시스템
+- **Refine 체인**: 다중 문서 참조를 통한 정확한 답변 생성
+- **출처 추적**: 페이지 번호와 문서명을 포함한 정확한 인용
+- **Hallucination 방지**: 문서에 명시된 정보만 사용하는 엄격한 프롬프트
+### 4. API 서버
+- **FastAPI 기반**: 비동기 처리 지원
+- **OpenAI 호환**: `/v1/chat/completions` 엔드포인트 제공
+- **스트리밍 지원**: 실시간 답변 생성
+- **Open WebUI 연동**: 플러그인 없이 바로 연결 가능
+## 시스템 요구사항
+### 하드웨어
+- **GPU**: CUDA 지원 (임베딩 및 LLM 추론용)
+- **RAM**: 최소 16GB (대용량 문서 처리 시 더 필요)
+- **저장공간**: 모델 및 벡터스토어용 10GB+
+### 소프트웨어
+- Python 3.8+
+- CUDA 11.7+ (GPU 사용 시)
+- Tesseract OCR
+## 설치 방법
+### 1. 저장소 클론
+```bash
+git clone <repository-url>
+cd open-webui-rag-system
+```
+### 2. 의존성 설치
+```bash
+pip install -r requirements.txt
+```
+### 3. Tesseract OCR 설치
+**Ubuntu/Debian:**
+```bash
+sudo apt-get install tesseract-ocr tesseract-ocr-kor
+```
+**Windows:**
+- [Tesseract 공식 페이지](https://github.com/UB-Mannheim/tesseract/wiki)에서 설치
+### 4. LLM 서버 설정
+`llm_loader.py`에서 사용할 LLM 서버 설정:
+```python
+# EXAONE 모델 사용 예시
+base_url="http://vllm:8000/v1"
+model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
+openai_api_key="token-abc123"
+```
+## 실행 방법
+### 1. 문서 준비
+처리할 문서들을 `dataset_test` 폴더에 저장:
+```
+dataset_test/
+├── document1.pdf
+├── document2.hwpx
+└── document3.pdf
+```
+### 2. 문서 처리 및 벡터스토어 생성
+```bash
+python document_processor_image_test.py
+```
+또는 벡터스토어 빌드 스크립트 사용:
+```bash
+python vector_store_test.py --folder dataset_test --save_path faiss_index_pymupdf
+```
+### 3. RAG 서버 실행
+```bash
+python rag_server.py
+```
+서버는 기본적으로 8000번 포트에서 실행됩니다.
+### 4. Open WebUI 연동
+Open WebUI의 모델 설정에서 다음과 같이 설정:
+- **API Base URL**: `http://localhost:8000/v1`
+- **API Key**: `token-abc123`
+- **Model Name**: `rag`
+### 5. 개별 테스트
+명령줄에서 직접 질문:
+```bash
+python rag_system.py --query "문서에서 찾고 싶은 내용"
+```
+대화형 모드:
+```bash
+python rag_system.py
+```
+## 프로젝트 구조
+```
+open-webui-rag-system/
+├── document_processor_image_test.py    # 문서 처리 메인 모듈
+├── vector_store_test.py                # 벡터스토어 생성 모듈
+├── rag_system.py                       # RAG 체인 구성 및 질의응답
+├── rag_server.py                       # FastAPI 서버
+├── llm_loader.py                      # LLM 모델 로더
+├── e5_embeddings.py                   # E5 임베딩 모듈
+├── requirements.txt                   # 의존성 목록
+├── dataset_test/                      # 문서 저장 폴더
+└── faiss_index_pymupdf/              # 생성된 벡터스토어
+```
+## 핵심 모듈 설명
+### document_processor_image_test.py
+- PDF와 HWPX 파일의 텍스트, 표, 이미지를 페이지별로 추출
+- PyMuPDF, pdfplumber, pytesseract를 활용한 다층 처리
+- 섹션별 메타데이터와 페이지 정보 유지
+### vector_store_test.py
+- E5-Large 임베딩 모델을 사용한 벡터화
+- FAISS를 이용한 효율적인 벡터스토어 구축
+- 배치 처리를 통한 메모리 최적화
+### rag_system.py
+- Refine 체인을 활용한 다단계 답변 생성
+- 페이지 번호 hallucination 방지 프롬프트
+- 출처 추적과 메타데이터 관리
+### rag_server.py
+- OpenAI 호환 API 엔드포인트 제공
+- 스트리밍 응답 지원
+- Open WebUI와의 원활한 연동
+## 설정 옵션
+### 문서 처리 옵션
+- **청크 크��**: `chunk_size=500` (기본값)
+- **청크 겹침**: `chunk_overlap=100` (기본값)
+- **OCR 언어**: `lang='kor+eng'` (한국어+영어)
+### 검색 옵션
+- **검색 문서 수**: `k=7` (기본값)
+- **임베딩 모델**: `intfloat/multilingual-e5-large-instruct`
+- **디바이스**: `cuda` 또는 `cpu`
+### LLM 설정
+지원하는 모델들:
+- LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
+- meta-llama/Meta-Llama-3-8B-Instruct
+- 기타 OpenAI 호환 모델
+## 트러블슈팅
+### 1. CUDA 메모리 부족
+```bash
+# CPU 모드로 실행
+python vector_store_test.py --device cpu
+```
+### 2. 한글 폰트 문제
+```bash
+# 한글 폰트 설치 (Ubuntu)
+sudo apt-get install fonts-nanum
+```
+### 3. Tesseract 경로 문제
+```python
+# pytesseract 경로 수동 설정
+pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
+```
+### 4. 모델 다운로드 실패
+```bash
+# Hugging Face 캐시 경로 확인
+export HF_HOME=/path/to/huggingface/cache
+```
+## API 사용 예시
+### 직접 질의
+```bash
+curl -X POST "http://localhost:8000/ask" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "문서에서 예산 관련 내용을 찾아주세요"}'
+```
+### OpenAI 호환 API
+```bash
+curl -X POST "http://localhost:8000/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "rag",
+    "messages": [{"role": "user", "content": "예산 현황이 어떻게 되나요?"}],
+    "stream": false
+  }'
+```
+## 성능 최적화
+### 1. 배치 크기 조정
+```bash
+python vector_store_test.py --batch_size 32  # GPU 메모리에 따라 조정
+```
+### 2. 청크 크기 최적화
+```python
+# 긴 문서의 경우 청크 크기 증가
+chunks = split_documents(docs, chunk_size=800, chunk_overlap=150)
+```
+### 3. 검색 결과 수 조정
+```bash
+python rag_system.py --k 10  # 더 많은 문서 참조
+```
+## 라이선스
+MIT License
+## 기여 방법
+1. Fork the repository
+2. Create your feature branch
+3. Commit your changes
+4. Push to the branch
+5. Create a Pull Request

concat_vector_store.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import os
+from langchain.schema.document import Document
+from e5_embeddings import E5Embeddings
+from langchain_community.vectorstores import FAISS
+from document_processor_image import load_documents, split_documents  # 반드시 이 함수가 필요
+# 경로 설정
+NEW_FOLDER = "25.05.28 RAG용 2차 업무편람 취합본"
+#NEW_FOLDER = "임시"
+VECTOR_STORE_PATH = "vector_db"
+# 1. 임베딩 모델 로딩
+def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
+    return E5Embeddings(
+        model_name=model_name,
+        model_kwargs={'device': device},
+        encode_kwargs={'normalize_embeddings': True}
+    )
+# 2. 기존 벡터 스토어 로드
+def load_vector_store(embeddings, load_path="vector_db"):
+    if not os.path.exists(load_path):
+        raise FileNotFoundError(f"벡터 스토어를 찾을 수 없습니다: {load_path}")
+    return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
+# 3. 문서 임베딩 및 추가
+def add_new_documents_to_vector_store(new_folder, vectorstore, embeddings):
+    print(f"📂 새로운 문서 로드 중: {new_folder}")
+    new_docs = load_documents(new_folder)
+    new_chunks = split_documents(new_docs, chunk_size=800, chunk_overlap=100)
+    print(f"📄 새로운 청크 수: {len(new_chunks)}")
+    print(f"추가 전 벡터 수: {vectorstore.index.ntotal}")
+    vectorstore.add_documents(new_chunks)
+    print(f"추가 후 벡터 수: {vectorstore.index.ntotal}")
+    print("✅ 새로운 문서가 벡터 스토어에 추가되었습니다.")
+# 4. 전체 실행
+if __name__ == "__main__":
+    embeddings = get_embeddings()
+    vectorstore = load_vector_store(embeddings, VECTOR_STORE_PATH)
+    add_new_documents_to_vector_store(NEW_FOLDER, vectorstore, embeddings)
+    vectorstore.save_local(VECTOR_STORE_PATH)
+    print(f"💾 벡터 스토어 저장 완료: {VECTOR_STORE_PATH}")

concat_vector_store_정리된.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import os
+import glob
+from langchain.schema.document import Document
+from e5_embeddings import E5Embeddings
+from langchain_community.vectorstores import FAISS
+from document_processor import load_pdf_with_pymupdf, split_documents
+# 경로 설정
+FOLDER = "25.05.28 RAG용 2차 업무편람 취합본"
+VECTOR_STORE_PATH = "vector_db"
+# 1. 임베딩 모델 로드
+def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
+    return E5Embeddings(
+        model_name=model_name,
+        model_kwargs={'device': device},
+        encode_kwargs={'normalize_embeddings': True}
+    )
+# 2. 기존 벡터 스토어 로드
+def load_vector_store(embeddings, load_path=VECTOR_STORE_PATH):
+    if not os.path.exists(load_path):
+        raise FileNotFoundError(f"벡터 스토어를 찾을 수 없습니다: {load_path}")
+    return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
+# 3. 정리된 PDF만 임베딩
+def embed_cleaned_pdfs(folder, vectorstore, embeddings):
+    pattern = os.path.join(folder, "정리된*.pdf")
+    pdf_files = glob.glob(pattern)
+    print(f"🧾 대상 PDF 수: {len(pdf_files)}")
+    new_documents = []
+    for pdf_path in pdf_files:
+        print(f"📄 처리 중: {pdf_path}")
+        text = load_pdf_with_pymupdf(pdf_path)
+        if text.strip():
+            new_documents.append(Document(page_content=text, metadata={"source": pdf_path}))
+    print(f"📚 문서 수: {len(new_documents)}")
+    chunks = split_documents(new_documents, chunk_size=300, chunk_overlap=50)
+    print(f"�� 청크 수: {len(chunks)}")
+    print(f"추가 전 벡터 수: {vectorstore.index.ntotal}")
+    vectorstore.add_documents(chunks)
+    print(f"추가 후 벡터 수: {vectorstore.index.ntotal}")
+    vectorstore.save_local(VECTOR_STORE_PATH)
+    print(f"✅ 저장 완료: {VECTOR_STORE_PATH}")
+# 실행
+if __name__ == "__main__":
+    embeddings = get_embeddings()
+    vectorstore = load_vector_store(embeddings)
+    embed_cleaned_pdfs(FOLDER, vectorstore, embeddings)

dataset/출력 HWP파일 양식/(동향보고) 교육부 K-에듀파인 시스템 전북 데이터센터 탑재 검토.hwp ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b9ffae391271afee3a1d65cf2a46c58eabeca3ba9305ac0a987fb034e63b1708
+size 110080

dataset/출력 HWP파일 양식/23.05.10 전라북도 데이터센터 건립 가능 부지.hwp ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c8bbb64a8ec39a2bfcc373ba0b7bacb4dc5fb25200c2eda08a3abcea733368f
+size 651264

dataset/출력 HWP파일 양식/25.02.28 향후 공공 민간물량 노력 포인트.hwp ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2a89a3d34664fd68852dc1c4129efe73752dac17272d8238653b80d49309f88b
+size 101376

dataset/출력 HWP파일 양식/25.03.07 생성형 AI 시스템 구축을 위한 업무협약식 계획.hwp ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39286f70b90a2e4207947785576467f99cf33081aed24edd0f588c8d10f07cbc
+size 817152

dataset/출장결과보고/(1) 24.08.21 카카오 아토 녹음 풀본1.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

dataset/출장결과보고/(4) 24.08.21 카카오,아토 면담 결과보고F.hwp ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:930e7fde652397ee17ea6cfbbe1b993b7166fc3e350f9aa6c999f982baad3944
+size 120832

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+version: '3.8'
+services:
+  rag-api:
+    build: .
+    ports:
+      - "8500:8500"
+    volumes:
+      - ./dataset:/app/dataset
+    environment:
+      - PYTHONPATH=/app
+    command: uvicorn rag_server:app --host 0.0.0.0 --port 8500 --reload

document_processor_image_test.py ADDED Viewed

	@@ -0,0 +1,440 @@

+import os
+import re
+import glob
+import time
+from collections import defaultdict
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_core.documents import Document
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+# PyMuPDF 라이브러리
+try:
+    import fitz  # PyMuPDF
+    PYMUPDF_AVAILABLE = True
+    print("✅ PyMuPDF 라이브러리 사용 가능")
+except ImportError:
+    PYMUPDF_AVAILABLE = False
+    print("⚠️ PyMuPDF 라이브러리가 설치되지 않음. pip install PyMuPDF로 설치하세요.")
+# PDF 처리용
+import pytesseract
+from PIL import Image
+from pdf2image import convert_from_path
+import pdfplumber
+from pymupdf4llm import LlamaMarkdownReader
+# --------------------------------
+# 로그 출력
+# --------------------------------
+def log(msg):
+    print(f"[{time.strftime('%H:%M:%S')}] {msg}")
+# --------------------------------
+# 텍스트 정제 함수
+# --------------------------------
+def clean_text(text):
+    return re.sub(r"[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318F\w\s.,!?\"'()$:\-]", "", text)
+def apply_corrections(text):
+    corrections = {
+        'º©': '정보', 'Ì': '의', '½': '운영', 'Ã': '', '©': '',
+        'â€™': "'", 'â€œ': '"', 'â€': '"'
+    }
+    for k, v in corrections.items():
+        text = text.replace(k, v)
+    return text
+# --------------------------------
+# HWPX 처리 (섹션별 처리만 사용)
+# --------------------------------
+def load_hwpx(file_path):
+    """HWPX 파일 로딩 (XML 파싱 방식만 사용)"""
+    import zipfile
+    import xml.etree.ElementTree as ET
+    import chardet
+    log(f"📥 HWPX 섹션별 처리 시작: {file_path}")
+    start = time.time()
+    documents = []
+    try:
+        with zipfile.ZipFile(file_path, 'r') as zip_ref:
+            file_list = zip_ref.namelist()
+            section_files = [f for f in file_list
+                           if f.startswith('Contents/section') and f.endswith('.xml')]
+            section_files.sort()  # section0.xml, section1.xml 순서로 정렬
+            log(f"📄 발견된 섹션 파일: {len(section_files)}개")
+            for section_idx, section_file in enumerate(section_files):
+                with zip_ref.open(section_file) as xml_file:
+                    raw = xml_file.read()
+                    encoding = chardet.detect(raw)['encoding'] or 'utf-8'
+                    try:
+                        text = raw.decode(encoding)
+                    except UnicodeDecodeError:
+                        text = raw.decode("cp949", errors="replace")
+                    tree = ET.ElementTree(ET.fromstring(text))
+                    root = tree.getroot()
+                    # 네임스페이스 없이 텍스트 찾기
+                    t_elements = [elem for elem in root.iter() if elem.tag.endswith('}t') or elem.tag == 't']
+                    body_text = ""
+                    for elem in t_elements:
+                        if elem.text:
+                            body_text += clean_text(elem.text) + " "
+                    # page 메타데이터는 빈 값으로 설정
+                    page_value = ""
+                    if body_text.strip():
+                        documents.append(Document(
+                            page_content=apply_corrections(body_text),
+                            metadata={
+                                "source": file_path,
+                                "filename": os.path.basename(file_path),
+                                "type": "hwpx_body",
+                                "page": page_value,
+                                "total_sections": len(section_files)
+                            }
+                        ))
+                        log(f"✅ 섹션 텍스트 추출 완료 (chars: {len(body_text)})")
+                    # 표 찾기
+                    table_elements = [elem for elem in root.iter() if elem.tag.endswith('}table') or elem.tag == 'table']
+                    if table_elements:
+                        table_text = ""
+                        for table_idx, table in enumerate(table_elements):
+                            table_text += f"[Table {table_idx + 1}]\n"
+                            rows = [elem for elem in table.iter() if elem.tag.endswith('}tr') or elem.tag == 'tr']
+                            for row in rows:
+                                row_text = []
+                                cells = [elem for elem in row.iter() if elem.tag.endswith('}tc') or elem.tag == 'tc']
+                                for cell in cells:
+                                    cell_texts = []
+                                    for t_elem in cell.iter():
+                                        if (t_elem.tag.endswith('}t') or t_elem.tag == 't') and t_elem.text:
+                                            cell_texts.append(clean_text(t_elem.text))
+                                    row_text.append(" ".join(cell_texts))
+                                if row_text:
+                                    table_text += "\t".join(row_text) + "\n"
+                        if table_text.strip():
+                            documents.append(Document(
+                                page_content=apply_corrections(table_text),
+                                metadata={
+                                    "source": file_path,
+                                    "filename": os.path.basename(file_path),
+                                    "type": "hwpx_table",
+                                    "page": page_value,
+                                    "total_sections": len(section_files)
+                                }
+                            ))
+                            log(f"📊 표 추출 완료")
+                    # 이미지 찾기
+                    if [elem for elem in root.iter() if elem.tag.endswith('}picture') or elem.tag == 'picture']:
+                        documents.append(Document(
+                            page_content="[이미지 포함]",
+                            metadata={
+                                "source": file_path,
+                                "filename": os.path.basename(file_path),
+                                "type": "hwpx_image",
+                                "page": page_value,
+                                "total_sections": len(section_files)
+                            }
+                        ))
+                        log(f"🖼️ 이미지 발견")
+    except Exception as e:
+        log(f"❌ HWPX 처리 오류: {e}")
+    duration = time.time() - start
+    # 문서 정보 요약 출력
+    if documents:
+        log(f"📋 추출된 문서 수: {len(documents)}")
+    log(f"✅ HWPX 처리 완료: {file_path} ⏱️ {duration:.2f}초, 총 {len(documents)}개 문서")
+    return documents
+# --------------------------------
+# PDF 처리 함수들 (기존과 동일)
+# --------------------------------
+def run_ocr_on_image(image: Image.Image, lang='kor+eng'):
+    return pytesseract.image_to_string(image, lang=lang)
+def extract_images_with_ocr(pdf_path, lang='kor+eng'):
+    try:
+        images = convert_from_path(pdf_path)
+        page_ocr_data = {}
+        for idx, img in enumerate(images):
+            page_num = idx + 1
+            text = run_ocr_on_image(img, lang=lang)
+            if text.strip():
+                page_ocr_data[page_num] = text.strip()
+        return page_ocr_data
+    except Exception as e:
+        print(f"❌ 이미지 OCR 실패: {e}")
+        return {}
+def extract_tables_with_pdfplumber(pdf_path):
+    page_table_data = {}
+    try:
+        with pdfplumber.open(pdf_path) as pdf:
+            for i, page in enumerate(pdf.pages):
+                page_num = i + 1
+                tables = page.extract_tables()
+                table_text = ""
+                for t_index, table in enumerate(tables):
+                    if table:
+                        table_text += f"[Table {t_index+1}]\n"
+                        for row in table:
+                            row_text = "\t".join(cell if cell else "" for cell in row)
+                            table_text += row_text + "\n"
+                if table_text.strip():
+                    page_table_data[page_num] = table_text.strip()
+        return page_table_data
+    except Exception as e:
+        print(f"❌ 표 추출 실패: {e}")
+        return {}
+def extract_body_text_with_pages(pdf_path):
+    page_body_data = {}
+    try:
+        pdf_processor = LlamaMarkdownReader()
+        docs = pdf_processor.load_data(file_path=pdf_path)
+        combined_text = ""
+        for d in docs:
+            if isinstance(d, dict) and "text" in d:
+                combined_text += d["text"]
+            elif hasattr(d, "text"):
+                combined_text += d.text
+        if combined_text.strip():
+            chars_per_page = 2000
+            start = 0
+            page_num = 1
+            while start < len(combined_text):
+                end = start + chars_per_page
+                if end > len(combined_text):
+                    end = len(combined_text)
+                page_text = combined_text[start:end]
+                if page_text.strip():
+                    page_body_data[page_num] = page_text.strip()
+                    page_num += 1
+                if end == len(combined_text):
+                    break
+                start = end - 100
+    except Exception as e:
+        print(f"❌ 본문 추출 실패: {e}")
+    return page_body_data
+def load_pdf_with_metadata(pdf_path):
+    """PDF 파일에서 페이지별 정보를 추출"""
+    log(f"📑 PDF 페이지별 처리 시작: {pdf_path}")
+    start = time.time()
+    # 먼저 PyPDFLoader로 실제 페이지 수 확인
+    try:
+        from langchain_community.document_loaders import PyPDFLoader
+        loader = PyPDFLoader(pdf_path)
+        pdf_pages = loader.load()
+        actual_total_pages = len(pdf_pages)
+        log(f"📄 PyPDFLoader로 확인한 실제 페이지 수: {actual_total_pages}")
+    except Exception as e:
+        log(f"❌ PyPDFLoader 페이지 수 확인 실패: {e}")
+        actual_total_pages = 1
+    try:
+        page_tables = extract_tables_with_pdfplumber(pdf_path)
+    except Exception as e:
+        page_tables = {}
+        print(f"❌ 표 추출 실패: {e}")
+    try:
+        page_ocr = extract_images_with_ocr(pdf_path)
+    except Exception as e:
+        page_ocr = {}
+        print(f"❌ 이미지 OCR 실패: {e}")
+    try:
+        page_body = extract_body_text_with_pages(pdf_path)
+    except Exception as e:
+        page_body = {}
+        print(f"❌ 본문 추출 실패: {e}")
+    duration = time.time() - start
+    log(f"✅ PDF 페이지별 처리 완료: {pdf_path} ⏱️ {duration:.2f}초")
+    # 실제 페이지 수를 기준으로 설정
+    all_pages = set(page_tables.keys()) | set(page_ocr.keys()) | set(page_body.keys())
+    if all_pages:
+        max_extracted_page = max(all_pages)
+        # 실제 페이지 수와 추출된 페이지 수 중 큰 값 사용
+        total_pages = max(actual_total_pages, max_extracted_page)
+    else:
+        total_pages = actual_total_pages
+    log(f"📊 최종 설정된 총 페이지 수: {total_pages}")
+    docs = []
+    for page_num in sorted(all_pages):
+        if page_num in page_tables and page_tables[page_num].strip():
+            docs.append(Document(
+                page_content=clean_text(apply_corrections(page_tables[page_num])),
+                metadata={
+                    "source": pdf_path,
+                    "filename": os.path.basename(pdf_path),
+                    "type": "table",
+                    "page": page_num,
+                    "total_pages": total_pages
+                }
+            ))
+            log(f"📊 페이지 {page_num}: 표 추출 완료")
+        if page_num in page_body and page_body[page_num].strip():
+            docs.append(Document(
+                page_content=clean_text(apply_corrections(page_body[page_num])),
+                metadata={
+                    "source": pdf_path,
+                    "filename": os.path.basename(pdf_path),
+                    "type": "body",
+                    "page": page_num,
+                    "total_pages": total_pages
+                }
+            ))
+            log(f"📄 페이지 {page_num}: 본문 추출 완료")
+        if page_num in page_ocr and page_ocr[page_num].strip():
+            docs.append(Document(
+                page_content=clean_text(apply_corrections(page_ocr[page_num])),
+                metadata={
+                    "source": pdf_path,
+                    "filename": os.path.basename(pdf_path),
+                    "type": "ocr",
+                    "page": page_num,
+                    "total_pages": total_pages
+                }
+            ))
+            log(f"🖼️ 페이지 {page_num}: OCR 추출 완료")
+    if not docs:
+        docs.append(Document(
+            page_content="[내용 추출 실패]",
+            metadata={
+                "source": pdf_path,
+                "filename": os.path.basename(pdf_path),
+                "type": "error",
+                "page": 1,
+                "total_pages": total_pages
+            }
+        ))
+    # 페이지 정보 요약 출력
+    if docs:
+        page_numbers = [doc.metadata.get('page', 0) for doc in docs if doc.metadata.get('page')]
+        if page_numbers:
+            log(f"📋 추출된 페이지 범위: {min(page_numbers)} ~ {max(page_numbers)}")
+    log(f"📊 추출된 페이지별 PDF 문서: {len(docs)}개 (총 {total_pages}페이지)")
+    return docs
+# --------------------------------
+# 문서 로딩 및 분할
+# --------------------------------
+def load_documents(folder_path):
+    documents = []
+    for file in glob.glob(os.path.join(folder_path, "*.hwpx")):
+        log(f"📄 HWPX 파일 확인: {file}")
+        docs = load_hwpx(file)
+        documents.extend(docs)
+    for file in glob.glob(os.path.join(folder_path, "*.pdf")):
+        log(f"📄 PDF 파일 확인: {file}")
+        documents.extend(load_pdf_with_metadata(file))
+    log(f"📚 문서 로딩 전체 완료! 총 문서 수: {len(documents)}")
+    return documents
+def split_documents(documents, chunk_size=800, chunk_overlap=100):
+    log("🔪 청크 분할 시작")
+    splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        length_function=len
+    )
+    chunks = []
+    for doc in documents:
+        split = splitter.split_text(doc.page_content)
+        for i, chunk in enumerate(split):
+            enriched_chunk = f"passage: {chunk}"
+            chunks.append(Document(
+                page_content=enriched_chunk,
+                metadata={**doc.metadata, "chunk_index": i}
+            ))
+    log(f"✅ 청크 분할 완료: 총 {len(chunks)}개 생성")
+    return chunks
+# --------------------------------
+# 메인 실행
+# --------------------------------
+if __name__ == "__main__":
+    folder = "dataset_test"
+    log("🚀 PyMuPDF 기반 문서 처리 시작")
+    docs = load_documents(folder)
+    log("📦 문서 로딩 완료")
+    # 페이지 정보 확인
+    log("📄 페이지 정보 요약:")
+    page_info = {}
+    for doc in docs:
+        source = doc.metadata.get('source', 'unknown')
+        page = doc.metadata.get('page', 'unknown')
+        doc_type = doc.metadata.get('type', 'unknown')
+        if source not in page_info:
+            page_info[source] = {'pages': set(), 'types': set()}
+        page_info[source]['pages'].add(page)
+        page_info[source]['types'].add(doc_type)
+    for source, info in page_info.items():
+        max_page = max(info['pages']) if info['pages'] and isinstance(max(info['pages']), int) else 'unknown'
+        log(f"  📄 {os.path.basename(source)}: {max_page}페이지, 타입: {info['types']}")
+    chunks = split_documents(docs)
+    log("💡 E5-Large-Instruct 임베딩 준비 중")
+    embedding_model = HuggingFaceEmbeddings(
+        model_name="intfloat/e5-large-v2",
+        model_kwargs={"device": "cuda"}
+    )
+    vectorstore = FAISS.from_documents(chunks, embedding_model)
+    vectorstore.save_local("vector_db")
+    log(f"📊 전체 문서 수: {len(docs)}")
+    log(f"🔗 청크 총 수: {len(chunks)}")
+    log("✅ FAISS 저장 완료: vector_db")
+    # 페이지 정보가 포함된 샘플 출력
+    log("\n📋 실제 페이지 정보 포함 샘플:")
+    for i, chunk in enumerate(chunks[:5]):
+        meta = chunk.metadata
+        log(f"  청크 {i+1}: {meta.get('type')} | 페이지 {meta.get('page')} | {os.path.basename(meta.get('source', 'unknown'))}")

e5_embeddings.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from langchain_huggingface import HuggingFaceEmbeddings
+class E5Embeddings(HuggingFaceEmbeddings):
+    def embed_documents(self, texts):
+        texts = [f"passage: {text}" for text in texts]
+        return super().embed_documents(texts)
+    def embed_query(self, text):
+        return super().embed_query(f"query: {text}")

llm_loader.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from langchain.chat_models import ChatOpenAI
+def load_llama_model():
+    return ChatOpenAI(
+        #Llama 3 8B 모델로 RAG 실행하고 싶은 경우
+        #base_url="http://torch27:8000/v1",
+        #model="meta-llama/Meta-Llama-3-8B-Instruct",
+        #openai_api_key="EMPTY"
+        #Exaone으로 RAG 실행하고 싶은 경우
+        base_url="http://220.124.155.35:8000/v1",
+        model="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
+        openai_api_key="token-abc123"
+        #base_url="https://7xiebe4unotxnp-8000.proxy.runpod.net/v1",
+        #model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
+        #openai_api_key="EMPTY"
+       # base_url="http://vllm_yjy:8000/v1",
+       # model="/models/Llama-3.3-70B-Instruct-AWQ",
+       # openai_api_key="token-abc123"
+    )

rag_server.py ADDED Viewed

	@@ -0,0 +1,197 @@

+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, FileResponse, HTMLResponse
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel
+from rag_system import build_rag_chain, ask_question
+from vector_store import get_embeddings, load_vector_store
+from llm_loader import load_llama_model
+import uuid
+import os
+import shutil
+from urllib.parse import urljoin, quote
+from fastapi.responses import StreamingResponse
+import json
+import time
+app = FastAPI()
+# 정적 파일 서빙을 위한 설정
+os.makedirs("static/documents", exist_ok=True)
+app.mount("/static", StaticFiles(directory="static"), name="static")
+# 전역 객체 준비
+embeddings = get_embeddings(device="cpu")
+vectorstore = load_vector_store(embeddings, load_path="vector_db")
+llm = load_llama_model()
+qa_chain = build_rag_chain(llm, vectorstore, language="ko", k=7)
+# 서버 URL 설정 (실제 환경에 맞게 수정 필요)
+BASE_URL = "http://220.124.155.35:8500"
+class Question(BaseModel):
+    question: str
+def get_document_url(source_path):
+    if not source_path or source_path == 'N/A':
+        return None
+    filename = os.path.basename(source_path)
+    dataset_root = os.path.join(os.getcwd(), "dataset")
+    # dataset 전체 하위 폴더에서 파일명 일치하는 파일 찾기
+    found_path = None
+    for root, dirs, files in os.walk(dataset_root):
+        if filename in files:
+            found_path = os.path.join(root, filename)
+            break
+    if not found_path or not os.path.exists(found_path):
+        return None
+    static_path = f"static/documents/{filename}"
+    shutil.copy2(found_path, static_path)
+    encoded_filename = quote(filename)
+    return urljoin(BASE_URL, f"/static/documents/{encoded_filename}")
+def create_download_link(url, filename):
+    return f'출처: [{filename}]({url})'
+@app.post("/ask")
+def ask(question: Question):
+    result = ask_question(qa_chain, question.question)
+    # 소스 문서 정보 처리
+    sources = []
+    for doc in result["source_documents"]:
+        source_path = doc.metadata.get('source', 'N/A')
+        document_url = get_document_url(source_path) if source_path != 'N/A' else None
+        source_info = {
+            "source": source_path,
+            "content": doc.page_content,
+            "page": doc.metadata.get('page', 'N/A'),
+            "document_url": document_url,
+            "filename": os.path.basename(source_path) if source_path != 'N/A' else None
+        }
+        sources.append(source_info)
+    return {
+        "answer": result['result'].split("A:")[-1].strip() if "A:" in result['result'] else result['result'].strip(),
+        "sources": sources
+    }
+@app.get("/v1/models")
+def list_models():
+    return JSONResponse({
+        "object": "list",
+        "data": [
+            {
+                "id": "rag",
+                "object": "model",
+                "owned_by": "local",
+            }
+        ]
+    })
+@app.post("/v1/chat/completions")
+async def openai_compatible_chat(request: Request):
+    payload = await request.json()
+    messages = payload.get("messages", [])
+    user_input = messages[-1]["content"] if messages else ""
+    stream = payload.get("stream", False)
+    result = ask_question(qa_chain, user_input)
+    answer = result['result']
+    # 소스 문서 정보 처리
+    sources = []
+    for doc in result["source_documents"]:
+        source_path = doc.metadata.get('source', 'N/A')
+        document_url = get_document_url(source_path) if source_path != 'N/A' else None
+        filename = os.path.basename(source_path) if source_path != 'N/A' else None
+        source_info = {
+            "source": source_path,
+            "content": doc.page_content,
+            "page": doc.metadata.get('page', 'N/A'),
+            "document_url": document_url,
+            "filename": filename
+        }
+        sources.append(source_info)
+    # 소스 정보를 한 줄씩만 출력
+    sources_md = "\n참고 문서:\n"
+    seen = set()
+    for source in sources:
+        key = (source['filename'], source['document_url'])
+        if source['document_url'] and source['filename'] and key not in seen:
+            sources_md += f"출처: [{source['filename']}]({source['document_url']})\n"
+            seen.add(key)
+    final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
+    final_answer += sources_md
+    if not stream:
+        return JSONResponse({
+            "id": f"chatcmpl-{uuid.uuid4()}",
+            "object": "chat.completion",
+            "choices": [{
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "content": final_answer
+                },
+                "finish_reason": "stop"
+            }],
+            "model": "rag",
+        })
+    # 스트리밍 응답을 위한 generator
+    def event_stream():
+        # 답변 본문만 먼저 스트리밍
+        answer_main = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
+        for char in answer_main:
+            chunk = {
+                "id": f"chatcmpl-{uuid.uuid4()}",
+                "object": "chat.completion.chunk",
+                "choices": [{
+                    "index": 0,
+                    "delta": {
+                        "content": char
+                    },
+                    "finish_reason": None
+                }]
+            }
+            yield f"data: {json.dumps(chunk)}\n\n"
+            time.sleep(0.005)
+        # 참고 문서(다운로드 링크)는 마지막에 한 번에 붙여서 전송
+        sources_md = "\n참고 문서:\n"
+        seen = set()
+        for source in sources:
+            key = (source['filename'], source['document_url'])
+            if source['document_url'] and source['filename'] and key not in seen:
+                sources_md += f"출처: [{source['filename']}]({source['document_url']})\n"
+                seen.add(key)
+        if sources_md.strip() != "참고 문서:":
+            chunk = {
+                "id": f"chatcmpl-{uuid.uuid4()}",
+                "object": "chat.completion.chunk",
+                "choices": [{
+                    "index": 0,
+                    "delta": {
+                        "content": sources_md
+                    },
+                    "finish_reason": None
+                }]
+            }
+            yield f"data: {json.dumps(chunk)}\n\n"
+        done = {
+            "id": f"chatcmpl-{uuid.uuid4()}",
+            "object": "chat.completion.chunk",
+            "choices": [{
+                "index": 0,
+                "delta": {},
+                "finish_reason": "stop"
+            }]
+        }
+        yield f"data: {json.dumps(done)}\n\n"
+        return
+    return StreamingResponse(event_stream(), media_type="text/event-stream")

rag_system.py ADDED Viewed

	@@ -0,0 +1,227 @@

+import os
+import argparse
+import sys
+from langchain.chains import RetrievalQA
+from langchain.prompts import PromptTemplate
+from vector_store import get_embeddings, load_vector_store
+from llm_loader import load_llama_model
+def create_refine_prompts_with_pages(language="ko"):
+    if language == "ko":
+        question_prompt = PromptTemplate(
+            input_variables=["context_str", "question"],
+            template="""
+다음은 검색된 문서 조각들입니다:
+{context_str}
+위 문서들을 참고하여 질문에 답변해주세요.
+**중요한 규칙:**
+- 답변 시 참고한 문서가 있다면 해당 정보를 인용하세요
+- 문서에 명시된 정보만 사용하고, 추측하지 마세요
+- 페이지 번호나 출처는 위 문서에서 확인된 것만 언급하세요
+- 확실하지 않은 정보는 "문서에서 확인되지 않음"이라고 명시하세요
+질문: {question}
+답변:"""
+        )
+        refine_prompt = PromptTemplate(
+            input_variables=["question", "existing_answer", "context_str"],
+            template="""
+기존 답변:
+{existing_answer}
+추가 문서:
+{context_str}
+기존 답변을 위 추가 문서를 바탕으로 보완하거나 수정해주세요.
+**규칙:**
+- 새로운 정보가 기존 답변과 다르다면 수정하세요
+- 추가 문서에 명시된 정보만 사용하세요
+- 하나의 완결된 답변으로 작성하세요
+- 확실하지 않은 출처나 페이지는 언급하지 마세요
+질문: {question}
+답변:"""
+        )
+    else:
+        question_prompt = PromptTemplate(
+            input_variables=["context_str", "question"],
+            template="""
+Here are the retrieved document fragments:
+{context_str}
+Please answer the question based on the above documents.
+**Important rules:**
+- Only use information explicitly stated in the documents
+- If citing sources, only mention what is clearly indicated in the documents above
+- Do not guess or infer page numbers not shown in the context
+- If unsure, state "not confirmed in the provided documents"
+Question: {question}
+Answer:"""
+        )
+        refine_prompt = PromptTemplate(
+            input_variables=["question", "existing_answer", "context_str"],
+            template="""
+Existing answer:
+{existing_answer}
+Additional documents:
+{context_str}
+Refine the existing answer using the additional documents.
+**Rules:**
+- Only use information explicitly stated in the additional documents
+- Create one coherent final answer
+- Do not mention uncertain sources or page numbers
+Question: {question}
+Answer:"""
+        )
+    return question_prompt, refine_prompt
+def build_rag_chain(llm, vectorstore, language="ko", k=7):
+    """RAG 체인 구축"""
+    question_prompt, refine_prompt = create_refine_prompts_with_pages(language)
+    qa_chain = RetrievalQA.from_chain_type(
+        llm=llm,
+        chain_type="refine",
+        retriever=vectorstore.as_retriever(search_kwargs={"k": k}),
+        chain_type_kwargs={
+            "question_prompt": question_prompt,
+            "refine_prompt": refine_prompt
+        },
+        return_source_documents=True
+    )
+    return qa_chain
+def ask_question_with_pages(qa_chain, question):
+    """질문 처리"""
+    result = qa_chain.invoke({"query": question})
+    # 결과에서 A: 이후 문장만 추출
+    answer = result['result']
+    final_answer = answer.split("A:")[-1].strip() if "A:" in answer else answer.strip()
+    print(f"\n🧾 질문: {question}")
+    print(f"\n🟢 최종 답변: {final_answer}")
+    # 메타데이터 디버깅 정보 출력 (비활성화)
+    # debug_metadata_info(result["source_documents"])
+    # 참고 문서를 페이지별로 정리
+    print("\n📚 참고 문서 요약:")
+    source_info = {}
+    for doc in result["source_documents"]:
+        source = doc.metadata.get('source', 'N/A')
+        page = doc.metadata.get('page', 'N/A')
+        doc_type = doc.metadata.get('type', 'N/A')
+        section = doc.metadata.get('section', None)
+        total_pages = doc.metadata.get('total_pages', None)
+        filename = doc.metadata.get('filename', 'N/A')
+        if filename == 'N/A':
+            filename = os.path.basename(source) if source != 'N/A' else 'N/A'
+        if filename not in source_info:
+            source_info[filename] = {
+                'pages': set(),
+                'sections': set(),
+                'types': set(),
+                'total_pages': total_pages
+            }
+        if page != 'N/A':
+            if isinstance(page, str) and page.startswith('섹션'):
+                source_info[filename]['sections'].add(page)
+            else:
+                source_info[filename]['pages'].add(page)
+        if section is not None:
+            source_info[filename]['sections'].add(f"섹션 {section}")
+        source_info[filename]['types'].add(doc_type)
+    # 결과 출력
+    total_chunks = len(result["source_documents"])
+    print(f"총 사용된 청크 수: {total_chunks}")
+    for filename, info in source_info.items():
+        print(f"\n- {filename}")
+        # 전체 페이지 수 정보
+        if info['total_pages']:
+            print(f"  전체 페이지 수: {info['total_pages']}")
+        # 페이지 정보 출력
+        if info['pages']:
+            pages_list = list(info['pages'])
+            print(f"  페이지: {', '.join(map(str, pages_list))}")
+        # 섹션 정보 출력
+        if info['sections']:
+            sections_list = sorted(list(info['sections']))
+            print(f"  섹션: {', '.join(sections_list)}")
+        # 페이지와 섹션이 모두 없는 경우
+        if not info['pages'] and not info['sections']:
+            print(f"  페이지: 정보 없음")
+        # 문서 유형 출력
+        types_str = ', '.join(sorted(info['types']))
+        print(f"  유형: {types_str}")
+    return result
+# 기존 ask_question 함수는 ask_question_with_pages로 교체
+def ask_question(qa_chain, question):
+    """호환성을 위한 래퍼 함수"""
+    return ask_question_with_pages(qa_chain, question)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="RAG refine system (페이지 번호 지원)")
+    parser.add_argument("--vector_store", type=str, default="vector_db", help="벡터 스토어 경로")
+    parser.add_argument("--model", type=str, default="LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct", help="LLM 모델 ID")
+    parser.add_argument("--device", type=str, default="cuda", choices=["cuda", "cpu"], help="사용할 디바이스")
+    parser.add_argument("--k", type=int, default=7, help="검색할 문서 수")
+    parser.add_argument("--language", type=str, default="ko", choices=["ko", "en"], help="사용할 언어")
+    parser.add_argument("--query", type=str, help="질문 (없으면 대화형 모드 실행)")
+    args = parser.parse_args()
+    embeddings = get_embeddings(device=args.device)
+    vectorstore = load_vector_store(embeddings, load_path=args.vector_store)
+    llm = load_llama_model()
+    qa_chain = build_rag_chain(llm, vectorstore, language=args.language, k=args.k)
+    print("🟢 RAG 페이지 번호 지원 시스템 준비 완료!")
+    if args.query:
+        ask_question_with_pages(qa_chain, args.query)
+    else:
+        print("💬 대화형 모드 시작 (종료하려면 'exit', 'quit', '종료' 입력)")
+        while True:
+            try:
+                query = input("\n질문: ").strip()
+                if query.lower() in ["exit", "quit", "종료"]:
+                    break
+                if query:  # 빈 입력 방지
+                    ask_question_with_pages(qa_chain, query)
+            except KeyboardInterrupt:
+                print("\n\n프로그램을 종료합니다.")
+                break
+            except Exception as e:
+                print(f"❗ 오류 발생: {e}\n다시 시도해주세요.")

requirements.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+langchain>=0.1.0
+langchain-community>=0.0.13
+langchain-core>=0.1.0
+langchain-huggingface>=0.0.2
+sentence-transformers>=2.2.2
+pypdf>=3.15.1
+faiss-cpu>=1.7.4
+transformers>=4.36.0
+accelerate>=0.21.0
+torch>=2.0.0
+peft>=0.7.0
+bitsandbytes>=0.41.0
+tqdm>=4.65.0
+python-docx>=0.8.11
+olefile>=0.46
+uvicorn
+fastapi
+openai

vector_store.py ADDED Viewed

	@@ -0,0 +1,104 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+벡터 스토어 모듈: 문서 임베딩 생성 및 벡터 스토어 구축
+배치 처리 적용으로 메모리 사용량 최적화 + 긴 청크 오류 방지
+"""
+import os
+import argparse
+import logging
+from tqdm import tqdm
+from langchain_community.vectorstores import FAISS
+from langchain.schema.document import Document
+from langchain_huggingface import HuggingFaceEmbeddings
+# 로깅 설정 - 불필요한 경고 메시지 제거
+logging.getLogger().setLevel(logging.ERROR)
+def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
+    return HuggingFaceEmbeddings(
+        model_name=model_name,
+        model_kwargs={'device': device},
+        encode_kwargs={'normalize_embeddings': True}
+    )
+def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=16):
+    if not documents:
+        raise ValueError("문서가 없습니다. 문서가 올바르게 로드되었는지 확인하세요.")
+    texts = [doc.page_content for doc in documents]
+    metadatas = [doc.metadata for doc in documents]
+    # 배치로 분할
+    batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
+    metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
+    print(f"Processing {len(batches)} batches with size {batch_size}")
+    print(f"Initializing vector store with batch 1/{len(batches)}")
+    # ✅ from_texts 대신 from_documents 사용 (길이 문제 방지)
+    first_docs = [
+        Document(page_content=text, metadata=meta)
+        for text, meta in zip(batches[0], metadata_batches[0])
+    ]
+    vectorstore = FAISS.from_documents(first_docs, embeddings)
+    # 나머지 배치 추가
+    for i in tqdm(range(1, len(batches)), desc="Processing batches"):
+        try:
+            docs_batch = [
+                Document(page_content=text, metadata=meta)
+                for text, meta in zip(batches[i], metadata_batches[i])
+            ]
+            vectorstore.add_documents(docs_batch)
+            if i % 10 == 0:
+                temp_save_path = f"{save_path}_temp"
+                os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
+                vectorstore.save_local(temp_save_path)
+                print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
+        except Exception as e:
+            print(f"Error processing batch {i}: {e}")
+            error_save_path = f"{save_path}_error_at_batch_{i}"
+            os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
+            vectorstore.save_local(error_save_path)
+            print(f"Partial vector store saved to {error_save_path}")
+            raise
+    os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
+    vectorstore.save_local(save_path)
+    print(f"Vector store saved to {save_path}")
+    return vectorstore
+def load_vector_store(embeddings, load_path="vector_db"):
+    if not os.path.exists(load_path):
+        raise FileNotFoundError(f"벡터 스토어를 찾을 수 없습니다: {load_path}")
+    return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="벡터 스토어 구축")
+    parser.add_argument("--folder", type=str, default="dataset", help="문서가 있는 폴더 경로")
+    parser.add_argument("--save_path", type=str, default="vector_db", help="벡터 스토어 저장 경로")
+    parser.add_argument("--batch_size", type=int, default=16, help="배치 크기")
+    parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="임베딩 모델 이름")
+    parser.add_argument("--device", type=str, default="cuda", help="사용할 디바이스 ('cuda' 또는 'cpu')")
+    args = parser.parse_args()
+    # 문서 처리 모듈 import
+    from document_processor import load_documents, split_documents
+    # 문서 로드 및 분할
+    documents = load_documents(args.folder)
+    chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
+    # 임베딩 모델 로드
+    embeddings = get_embeddings(model_name=args.model_name, device=args.device)
+    # 벡터 스토어 구축
+    build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)

vector_store_test.py ADDED Viewed

	@@ -0,0 +1,121 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+벡터 스토어 모듈: 문서 임베딩 생성 및 벡터 스토어 구축
+배치 처리 적용 + 청크 길이 확인 추가
+"""
+import os
+import argparse
+import logging
+from tqdm import tqdm
+from langchain_community.vectorstores import FAISS
+from langchain.schema.document import Document
+from langchain_huggingface import HuggingFaceEmbeddings
+from e5_embeddings import E5Embeddings
+# 로깅 설정
+logging.getLogger().setLevel(logging.ERROR)
+def get_embeddings(model_name="intfloat/multilingual-e5-large-instruct", device="cuda"):
+    print(f"[INFO] 임베딩 모델 디바이스: {device}")
+    return E5Embeddings(
+        model_name=model_name,
+        model_kwargs={'device': device},
+        encode_kwargs={'normalize_embeddings': True}
+    )
+def build_vector_store_batch(documents, embeddings, save_path="vector_db", batch_size=4):
+    if not documents:
+        raise ValueError("문서가 없습니다. 문서가 올바르게 로드되었는지 확인하세요.")
+    texts = [doc.page_content for doc in documents]
+    metadatas = [doc.metadata for doc in documents]
+    # 청크 길이 출력
+    lengths = [len(t) for t in texts]
+    print(f"💡 청크 수: {len(texts)}")
+    print(f"💡 가장 긴 청크 길이: {max(lengths)} chars")
+    print(f"💡 평균 청크 길이: {sum(lengths) // len(lengths)} chars")
+    # 배치로 나누기
+    batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
+    metadata_batches = [metadatas[i:i + batch_size] for i in range(0, len(metadatas), batch_size)]
+    print(f"Processing {len(batches)} batches with size {batch_size}")
+    print(f"Initializing vector store with batch 1/{len(batches)}")
+    # ✅ from_documents 사용
+    first_docs = [
+        Document(page_content=text, metadata=meta)
+        for text, meta in zip(batches[0], metadata_batches[0])
+    ]
+    vectorstore = FAISS.from_documents(first_docs, embeddings)
+    for i in tqdm(range(1, len(batches)), desc="Processing batches"):
+        try:
+            docs_batch = [
+                Document(page_content=text, metadata=meta)
+                for text, meta in zip(batches[i], metadata_batches[i])
+            ]
+            vectorstore.add_documents(docs_batch)
+            if i % 10 == 0:
+                temp_save_path = f"{save_path}_temp"
+                os.makedirs(os.path.dirname(temp_save_path) if os.path.dirname(temp_save_path) else '.', exist_ok=True)
+                vectorstore.save_local(temp_save_path)
+                print(f"Temporary vector store saved to {temp_save_path} after batch {i}")
+        except Exception as e:
+            print(f"Error processing batch {i}: {e}")
+            error_save_path = f"{save_path}_error_at_batch_{i}"
+            os.makedirs(os.path.dirname(error_save_path) if os.path.dirname(error_save_path) else '.', exist_ok=True)
+            vectorstore.save_local(error_save_path)
+            print(f"Partial vector store saved to {error_save_path}")
+            raise
+    os.makedirs(os.path.dirname(save_path) if os.path.dirname(save_path) else '.', exist_ok=True)
+    vectorstore.save_local(save_path)
+    print(f"Vector store saved to {save_path}")
+    return vectorstore
+def load_vector_store(embeddings, load_path="vector_db"):
+    if not os.path.exists(load_path):
+        raise FileNotFoundError(f"벡터 스토어를 찾을 수 없습니다: {load_path}")
+    return FAISS.load_local(load_path, embeddings, allow_dangerous_deserialization=True)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="벡터 스토어 구축")
+    parser.add_argument("--folder", type=str, default="final_dataset", help="문서가 있는 폴더 경로")
+    parser.add_argument("--save_path", type=str, default="vector_db", help="벡터 스토어 저장 경로")
+    parser.add_argument("--batch_size", type=int, default=4, help="배치 크기")
+    parser.add_argument("--model_name", type=str, default="intfloat/multilingual-e5-large-instruct", help="임베딩 모델 이름")
+   # parser.add_argument("--device", type=str, default="cuda", help="사용할 디바이스 ('cuda' 또는 'cpu')")
+    parser.add_argument("--device", type=str, default="cuda", help="사용할 디바이스 ('cuda' 또는 'cpu' 또는 'cuda:1')")
+    args = parser.parse_args()
+    # 문서 처리 모듈 import
+    from document_processor_image_test import load_documents, split_documents
+    documents = load_documents(args.folder)
+    chunks = split_documents(documents, chunk_size=800, chunk_overlap=100)
+    print(f"[DEBUG] 문서 로딩 및 청크 분할 완료, 임베딩 단계 진입 전")
+    print(f"[INFO] 선택된 디바이스: {args.device}")
+    try:
+        embeddings = get_embeddings(
+            model_name=args.model_name,
+            device=args.device
+        )
+        print(f"[DEBUG] 임베딩 모델 생성 완료")
+    except Exception as e:
+        print(f"[ERROR] 임베딩 모델 생성 �� 에러 발생: {e}")
+        import traceback; traceback.print_exc()
+        exit(1)
+    build_vector_store_batch(chunks, embeddings, args.save_path, args.batch_size)