gabykim commited on
Commit
e0ace9d
Β·
1 Parent(s): a610ed1

readme & license

Browse files
Files changed (2) hide show
  1. LICENSE +13 -0
  2. README.md +126 -17
LICENSE ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright 2025 KnowLang
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
README.md CHANGED
@@ -1,17 +1,126 @@
1
- ---
2
- title: KnowLangBot
3
- emoji: πŸ€–
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- app_port: 7860
8
- ---
9
-
10
- # Know Lang Bot
11
- A tool for exploring and understanding complex codebases using LLMs.
12
- Features
13
-
14
- # Code parsing and analysis
15
- - Semantic search across repositories
16
- - Automatic documentation maintenance
17
- - LLM-powered code understanding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KnowLang: Comprehensive Understanding for Complex Codebase
2
+
3
+ KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.
4
+
5
+ [![Hugging Face Space](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)
6
+
7
+ ## Features
8
+
9
+ - πŸ” **Semantic Code Search**: Find relevant code snippets based on natural language queries
10
+ - πŸ“š **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
11
+ - 🎯 **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
12
+ - πŸ”„ **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
13
+ - 🐍 **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support
14
+
15
+ ## How It Works
16
+
17
+ ### Code Parsing Pipeline
18
+
19
+ ```mermaid
20
+ flowchart TD
21
+ A[Git Repository] --> B[Code Files]
22
+ B --> C[Code Parser]
23
+ C --> D{Parse by Type}
24
+ D --> E[Class Definitions]
25
+ D --> F[Function Definitions]
26
+ D --> G[Other Code]
27
+ E --> H[Code Chunks]
28
+ F --> H
29
+ G --> H
30
+ H --> I[LLM Summarization]
31
+ H --> J
32
+ I --> J[Embeddings]
33
+ J --> K[(Vector Store)]
34
+ ```
35
+
36
+ ### RAG Chatbot Pipeline
37
+
38
+ ```mermaid
39
+ flowchart LR
40
+ A[User Query] --> B[Query Embedding]
41
+ B --> C[Vector Search]
42
+ C --> D[Context Collection]
43
+ D --> E[LLM Response Generation]
44
+ E --> F[User Interface]
45
+ ```
46
+
47
+ ## Quick Start
48
+
49
+ ```bash
50
+ pip install knowlang
51
+ ```
52
+
53
+ Basic usage:
54
+
55
+ ```python
56
+ from knowlang import CodebaseRAG
57
+
58
+ # Initialize with a repository
59
+ rag = CodebaseRAG("huggingface/transformers")
60
+
61
+ # Start the chat interface
62
+ rag.launch_chat()
63
+ ```
64
+
65
+ ## Architecture
66
+
67
+ KnowLang uses several key technologies:
68
+
69
+ - **Tree-sitter**: For robust, language-agnostic code parsing
70
+ - **ChromaDB**: For efficient vector storage and retrieval
71
+ - **PydanticAI**: For type-safe LLM interactions
72
+ - **Gradio**: For the interactive chat interface
73
+
74
+ ## Technical Details
75
+
76
+ ### Code Parsing
77
+
78
+ Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:
79
+
80
+ 1. Repository cloning and file identification
81
+ 2. Semantic parsing with Tree-sitter
82
+ 3. Smart chunking based on code structure
83
+ 4. LLM-powered summarization
84
+ 5. Embedding generation with mxbai-embed-large
85
+ 6. Vector store indexing
86
+
87
+ ### RAG Implementation
88
+
89
+ The RAG system uses a multi-stage retrieval process:
90
+
91
+ 1. Query embedding generation
92
+ 2. Initial vector similarity search
93
+ 3. Context aggregation
94
+ 4. LLM response generation with full context
95
+
96
+
97
+ ## Roadmap
98
+
99
+ - [ ] Inter-repository semantic search
100
+ - [ ] Support for additional programming languages
101
+ - [ ] Automatic documentation maintenance
102
+ - [ ] Integration with popular IDEs
103
+ - [ ] Custom embedding model training
104
+ - [ ] Enhanced evaluation metrics
105
+
106
+ ## License
107
+
108
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.
109
+
110
+ ## Citation
111
+
112
+ If you use KnowLang in your research, please cite:
113
+
114
+ ```bibtex
115
+ @software{knowlang2025,
116
+ author = KnowLang,
117
+ title = {KnowLang: Comprehensive Understanding for Complex Codebase},
118
+ year = {2025},
119
+ publisher = {GitHub},
120
+ url = {https://github.com/kimgb415/know-lang}
121
+ }
122
+ ```
123
+
124
+ ## Support
125
+
126
+ For support, please open an issue on GitHub or reach out to us directly through discussions.