Spaces:
Sleeping
Sleeping
readme & license
Browse files
LICENSE
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Copyright 2025 KnowLang
|
2 |
+
|
3 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
you may not use this file except in compliance with the License.
|
5 |
+
You may obtain a copy of the License at
|
6 |
+
|
7 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
|
9 |
+
Unless required by applicable law or agreed to in writing, software
|
10 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
See the License for the specific language governing permissions and
|
13 |
+
limitations under the License.
|
README.md
CHANGED
@@ -1,17 +1,126 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# KnowLang: Comprehensive Understanding for Complex Codebase
|
2 |
+
|
3 |
+
KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.
|
4 |
+
|
5 |
+
[](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)
|
6 |
+
|
7 |
+
## Features
|
8 |
+
|
9 |
+
- π **Semantic Code Search**: Find relevant code snippets based on natural language queries
|
10 |
+
- π **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
|
11 |
+
- π― **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
|
12 |
+
- π **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
|
13 |
+
- π **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support
|
14 |
+
|
15 |
+
## How It Works
|
16 |
+
|
17 |
+
### Code Parsing Pipeline
|
18 |
+
|
19 |
+
```mermaid
|
20 |
+
flowchart TD
|
21 |
+
A[Git Repository] --> B[Code Files]
|
22 |
+
B --> C[Code Parser]
|
23 |
+
C --> D{Parse by Type}
|
24 |
+
D --> E[Class Definitions]
|
25 |
+
D --> F[Function Definitions]
|
26 |
+
D --> G[Other Code]
|
27 |
+
E --> H[Code Chunks]
|
28 |
+
F --> H
|
29 |
+
G --> H
|
30 |
+
H --> I[LLM Summarization]
|
31 |
+
H --> J
|
32 |
+
I --> J[Embeddings]
|
33 |
+
J --> K[(Vector Store)]
|
34 |
+
```
|
35 |
+
|
36 |
+
### RAG Chatbot Pipeline
|
37 |
+
|
38 |
+
```mermaid
|
39 |
+
flowchart LR
|
40 |
+
A[User Query] --> B[Query Embedding]
|
41 |
+
B --> C[Vector Search]
|
42 |
+
C --> D[Context Collection]
|
43 |
+
D --> E[LLM Response Generation]
|
44 |
+
E --> F[User Interface]
|
45 |
+
```
|
46 |
+
|
47 |
+
## Quick Start
|
48 |
+
|
49 |
+
```bash
|
50 |
+
pip install knowlang
|
51 |
+
```
|
52 |
+
|
53 |
+
Basic usage:
|
54 |
+
|
55 |
+
```python
|
56 |
+
from knowlang import CodebaseRAG
|
57 |
+
|
58 |
+
# Initialize with a repository
|
59 |
+
rag = CodebaseRAG("huggingface/transformers")
|
60 |
+
|
61 |
+
# Start the chat interface
|
62 |
+
rag.launch_chat()
|
63 |
+
```
|
64 |
+
|
65 |
+
## Architecture
|
66 |
+
|
67 |
+
KnowLang uses several key technologies:
|
68 |
+
|
69 |
+
- **Tree-sitter**: For robust, language-agnostic code parsing
|
70 |
+
- **ChromaDB**: For efficient vector storage and retrieval
|
71 |
+
- **PydanticAI**: For type-safe LLM interactions
|
72 |
+
- **Gradio**: For the interactive chat interface
|
73 |
+
|
74 |
+
## Technical Details
|
75 |
+
|
76 |
+
### Code Parsing
|
77 |
+
|
78 |
+
Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:
|
79 |
+
|
80 |
+
1. Repository cloning and file identification
|
81 |
+
2. Semantic parsing with Tree-sitter
|
82 |
+
3. Smart chunking based on code structure
|
83 |
+
4. LLM-powered summarization
|
84 |
+
5. Embedding generation with mxbai-embed-large
|
85 |
+
6. Vector store indexing
|
86 |
+
|
87 |
+
### RAG Implementation
|
88 |
+
|
89 |
+
The RAG system uses a multi-stage retrieval process:
|
90 |
+
|
91 |
+
1. Query embedding generation
|
92 |
+
2. Initial vector similarity search
|
93 |
+
3. Context aggregation
|
94 |
+
4. LLM response generation with full context
|
95 |
+
|
96 |
+
|
97 |
+
## Roadmap
|
98 |
+
|
99 |
+
- [ ] Inter-repository semantic search
|
100 |
+
- [ ] Support for additional programming languages
|
101 |
+
- [ ] Automatic documentation maintenance
|
102 |
+
- [ ] Integration with popular IDEs
|
103 |
+
- [ ] Custom embedding model training
|
104 |
+
- [ ] Enhanced evaluation metrics
|
105 |
+
|
106 |
+
## License
|
107 |
+
|
108 |
+
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.
|
109 |
+
|
110 |
+
## Citation
|
111 |
+
|
112 |
+
If you use KnowLang in your research, please cite:
|
113 |
+
|
114 |
+
```bibtex
|
115 |
+
@software{knowlang2025,
|
116 |
+
author = KnowLang,
|
117 |
+
title = {KnowLang: Comprehensive Understanding for Complex Codebase},
|
118 |
+
year = {2025},
|
119 |
+
publisher = {GitHub},
|
120 |
+
url = {https://github.com/kimgb415/know-lang}
|
121 |
+
}
|
122 |
+
```
|
123 |
+
|
124 |
+
## Support
|
125 |
+
|
126 |
+
For support, please open an issue on GitHub or reach out to us directly through discussions.
|