Spaces:
Sleeping
Sleeping
File size: 6,382 Bytes
584e9ef e0ace9d f8f5143 26428c0 eec4a03 26428c0 f8f5143 26428c0 f8f5143 826a641 f8f5143 826a641 f8f5143 b8b2433 f8f5143 353a351 f8f5143 e0ace9d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
---
title: KnowLangBot
emoji: π€
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
# KnowLang: Comprehensive Understanding for Complex Codebase
KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.
[](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)
## Features
- π **Semantic Code Search**: Find relevant code snippets based on natural language queries
- π **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
- π― **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
- π **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
- π **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support
## How It Works
### Code Parsing Pipeline
```mermaid
flowchart TD
A[Git Repository] --> B[Code Files]
B --> C[Code Parser]
C --> D{Parse by Type}
D --> E[Class Definitions]
D --> F[Function Definitions]
D --> G[Other Code]
E --> H[Code Chunks]
F --> H
G --> H
H --> I[LLM Summarization]
H --> J
I --> J[Embeddings]
J --> K[(Vector Store)]
```
### RAG Chatbot Pipeline
```mermaid
flowchart LR
A[User Query] --> B[Query Embedding]
B --> C[Vector Search]
C --> D[Context Collection]
D --> E[LLM Response Generation]
E --> F[User Interface]
```
## Prerequisites
KnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:
1. Install Ollama:
```bash
# check the official download instructions from https://ollama.com/download
curl -fsSL https://ollama.com/install.sh | sh
```
2. Pull required models:
```bash
# For LLM responses
ollama pull llama3.2
# For code embeddings
ollama pull mxbai-embed-large
```
3. Verify Ollama is running:
```bash
ollama list
```
You should see both `llama3.2` and `mxbai-embed-large` in the list of available models.
Note: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.
## Quick Start
### System Requirements
- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)
- **Storage**: At least 10GB free space for model files
- **OS**:
- Linux (recommended)
- macOS 12+ (Intel or Apple Silicon)
- Windows 10+ with WSL2
- **Python**: 3.10 or higher
### Installation
You can install KnowLang via pip:
```bash
pip install knowlang
```
Alternatively, you can clone the repository and install it in editable mode:
```bash
git clone https://github.com/kimgb415/know-lang.git
cd know-lang
pip install -e .
```
This allows you to make changes to the source code and have them immediately reflected without reinstalling the package.
### Basic Usage
1. First, parse and index your codebase:
```bash
# For a local codebase
knowlang parse ./my-project
# For verbose output
knowlang -v parse ./my-project
```
> β οΈ **Warning**
> Make sure to setup the correct paths to include and exclude for parsing. Please refer to "Parser Settings" section in [Configuration Guide](configuration.md) for more information
2. Then, launch the chat interface:
```bash
knowlang chat
```
That's it! The chat interface will open in your browser, ready to answer questions about your codebase.

### Advanced Usage
#### Custom Configuration
```bash
# Use custom configuration file
knowlang parse --config my_config.yaml ./my-project
# Output parsing results in JSON format
knowlang parse --output json ./my-project
```
#### Chat Interface Options
```bash
# Run on a specific port
knowlang chat --port 7860
# Create a shareable link
knowlang chat --share
# Run on custom server
knowlang chat --server-name localhost --server-port 8000
```
### Example Session
```bash
# Parse the transformers library
$ knowlang parse ./transformers
Found 1247 code chunks
Processing summaries... Done!
# Start chatting
$ knowlang chat
π‘ Ask questions like:
- How is tokenization implemented?
- Explain the training pipeline
- Show me examples of custom model usage
```
## Architecture
KnowLang uses several key technologies:
- **Tree-sitter**: For robust, language-agnostic code parsing
- **ChromaDB**: For efficient vector storage and retrieval
- **PydanticAI**: For type-safe LLM interactions
- **Gradio**: For the interactive chat interface
## Technical Details
### Code Parsing
Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:
1. Repository cloning and file identification
2. Semantic parsing with Tree-sitter
3. Smart chunking based on code structure
4. LLM-powered summarization
5. Embedding generation with mxbai-embed-large
6. Vector store indexing
### RAG Implementation
The RAG system uses a multi-stage retrieval process:
1. Query embedding generation
2. Initial vector similarity search
3. Context aggregation
4. LLM response generation with full context
## Roadmap
- [ ] Inter-repository semantic search
- [ ] Support for additional programming languages
- [ ] Automatic documentation maintenance
- [ ] Integration with popular IDEs
- [ ] Custom embedding model training
- [ ] Enhanced evaluation metrics
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.
## Citation
If you use KnowLang in your research, please cite:
```bibtex
@software{knowlang2025,
author = KnowLang,
title = {KnowLang: Comprehensive Understanding for Complex Codebase},
year = {2025},
publisher = {GitHub},
url = {https://github.com/kimgb415/know-lang}
}
```
## Support
For support, please open an issue on GitHub or reach out to us directly through discussions. |