File size: 3,566 Bytes
584e9ef e0ace9d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
title: KnowLangBot
emoji: π€
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
# KnowLang: Comprehensive Understanding for Complex Codebase
KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.
[](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)
## Features
- π **Semantic Code Search**: Find relevant code snippets based on natural language queries
- π **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
- π― **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
- π **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
- π **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support
## How It Works
### Code Parsing Pipeline
```mermaid
flowchart TD
A[Git Repository] --> B[Code Files]
B --> C[Code Parser]
C --> D{Parse by Type}
D --> E[Class Definitions]
D --> F[Function Definitions]
D --> G[Other Code]
E --> H[Code Chunks]
F --> H
G --> H
H --> I[LLM Summarization]
H --> J
I --> J[Embeddings]
J --> K[(Vector Store)]
```
### RAG Chatbot Pipeline
```mermaid
flowchart LR
A[User Query] --> B[Query Embedding]
B --> C[Vector Search]
C --> D[Context Collection]
D --> E[LLM Response Generation]
E --> F[User Interface]
```
## Architecture
KnowLang uses several key technologies:
- **Tree-sitter**: For robust, language-agnostic code parsing
- **ChromaDB**: For efficient vector storage and retrieval
- **PydanticAI**: For type-safe LLM interactions
- **Gradio**: For the interactive chat interface
## Technical Details
### Code Parsing
Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:
1. Repository cloning and file identification
2. Semantic parsing with Tree-sitter
3. Smart chunking based on code structure
4. LLM-powered summarization
5. Embedding generation with mxbai-embed-large
6. Vector store indexing
### RAG Implementation
The RAG system uses a multi-stage retrieval process:
1. Query embedding generation
2. Initial vector similarity search
3. Context aggregation
4. LLM response generation with full context
## Roadmap
- [ ] Inter-repository semantic search
- [ ] Support for additional programming languages
- [ ] Automatic documentation maintenance
- [ ] Integration with popular IDEs
- [ ] Custom embedding model training
- [ ] Enhanced evaluation metrics
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.
## Citation
If you use KnowLang in your research, please cite:
```bibtex
@software{knowlang2025,
author = KnowLang,
title = {KnowLang: Comprehensive Understanding for Complex Codebase},
year = {2025},
publisher = {GitHub},
url = {https://github.com/kimgb415/know-lang}
}
```
## Support
For support, please open an issue on GitHub or reach out to us directly through discussions. |