Spaces:

gabykim
/

KnowLang_Transformers_Demo

Sleeping

App Files Files Community

KnowLang_Transformers_Demo / README.md

gabykim

git clone installation guide

826a641 about 2 months ago

preview code

raw

history blame contribute delete

6.38 kB

	---
	title: KnowLangBot
	emoji: 🤖
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	---

	# KnowLang: Comprehensive Understanding for Complex Codebase

	KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.

	[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)

	## Features

	- 🔍 Semantic Code Search: Find relevant code snippets based on natural language queries
	- 📚 Contextual Q&A: Get detailed explanations about code functionality and implementation details
	- 🎯 Smart Chunking: Intelligent code parsing that preserves semantic meaning
	- 🔄 Multi-Stage Retrieval: Combined embedding and semantic search for better results
	- 🐍 Python Support: Currently optimized for Python codebases, with a roadmap for multi-language support

	## How It Works

	### Code Parsing Pipeline

	```mermaid
	flowchart TD
	A[Git Repository] --> B[Code Files]
	B --> C[Code Parser]
	C --> D{Parse by Type}
	D --> E[Class Definitions]
	D --> F[Function Definitions]
	D --> G[Other Code]
	E --> H[Code Chunks]
	F --> H
	G --> H
	H --> I[LLM Summarization]
	H --> J
	I --> J[Embeddings]
	J --> K[(Vector Store)]
	```

	### RAG Chatbot Pipeline

	```mermaid
	flowchart LR
	A[User Query] --> B[Query Embedding]
	B --> C[Vector Search]
	C --> D[Context Collection]
	D --> E[LLM Response Generation]
	E --> F[User Interface]
	```


	## Prerequisites

	KnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:

	1. Install Ollama:
	```bash
	# check the official download instructions from https://ollama.com/download
	curl -fsSL https://ollama.com/install.sh \| sh
	```

	2. Pull required models:
	```bash
	# For LLM responses
	ollama pull llama3.2

	# For code embeddings
	ollama pull mxbai-embed-large
	```

	3. Verify Ollama is running:
	```bash
	ollama list
	```

	You should see both `llama3.2` and `mxbai-embed-large` in the list of available models.

	Note: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.

	## Quick Start

	### System Requirements

	- RAM: Minimum 16GB recommended (Ollama models require significant memory)
	- Storage: At least 10GB free space for model files
	- OS:
	- Linux (recommended)
	- macOS 12+ (Intel or Apple Silicon)
	- Windows 10+ with WSL2
	- Python: 3.10 or higher


	### Installation
	You can install KnowLang via pip:
	```bash
	pip install knowlang
	```
	Alternatively, you can clone the repository and install it in editable mode:
	```bash
	git clone https://github.com/kimgb415/know-lang.git
	cd know-lang
	pip install -e .
	```
	This allows you to make changes to the source code and have them immediately reflected without reinstalling the package.

	### Basic Usage

	1. First, parse and index your codebase:
	```bash
	# For a local codebase
	knowlang parse ./my-project

	# For verbose output
	knowlang -v parse ./my-project
	```
	> ⚠️ Warning
	> Make sure to setup the correct paths to include and exclude for parsing. Please refer to "Parser Settings" section in [Configuration Guide](configuration.md) for more information

	2. Then, launch the chat interface:
	```bash
	knowlang chat
	```

	That's it! The chat interface will open in your browser, ready to answer questions about your codebase.

	![Chat Interface](chat.png)

	### Advanced Usage

	#### Custom Configuration
	```bash
	# Use custom configuration file
	knowlang parse --config my_config.yaml ./my-project

	# Output parsing results in JSON format
	knowlang parse --output json ./my-project
	```

	#### Chat Interface Options
	```bash
	# Run on a specific port
	knowlang chat --port 7860

	# Create a shareable link
	knowlang chat --share

	# Run on custom server
	knowlang chat --server-name localhost --server-port 8000
	```

	### Example Session

	```bash
	# Parse the transformers library
	$ knowlang parse ./transformers
	Found 1247 code chunks
	Processing summaries... Done!

	# Start chatting
	$ knowlang chat

	💡 Ask questions like:
	- How is tokenization implemented?
	- Explain the training pipeline
	- Show me examples of custom model usage
	```

	## Architecture

	KnowLang uses several key technologies:

	- Tree-sitter: For robust, language-agnostic code parsing
	- ChromaDB: For efficient vector storage and retrieval
	- PydanticAI: For type-safe LLM interactions
	- Gradio: For the interactive chat interface

	## Technical Details

	### Code Parsing

	Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:

	1. Repository cloning and file identification
	2. Semantic parsing with Tree-sitter
	3. Smart chunking based on code structure
	4. LLM-powered summarization
	5. Embedding generation with mxbai-embed-large
	6. Vector store indexing

	### RAG Implementation

	The RAG system uses a multi-stage retrieval process:

	1. Query embedding generation
	2. Initial vector similarity search
	3. Context aggregation
	4. LLM response generation with full context


	## Roadmap

	- [ ] Inter-repository semantic search
	- [ ] Support for additional programming languages
	- [ ] Automatic documentation maintenance
	- [ ] Integration with popular IDEs
	- [ ] Custom embedding model training
	- [ ] Enhanced evaluation metrics

	## License

	This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.

	## Citation

	If you use KnowLang in your research, please cite:

	```bibtex
	@software{knowlang2025,
	author = KnowLang,
	title = {KnowLang: Comprehensive Understanding for Complex Codebase},
	year = {2025},
	publisher = {GitHub},
	url = {https://github.com/kimgb415/know-lang}
	}
	```

	## Support

	For support, please open an issue on GitHub or reach out to us directly through discussions.