gabykim's picture
README quickstart
f8f5143
|
raw
history blame
4.75 kB
metadata
title: KnowLangBot
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860

KnowLang: Comprehensive Understanding for Complex Codebase

KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.

Hugging Face Space

Features

  • πŸ” Semantic Code Search: Find relevant code snippets based on natural language queries
  • πŸ“š Contextual Q&A: Get detailed explanations about code functionality and implementation details
  • 🎯 Smart Chunking: Intelligent code parsing that preserves semantic meaning
  • πŸ”„ Multi-Stage Retrieval: Combined embedding and semantic search for better results
  • 🐍 Python Support: Currently optimized for Python codebases, with a roadmap for multi-language support

How It Works

Code Parsing Pipeline

flowchart TD
    A[Git Repository] --> B[Code Files]
    B --> C[Code Parser]
    C --> D{Parse by Type}
    D --> E[Class Definitions]
    D --> F[Function Definitions]
    D --> G[Other Code]
    E --> H[Code Chunks]
    F --> H
    G --> H
    H --> I[LLM Summarization]
    H --> J
    I --> J[Embeddings]
    J --> K[(Vector Store)]

RAG Chatbot Pipeline

flowchart LR
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]
    C --> D[Context Collection]
    D --> E[LLM Response Generation]
    E --> F[User Interface]

Quick Start

Installation

pip install knowlang

Basic Usage

  1. First, parse and index your codebase:
# For a local codebase
knowlang parse ./my-project

# For verbose output
knowlang -v parse ./my-project
  1. Then, launch the chat interface:
knowlang chat

That's it! The chat interface will open in your browser, ready to answer questions about your codebase.

Advanced Usage

Custom Configuration

# Use custom configuration file
knowlang parse --config my_config.yaml ./my-project

# Output parsing results in JSON format
knowlang parse --output json ./my-project

Chat Interface Options

# Run on a specific port
knowlang chat --port 7860

# Create a shareable link
knowlang chat --share

# Run on custom server
knowlang chat --server-name localhost --server-port 8000

Example Session

# Parse the transformers library
$ knowlang parse ./transformers
Found 1247 code chunks
Processing summaries... Done!

# Start chatting
$ knowlang chat

πŸ’‘ Ask questions like:
- How is tokenization implemented?
- Explain the training pipeline
- Show me examples of custom model usage

Architecture

KnowLang uses several key technologies:

  • Tree-sitter: For robust, language-agnostic code parsing
  • ChromaDB: For efficient vector storage and retrieval
  • PydanticAI: For type-safe LLM interactions
  • Gradio: For the interactive chat interface

Technical Details

Code Parsing

Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:

  1. Repository cloning and file identification
  2. Semantic parsing with Tree-sitter
  3. Smart chunking based on code structure
  4. LLM-powered summarization
  5. Embedding generation with mxbai-embed-large
  6. Vector store indexing

RAG Implementation

The RAG system uses a multi-stage retrieval process:

  1. Query embedding generation
  2. Initial vector similarity search
  3. Context aggregation
  4. LLM response generation with full context

Roadmap

  • Inter-repository semantic search
  • Support for additional programming languages
  • Automatic documentation maintenance
  • Integration with popular IDEs
  • Custom embedding model training
  • Enhanced evaluation metrics

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.

Citation

If you use KnowLang in your research, please cite:

@software{knowlang2025,
  author = KnowLang,
  title = {KnowLang: Comprehensive Understanding for Complex Codebase},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/kimgb415/know-lang}
}

Support

For support, please open an issue on GitHub or reach out to us directly through discussions.