CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

VayuChat is a Streamlit-based conversational AI application for air quality data analysis. It provides an interactive chat interface where users can ask questions about PM2.5 and PM10 pollution data through natural language, and receive responses including visualizations and data insights.

Architecture

The application follows a two-file architecture:

app.py: Main Streamlit application with UI components, chat interface, and user interaction handling
src.py: Core data processing logic, LLM integration, and code generation/execution engine

Key architectural patterns:

Code Generation Pipeline: User questions are converted to executable Python code via LLM prompting, then executed dynamically
Multi-LLM Support: Supports both Groq (LLaMA models) and Google Gemini models through LangChain
Session Management: Uses Streamlit session state for chat history and user interactions
Feedback Loop: Comprehensive logging and feedback collection to HuggingFace datasets

Development Commands

Run the Application

streamlit run app.py

Install Dependencies

pip install -r requirements.txt

Environment Setup

Create a .env file with the following variables:

GROQ_API_KEY=your_groq_api_key_here
GEMINI_TOKEN=your_google_gemini_api_key_here
HF_TOKEN=your_huggingface_token_here  # Optional, for logging

Data Requirements

Data.csv: Must contain columns: Timestamp, station, PM2.5, PM10, address, city, latitude, longitude, state
IITGN_Logo.png: Logo image for the sidebar
questions.txt: Pre-defined quick prompt questions (optional)
system_prompt.txt: Contains specific instructions for the LLM code generation

Code Generation System

The application uses a unique code generation approach in src.py:

Template-based Code Generation: User questions are embedded into a Python code template that includes data loading and analysis patterns
Dynamic Execution: Generated code is executed in a controlled environment with pandas, matplotlib, and other libraries available
Result Handling: Results are stored in an answer variable and can be either text/numbers or plot file paths
Error Recovery: Comprehensive error handling with logging to HuggingFace datasets

Key Functions (src.py)

ask_question(): Main entry point for processing user queries
preprocess_and_load_df(): Data loading and preprocessing
load_agent() / load_smart_df(): LLM agent initialization
log_interaction(): Interaction logging to HuggingFace
upload_feedback(): User feedback collection (in app.py)

Model Configuration

Available models are defined in both files:

Groq models: LLaMA 3.1, LLaMA 3.3, LLaMA 4 variants, DeepSeek-R1, GPT-OSS
Google models: Gemini 1.5 Pro

Plotting Guidelines

When generating visualization code, the system follows specific guidelines from system_prompt.txt:

Include India (60 µg/m³) and WHO (15 µg/m³) guidelines for PM2.5
Include India (100 µg/m³) and WHO (50 µg/m³) guidelines for PM10
Use tight layout and 45-degree rotated x-axis labels
Save plots with unique filenames using UUID
Use 'Reds' colormap for air quality visualizations
Round floating point numbers to 2 decimal places
Always report units (µg/m³) and include standard deviation/error for aggregations

Logging and Feedback

All interactions are logged to SustainabilityLabIITGN/VayuChat_logs HuggingFace dataset
User feedback is collected and stored in SustainabilityLabIITGN/VayuChat_Feedback dataset
Session tracking via UUID for analytics