CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.

Project Overview

OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.

Recent Updates

Deep Link Sharing (Added 2025-08-07)

The application now supports deep linking with full state preservation for easy sharing and collaboration:

Features:

Complete URL state management for all view settings
Copy Link button for one-click sharing
Automatic restoration of view state from URL parameters
Success notification when link is copied

URL Parameters Supported:

dataset - HuggingFace dataset ID
index - Sample index (0-based)
view - View mode (comparison, diff, improved)
diff - Diff algorithm (char, word, line, markdown)
markdown - Markdown rendering state (true/false)
reasoning - Reasoning panel expansion state (true/false, only for samples with reasoning traces)

Implementation Details:

URL updates automatically as users navigate and change settings
Prevents double-loading when URL contains specific index
Fallback clipboard API support for older browsers
Reasoning state only included in URL when reasoning trace is present

Example Deep Links:

Basic: /?dataset=davanstrien/exams-ocr&index=5
Full state: /?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true

Reasoning Trace Support (Added 2025-08-07)

The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output:

Features:

Automatic detection of <think> and <answer> XML-like tags in improved text
Collapsible "Model Reasoning" panel showing the model's thought process
Clean separation of reasoning from final output for better readability
Reasoning statistics (word count, percentage of total output)
Support for formatted reasoning steps with numbered analysis points
"Reasoning Trace" badge indicator in the statistics panel
Deep link support preserves reasoning panel state

Implementation Details:

New reasoning-parser.js module handles detection and parsing of reasoning traces
Supports multiple reasoning formats (<think>, <thinking>, <reasoning> tags)
Important: Only well-formed traces with both opening AND closing tags are parsed
Malformed traces (missing closing tags) are displayed as plain text
Formats numbered steps from reasoning content for structured display
Caches parsed reasoning to avoid reprocessing
Exports include optional reasoning trace content
Reasoning panel state included in shareable URLs

Supported Datasets:

davanstrien/india-medical-ocr-test - Medical documents processed with NuMarkdown-8B-Thinking
Any dataset with reasoning traces in supported XML-like formats

UI Components:

Collapsible reasoning panel with smooth animations
Step-by-step reasoning display with numbered indicators
"Final Output" label when reasoning is present
Dark mode optimized styling for reasoning sections

Markdown Rendering Support (Added 2025-08-01)

The application now supports rendering markdown-formatted VLM output for improved readability:

Features:

Automatic markdown detection in improved OCR text
Toggle button to switch between raw markdown and rendered view
Support for common markdown elements: headers, lists, tables, code blocks, links
Security-focused implementation with XSS prevention
Performance optimization with render caching

Implementation Details:

Uses marked.js library for markdown parsing
Custom renderers for security (sanitizes URLs, prevents script injection)
Tailwind-styled markdown elements matching the app's design
HTML table support for VLM outputs that use table tags
Cache system limits memory usage to 50 rendered items

UI Changes:

Markdown toggle button appears when markdown is detected
"Markdown Detected" badge in statistics panel
New "Markdown Diff" mode showing plain vs rendered comparison
Both "Improved Only" and "Side by Side" views support rendering

Architecture

Technology Stack

Frontend Framework: Alpine.js (lightweight reactivity, ~15KB)
Styling: Tailwind CSS (utility-first, responsive design)
Interactions: HTMX (server-side rendering capabilities)
API: HuggingFace Dataset Viewer API (no backend required)
Language: Vanilla JavaScript (no build process needed)

Core Components

index.html - Main application shell

Split-pane layout (1/3 image, 2/3 text comparison)
Three view modes: Side-by-side, Inline diff, Improved only
Dark mode support with proper contrast
Responsive design for mobile devices

js/dataset-api.js - HuggingFace API wrapper

Smart caching with 45-minute expiration for signed URLs
Batch loading (100 rows at a time)
Automatic column detection for different dataset schemas
Image URL refresh on expiration

js/app.js - Alpine.js application logic

Keyboard navigation (J/K, arrows)
URL state management for shareable links
Diff mode switching (character/word/line)
Dark mode persistence in localStorage

js/diff-utils.js - Text comparison algorithms

Character-level diff with inline highlighting
Word-level diff preserving whitespace
Line-level diff for larger changes
LCS (Longest Common Subsequence) implementation

css/styles.css - Custom styling

Dark mode enhancements
Diff highlighting with accessibility in mind
Smooth transitions and animations
Print-friendly styles

Key Design Decisions

Why Separate from OCR Time Machine?

Focused Purpose: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
Performance: No Python/Gradio overhead - instant loading and navigation
User Experience: Custom UI optimized for text comparison workflows
Deployment: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)

API vs Backend Trade-offs

Chose HF Dataset Viewer API because:

No backend infrastructure needed
Automatic image serving with CDN
Built-in pagination support
Works with any public HF dataset

Limitations accepted:

Image URLs expire (~1 hour)
100 rows max per request
No write capabilities
Public datasets only (no auth yet)

UI/UX Principles

Keyboard-first: Professional users prefer keyboard navigation
Information density: Show more content, less chrome
Visual diff: Color-coded changes are easier to scan than side-by-side
Dark mode: Essential for extended reading sessions
Responsive: Works on tablets for field work

Development Approach

Phase 1: MVP (Completed)

Basic dataset loading and navigation
Side-by-side text comparison
Keyboard shortcuts
Dark mode

Phase 2: Enhancements (Completed)

Three diff algorithms (char/word/line)
URL state management
Image error handling with refresh
Responsive mobile layout

Phase 3: Polish (Completed)

Fixed dark mode contrast issues
Optimized performance with direct indexing
Added loading states and error handling
Comprehensive documentation

Common Tasks

Adding Column Name Patterns

// In dataset-api.js detectColumns() method
if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}

Adding Keyboard Shortcuts

// In app.js setupKeyboardNavigation()
case 'your_key':
    // Your action
    break;

Customizing Diff Colors

// In diff-utils.js
// Light mode: bg-red-200, text-red-800
// Dark mode: bg-red-950, text-red-300

Working with Markdown Rendering

// Enable/disable markdown rendering
this.renderMarkdown = true; // Toggle markdown rendering

// Add new markdown patterns to detection
// In app.js detectMarkdown() method
const markdownPatterns = [
    /your_pattern_here/,  // Add your pattern
    // ... existing patterns
];

// Customize markdown styles
// In app.js renderMarkdownText() method
html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">');

Performance Optimizations

Direct Dataset Indexing: Uses dataset[index] instead of loading batches into memory
Smart Caching: Caches API responses for 45 minutes (conservative for signed URLs)
Batch Fetching: Loads 100 rows at once, caches for smooth navigation
Lazy Loading: Only fetches data when needed

Known Issues & Solutions

Issue: Navigation buttons were disabled

Cause: API response structure wasn't parsed correctly Fix: Updated getTotalRows() to check size.config.num_rows and size.splits[0].num_rows

Issue: Dark mode text unreadable

Cause: Insufficient contrast in diff highlighting and code blocks Fix:

Changed diff colors to use dark:bg-red-950 and dark:text-red-300
Added explicit text-gray-900 dark:text-gray-100 to all text containers

Issue: Image loading errors

Cause: Signed URLs expire after ~1 hour Fix: Implemented handleImageError() with automatic URL refresh

Issue: Markdown tables not rendering

Cause: Default marked.js settings and HTML security restrictions Fix:

Enabled tables: true in marked.js options
Added safe HTML table tag allowlist in renderer
Applied proper Tailwind CSS classes to table elements
Added CSS overrides for prose container compatibility

Mobile Support Status

While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See mobile-enhancement-plan.md for detailed technical requirements and implementation approach.

Current limitations:

Fixed desktop layout doesn't adapt well to small screens
No touch gesture support for navigation
Small touch targets for buttons and inputs
Desktop-only interactions (hover states, keyboard shortcuts)

Planned improvements:

Responsive stacked layout for mobile devices
Touch gestures (swipe for navigation)
Mobile-optimized navigation bar
Touch-friendly UI components

Future Enhancements

Comprehensive mobile support (see mobile-enhancement-plan.md)
Search/filter within dataset
Bookmark favorite samples
Export selected texts
Support for private datasets (auth)
Metrics display (CER/WER)
Batch operations
PWA support for offline viewing

Deployment

Static Hosting (Recommended)

# Any static file server works
python3 -m http.server 8000
npx serve .

GitHub Pages

Push to GitHub repository
Enable Pages in settings
Access at: https://[username].github.io/[repo]/ocr-text-explorer/

CDN Deployment

Upload files to any CDN
No server-side processing needed
Works with CloudFlare, Netlify, Vercel, etc.

Testing Datasets

Known working datasets:

davanstrien/exams-ocr - Default dataset with exam papers (uses text and markdown columns)
davanstrien/rolm-test - Victorian theatre playbills processed with RolmOCR (uses text and rolmocr_text columns, includes inference_info metadata)
Any dataset with image + text columns

Column patterns automatically detected:

Original: text, ocr, original_text, ground_truth
Improved: markdown, new_ocr, corrected_text, vlm_ocr, rolmocr_text
Metadata: inference_info (JSON array with model details, processing date, parameters)

Recent Updates

Model Information Display (Added 2025-08-04)

The application now displays model processing information when available:

Features:

Automatic detection of inference_info column
Model metadata panel showing: model name, processing date, batch size, max tokens
Link to processing script when available
Positioned prominently below image for immediate visibility

Implementation Notes:

The model info panel only appears when inference_info column exists
Supports datasets processed with UV scripts via HF Jobs
Gracefully handles datasets without model metadata

Reasoning Trace Parsing Fix (Added 2025-08-07)

Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors:

Problem:

Some model outputs contained opening <think> tags without closing </think> tags
This appeared to be truncated or malformed model output
The parser would attempt to parse these incomplete traces, causing confusion

Solution:

Updated detectReasoningTrace() to require BOTH opening and closing tags
Added console warnings when incomplete traces are detected
Malformed traces are now displayed as plain text instead of being parsed

Benefits:

Cleaner handling of incomplete model outputs
No confusing partial reasoning panels for malformed content
Maintains full functionality for well-formed reasoning traces
Helpful console warnings for debugging

Technical Details:

File: js/reasoning-parser.js
Only traces with complete XML tags (<think>...</think>, <thinking>...</thinking>, etc.) are parsed
Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags"