ocr-time-capsule / CLAUDE.md
davanstrien's picture
davanstrien HF Staff
Update CLAUDE.md with reasoning trace parsing fix documentation
c4a9ea8

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.

Project Overview

OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.

Recent Updates

Deep Link Sharing (Added 2025-08-07)

The application now supports deep linking with full state preservation for easy sharing and collaboration:

Features:

  • Complete URL state management for all view settings
  • Copy Link button for one-click sharing
  • Automatic restoration of view state from URL parameters
  • Success notification when link is copied

URL Parameters Supported:

  • dataset - HuggingFace dataset ID
  • index - Sample index (0-based)
  • view - View mode (comparison, diff, improved)
  • diff - Diff algorithm (char, word, line, markdown)
  • markdown - Markdown rendering state (true/false)
  • reasoning - Reasoning panel expansion state (true/false, only for samples with reasoning traces)

Implementation Details:

  • URL updates automatically as users navigate and change settings
  • Prevents double-loading when URL contains specific index
  • Fallback clipboard API support for older browsers
  • Reasoning state only included in URL when reasoning trace is present

Example Deep Links:

  • Basic: /?dataset=davanstrien/exams-ocr&index=5
  • Full state: /?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true

Reasoning Trace Support (Added 2025-08-07)

The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output:

Features:

  • Automatic detection of <think> and <answer> XML-like tags in improved text
  • Collapsible "Model Reasoning" panel showing the model's thought process
  • Clean separation of reasoning from final output for better readability
  • Reasoning statistics (word count, percentage of total output)
  • Support for formatted reasoning steps with numbered analysis points
  • "Reasoning Trace" badge indicator in the statistics panel
  • Deep link support preserves reasoning panel state

Implementation Details:

  • New reasoning-parser.js module handles detection and parsing of reasoning traces
  • Supports multiple reasoning formats (<think>, <thinking>, <reasoning> tags)
  • Important: Only well-formed traces with both opening AND closing tags are parsed
  • Malformed traces (missing closing tags) are displayed as plain text
  • Formats numbered steps from reasoning content for structured display
  • Caches parsed reasoning to avoid reprocessing
  • Exports include optional reasoning trace content
  • Reasoning panel state included in shareable URLs

Supported Datasets:

  • davanstrien/india-medical-ocr-test - Medical documents processed with NuMarkdown-8B-Thinking
  • Any dataset with reasoning traces in supported XML-like formats

UI Components:

  • Collapsible reasoning panel with smooth animations
  • Step-by-step reasoning display with numbered indicators
  • "Final Output" label when reasoning is present
  • Dark mode optimized styling for reasoning sections

Markdown Rendering Support (Added 2025-08-01)

The application now supports rendering markdown-formatted VLM output for improved readability:

Features:

  • Automatic markdown detection in improved OCR text
  • Toggle button to switch between raw markdown and rendered view
  • Support for common markdown elements: headers, lists, tables, code blocks, links
  • Security-focused implementation with XSS prevention
  • Performance optimization with render caching

Implementation Details:

  • Uses marked.js library for markdown parsing
  • Custom renderers for security (sanitizes URLs, prevents script injection)
  • Tailwind-styled markdown elements matching the app's design
  • HTML table support for VLM outputs that use table tags
  • Cache system limits memory usage to 50 rendered items

UI Changes:

  • Markdown toggle button appears when markdown is detected
  • "Markdown Detected" badge in statistics panel
  • New "Markdown Diff" mode showing plain vs rendered comparison
  • Both "Improved Only" and "Side by Side" views support rendering

Architecture

Technology Stack

  • Frontend Framework: Alpine.js (lightweight reactivity, ~15KB)
  • Styling: Tailwind CSS (utility-first, responsive design)
  • Interactions: HTMX (server-side rendering capabilities)
  • API: HuggingFace Dataset Viewer API (no backend required)
  • Language: Vanilla JavaScript (no build process needed)

Core Components

index.html - Main application shell

  • Split-pane layout (1/3 image, 2/3 text comparison)
  • Three view modes: Side-by-side, Inline diff, Improved only
  • Dark mode support with proper contrast
  • Responsive design for mobile devices

js/dataset-api.js - HuggingFace API wrapper

  • Smart caching with 45-minute expiration for signed URLs
  • Batch loading (100 rows at a time)
  • Automatic column detection for different dataset schemas
  • Image URL refresh on expiration

js/app.js - Alpine.js application logic

  • Keyboard navigation (J/K, arrows)
  • URL state management for shareable links
  • Diff mode switching (character/word/line)
  • Dark mode persistence in localStorage

js/diff-utils.js - Text comparison algorithms

  • Character-level diff with inline highlighting
  • Word-level diff preserving whitespace
  • Line-level diff for larger changes
  • LCS (Longest Common Subsequence) implementation

css/styles.css - Custom styling

  • Dark mode enhancements
  • Diff highlighting with accessibility in mind
  • Smooth transitions and animations
  • Print-friendly styles

Key Design Decisions

Why Separate from OCR Time Machine?

  1. Focused Purpose: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
  2. Performance: No Python/Gradio overhead - instant loading and navigation
  3. User Experience: Custom UI optimized for text comparison workflows
  4. Deployment: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)

API vs Backend Trade-offs

Chose HF Dataset Viewer API because:

  • No backend infrastructure needed
  • Automatic image serving with CDN
  • Built-in pagination support
  • Works with any public HF dataset

Limitations accepted:

  • Image URLs expire (~1 hour)
  • 100 rows max per request
  • No write capabilities
  • Public datasets only (no auth yet)

UI/UX Principles

  1. Keyboard-first: Professional users prefer keyboard navigation
  2. Information density: Show more content, less chrome
  3. Visual diff: Color-coded changes are easier to scan than side-by-side
  4. Dark mode: Essential for extended reading sessions
  5. Responsive: Works on tablets for field work

Development Approach

Phase 1: MVP (Completed)

  • Basic dataset loading and navigation
  • Side-by-side text comparison
  • Keyboard shortcuts
  • Dark mode

Phase 2: Enhancements (Completed)

  • Three diff algorithms (char/word/line)
  • URL state management
  • Image error handling with refresh
  • Responsive mobile layout

Phase 3: Polish (Completed)

  • Fixed dark mode contrast issues
  • Optimized performance with direct indexing
  • Added loading states and error handling
  • Comprehensive documentation

Common Tasks

Adding Column Name Patterns

// In dataset-api.js detectColumns() method
if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}

Adding Keyboard Shortcuts

// In app.js setupKeyboardNavigation()
case 'your_key':
    // Your action
    break;

Customizing Diff Colors

// In diff-utils.js
// Light mode: bg-red-200, text-red-800
// Dark mode: bg-red-950, text-red-300

Working with Markdown Rendering

// Enable/disable markdown rendering
this.renderMarkdown = true; // Toggle markdown rendering

// Add new markdown patterns to detection
// In app.js detectMarkdown() method
const markdownPatterns = [
    /your_pattern_here/,  // Add your pattern
    // ... existing patterns
];

// Customize markdown styles
// In app.js renderMarkdownText() method
html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">');

Performance Optimizations

  1. Direct Dataset Indexing: Uses dataset[index] instead of loading batches into memory
  2. Smart Caching: Caches API responses for 45 minutes (conservative for signed URLs)
  3. Batch Fetching: Loads 100 rows at once, caches for smooth navigation
  4. Lazy Loading: Only fetches data when needed

Known Issues & Solutions

Issue: Navigation buttons were disabled

Cause: API response structure wasn't parsed correctly Fix: Updated getTotalRows() to check size.config.num_rows and size.splits[0].num_rows

Issue: Dark mode text unreadable

Cause: Insufficient contrast in diff highlighting and code blocks Fix:

  • Changed diff colors to use dark:bg-red-950 and dark:text-red-300
  • Added explicit text-gray-900 dark:text-gray-100 to all text containers

Issue: Image loading errors

Cause: Signed URLs expire after ~1 hour Fix: Implemented handleImageError() with automatic URL refresh

Issue: Markdown tables not rendering

Cause: Default marked.js settings and HTML security restrictions Fix:

  • Enabled tables: true in marked.js options
  • Added safe HTML table tag allowlist in renderer
  • Applied proper Tailwind CSS classes to table elements
  • Added CSS overrides for prose container compatibility

Mobile Support Status

While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See mobile-enhancement-plan.md for detailed technical requirements and implementation approach.

Current limitations:

  • Fixed desktop layout doesn't adapt well to small screens
  • No touch gesture support for navigation
  • Small touch targets for buttons and inputs
  • Desktop-only interactions (hover states, keyboard shortcuts)

Planned improvements:

  • Responsive stacked layout for mobile devices
  • Touch gestures (swipe for navigation)
  • Mobile-optimized navigation bar
  • Touch-friendly UI components

Future Enhancements

  • Comprehensive mobile support (see mobile-enhancement-plan.md)
  • Search/filter within dataset
  • Bookmark favorite samples
  • Export selected texts
  • Support for private datasets (auth)
  • Metrics display (CER/WER)
  • Batch operations
  • PWA support for offline viewing

Deployment

Static Hosting (Recommended)

# Any static file server works
python3 -m http.server 8000
npx serve .

GitHub Pages

  1. Push to GitHub repository
  2. Enable Pages in settings
  3. Access at: https://[username].github.io/[repo]/ocr-text-explorer/

CDN Deployment

  • Upload files to any CDN
  • No server-side processing needed
  • Works with CloudFlare, Netlify, Vercel, etc.

Testing Datasets

Known working datasets:

  • davanstrien/exams-ocr - Default dataset with exam papers (uses text and markdown columns)
  • davanstrien/rolm-test - Victorian theatre playbills processed with RolmOCR (uses text and rolmocr_text columns, includes inference_info metadata)
  • Any dataset with image + text columns

Column patterns automatically detected:

  • Original: text, ocr, original_text, ground_truth
  • Improved: markdown, new_ocr, corrected_text, vlm_ocr, rolmocr_text
  • Metadata: inference_info (JSON array with model details, processing date, parameters)

Recent Updates

Model Information Display (Added 2025-08-04)

The application now displays model processing information when available:

Features:

  • Automatic detection of inference_info column
  • Model metadata panel showing: model name, processing date, batch size, max tokens
  • Link to processing script when available
  • Positioned prominently below image for immediate visibility

Implementation Notes:

  • The model info panel only appears when inference_info column exists
  • Supports datasets processed with UV scripts via HF Jobs
  • Gracefully handles datasets without model metadata

Reasoning Trace Parsing Fix (Added 2025-08-07)

Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors:

Problem:

  • Some model outputs contained opening <think> tags without closing </think> tags
  • This appeared to be truncated or malformed model output
  • The parser would attempt to parse these incomplete traces, causing confusion

Solution:

  • Updated detectReasoningTrace() to require BOTH opening and closing tags
  • Added console warnings when incomplete traces are detected
  • Malformed traces are now displayed as plain text instead of being parsed

Benefits:

  • Cleaner handling of incomplete model outputs
  • No confusing partial reasoning panels for malformed content
  • Maintains full functionality for well-formed reasoning traces
  • Helpful console warnings for debugging

Technical Details:

  • File: js/reasoning-parser.js
  • Only traces with complete XML tags (<think>...</think>, <thinking>...</thinking>, etc.) are parsed
  • Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags"