ocr-time-capsule / CLAUDE.md
davanstrien's picture
davanstrien HF Staff
Update CLAUDE.md with reasoning trace parsing fix documentation
c4a9ea8
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
## Project Overview
OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
## Recent Updates
### Deep Link Sharing (Added 2025-08-07)
The application now supports deep linking with full state preservation for easy sharing and collaboration:
**Features:**
- Complete URL state management for all view settings
- Copy Link button for one-click sharing
- Automatic restoration of view state from URL parameters
- Success notification when link is copied
**URL Parameters Supported:**
- `dataset` - HuggingFace dataset ID
- `index` - Sample index (0-based)
- `view` - View mode (comparison, diff, improved)
- `diff` - Diff algorithm (char, word, line, markdown)
- `markdown` - Markdown rendering state (true/false)
- `reasoning` - Reasoning panel expansion state (true/false, only for samples with reasoning traces)
**Implementation Details:**
- URL updates automatically as users navigate and change settings
- Prevents double-loading when URL contains specific index
- Fallback clipboard API support for older browsers
- Reasoning state only included in URL when reasoning trace is present
**Example Deep Links:**
- Basic: `/?dataset=davanstrien/exams-ocr&index=5`
- Full state: `/?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true`
### Reasoning Trace Support (Added 2025-08-07)
The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output:
**Features:**
- Automatic detection of `<think>` and `<answer>` XML-like tags in improved text
- Collapsible "Model Reasoning" panel showing the model's thought process
- Clean separation of reasoning from final output for better readability
- Reasoning statistics (word count, percentage of total output)
- Support for formatted reasoning steps with numbered analysis points
- "Reasoning Trace" badge indicator in the statistics panel
- Deep link support preserves reasoning panel state
**Implementation Details:**
- New `reasoning-parser.js` module handles detection and parsing of reasoning traces
- Supports multiple reasoning formats (`<think>`, `<thinking>`, `<reasoning>` tags)
- **Important:** Only well-formed traces with both opening AND closing tags are parsed
- Malformed traces (missing closing tags) are displayed as plain text
- Formats numbered steps from reasoning content for structured display
- Caches parsed reasoning to avoid reprocessing
- Exports include optional reasoning trace content
- Reasoning panel state included in shareable URLs
**Supported Datasets:**
- `davanstrien/india-medical-ocr-test` - Medical documents processed with NuMarkdown-8B-Thinking
- Any dataset with reasoning traces in supported XML-like formats
**UI Components:**
- Collapsible reasoning panel with smooth animations
- Step-by-step reasoning display with numbered indicators
- "Final Output" label when reasoning is present
- Dark mode optimized styling for reasoning sections
### Markdown Rendering Support (Added 2025-08-01)
The application now supports rendering markdown-formatted VLM output for improved readability:
**Features:**
- Automatic markdown detection in improved OCR text
- Toggle button to switch between raw markdown and rendered view
- Support for common markdown elements: headers, lists, tables, code blocks, links
- Security-focused implementation with XSS prevention
- Performance optimization with render caching
**Implementation Details:**
- Uses marked.js library for markdown parsing
- Custom renderers for security (sanitizes URLs, prevents script injection)
- Tailwind-styled markdown elements matching the app's design
- HTML table support for VLM outputs that use table tags
- Cache system limits memory usage to 50 rendered items
**UI Changes:**
- Markdown toggle button appears when markdown is detected
- "Markdown Detected" badge in statistics panel
- New "Markdown Diff" mode showing plain vs rendered comparison
- Both "Improved Only" and "Side by Side" views support rendering
## Architecture
### Technology Stack
- **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB)
- **Styling**: Tailwind CSS (utility-first, responsive design)
- **Interactions**: HTMX (server-side rendering capabilities)
- **API**: HuggingFace Dataset Viewer API (no backend required)
- **Language**: Vanilla JavaScript (no build process needed)
### Core Components
**index.html** - Main application shell
- Split-pane layout (1/3 image, 2/3 text comparison)
- Three view modes: Side-by-side, Inline diff, Improved only
- Dark mode support with proper contrast
- Responsive design for mobile devices
**js/dataset-api.js** - HuggingFace API wrapper
- Smart caching with 45-minute expiration for signed URLs
- Batch loading (100 rows at a time)
- Automatic column detection for different dataset schemas
- Image URL refresh on expiration
**js/app.js** - Alpine.js application logic
- Keyboard navigation (J/K, arrows)
- URL state management for shareable links
- Diff mode switching (character/word/line)
- Dark mode persistence in localStorage
**js/diff-utils.js** - Text comparison algorithms
- Character-level diff with inline highlighting
- Word-level diff preserving whitespace
- Line-level diff for larger changes
- LCS (Longest Common Subsequence) implementation
**css/styles.css** - Custom styling
- Dark mode enhancements
- Diff highlighting with accessibility in mind
- Smooth transitions and animations
- Print-friendly styles
## Key Design Decisions
### Why Separate from OCR Time Machine?
1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
2. **Performance**: No Python/Gradio overhead - instant loading and navigation
3. **User Experience**: Custom UI optimized for text comparison workflows
4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
### API vs Backend Trade-offs
**Chose HF Dataset Viewer API because:**
- No backend infrastructure needed
- Automatic image serving with CDN
- Built-in pagination support
- Works with any public HF dataset
**Limitations accepted:**
- Image URLs expire (~1 hour)
- 100 rows max per request
- No write capabilities
- Public datasets only (no auth yet)
### UI/UX Principles
1. **Keyboard-first**: Professional users prefer keyboard navigation
2. **Information density**: Show more content, less chrome
3. **Visual diff**: Color-coded changes are easier to scan than side-by-side
4. **Dark mode**: Essential for extended reading sessions
5. **Responsive**: Works on tablets for field work
## Development Approach
### Phase 1: MVP (Completed)
- Basic dataset loading and navigation
- Side-by-side text comparison
- Keyboard shortcuts
- Dark mode
### Phase 2: Enhancements (Completed)
- Three diff algorithms (char/word/line)
- URL state management
- Image error handling with refresh
- Responsive mobile layout
### Phase 3: Polish (Completed)
- Fixed dark mode contrast issues
- Optimized performance with direct indexing
- Added loading states and error handling
- Comprehensive documentation
## Common Tasks
### Adding Column Name Patterns
```javascript
// In dataset-api.js detectColumns() method
if (!originalTextColumn && ['your_column_name'].includes(name)) {
originalTextColumn = name;
}
```
### Adding Keyboard Shortcuts
```javascript
// In app.js setupKeyboardNavigation()
case 'your_key':
// Your action
break;
```
### Customizing Diff Colors
```javascript
// In diff-utils.js
// Light mode: bg-red-200, text-red-800
// Dark mode: bg-red-950, text-red-300
```
### Working with Markdown Rendering
```javascript
// Enable/disable markdown rendering
this.renderMarkdown = true; // Toggle markdown rendering
// Add new markdown patterns to detection
// In app.js detectMarkdown() method
const markdownPatterns = [
/your_pattern_here/, // Add your pattern
// ... existing patterns
];
// Customize markdown styles
// In app.js renderMarkdownText() method
html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">');
```
## Performance Optimizations
1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory
2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs)
3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation
4. **Lazy Loading**: Only fetches data when needed
## Known Issues & Solutions
### Issue: Navigation buttons were disabled
**Cause**: API response structure wasn't parsed correctly
**Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`
### Issue: Dark mode text unreadable
**Cause**: Insufficient contrast in diff highlighting and code blocks
**Fix**:
- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
- Added explicit `text-gray-900 dark:text-gray-100` to all text containers
### Issue: Image loading errors
**Cause**: Signed URLs expire after ~1 hour
**Fix**: Implemented handleImageError() with automatic URL refresh
### Issue: Markdown tables not rendering
**Cause**: Default marked.js settings and HTML security restrictions
**Fix**:
- Enabled `tables: true` in marked.js options
- Added safe HTML table tag allowlist in renderer
- Applied proper Tailwind CSS classes to table elements
- Added CSS overrides for prose container compatibility
## Mobile Support Status
While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See [mobile-enhancement-plan.md](mobile-enhancement-plan.md) for detailed technical requirements and implementation approach.
**Current limitations:**
- Fixed desktop layout doesn't adapt well to small screens
- No touch gesture support for navigation
- Small touch targets for buttons and inputs
- Desktop-only interactions (hover states, keyboard shortcuts)
**Planned improvements:**
- Responsive stacked layout for mobile devices
- Touch gestures (swipe for navigation)
- Mobile-optimized navigation bar
- Touch-friendly UI components
## Future Enhancements
- [ ] Comprehensive mobile support (see mobile-enhancement-plan.md)
- [ ] Search/filter within dataset
- [ ] Bookmark favorite samples
- [ ] Export selected texts
- [ ] Support for private datasets (auth)
- [ ] Metrics display (CER/WER)
- [ ] Batch operations
- [ ] PWA support for offline viewing
## Deployment
### Static Hosting (Recommended)
```bash
# Any static file server works
python3 -m http.server 8000
npx serve .
```
### GitHub Pages
1. Push to GitHub repository
2. Enable Pages in settings
3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`
### CDN Deployment
- Upload files to any CDN
- No server-side processing needed
- Works with CloudFlare, Netlify, Vercel, etc.
## Testing Datasets
Known working datasets:
- `davanstrien/exams-ocr` - Default dataset with exam papers (uses `text` and `markdown` columns)
- `davanstrien/rolm-test` - Victorian theatre playbills processed with RolmOCR (uses `text` and `rolmocr_text` columns, includes `inference_info` metadata)
- Any dataset with image + text columns
Column patterns automatically detected:
- Original: `text`, `ocr`, `original_text`, `ground_truth`
- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`, `rolmocr_text`
- Metadata: `inference_info` (JSON array with model details, processing date, parameters)
## Recent Updates
### Model Information Display (Added 2025-08-04)
The application now displays model processing information when available:
**Features:**
- Automatic detection of `inference_info` column
- Model metadata panel showing: model name, processing date, batch size, max tokens
- Link to processing script when available
- Positioned prominently below image for immediate visibility
**Implementation Notes:**
- The model info panel only appears when `inference_info` column exists
- Supports datasets processed with UV scripts via HF Jobs
- Gracefully handles datasets without model metadata
### Reasoning Trace Parsing Fix (Added 2025-08-07)
Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors:
**Problem:**
- Some model outputs contained opening `<think>` tags without closing `</think>` tags
- This appeared to be truncated or malformed model output
- The parser would attempt to parse these incomplete traces, causing confusion
**Solution:**
- Updated `detectReasoningTrace()` to require BOTH opening and closing tags
- Added console warnings when incomplete traces are detected
- Malformed traces are now displayed as plain text instead of being parsed
**Benefits:**
- Cleaner handling of incomplete model outputs
- No confusing partial reasoning panels for malformed content
- Maintains full functionality for well-formed reasoning traces
- Helpful console warnings for debugging
**Technical Details:**
- File: `js/reasoning-parser.js`
- Only traces with complete XML tags (`<think>...</think>`, `<thinking>...</thinking>`, etc.) are parsed
- Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags"