Spaces:

davanstrien
/

ocr-time-capsule

Running

App Files Files Community

ocr-time-capsule / CLAUDE.md

davanstrien HF Staff

Update CLAUDE.md with reasoning trace parsing fix documentation

c4a9ea8 7 days ago

preview code

raw

history blame contribute delete

13.5 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.

	## Project Overview

	OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.

	## Recent Updates

	### Deep Link Sharing (Added 2025-08-07)

	The application now supports deep linking with full state preservation for easy sharing and collaboration:

	Features:
	- Complete URL state management for all view settings
	- Copy Link button for one-click sharing
	- Automatic restoration of view state from URL parameters
	- Success notification when link is copied

	URL Parameters Supported:
	- `dataset` - HuggingFace dataset ID
	- `index` - Sample index (0-based)
	- `view` - View mode (comparison, diff, improved)
	- `diff` - Diff algorithm (char, word, line, markdown)
	- `markdown` - Markdown rendering state (true/false)
	- `reasoning` - Reasoning panel expansion state (true/false, only for samples with reasoning traces)

	Implementation Details:
	- URL updates automatically as users navigate and change settings
	- Prevents double-loading when URL contains specific index
	- Fallback clipboard API support for older browsers
	- Reasoning state only included in URL when reasoning trace is present

	Example Deep Links:
	- Basic: `/?dataset=davanstrien/exams-ocr&index=5`
	- Full state: `/?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true`

	### Reasoning Trace Support (Added 2025-08-07)

	The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output:

	Features:
	- Automatic detection of `<think>` and `<answer>` XML-like tags in improved text
	- Collapsible "Model Reasoning" panel showing the model's thought process
	- Clean separation of reasoning from final output for better readability
	- Reasoning statistics (word count, percentage of total output)
	- Support for formatted reasoning steps with numbered analysis points
	- "Reasoning Trace" badge indicator in the statistics panel
	- Deep link support preserves reasoning panel state

	Implementation Details:
	- New `reasoning-parser.js` module handles detection and parsing of reasoning traces
	- Supports multiple reasoning formats (`<think>`, `<thinking>`, `<reasoning>` tags)
	- Important: Only well-formed traces with both opening AND closing tags are parsed
	- Malformed traces (missing closing tags) are displayed as plain text
	- Formats numbered steps from reasoning content for structured display
	- Caches parsed reasoning to avoid reprocessing
	- Exports include optional reasoning trace content
	- Reasoning panel state included in shareable URLs

	Supported Datasets:
	- `davanstrien/india-medical-ocr-test` - Medical documents processed with NuMarkdown-8B-Thinking
	- Any dataset with reasoning traces in supported XML-like formats

	UI Components:
	- Collapsible reasoning panel with smooth animations
	- Step-by-step reasoning display with numbered indicators
	- "Final Output" label when reasoning is present
	- Dark mode optimized styling for reasoning sections

	### Markdown Rendering Support (Added 2025-08-01)

	The application now supports rendering markdown-formatted VLM output for improved readability:

	Features:
	- Automatic markdown detection in improved OCR text
	- Toggle button to switch between raw markdown and rendered view
	- Support for common markdown elements: headers, lists, tables, code blocks, links
	- Security-focused implementation with XSS prevention
	- Performance optimization with render caching

	Implementation Details:
	- Uses marked.js library for markdown parsing
	- Custom renderers for security (sanitizes URLs, prevents script injection)
	- Tailwind-styled markdown elements matching the app's design
	- HTML table support for VLM outputs that use table tags
	- Cache system limits memory usage to 50 rendered items

	UI Changes:
	- Markdown toggle button appears when markdown is detected
	- "Markdown Detected" badge in statistics panel
	- New "Markdown Diff" mode showing plain vs rendered comparison
	- Both "Improved Only" and "Side by Side" views support rendering

	## Architecture

	### Technology Stack
	- Frontend Framework: Alpine.js (lightweight reactivity, ~15KB)
	- Styling: Tailwind CSS (utility-first, responsive design)
	- Interactions: HTMX (server-side rendering capabilities)
	- API: HuggingFace Dataset Viewer API (no backend required)
	- Language: Vanilla JavaScript (no build process needed)

	### Core Components

	index.html - Main application shell
	- Split-pane layout (1/3 image, 2/3 text comparison)
	- Three view modes: Side-by-side, Inline diff, Improved only
	- Dark mode support with proper contrast
	- Responsive design for mobile devices

	js/dataset-api.js - HuggingFace API wrapper
	- Smart caching with 45-minute expiration for signed URLs
	- Batch loading (100 rows at a time)
	- Automatic column detection for different dataset schemas
	- Image URL refresh on expiration

	js/app.js - Alpine.js application logic
	- Keyboard navigation (J/K, arrows)
	- URL state management for shareable links
	- Diff mode switching (character/word/line)
	- Dark mode persistence in localStorage

	js/diff-utils.js - Text comparison algorithms
	- Character-level diff with inline highlighting
	- Word-level diff preserving whitespace
	- Line-level diff for larger changes
	- LCS (Longest Common Subsequence) implementation

	css/styles.css - Custom styling
	- Dark mode enhancements
	- Diff highlighting with accessibility in mind
	- Smooth transitions and animations
	- Print-friendly styles

	## Key Design Decisions

	### Why Separate from OCR Time Machine?

	1. Focused Purpose: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
	2. Performance: No Python/Gradio overhead - instant loading and navigation
	3. User Experience: Custom UI optimized for text comparison workflows
	4. Deployment: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)

	### API vs Backend Trade-offs

	Chose HF Dataset Viewer API because:
	- No backend infrastructure needed
	- Automatic image serving with CDN
	- Built-in pagination support
	- Works with any public HF dataset

	Limitations accepted:
	- Image URLs expire (~1 hour)
	- 100 rows max per request
	- No write capabilities
	- Public datasets only (no auth yet)

	### UI/UX Principles

	1. Keyboard-first: Professional users prefer keyboard navigation
	2. Information density: Show more content, less chrome
	3. Visual diff: Color-coded changes are easier to scan than side-by-side
	4. Dark mode: Essential for extended reading sessions
	5. Responsive: Works on tablets for field work

	## Development Approach

	### Phase 1: MVP (Completed)
	- Basic dataset loading and navigation
	- Side-by-side text comparison
	- Keyboard shortcuts
	- Dark mode

	### Phase 2: Enhancements (Completed)
	- Three diff algorithms (char/word/line)
	- URL state management
	- Image error handling with refresh
	- Responsive mobile layout

	### Phase 3: Polish (Completed)
	- Fixed dark mode contrast issues
	- Optimized performance with direct indexing
	- Added loading states and error handling
	- Comprehensive documentation

	## Common Tasks

	### Adding Column Name Patterns
	```javascript
	// In dataset-api.js detectColumns() method
	if (!originalTextColumn && ['your_column_name'].includes(name)) {
	originalTextColumn = name;
	}
	```

	### Adding Keyboard Shortcuts
	```javascript
	// In app.js setupKeyboardNavigation()
	case 'your_key':
	// Your action
	break;
	```

	### Customizing Diff Colors
	```javascript
	// In diff-utils.js
	// Light mode: bg-red-200, text-red-800
	// Dark mode: bg-red-950, text-red-300
	```

	### Working with Markdown Rendering
	```javascript
	// Enable/disable markdown rendering
	this.renderMarkdown = true; // Toggle markdown rendering

	// Add new markdown patterns to detection
	// In app.js detectMarkdown() method
	const markdownPatterns = [
	/your_pattern_here/, // Add your pattern
	// ... existing patterns
	];

	// Customize markdown styles
	// In app.js renderMarkdownText() method
	html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">');
	```

	## Performance Optimizations

	1. Direct Dataset Indexing: Uses `dataset[index]` instead of loading batches into memory
	2. Smart Caching: Caches API responses for 45 minutes (conservative for signed URLs)
	3. Batch Fetching: Loads 100 rows at once, caches for smooth navigation
	4. Lazy Loading: Only fetches data when needed

	## Known Issues & Solutions

	### Issue: Navigation buttons were disabled
	Cause: API response structure wasn't parsed correctly
	Fix: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`

	### Issue: Dark mode text unreadable
	Cause: Insufficient contrast in diff highlighting and code blocks
	Fix:
	- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
	- Added explicit `text-gray-900 dark:text-gray-100` to all text containers

	### Issue: Image loading errors
	Cause: Signed URLs expire after ~1 hour
	Fix: Implemented handleImageError() with automatic URL refresh

	### Issue: Markdown tables not rendering
	Cause: Default marked.js settings and HTML security restrictions
	Fix:
	- Enabled `tables: true` in marked.js options
	- Added safe HTML table tag allowlist in renderer
	- Applied proper Tailwind CSS classes to table elements
	- Added CSS overrides for prose container compatibility

	## Mobile Support Status

	While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See [mobile-enhancement-plan.md](mobile-enhancement-plan.md) for detailed technical requirements and implementation approach.

	Current limitations:
	- Fixed desktop layout doesn't adapt well to small screens
	- No touch gesture support for navigation
	- Small touch targets for buttons and inputs
	- Desktop-only interactions (hover states, keyboard shortcuts)

	Planned improvements:
	- Responsive stacked layout for mobile devices
	- Touch gestures (swipe for navigation)
	- Mobile-optimized navigation bar
	- Touch-friendly UI components

	## Future Enhancements

	- [ ] Comprehensive mobile support (see mobile-enhancement-plan.md)
	- [ ] Search/filter within dataset
	- [ ] Bookmark favorite samples
	- [ ] Export selected texts
	- [ ] Support for private datasets (auth)
	- [ ] Metrics display (CER/WER)
	- [ ] Batch operations
	- [ ] PWA support for offline viewing

	## Deployment

	### Static Hosting (Recommended)
	```bash
	# Any static file server works
	python3 -m http.server 8000
	npx serve .
	```

	### GitHub Pages
	1. Push to GitHub repository
	2. Enable Pages in settings
	3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`

	### CDN Deployment
	- Upload files to any CDN
	- No server-side processing needed
	- Works with CloudFlare, Netlify, Vercel, etc.

	## Testing Datasets

	Known working datasets:
	- `davanstrien/exams-ocr` - Default dataset with exam papers (uses `text` and `markdown` columns)
	- `davanstrien/rolm-test` - Victorian theatre playbills processed with RolmOCR (uses `text` and `rolmocr_text` columns, includes `inference_info` metadata)
	- Any dataset with image + text columns

	Column patterns automatically detected:
	- Original: `text`, `ocr`, `original_text`, `ground_truth`
	- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`, `rolmocr_text`
	- Metadata: `inference_info` (JSON array with model details, processing date, parameters)

	## Recent Updates

	### Model Information Display (Added 2025-08-04)

	The application now displays model processing information when available:

	Features:
	- Automatic detection of `inference_info` column
	- Model metadata panel showing: model name, processing date, batch size, max tokens
	- Link to processing script when available
	- Positioned prominently below image for immediate visibility

	Implementation Notes:
	- The model info panel only appears when `inference_info` column exists
	- Supports datasets processed with UV scripts via HF Jobs
	- Gracefully handles datasets without model metadata

	### Reasoning Trace Parsing Fix (Added 2025-08-07)

	Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors:

	Problem:
	- Some model outputs contained opening `<think>` tags without closing `</think>` tags
	- This appeared to be truncated or malformed model output
	- The parser would attempt to parse these incomplete traces, causing confusion

	Solution:
	- Updated `detectReasoningTrace()` to require BOTH opening and closing tags
	- Added console warnings when incomplete traces are detected
	- Malformed traces are now displayed as plain text instead of being parsed

	Benefits:
	- Cleaner handling of incomplete model outputs
	- No confusing partial reasoning panels for malformed content
	- Maintains full functionality for well-formed reasoning traces
	- Helpful console warnings for debugging

	Technical Details:
	- File: `js/reasoning-parser.js`
	- Only traces with complete XML tags (`<think>...</think>`, `<thinking>...</thinking>`, etc.) are parsed
	- Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags"