Spaces:
Running
Running
# CLAUDE.md | |
This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer. | |
## Project Overview | |
OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience. | |
## Recent Updates | |
### Deep Link Sharing (Added 2025-08-07) | |
The application now supports deep linking with full state preservation for easy sharing and collaboration: | |
**Features:** | |
- Complete URL state management for all view settings | |
- Copy Link button for one-click sharing | |
- Automatic restoration of view state from URL parameters | |
- Success notification when link is copied | |
**URL Parameters Supported:** | |
- `dataset` - HuggingFace dataset ID | |
- `index` - Sample index (0-based) | |
- `view` - View mode (comparison, diff, improved) | |
- `diff` - Diff algorithm (char, word, line, markdown) | |
- `markdown` - Markdown rendering state (true/false) | |
- `reasoning` - Reasoning panel expansion state (true/false, only for samples with reasoning traces) | |
**Implementation Details:** | |
- URL updates automatically as users navigate and change settings | |
- Prevents double-loading when URL contains specific index | |
- Fallback clipboard API support for older browsers | |
- Reasoning state only included in URL when reasoning trace is present | |
**Example Deep Links:** | |
- Basic: `/?dataset=davanstrien/exams-ocr&index=5` | |
- Full state: `/?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true` | |
### Reasoning Trace Support (Added 2025-08-07) | |
The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output: | |
**Features:** | |
- Automatic detection of `<think>` and `<answer>` XML-like tags in improved text | |
- Collapsible "Model Reasoning" panel showing the model's thought process | |
- Clean separation of reasoning from final output for better readability | |
- Reasoning statistics (word count, percentage of total output) | |
- Support for formatted reasoning steps with numbered analysis points | |
- "Reasoning Trace" badge indicator in the statistics panel | |
- Deep link support preserves reasoning panel state | |
**Implementation Details:** | |
- New `reasoning-parser.js` module handles detection and parsing of reasoning traces | |
- Supports multiple reasoning formats (`<think>`, `<thinking>`, `<reasoning>` tags) | |
- **Important:** Only well-formed traces with both opening AND closing tags are parsed | |
- Malformed traces (missing closing tags) are displayed as plain text | |
- Formats numbered steps from reasoning content for structured display | |
- Caches parsed reasoning to avoid reprocessing | |
- Exports include optional reasoning trace content | |
- Reasoning panel state included in shareable URLs | |
**Supported Datasets:** | |
- `davanstrien/india-medical-ocr-test` - Medical documents processed with NuMarkdown-8B-Thinking | |
- Any dataset with reasoning traces in supported XML-like formats | |
**UI Components:** | |
- Collapsible reasoning panel with smooth animations | |
- Step-by-step reasoning display with numbered indicators | |
- "Final Output" label when reasoning is present | |
- Dark mode optimized styling for reasoning sections | |
### Markdown Rendering Support (Added 2025-08-01) | |
The application now supports rendering markdown-formatted VLM output for improved readability: | |
**Features:** | |
- Automatic markdown detection in improved OCR text | |
- Toggle button to switch between raw markdown and rendered view | |
- Support for common markdown elements: headers, lists, tables, code blocks, links | |
- Security-focused implementation with XSS prevention | |
- Performance optimization with render caching | |
**Implementation Details:** | |
- Uses marked.js library for markdown parsing | |
- Custom renderers for security (sanitizes URLs, prevents script injection) | |
- Tailwind-styled markdown elements matching the app's design | |
- HTML table support for VLM outputs that use table tags | |
- Cache system limits memory usage to 50 rendered items | |
**UI Changes:** | |
- Markdown toggle button appears when markdown is detected | |
- "Markdown Detected" badge in statistics panel | |
- New "Markdown Diff" mode showing plain vs rendered comparison | |
- Both "Improved Only" and "Side by Side" views support rendering | |
## Architecture | |
### Technology Stack | |
- **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB) | |
- **Styling**: Tailwind CSS (utility-first, responsive design) | |
- **Interactions**: HTMX (server-side rendering capabilities) | |
- **API**: HuggingFace Dataset Viewer API (no backend required) | |
- **Language**: Vanilla JavaScript (no build process needed) | |
### Core Components | |
**index.html** - Main application shell | |
- Split-pane layout (1/3 image, 2/3 text comparison) | |
- Three view modes: Side-by-side, Inline diff, Improved only | |
- Dark mode support with proper contrast | |
- Responsive design for mobile devices | |
**js/dataset-api.js** - HuggingFace API wrapper | |
- Smart caching with 45-minute expiration for signed URLs | |
- Batch loading (100 rows at a time) | |
- Automatic column detection for different dataset schemas | |
- Image URL refresh on expiration | |
**js/app.js** - Alpine.js application logic | |
- Keyboard navigation (J/K, arrows) | |
- URL state management for shareable links | |
- Diff mode switching (character/word/line) | |
- Dark mode persistence in localStorage | |
**js/diff-utils.js** - Text comparison algorithms | |
- Character-level diff with inline highlighting | |
- Word-level diff preserving whitespace | |
- Line-level diff for larger changes | |
- LCS (Longest Common Subsequence) implementation | |
**css/styles.css** - Custom styling | |
- Dark mode enhancements | |
- Diff highlighting with accessibility in mind | |
- Smooth transitions and animations | |
- Print-friendly styles | |
## Key Design Decisions | |
### Why Separate from OCR Time Machine? | |
1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results | |
2. **Performance**: No Python/Gradio overhead - instant loading and navigation | |
3. **User Experience**: Custom UI optimized for text comparison workflows | |
4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.) | |
### API vs Backend Trade-offs | |
**Chose HF Dataset Viewer API because:** | |
- No backend infrastructure needed | |
- Automatic image serving with CDN | |
- Built-in pagination support | |
- Works with any public HF dataset | |
**Limitations accepted:** | |
- Image URLs expire (~1 hour) | |
- 100 rows max per request | |
- No write capabilities | |
- Public datasets only (no auth yet) | |
### UI/UX Principles | |
1. **Keyboard-first**: Professional users prefer keyboard navigation | |
2. **Information density**: Show more content, less chrome | |
3. **Visual diff**: Color-coded changes are easier to scan than side-by-side | |
4. **Dark mode**: Essential for extended reading sessions | |
5. **Responsive**: Works on tablets for field work | |
## Development Approach | |
### Phase 1: MVP (Completed) | |
- Basic dataset loading and navigation | |
- Side-by-side text comparison | |
- Keyboard shortcuts | |
- Dark mode | |
### Phase 2: Enhancements (Completed) | |
- Three diff algorithms (char/word/line) | |
- URL state management | |
- Image error handling with refresh | |
- Responsive mobile layout | |
### Phase 3: Polish (Completed) | |
- Fixed dark mode contrast issues | |
- Optimized performance with direct indexing | |
- Added loading states and error handling | |
- Comprehensive documentation | |
## Common Tasks | |
### Adding Column Name Patterns | |
```javascript | |
// In dataset-api.js detectColumns() method | |
if (!originalTextColumn && ['your_column_name'].includes(name)) { | |
originalTextColumn = name; | |
} | |
``` | |
### Adding Keyboard Shortcuts | |
```javascript | |
// In app.js setupKeyboardNavigation() | |
case 'your_key': | |
// Your action | |
break; | |
``` | |
### Customizing Diff Colors | |
```javascript | |
// In diff-utils.js | |
// Light mode: bg-red-200, text-red-800 | |
// Dark mode: bg-red-950, text-red-300 | |
``` | |
### Working with Markdown Rendering | |
```javascript | |
// Enable/disable markdown rendering | |
this.renderMarkdown = true; // Toggle markdown rendering | |
// Add new markdown patterns to detection | |
// In app.js detectMarkdown() method | |
const markdownPatterns = [ | |
/your_pattern_here/, // Add your pattern | |
// ... existing patterns | |
]; | |
// Customize markdown styles | |
// In app.js renderMarkdownText() method | |
html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">'); | |
``` | |
## Performance Optimizations | |
1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory | |
2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs) | |
3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation | |
4. **Lazy Loading**: Only fetches data when needed | |
## Known Issues & Solutions | |
### Issue: Navigation buttons were disabled | |
**Cause**: API response structure wasn't parsed correctly | |
**Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows` | |
### Issue: Dark mode text unreadable | |
**Cause**: Insufficient contrast in diff highlighting and code blocks | |
**Fix**: | |
- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300` | |
- Added explicit `text-gray-900 dark:text-gray-100` to all text containers | |
### Issue: Image loading errors | |
**Cause**: Signed URLs expire after ~1 hour | |
**Fix**: Implemented handleImageError() with automatic URL refresh | |
### Issue: Markdown tables not rendering | |
**Cause**: Default marked.js settings and HTML security restrictions | |
**Fix**: | |
- Enabled `tables: true` in marked.js options | |
- Added safe HTML table tag allowlist in renderer | |
- Applied proper Tailwind CSS classes to table elements | |
- Added CSS overrides for prose container compatibility | |
## Mobile Support Status | |
While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See [mobile-enhancement-plan.md](mobile-enhancement-plan.md) for detailed technical requirements and implementation approach. | |
**Current limitations:** | |
- Fixed desktop layout doesn't adapt well to small screens | |
- No touch gesture support for navigation | |
- Small touch targets for buttons and inputs | |
- Desktop-only interactions (hover states, keyboard shortcuts) | |
**Planned improvements:** | |
- Responsive stacked layout for mobile devices | |
- Touch gestures (swipe for navigation) | |
- Mobile-optimized navigation bar | |
- Touch-friendly UI components | |
## Future Enhancements | |
- [ ] Comprehensive mobile support (see mobile-enhancement-plan.md) | |
- [ ] Search/filter within dataset | |
- [ ] Bookmark favorite samples | |
- [ ] Export selected texts | |
- [ ] Support for private datasets (auth) | |
- [ ] Metrics display (CER/WER) | |
- [ ] Batch operations | |
- [ ] PWA support for offline viewing | |
## Deployment | |
### Static Hosting (Recommended) | |
```bash | |
# Any static file server works | |
python3 -m http.server 8000 | |
npx serve . | |
``` | |
### GitHub Pages | |
1. Push to GitHub repository | |
2. Enable Pages in settings | |
3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/` | |
### CDN Deployment | |
- Upload files to any CDN | |
- No server-side processing needed | |
- Works with CloudFlare, Netlify, Vercel, etc. | |
## Testing Datasets | |
Known working datasets: | |
- `davanstrien/exams-ocr` - Default dataset with exam papers (uses `text` and `markdown` columns) | |
- `davanstrien/rolm-test` - Victorian theatre playbills processed with RolmOCR (uses `text` and `rolmocr_text` columns, includes `inference_info` metadata) | |
- Any dataset with image + text columns | |
Column patterns automatically detected: | |
- Original: `text`, `ocr`, `original_text`, `ground_truth` | |
- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`, `rolmocr_text` | |
- Metadata: `inference_info` (JSON array with model details, processing date, parameters) | |
## Recent Updates | |
### Model Information Display (Added 2025-08-04) | |
The application now displays model processing information when available: | |
**Features:** | |
- Automatic detection of `inference_info` column | |
- Model metadata panel showing: model name, processing date, batch size, max tokens | |
- Link to processing script when available | |
- Positioned prominently below image for immediate visibility | |
**Implementation Notes:** | |
- The model info panel only appears when `inference_info` column exists | |
- Supports datasets processed with UV scripts via HF Jobs | |
- Gracefully handles datasets without model metadata | |
### Reasoning Trace Parsing Fix (Added 2025-08-07) | |
Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors: | |
**Problem:** | |
- Some model outputs contained opening `<think>` tags without closing `</think>` tags | |
- This appeared to be truncated or malformed model output | |
- The parser would attempt to parse these incomplete traces, causing confusion | |
**Solution:** | |
- Updated `detectReasoningTrace()` to require BOTH opening and closing tags | |
- Added console warnings when incomplete traces are detected | |
- Malformed traces are now displayed as plain text instead of being parsed | |
**Benefits:** | |
- Cleaner handling of incomplete model outputs | |
- No confusing partial reasoning panels for malformed content | |
- Maintains full functionality for well-formed reasoning traces | |
- Helpful console warnings for debugging | |
**Technical Details:** | |
- File: `js/reasoning-parser.js` | |
- Only traces with complete XML tags (`<think>...</think>`, `<thinking>...</thinking>`, etc.) are parsed | |
- Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags" |