ritabratamaiti commited on
Commit
02b0a38
·
verified ·
1 Parent(s): 3f8880d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -3
README.md CHANGED
@@ -1,3 +1,124 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - unsloth/LaTeX_OCR
5
+ language:
6
+ - en
7
+ base_model:
8
+ - meta-llama/Llama-3.2-1B
9
+ - google/siglip-so400m-patch14-384
10
+ tags:
11
+ - vlm
12
+ - vision
13
+ - multimodal
14
+ - AnyModal
15
+ ---
16
+ # AnyModal/LaTeX-OCR-Llama-3.2-1B
17
+
18
+ **AnyModal/LaTeX-OCR-Llama-3.2-1B** is an experimental model designed to convert images of handwritten and printed mathematical equations into LaTeX representations. Developed within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model combines a `google/siglip-so400m-patch14-384` image encoder with the Llama 3.2-1B language model. It has been trained on 20% of the [unsloth/LaTeX_OCR dataset](https://huggingface.co/datasets/unsloth/LaTeX_OCR), which itself is a subset of the [linxy/LaTeX_OCR dataset](https://huggingface.co/datasets/linxy/LaTeX_OCR).
19
+
20
+ ---
21
+
22
+ ## Trained On
23
+
24
+ This model was trained on the [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) dataset:
25
+
26
+ **LaTeX OCR Dataset**
27
+ *Linxy et al.*
28
+
29
+ The dataset contains 1% of samples from the larger [linxy/LaTeX_OCR dataset](https://huggingface.co/datasets/linxy/LaTeX_OCR), which includes images of handwritten and printed mathematical equations annotated with their corresponding LaTeX expressions. The current model was trained on 20% of the unsloth dataset, highlighting its experimental nature.
30
+
31
+ ---
32
+
33
+ ## How to Use
34
+
35
+ ### Installation
36
+
37
+ Install the required dependencies:
38
+
39
+ ```bash
40
+ pip install torch transformers torchvision huggingface_hub tqdm matplotlib Pillow
41
+ ```
42
+
43
+ ### Inference
44
+
45
+ Below is an example of generating LaTeX code from an image:
46
+
47
+ ```python
48
+ import llm
49
+ import anymodal
50
+ import torch
51
+ import vision
52
+ from PIL import Image
53
+ from huggingface_hub import hf_hub_download, snapshot_download
54
+
55
+ # Load language model and tokenizer
56
+ llm_tokenizer, llm_model = llm.get_llm(
57
+ "meta-llama/Llama-3.2-1B",
58
+ access_token="GET_YOUR_OWN_TOKEN_FROM_HUGGINGFACE",
59
+ quantized=False,
60
+ use_peft=False,
61
+ )
62
+
63
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
64
+ llm_model.to(device)
65
+
66
+ llm_hidden_size = llm.get_hidden_size(llm_tokenizer, llm_model)
67
+
68
+ # Load vision model components
69
+ image_processor, vision_model, vision_hidden_size = vision.get_image_encoder(
70
+ "google/siglip-so400m-patch14-384", use_peft=False
71
+ )
72
+
73
+ # Initialize vision tokenizer and encoder
74
+ vision_encoder = vision.VisionEncoder(vision_model)
75
+ vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1)
76
+
77
+ # Initialize MultiModalModel
78
+ multimodal_model = anymodal.MultiModalModel(
79
+ input_processor=None,
80
+ input_encoder=vision_encoder,
81
+ input_tokenizer=vision_tokenizer,
82
+ language_tokenizer=llm_tokenizer,
83
+ language_model=llm_model,
84
+ prompt_text="The latex expression of the equation in the image is: ",
85
+ )
86
+
87
+ # Load pre-trained weights
88
+ if not os.path.exists("latex_ocr"):
89
+ os.makedirs("latex_ocr")
90
+
91
+ snapshot_download("AnyModal/latex-ocr-Llama-3.2-1B", local_dir="latex_ocr")
92
+ multimodal_model._load_model("latex_ocr")
93
+
94
+ # Generate LaTeX expression from an image
95
+ image_path = "example_equation.jpg" # Path to your image
96
+ image = Image.open(image_path).convert("RGB")
97
+ processed_image = image_processor(image, return_tensors="pt")
98
+ processed_image = {key: val.squeeze(0) for key, val in processed_image.items()}
99
+
100
+ # Generate LaTeX caption
101
+ generated_caption = multimodal_model.generate(processed_image, max_new_tokens=120)
102
+ print("Generated LaTeX Caption:", generated_caption)
103
+ ```
104
+
105
+ ---
106
+
107
+ ## Project and Training Scripts
108
+
109
+ This model is part of the [AnyModal LaTeX OCR Project](https://github.com/ritabratamaiti/AnyModal/tree/main/LaTeX%20OCR).
110
+
111
+ - **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/LaTeX%20OCR/train.py)
112
+ - **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/LaTeX%20OCR/inference.py)
113
+
114
+ Refer to the project repository for further implementation details.
115
+
116
+ ---
117
+
118
+ ## Project Details
119
+
120
+ - **Vision Encoder**: The `google/siglip-so400m-patch14-384` model, pre-trained for visual feature extraction, was used as the image encoder.
121
+ - **Projector Network**: A dense projection network aligns visual features with the Llama 3.2-1B text generation model.
122
+ - **Language Model**: Llama 3.2-1B, a small causal language model, generates the LaTeX expression.
123
+
124
+ This implementation highlights a proof-of-concept approach using a limited training subset. Better performance can likely be achieved by training on more samples and incorporating a text-conditioned image encoder.