File size: 3,248 Bytes
813b26c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---

language:
- de
pipeline_tag: image-to-text
---

# Ingredient Scanner
## Abstract

With the recent advancements in computer vision and optical character recognition and using a convolutional neural network to cut out the product from a picture, it has now become possible to reliably extract ingredient lists from the back of a product using the Anthropic API. Open-weight or even only on-device optical character recognition lacks the quality to be used in a production environment, although the progress in development is promising. The Anthropic API is also currently not feasible due to the high cost of 1 Swiss Franc per 100 pictures.

The training code and data is available on [GitHub](https://github.com/lenamerkli/ingredient-scanner/). This repository just contains an inference example and the [report](https://huggingface.co/lenamerkli/ingredient-scanner/blob/main/ingredient-scanner.pdf).

This is an entry for the [2024 Swiss AI competition](https://www.ki-wettbewerb.ch/).

## Table of Contents

0. [Abstract](#abstract)
1. [Report](#report)
2. [Model Details](#model-details)
3. [Usage](#usage)
4. [Citation](#citation)

## Report
Read the full report [here](https://huggingface.co/lenamerkli/ingredient-scanner/blob/main/ingredient-scanner.pdf).

## Model Details
This repository consists of two models, one vision model and a large language model.

### Vision Model
Custom convolutional neural network based on [ResNet18](https://pytorch.org/hub/pytorch_vision_resnet/). It detects the four corner points and the upper and lower limits of a product.

### Language Model
Converts the text from the optical character recognition engine which lies in-between the two models to JSON. It is fine-tuned from [unsloth/Qwen2-0.5B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2-0.5B-Instruct-bnb-4bit).

## Usage
Clone the repository and install the dependencies on any debian-based system:
```bash

git clone https://huggingface.co/lenamerkli/ingredient-scanner

cd ingredient-scanner

python3 -m venv .venv

source .venv/bin/activate

pip3 install -r requirements.txt

```
Note: not all requirements are needed for inference, as both training and inference requirements are listed.

Select the OCR engine in `main.py` by uncommenting one of the lines 20 to 22:
```python

# ENGINE: list[str] = ['easyocr']

# ENGINE: list[str] = ['anthropic', 'claude-3-5-sonnet-20240620']

# ENGINE: list[str] = ['llama_cpp/v2/vision', 'qwen-vl-next_b2583']

```
Note: Qwen-VL-Next is not an official qwen model. This is only to protect business secrets of a private model.

Run the inference script:
```bash

python3 main.py

```

You will be asked to enter the file path to a PNG image.

### Anthropic API

If you want to use the Anthropic API, create a `.env` file with the following content:
```

ANTHROPIC_API_KEY=YOUR_API_KEY

```

## Citation
Here is how to cite this paper in the bibtex format:
```bibtex

@misc{merkli2024ingriedient-scanner,

    title={Ingredient Scanner: Automating Reading of Ingredient Labels with Computer Vision},

    author={Lena Merkli and Sonja Merkli},

    date={2024-07-16},

    url={https://huggingface.co/lenamerkli/ingredient-scanner},

}

```