Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
machineuser
commited on
Commit
·
2b1e8c5
1
Parent(s):
86c4ad7
Sync widgets demo
Browse files
packages/tasks/src/tasks/image-to-text/about.md
CHANGED
@@ -27,6 +27,19 @@ captioner("https://huggingface.co/datasets/Narsil/image_dummy/resolve/main/parro
|
|
27 |
## [{'generated_text': 'two birds are standing next to each other '}]
|
28 |
```
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
### OCR
|
31 |
|
32 |
This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.
|
|
|
27 |
## [{'generated_text': 'two birds are standing next to each other '}]
|
28 |
```
|
29 |
|
30 |
+
### Conversation about the Image
|
31 |
+
|
32 |
+
Some text generation models also take image inputs. These are called vision language models. You can use `image-to-text` pipeline to use these models like below.
|
33 |
+
|
34 |
+
```python
|
35 |
+
from transformers import pipeline
|
36 |
+
|
37 |
+
mm_pipeline = pipeline("image-to-text",model="llava-hf/llava-1.5-7b-hf")
|
38 |
+
mm_pipeline("https://huggingface.co/spaces/llava-hf/llava-4bit/resolve/main/examples/baklava.png", "How to make this pastry?")
|
39 |
+
|
40 |
+
## [{'generated_text': 'To create these pastries, you will need a few key ingredients and tools. Firstly, gather the dough by combining flour with water in your mixing bowl until it forms into an elastic ball that can be easily rolled out on top of another surface or table without breaking apart (like pizza).'}]
|
41 |
+
```
|
42 |
+
|
43 |
### OCR
|
44 |
|
45 |
This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.
|
packages/tasks/src/tasks/image-to-text/data.ts
CHANGED
@@ -42,6 +42,10 @@ const taskData: TaskDataCustom = {
|
|
42 |
description: "A strong optical character recognition model.",
|
43 |
id: "facebook/nougat-base",
|
44 |
},
|
|
|
|
|
|
|
|
|
45 |
],
|
46 |
spaces: [
|
47 |
{
|
|
|
42 |
description: "A strong optical character recognition model.",
|
43 |
id: "facebook/nougat-base",
|
44 |
},
|
45 |
+
{
|
46 |
+
description: "A powerful model that lets you have a conversation with the image.",
|
47 |
+
id: "llava-hf/llava-1.5-7b-hf",
|
48 |
+
},
|
49 |
],
|
50 |
spaces: [
|
51 |
{
|
packages/tasks/src/tasks/index.ts
CHANGED
@@ -51,8 +51,8 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
|
|
51 |
"graph-ml": ["transformers"],
|
52 |
"image-classification": ["keras", "timm", "transformers", "transformers.js"],
|
53 |
"image-segmentation": ["transformers", "transformers.js"],
|
54 |
-
"image-to-image": ["diffusers", "transformers.js"],
|
55 |
-
"image-to-text": ["transformers.js"],
|
56 |
"image-to-video": ["diffusers"],
|
57 |
"video-classification": ["transformers"],
|
58 |
"mask-generation": ["transformers"],
|
|
|
51 |
"graph-ml": ["transformers"],
|
52 |
"image-classification": ["keras", "timm", "transformers", "transformers.js"],
|
53 |
"image-segmentation": ["transformers", "transformers.js"],
|
54 |
+
"image-to-image": ["diffusers", "transformers", "transformers.js"],
|
55 |
+
"image-to-text": ["transformers", "transformers.js"],
|
56 |
"image-to-video": ["diffusers"],
|
57 |
"video-classification": ["transformers"],
|
58 |
"mask-generation": ["transformers"],
|
packages/tasks/src/tasks/text-generation/about.md
CHANGED
@@ -42,6 +42,10 @@ When it comes to text generation, the underlying language model can come in seve
|
|
42 |
|
43 |
- **Human feedback models:** these models extend base and instruction-trained models by incorporating human feedback that rates the quality of the generated text according to criteria like [helpfulness, honesty, and harmlessness](https://arxiv.org/abs/2112.00861). The human feedback is then combined with an optimization technique like reinforcement learning to align the original model to be closer with human preferences. The overall methodology is often called [Reinforcement Learning from Human Feedback](https://huggingface.co/blog/rlhf), or RLHF for short. [Llama2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) is an open-source model aligned through human feedback.
|
44 |
|
|
|
|
|
|
|
|
|
45 |
## Inference
|
46 |
|
47 |
You can use the 🤗 Transformers library `text-generation` pipeline to do inference with Text Generation models. It takes an incomplete text and returns multiple outputs with which the text can be completed.
|
|
|
42 |
|
43 |
- **Human feedback models:** these models extend base and instruction-trained models by incorporating human feedback that rates the quality of the generated text according to criteria like [helpfulness, honesty, and harmlessness](https://arxiv.org/abs/2112.00861). The human feedback is then combined with an optimization technique like reinforcement learning to align the original model to be closer with human preferences. The overall methodology is often called [Reinforcement Learning from Human Feedback](https://huggingface.co/blog/rlhf), or RLHF for short. [Llama2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) is an open-source model aligned through human feedback.
|
44 |
|
45 |
+
## Text Generation from Image and Text
|
46 |
+
|
47 |
+
There are language models that can input both text and image and output text, called vision language models. [LLaVA](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and [BLIP-2](https://huggingface.co/Salesforce/blip2-opt-2.7b) are good examples. Although they work just like other language models by means of input parameters for generation, since they also take input images, you can use them with `image-to-text` pipeline. You can find information about the pipeline in [image-to-text](https://huggingface.co/tasks/image-to-text) task page.
|
48 |
+
|
49 |
## Inference
|
50 |
|
51 |
You can use the 🤗 Transformers library `text-generation` pipeline to do inference with Text Generation models. It takes an incomplete text and returns multiple outputs with which the text can be completed.
|