machineuser commited on
Commit
2b1e8c5
·
1 Parent(s): 86c4ad7

Sync widgets demo

Browse files
packages/tasks/src/tasks/image-to-text/about.md CHANGED
@@ -27,6 +27,19 @@ captioner("https://huggingface.co/datasets/Narsil/image_dummy/resolve/main/parro
27
  ## [{'generated_text': 'two birds are standing next to each other '}]
28
  ```
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ### OCR
31
 
32
  This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.
 
27
  ## [{'generated_text': 'two birds are standing next to each other '}]
28
  ```
29
 
30
+ ### Conversation about the Image
31
+
32
+ Some text generation models also take image inputs. These are called vision language models. You can use `image-to-text` pipeline to use these models like below.
33
+
34
+ ```python
35
+ from transformers import pipeline
36
+
37
+ mm_pipeline = pipeline("image-to-text",model="llava-hf/llava-1.5-7b-hf")
38
+ mm_pipeline("https://huggingface.co/spaces/llava-hf/llava-4bit/resolve/main/examples/baklava.png", "How to make this pastry?")
39
+
40
+ ## [{'generated_text': 'To create these pastries, you will need a few key ingredients and tools. Firstly, gather the dough by combining flour with water in your mixing bowl until it forms into an elastic ball that can be easily rolled out on top of another surface or table without breaking apart (like pizza).'}]
41
+ ```
42
+
43
  ### OCR
44
 
45
  This code snippet uses Microsoft’s TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images.
packages/tasks/src/tasks/image-to-text/data.ts CHANGED
@@ -42,6 +42,10 @@ const taskData: TaskDataCustom = {
42
  description: "A strong optical character recognition model.",
43
  id: "facebook/nougat-base",
44
  },
 
 
 
 
45
  ],
46
  spaces: [
47
  {
 
42
  description: "A strong optical character recognition model.",
43
  id: "facebook/nougat-base",
44
  },
45
+ {
46
+ description: "A powerful model that lets you have a conversation with the image.",
47
+ id: "llava-hf/llava-1.5-7b-hf",
48
+ },
49
  ],
50
  spaces: [
51
  {
packages/tasks/src/tasks/index.ts CHANGED
@@ -51,8 +51,8 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
51
  "graph-ml": ["transformers"],
52
  "image-classification": ["keras", "timm", "transformers", "transformers.js"],
53
  "image-segmentation": ["transformers", "transformers.js"],
54
- "image-to-image": ["diffusers", "transformers.js"],
55
- "image-to-text": ["transformers.js"],
56
  "image-to-video": ["diffusers"],
57
  "video-classification": ["transformers"],
58
  "mask-generation": ["transformers"],
 
51
  "graph-ml": ["transformers"],
52
  "image-classification": ["keras", "timm", "transformers", "transformers.js"],
53
  "image-segmentation": ["transformers", "transformers.js"],
54
+ "image-to-image": ["diffusers", "transformers", "transformers.js"],
55
+ "image-to-text": ["transformers", "transformers.js"],
56
  "image-to-video": ["diffusers"],
57
  "video-classification": ["transformers"],
58
  "mask-generation": ["transformers"],
packages/tasks/src/tasks/text-generation/about.md CHANGED
@@ -42,6 +42,10 @@ When it comes to text generation, the underlying language model can come in seve
42
 
43
  - **Human feedback models:** these models extend base and instruction-trained models by incorporating human feedback that rates the quality of the generated text according to criteria like [helpfulness, honesty, and harmlessness](https://arxiv.org/abs/2112.00861). The human feedback is then combined with an optimization technique like reinforcement learning to align the original model to be closer with human preferences. The overall methodology is often called [Reinforcement Learning from Human Feedback](https://huggingface.co/blog/rlhf), or RLHF for short. [Llama2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) is an open-source model aligned through human feedback.
44
 
 
 
 
 
45
  ## Inference
46
 
47
  You can use the 🤗 Transformers library `text-generation` pipeline to do inference with Text Generation models. It takes an incomplete text and returns multiple outputs with which the text can be completed.
 
42
 
43
  - **Human feedback models:** these models extend base and instruction-trained models by incorporating human feedback that rates the quality of the generated text according to criteria like [helpfulness, honesty, and harmlessness](https://arxiv.org/abs/2112.00861). The human feedback is then combined with an optimization technique like reinforcement learning to align the original model to be closer with human preferences. The overall methodology is often called [Reinforcement Learning from Human Feedback](https://huggingface.co/blog/rlhf), or RLHF for short. [Llama2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) is an open-source model aligned through human feedback.
44
 
45
+ ## Text Generation from Image and Text
46
+
47
+ There are language models that can input both text and image and output text, called vision language models. [LLaVA](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and [BLIP-2](https://huggingface.co/Salesforce/blip2-opt-2.7b) are good examples. Although they work just like other language models by means of input parameters for generation, since they also take input images, you can use them with `image-to-text` pipeline. You can find information about the pipeline in [image-to-text](https://huggingface.co/tasks/image-to-text) task page.
48
+
49
  ## Inference
50
 
51
  You can use the 🤗 Transformers library `text-generation` pipeline to do inference with Text Generation models. It takes an incomplete text and returns multiple outputs with which the text can be completed.