Sam Heutmaker
commited on
Commit
·
6c8e84b
1
Parent(s):
079a741
update readme
Browse files
README.md
CHANGED
@@ -1,9 +1,55 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
9 |
|
@@ -18,7 +64,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
|
|
18 |
|
19 |
## Architecture
|
20 |
|
21 |
-
|
22 |
|
23 |
### Technical Specifications
|
24 |
- **Parameters**: 12 billion
|
@@ -64,7 +110,7 @@ FP8 quantization showed no measurable quality degradation compared to bf16 preci
|
|
64 |
|
65 |
## Cost Comparison
|
66 |
|
67 |
-
|
68 |
|
69 |
<img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="100%" />
|
70 |
|
@@ -84,7 +130,7 @@ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost sa
|
|
84 |
|
85 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
86 |
|
87 |
-
**[Run
|
88 |
|
89 |
### Required Prompts
|
90 |
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
license: apache-2.0
|
5 |
+
tags:
|
6 |
+
- vision-language-model
|
7 |
+
- video-understanding
|
8 |
+
- image-captioning
|
9 |
+
- gemma
|
10 |
+
- fp8
|
11 |
+
- json-mode
|
12 |
+
- structured-output
|
13 |
+
- video-analysis
|
14 |
+
- content-moderation
|
15 |
+
- accessibility
|
16 |
+
base_model: google/gemma-12b
|
17 |
+
pipeline_tag: image-text-to-text
|
18 |
+
widget:
|
19 |
+
- example_title: Video Frame Analysis
|
20 |
+
messages:
|
21 |
+
- role: system
|
22 |
+
content: You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
|
23 |
+
- role: user
|
24 |
+
content: "[Image of a nature scene] Analyze this frame and return structured JSON output."
|
25 |
+
model-index:
|
26 |
+
- name: ClipTagger-12b
|
27 |
+
results:
|
28 |
+
- task:
|
29 |
+
type: image-to-text
|
30 |
+
name: Video Frame Captioning
|
31 |
+
metrics:
|
32 |
+
- name: Average Judge Score
|
33 |
+
type: quality
|
34 |
+
value: 3.53
|
35 |
+
- name: ROUGE-1
|
36 |
+
type: rouge-1
|
37 |
+
value: 0.674
|
38 |
+
- name: ROUGE-L
|
39 |
+
type: rouge-l
|
40 |
+
value: 0.520
|
41 |
+
- name: BLEU
|
42 |
+
type: bleu
|
43 |
+
value: 0.267
|
44 |
+
---
|
45 |
+
|
46 |
+
# ClipTagger-12b
|
47 |
+
|
48 |
+

|
49 |
+
|
50 |
+
## Model Description
|
51 |
+
|
52 |
+
**ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
53 |
|
54 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
55 |
|
|
|
64 |
|
65 |
## Architecture
|
66 |
|
67 |
+
ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
68 |
|
69 |
### Technical Specifications
|
70 |
- **Parameters**: 12 billion
|
|
|
110 |
|
111 |
## Cost Comparison
|
112 |
|
113 |
+
ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
|
114 |
|
115 |
<img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="100%" />
|
116 |
|
|
|
130 |
|
131 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
132 |
|
133 |
+
**[Run ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
|
134 |
|
135 |
### Required Prompts
|
136 |
|