smolagents can see š„ we just shipped vision support to smolagents š¤ agentic computers FTW
you can now: š» let the agent get images dynamically (e.g. agentic web browser) š pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! š¤Æ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) š¤
Multimodal š¼ļø > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook ā 22k hours worth of samples from instruction videos š¤Æ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
Embeddings š > @MoritzLaurer released zero-shot version of ModernBERT large š > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation āÆļø > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts š„ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
š¬ Revolutionize Your Video Creation Dokdo Multimodal AI Transform a single image into a stunning video with perfect audio harmony! š
Superior Technology š« Advanced Flow Matching: Smoother video transitions surpassing Kling and Sora Intelligent Sound System: Automatically generates perfect audio by analyzing video mood Multimodal Framework: Advanced AI integrating image, text, and audio analysis Outstanding Performance šÆ Ultra-High Resolution: 4K video quality with bfloat16 acceleration Real-Time Optimization: 3x faster processing with PyTorch GPU acceleration Smart Sound Matching: Real-time audio effects based on scene transitions and motion Exceptional Features āØ Custom Audio Creation: Natural soundtrack matching video tempo and rhythm Intelligent Watermarking: Adaptive watermark adjusting to video characteristics Multilingual Support: Precise translation engine powered by Helsinki-NLP Versatile Applications š Social Media Marketing: Create engaging shorts for Instagram and YouTube Product Promotion: Dynamic promotional videos highlighting product features Educational Content: Interactive learning materials with enhanced engagement Portfolio Enhancement: Professional-grade videos showcasing your work Experience the video revolution with Dokdo Multimodal, where anyone can create professional-quality content from a single image. Elevate your content with perfectly synchronized video and audio that captivates your audience! šØ
Start creating stunning videos that stand out from the crowd - whether you're a marketer, educator, content creator, or business owner. Join the future of AI-powered video creation today!
Stoked to release the latest iteration of our MilkDropLM project! This new release is based on the powerful Qwen2.5-Coder-32B-Instruct model using the same great dataset that powered our 7b model.
What's new?
- Genome Unlocked: Deeper understanding of preset relationships for more accurate and creative generations.
- Preset Revival: Breathe new life into old presets with our upgraded model!
- Loop-B-Gone: Say goodbye to pesky loops and hello to smooth generation.
- Natural Chats: Engage in more natural sounding conversations with our LLM than ever before.
Released under Apache 2.0, because sharing is caring!
Shoutout to @superwatermelon for his invaluable insights and collab, and to all those courageous members in the community that have tested and provided feedback before!