Ljubomir Josifovski

ljupco

https://ljubomirj.github.io/

AI & ML interests

Now - systematic trading, research & development. Prior - speech recognition in noise, speech synthesis, machine learning.

Recent Activity

liked a model about 17 hours ago

perplexity-ai/r1-1776

liked a model 5 days ago

NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF

liked a model 8 days ago

ProsusAI/finbert

View all activity

Organizations

None yet

ljupco's activity

liked a model about 17 hours ago

perplexity-ai/r1-1776

Updated about 19 hours ago • 475 • 693

liked a model 5 days ago

NousResearch/DeepHermes-3-Llama-3-8B-Preview-GGUF

Updated 3 days ago • 14.3k • 57

liked a model 8 days ago

ProsusAI/finbert

Text Classification • Updated May 23, 2023 • 1.44M • • 785

liked a model 9 days ago

tomg-group-umd/huginn-0125

Text Generation • Updated 2 days ago • 8.77k • 212

liked a model 11 days ago

simplescaling/s1-32B

Text Generation • Updated 8 days ago • 9.28k • 274

liked a model 12 days ago

NovaSky-AI/Sky-T1-32B-Preview

Text Generation • Updated Jan 13 • 14k • 533

upvoted a collection 13 days ago

Hibiki fr-en

Collection

Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki. • 5 items • Updated 13 days ago • 48

liked a model 13 days ago

TheBloke/Wizard-Vicuna-30B-Uncensored-GGUF

Updated Sep 27, 2023 • 6.31k • 49

liked 3 models 19 days ago

liked a Space 22 days ago

HuggingDiscussions

🏢

Join discussions on Hugging Face Hub

liked a model 26 days ago

HKUSTAudio/Llasa-3B

Text-to-Speech • Updated 6 days ago • 8.71k • 449

reacted to tomaarsen's post with ❤️ about 1 month ago

Post

4606

🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
📜 my training scripts, using the Sentence Transformers library
📊 my Weights & Biases reports with losses & metrics
📕 my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1

1 reply

reacted to tomaarsen's post with ❤️ about 1 month ago

Post

3007

That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.