Nikita Gryzunov's picture

Nikita Gryzunov

nikgr

AI & ML interests

Interested in developing classification models (CNNs, etc) for transcription factor (TF) binding sites (TFBS). The IBIS Challenge (https://ibis.autosome.org) developer and GRECO-BIT consortium (https://thegreco.org) member.

Organizations

Blog-explorers's profile picture Social Post Explorers's profile picture

nikgr's activity

reacted to merterbak's post with 🔥 7 months ago
posted an update 7 months ago
view post
Post
1560
🐦 Do you remember IBIS? Not a fancy bird but the open challenge in Inferring Binding Specificities of unexplored human transcription factors. Check our site (https://ibis.autosome.org/) and have a sip of fresh news below.

👥 More than 100 teams registered for the challenge yet only two dozen are using the opportunity to explore their models on the Leaderboard. Don't miss the chance to participate in the Leaderboard stage, although independently of that you can submit the final solution.

🌐 Remember, the training data for Leaderboard and Final are available online, and you are free to mix-and-match it in any combination.

🌌 For Leaderboard, we have received 650 total submissions of AAA (advanced ML) and 296 PWM models (a whopping set of 6682 PWMs in total).

🚀 For PWMs, the baseline is left far behind, but some TFs remain tough nuts to be cracked (see the attached figure 1).

📈 For AAAs, there is a solid improvement over the best-submitted PWMs in A2G, but the G2A discipline remains unpopular (see the attached figure 2). Free hint: this is your chance!

💡 Another free hint: If your model tends to overfit given a limited set of data for some TFs don't forget to use reverse-complement and shift augmentations. Also, don't hesitate to use multitarget models i.e. predicting the binding of multiple TFs at the same time.

💡 Last but not least, try to combine knowledge from all accessible experiment types, especially for G2A discipline (ChIP-Seq & genomic HT-SELEX) in a single model!

📣 Finally and importantly, following the requests from the community, we decided to EXTEND the Leaderboard until the final submission deadline.

🗓️ The final submission deadline is also EXTENDED until Aug 15. The final submission form and details will be posted on the IBIS website in the first half of July, follow our Telegram group and mailing list (see the links at https://ibis.autosome.org).
posted an update 10 months ago
view post
Post
1913
🐦 The IBIS Challenge: an open competition in Inferring and predicting transcription factor Binding Specificities: modeling DNA patterns recognized by human regulatory proteins.

🧬 Deciphering human gene regulation is a cornerstone of modern molecular biology and biomedicine. Gene activity is controlled by special regulatory proteins, the transcription factors, which recognize DNA sequence patterns. We invite you to join IBIS in our search for the best method to model binding specificities of yet unexplored human regulatory proteins.

In the challenge, you may use classic methods to represent sequence patterns or any modern approaches
🚀 including decision trees, CNNs, RNNs, LSTMs, and transformers.

💡IBIS allows using arbitrary genome or random sequences to pre-train an artificial neural network or to extract features, and use a few existing datasets, see the IBIS documentation (https://ibis.autosome.org/docs/technical_details).

📊 Yet, the main power and opportunity come with a diverse array of experimental data on 40 human regulatory proteins, many of which remained unexplored until now.

🏆 The best methods will be highlighted in the post-challenge high-impact scientific paper 📝, while the winners 🥇of the Primary track of the Final round will be invited to contribute as co-authors.

🌐 Learn more at https://ibis.autosome.org/
🤗 Our article at HF: https://huggingface.co/blog/nikgr/the-ibis-challenge
👥 Organizers - GRECO-BIT & Codebook consortia: https://ibis.autosome.org/docs/about_us
reacted to DmitryRyumin's post with 🤗 10 months ago
view post
Post
2299
🚀🎭🌟 New Research Alert - CVPR 2024 (Avatars Collection)! 🌟 🎭🚀
📄 Title: GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image 🔝

📝 Description: GeneAvatar is a generic approach for editing 3D head avatars based on a single 2D image, applicable to different volumetric representations. The novel expression-aware generative modification model delivers high quality and consistent editing results across multiple viewpoints and emotions.

👥 Authors: Chong Bao et al.

📅 Conference: CVPR, Jun 17-21, 2024 | Seattle WA, USA 🇺🇸

🔗 Paper: GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image (2404.02152)

🌐 Github Page: https://zju3dv.github.io/geneavatar/
📁 Repository: https://github.com/zju3dv/GeneAvatar

📺 Video: https://www.youtube.com/watch?v=4zfbfPivtVU

📚 More Papers: more cutting-edge research presented at other conferences in the DmitryRyumin/NewEraAI-Papers curated by @DmitryRyumin

🚀 Added to the Avatars Collection: DmitryRyumin/avatars-65df37cdf81fec13d4dbac36

🔍 Keywords: #GeneAvatar #HeadAvatar #3DHeadAvatarEditing #VolumetricHeadAvatar #SingleImageEditing #ExpressionAwareModification #CVPR2024 #DeepLearning #Innovation
reacted to clem's post with 🤗 10 months ago
view post
Post
2537
Introducing gretelai/synthetic_text_to_sql by https://huggingface.co/gretelai

It stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.

The dataset includes:

- 105,851 records partitioned into 100,000 train and 5,851 test records
~23M total tokens, including ~12M SQL tokens
- Coverage across 100 distinct domains/verticals
- Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting
- Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations
- Database context, including table and view create statements
- Natural language explanations of what the SQL query is doing
- Contextual tags to optimize model training

Blogpost: https://gretel.ai/blog/synthetic-text-to-sql-dataset
Dataset: gretelai/synthetic_text_to_sql
  • 1 reply
·
reacted to akhaliq's post with ❤️ 10 months ago
view post
Post
4216
Mixture-of-Depths

Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.
  • 1 reply
·
reacted to SivilTaram's post with 🔥 10 months ago
view post
Post
2433
⚓️ Sailor: A New Multilingual Open LLM for South-East Asia 🌏

Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.

Today, we're more than excited to share the key technical details behind the Sailor models! 💪

**Key highlights**:
🔍 Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication.
🤖 Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations.
🔍 Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages!
🌟 Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.

We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! 🌍✨

To learn more, please access our research paper or reach out to our team.
🔗 Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🧩 Model: sail/sailor-language-models-65e19a749f978976f1959825
💻 Code: https://github.com/sail-sg/sailor-llm
reacted to m-ric's post with ❤️ 10 months ago
view post
Post
2071
[𝐍𝐞𝐰 𝐏𝐚𝐩𝐞𝐫] 𝐀𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐫𝐞𝐪𝐮𝐢𝐫𝐞 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐞𝐟𝐟𝐨𝐫𝐭 𝐭𝐨 𝐜𝐨𝐦𝐩𝐮𝐭𝐞! ⇒ 𝐌𝐢𝐱𝐭𝐮𝐫𝐞 𝐨𝐟 𝐝𝐞𝐩𝐭𝐡𝐬 🫧🐠

Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.

Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: 𝗻𝗼𝘁 𝗮𝗹𝗹 𝘁𝗼𝗸𝗲𝗻𝘀 𝗮𝗿𝗲 𝗰𝗿𝗲𝗮𝘁𝗲𝗱 𝗲𝗾𝘂𝗮𝗹!

➡️ 𝗧𝗵𝗲𝘆 𝗵𝗮𝗱 𝘁𝗵𝗶𝘀 𝗴𝗲𝗻𝗶𝘂𝘀 𝗶𝗱𝗲𝗮: 💡 𝗵𝗮𝘃𝗶𝗻𝗴 𝗮 𝘁𝗼𝗸𝗲𝗻 𝗴𝗼 𝘁𝗵𝗿𝗼𝘂𝗴𝗵 𝗮 𝗯𝗹𝗼𝗰𝗸 𝘀𝗵𝗼𝘂𝗹𝗱 𝗯𝗲 𝗼𝗽𝘁𝗶𝗼𝗻𝗮𝗹. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection.
The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. 𝘛𝘩𝘪𝘴 𝘢𝘭𝘭𝘰𝘸𝘴 𝘵𝘰 𝘤𝘩𝘰𝘰𝘴𝘦 𝘵𝘩𝘦 𝘦𝘹𝘢𝘤𝘵 𝙘𝙖𝙥𝙖𝙘𝙞𝙩𝙮 𝘰𝘧 𝘢 𝘣𝘭𝘰𝘤𝘬, 𝘪.𝘦. 𝘵𝘩𝘦 𝘱𝘳𝘰𝘱𝘰𝘳𝘵𝘪𝘰𝘯 𝘰𝘧 𝘵𝘰𝘬𝘦𝘯𝘴 𝘵𝘩𝘢𝘵 𝘨𝘰 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘪𝘵, 𝘸𝘩𝘪𝘤𝘩 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 𝘪𝘯𝘧𝘭𝘶𝘦𝘯𝘤𝘦𝘴 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘪𝘯𝘵𝘦𝘯𝘴𝘪𝘵𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘧𝘰𝘳𝘸𝘢𝘳𝘥 𝘱𝘢𝘴𝘴.

This yields Mixture-of-Depths (MoD), with spectacular results.

✨ 𝗥𝗲𝘀𝘂𝗹𝘁𝘀:
🎚️ 𝗖𝗮𝗽𝗮𝗰𝗶𝘁𝘆 𝗰𝗮𝗻 𝗯𝗲 𝘁𝘂𝗻𝗲𝗱 𝗮𝗹𝗹 𝘁𝗵𝗲 𝘄𝗮𝘆 𝗱𝗼𝘄𝗻 𝘁𝗼 𝟭𝟮.𝟱% for every second block: thus 87.5% of tokens just skip the block!
🚀 For the same training time and performance, >𝟲𝟬% 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗽𝗲𝗲𝗱!
🤝 𝗖𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗯𝗶𝗻𝗲𝗱 𝘄𝗶𝘁𝗵 𝗠𝗶𝘅𝘁𝘂𝗿𝗲-𝗼𝗳-𝗘𝘅𝗽𝗲𝗿𝘁𝘀 for further improvements.

📄 𝗣𝗮𝗽𝗲𝗿 𝗵𝗲𝗿𝗲 👉 Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)
📚 I added it to my paper collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
  • 1 reply
·
reacted to Jaward's post with 🤗 10 months ago
view post
Post
2829
After giving GPU Programming a hands-on try, I have come to appreciate the level of complexity in AI compute:

- Existing/leading frameworks (CUDA, OpenCL, DSLs, even Triton), still fall at the mercy of low-level compute that requires deeper understanding and experience.
- Ambiguous optimizations methods that will literally drive you mad 🤯
- Triton is cool but not cool enough (high level abstractions that fall back to low level compute issues as you build more specialized kernels)
- As for CUDA, optimization requires considering all major components of the GPU (DRAM, SRAM, ALUs) 🤕
- Models today require stallion written GPU kernels to reduce storage and compute cost.
- GPTQ was a big save 👍🏼

@karpathy is right expertise in this area is scarce and the reason is quite obvious - uncertainties: we are still struggling to get peak performance from multi-connected GPUs while maintaining precision and reducing cost.

May the Scaling Laws favor us lol.
·
reacted to davanstrien's post with 👍 10 months ago
view post
Post
2748
TIL: since Text Generation Inference supports Messages API, which is compatible with the OpenAI Chat Completion API, you can trace calls made to inference endpoints using Langfuse's OpenAI API integration.

A Hugging Face Pro subscription includes access to many models you want to test when developing an app (https://huggingface.co/blog/inference-pro). Using the endpoint and tracing your generations during this development process is an excellent way for GPU-poor people to bootstrap an initial dataset quickly while prototyping.
reacted to yagilb's post with 🔥 10 months ago
view post
Post
4450
Today we're starting a new initiative: LM Studio Community Models! 🤖

@bartowski , a prolific quantizer (both GGUF and EXL2) will be helping to curate notable new models in LM Studio's Community Models page: https://huggingface.co/lmstudio-community.

Our goal is to ensure the community has access to GGUF files for new & noteworthy models as soon as possible. Keep an eye on that page for updates.

If you're unfamiliar with GGUF, it's the de-facto standard for 'compressed' LLM weights. It is the native format of llama.cpp (https://github.com/ggerganov/llama.cpp, an LLM runtime C/C++ library.) This format is supported in LM Studio.

We will also be sharing new models on the LM Studio Discord: https://discord.gg/aPQfnNkxGC
·
reacted to gsarti's post with ❤️ 10 months ago
view post
Post
2194
🔍 Today's pick in Interpretability & Analysis of LMs: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models by @sammarks C. Rager @eircjm @belinkov @davidbau @amueller

This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting to produce behaviors like predicting sequence increments.

I found the experiment of Section 4 especially convincing and exciting in terms of downstream applications: authors trained a classifier over a biased dataset, and showcased how SHIFT intervention in feature space leads to performances matching those of the same model trained on an unbiased data distribution!

📄 Paper: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
reacted to abidlabs's post with 🤗 10 months ago
view post
Post
3339
Introducing the Gradio API Recorder 🪄

Every Gradio app now includes an API recorder that lets you reconstruct your interaction in a Gradio app as code using the Python or JS clients! Our goal is to make Gradio the easiest way to build ML APIs, not just UIs 🔥

·
reacted to aari1995's post with 🚀 10 months ago
view post
Post
3326
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)

ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual

multilingual/orca_dpo_pairs
·
reacted to akhaliq's post with ❤️ 10 months ago
view post
Post
2650
Advancing LLM Reasoning Generalists with Preference Trees

Advancing LLM Reasoning Generalists with Preference Trees (2404.02078)

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
reacted to conceptofmind's post with 🔥 10 months ago
view post
Post
3298
Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data.

During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.

Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form.

Link to the processed dataset: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

The Caselaw Access Project dataset is licensed under the CC0 License.

We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. If you are interested in contributing commercially licensed data be sure to reach out: https://twitter.com/EnricoShippole

Follow us for the next collaborative dataset releases: https://twitter.com/TeraflopAI
reacted to clefourrier's post with 👍 10 months ago
view post
Post
2362
Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.
·
reacted to AlekseyKorshuk's post with 🚀 10 months ago
view post
Post
3615
Happy to share Living Images and the demo video of the product outpainting model behind it 🚀

Send your generation requests in the thread 🧵or use https://img.coframe.ai
reacted to osanseviero's post with ❤️ 10 months ago
view post
Post
3547
Diaries of Open Source. Part 12 🤗

🚀Alibaba releases Qwen1.5-MoE-A2.7B, an interesting MoE with 2.7B activated parameters and 64 experts
Blog https://qwenlm.github.io/blog/qwen-moe/
Demo: Qwen/qwen1.5-MoE-A2.7B-Chat-demo
Models: https://hf.co/Qwen
GitHub: https://github.com/QwenLM/Qwen1.5

🎵VoiceCraft, SOTA speech editing and text to speech
GitHub: https://github.com/jasonppy/VoiceCraft
Model: pyp1/VoiceCraft

🐍 AI21Labs release Jamba, an SSM-Transformer, pretrained MoE which allows a large context window (256K) and high throughput
Blog https://www.ai21.com/blog/announcing-jamba
Model ai21labs/Jamba-v0.1

✨ Berkeley releases Starling-LM-7B, an RLHF-ed model, and -RM-34B, a Yi-based reward model very good for its size
Starling Beta: Nexusflow/Starling-LM-7B-beta
Starling RM: Nexusflow/Starling-RM-34B

🖥️Stability releases Stable Code Instruct 3B, an instruct model for code generation
Blog: https://stability.ai/news/introducing-stable-code-instruct-3b
Demo: stabilityai/stable-code-instruct-3b
Report: https://stability.ai/s/Stable_Code_TechReport_release.pdf

📚Common Corpus: the largest public domain dataset for training LLMs
Blog: https://hf.co/blog/Pclanglais/common-corpus
Dataset: https://hf.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Misc:
⚡GaLore: a very memory-efficient technique that allows pretraining models in consumer GPUs https://hf.co/blog/galore
Moirai
📈Moirai, foundation models for time series forecasting https://hf.co/collections/Salesforce/moirai-10-r-models-65c8d3a94c51428c300e0742
🔥 Mistral-ORPO-Capybara-7K, a high-quality Mistral fine-tune using ORPO, a new alignment technique kaist-ai/mistral-orpo-capybara-7k
🤯APISR, an anime super-resolution upscaling model HikariDawn/APISR
·
reacted to Molbap's post with 🔥 10 months ago
view post
Post
5166
🚀🚀 Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
🔥 With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:


import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))



We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. 🤗

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.
·