Guilherme Penedo
guipenedo
AI & ML interests
None yet
Recent Activity
new activity
about 6 hours ago
HuggingFaceFW/fineweb:Downloading the 350BT sample uses 990GB of disk space
new activity
about 6 hours ago
HuggingFaceFW/fineweb:Create Ffcc
upvoted
an
article
about 9 hours ago
Finding Moroccan Arabic (Darija) in Fineweb 2
Organizations
guipenedo's activity
Downloading the 350BT sample uses 990GB of disk space
4
#57 opened 26 days ago
by
ddh0
![](https://cdn-avatars.huggingface.co/v1/production/uploads/64bc5b4512d00c4589dfa8a6/E-jSfIcXcPTQanocQs78P.png)
Create Ffcc
1
#58 opened 1 day ago
by
Ricky23184
Update 2025/2025-01-22-Torstar.md
#4 opened 12 days ago
by
guipenedo
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/X2aLkJ0ofhkXwAg7lXvxD.jpeg)
New update returns a 500 server error using the datasets-server API
6
#18 opened about 2 months ago
by
jonna32
Synthetic Data Generator
1
#5 opened about 1 month ago
by
kishorekashyap
Cannot load with datasets
3
#4 opened about 1 month ago
by
mbanon
![](https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_m3de_G457LNKIwQO1M6f.jpeg)
A lot of load errors after new update
14
#19 opened about 1 month ago
by
yzhangcs
![](https://cdn-avatars.huggingface.co/v1/production/uploads/638f5839f6de4b9e7e1627fb/6QGkrqRag6-GnH9k60Oil.jpeg)
Add "date" column to "default" subset
#20 opened about 1 month ago
by
lhoestq
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1594214747713-5e9ecfc04957053f60648a3e.png)
Simple exact deduplication removes 2/3 of data.
4
#49 opened 6 months ago
by
egor-pakhomov
Torrent?
3
#4 opened 10 months ago
by
emilss
Any plan to train models on larger subset of dataset?
1
#8 opened 10 months ago
by
mrfakename
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png)
Are copyrighted works included in this dataset?
4
#9 opened 10 months ago
by
umm-maybe
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1653942799944-noauth.png)
Reprocessing for a new language
14
#12 opened 10 months ago
by
pere
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1631018295628-5f0ca59719cb630495b81509.jpeg)
Training configs for data ablation study
2
#14 opened 10 months ago
by
jimmyhbx
tiny-fineweb
3
#19 opened 10 months ago
by
3thn
![](https://cdn-avatars.huggingface.co/v1/production/uploads/66144e2044765354627477b9/eT5upf5np13H0o1ZViweY.png)
Unsafe files
1
#25 opened 9 months ago
by
alielfilali01
![](https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png)
"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy
#28 opened 9 months ago
by
clem
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg)
Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 8 months ago
by
kimcando
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62759969a227a8b3a7065b2a/ADlH0VjyVlZ0Om8Mxj5HX.jpeg)
Language subset
3
#33 opened 8 months ago
by
talmor