-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 76 -
HuggingFaceFW/finewiki
Viewer • Updated • 61.6M • 8.77k • 273 -
nhagar/fineweb_urls
Viewer • Updated • 24.5B • 238 • 1 -
PleIAs/common_corpus
Viewer • Updated • 470M • 43.2k • 330
code
codenmhf
AI & ML interests
None yet
Recent Activity
liked
a dataset
about 2 months ago
Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
liked
a dataset
about 2 months ago
HuggingFaceFW/fineweb-edu
liked
a dataset
about 2 months ago
mlfoundations/dclm-baseline-1.0
Organizations
None yet