LAION-5B: An open large-scale dataset for training next generation image-text models

Schuhmann, Christoph; Katta, Aarush; Kaczmarczyk, Robert; Wortsman, Mitchell; Cherti, Mehdi; Schmidt, Ludwig; Wightman, Ross; Vencu, Richard; Beaumont, Romain; Gordon, Cade; Jitsev, Jenia; Crowson, Katherine; Schramowsk, Patrick; Coombes, Theo; Kundurthy, Srivatsa; Mullis, Clayton

TY  - CONF
AU  - Schuhmann, Christoph
AU  - Beaumont, Romain
AU  - Vencu, Richard
AU  - Gordon, Cade
AU  - Wightman, Ross
AU  - Cherti, Mehdi
AU  - Coombes, Theo
AU  - Katta, Aarush
AU  - Mullis, Clayton
AU  - Wortsman, Mitchell
AU  - Schramowsk, Patrick
AU  - Kundurthy, Srivatsa
AU  - Crowson, Katherine
AU  - Schmidt, Ludwig
AU  - Kaczmarczyk, Robert
AU  - Jitsev, Jenia
TI  - LAION-5B: An open large-scale dataset for training next generation image-text models
VL  - 35
CY  - Red Hook, NY
PB  - Curran Associates, Inc.
M1  - FZJ-2024-00372
SN  - 9781713871088
T2  - Advances in neural information processing systems
SP  - 25278 - 25294
PY  - 2022
N1  - Also on arXiv: https://doi.org/10.48550/arXiv.2210.08402
AB  - Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filteredimage-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection.
T2  - 9781713871088
CY  - 28 Nov 2022 - 9 Dec 2022, New Orleans, Louisiana (USA)
Y2  - 28 Nov 2022 - 9 Dec 2022
M2  - New Orleans, Louisiana, USA
LB  - PUB:(DE-HGF)8 ; PUB:(DE-HGF)7
DO  - DOI:10.34734/FZJ-2024-00372
UR  - https://juser.fz-juelich.de/record/1020896
ER  -

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help