LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, Christoph; Kaczmarczyk, Robert; Komatsuzaki, Aran; Katta, Aarush; Vencu, Richard; Beaumont, Romain; Jitsev, Jenia; Coombes, Theo; Mullis, Clayton

Contribution to a conference proceedings

FZJ-2022-00923

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C. ; Vencu, R. ; Beaumont, R. ; Kaczmarczyk, R. ; Mullis, C. ; Katta, A. ; Coombes, T. ; Jitsev, J.FZJ* ; Komatsuzaki, A.

2021

NeurIPS Workshop Datacentric AI, DCAI2021, online, online, 14 Dec 2021 - 14 Dec 2021 5 p. (2021)

Please use a persistent id in citations: http://hdl.handle.net/2128/30478

Abstract: Multi-modal language-vision models trained on hundreds of millions of image-textpairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability toperform zero- or few-shot learning and transfer even in absence of per-sample labelson target image data. Despite this trend, to date there has been no publicly availabledatasets of sufficient scale for training such models from scratch. To address thisissue, in a community effort we build and release for public LAION-400M, adataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddingsand kNN indices that allow efficient similarity search

Contributing Institute(s):