LLM-Assisted Pipeline for Scraping and Validating Image-Based Datasets

Bruns, Benjamin

Poster (After Call)

FZJ-2026-02014

LLM-Assisted Pipeline for Scraping and Validating Image-Based Datasets

Bruns, B.FZJ*

2026

deRSE26 - 6th conference for Research Software Engineering, Stuttgart, Germany, 3 Mar 2026 - 5 Mar 2026

Abstract: The performance of deep neural networks in computer vision tasks depends heavily on access to large, diverse, and well-documented datasets. However, such datasets are often scattered across the web and lack standardized formats, making automated integration into training pipelines challenging.This work presents a prototype pipeline designed to address this issue by targeting datasets listed in a semi-structured format on CVonline’s “Image Databases” page (https://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm). The pipeline features a modular web scraping system that extracts dataset links, documentation, and meta-data. To address the heterogeneity and sparsity of metadata, large language models (LLMs) are employed to semantically parse descriptions, infer missing attributes, and classify datasets into high-level categories such as classification, segmentation, detection, and pose estimation.The output is a structured, validated dataset catalog enriched with inferred metadata and categorized for downstream use. Instead of redistributing the datasets, the system provides detailed metadata and download instructions. Additionally, a secondary LLM generates custom download scripts for datasets identified as directly accessible.Preliminary evaluation demonstrates promising results: over 50 datasets were successfully identified as auto-matically downloadable, well-documented, and suitable for machine learning workflows. Several have already been integrated into training pipelines with nearly no modifications needed to the downloaded content.

Contributing Institute(s):

Datenanalyse und Maschinenlernen (IAS-8)

Research Program(s):

5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2026

Click to display QR Code for this record

The record appears in these collections:
Dokumenttypen > Präsentationen > Poster
Institutssammlungen > IAS > IAS-8
Workflowsammlungen > Öffentliche Einträge
Publikationsdatenbank

Datensatz erzeugt am 2026-03-11, letzte Änderung am 2026-03-12

Ähnliche Datensätze

Externer link:

Volltext

Dieses Dokument bewerten:

(Bisher nicht rezensiert)

Zum persönlichen Korb hinzufügen
Export als Author List with IDs BibTeX (UTF-8), EndNote XML, EndNote Text, RIS, MARC, Print MARC, MARCXML, DC,
Request correction
Submit fulltext

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe