Poster (After Call) FZJ-2026-02014

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
LLM-Assisted Pipeline for Scraping and Validating Image-Based Datasets



2026

deRSE26 - 6th conference for Research Software Engineering, StuttgartStuttgart, Germany, 3 Mar 2026 - 5 Mar 20262026-03-032026-03-05

Abstract: The performance of deep neural networks in computer vision tasks depends heavily on access to large, diverse, and well-documented datasets. However, such datasets are often scattered across the web and lack standardized formats, making automated integration into training pipelines challenging.This work presents a prototype pipeline designed to address this issue by targeting datasets listed in a semi-structured format on CVonline’s “Image Databases” page (https://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm). The pipeline features a modular web scraping system that extracts dataset links, documentation, and meta-data. To address the heterogeneity and sparsity of metadata, large language models (LLMs) are employed to semantically parse descriptions, infer missing attributes, and classify datasets into high-level categories such as classification, segmentation, detection, and pose estimation.The output is a structured, validated dataset catalog enriched with inferred metadata and categorized for downstream use. Instead of redistributing the datasets, the system provides detailed metadata and download instructions. Additionally, a secondary LLM generates custom download scripts for datasets identified as directly accessible.Preliminary evaluation demonstrates promising results: over 50 datasets were successfully identified as auto-matically downloadable, well-documented, and suitable for machine learning workflows. Several have already been integrated into training pipelines with nearly no modifications needed to the downloaded content.


Contributing Institute(s):
  1. Datenanalyse und Maschinenlernen (IAS-8)
Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2026
Click to display QR Code for this record

The record appears in these collections:
Document types > Presentations > Poster
Institute Collections > IAS > IAS-8
Workflow collections > Public records
Publications database

 Record created 2026-03-11, last modified 2026-03-12


External link:
Download fulltext
Fulltext
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)