Poster (After Call) FZJ-2026-02014

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
LLM-Assisted Pipeline for Scraping and Validating Image-Based Datasets



2026

deRSE26 - 6th conference for Research Software Engineering, StuttgartStuttgart, Germany, 3 Mar 2026 - 5 Mar 20262026-03-032026-03-05

Abstract: The performance of deep neural networks in computer vision tasks depends heavily on access to large, diverse, and well-documented datasets. However, such datasets are often scattered across the web and lack standardized formats, making automated integration into training pipelines challenging.This work presents a prototype pipeline designed to address this issue by targeting datasets listed in a semi-structured format on CVonline’s “Image Databases” page (https://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm). The pipeline features a modular web scraping system that extracts dataset links, documentation, and meta-data. To address the heterogeneity and sparsity of metadata, large language models (LLMs) are employed to semantically parse descriptions, infer missing attributes, and classify datasets into high-level categories such as classification, segmentation, detection, and pose estimation.The output is a structured, validated dataset catalog enriched with inferred metadata and categorized for downstream use. Instead of redistributing the datasets, the system provides detailed metadata and download instructions. Additionally, a secondary LLM generates custom download scripts for datasets identified as directly accessible.Preliminary evaluation demonstrates promising results: over 50 datasets were successfully identified as auto-matically downloadable, well-documented, and suitable for machine learning workflows. Several have already been integrated into training pipelines with nearly no modifications needed to the downloaded content.


Contributing Institute(s):
  1. Datenanalyse und Maschinenlernen (IAS-8)
Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2026
Click to display QR Code for this record

The record appears in these collections:
Dokumenttypen > Präsentationen > Poster
Institutssammlungen > IAS > IAS-8
Workflowsammlungen > Öffentliche Einträge
Publikationsdatenbank

 Datensatz erzeugt am 2026-03-11, letzte Änderung am 2026-03-12


Externer link:
Volltext herunterladen
Volltext
Dieses Dokument bewerten:

Rate this document:
1
2
3
 
(Bisher nicht rezensiert)