| Hauptseite > Publikationsdatenbank > LLM-Assisted Pipeline for Scraping and Validating Image-Based Datasets |
| Poster (After Call) | FZJ-2026-02014 |
2026
Abstract: The performance of deep neural networks in computer vision tasks depends heavily on access to large, diverse, and well-documented datasets. However, such datasets are often scattered across the web and lack standardized formats, making automated integration into training pipelines challenging.This work presents a prototype pipeline designed to address this issue by targeting datasets listed in a semi-structured format on CVonline’s “Image Databases” page (https://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm). The pipeline features a modular web scraping system that extracts dataset links, documentation, and meta-data. To address the heterogeneity and sparsity of metadata, large language models (LLMs) are employed to semantically parse descriptions, infer missing attributes, and classify datasets into high-level categories such as classification, segmentation, detection, and pose estimation.The output is a structured, validated dataset catalog enriched with inferred metadata and categorized for downstream use. Instead of redistributing the datasets, the system provides detailed metadata and download instructions. Additionally, a secondary LLM generates custom download scripts for datasets identified as directly accessible.Preliminary evaluation demonstrates promising results: over 50 datasets were successfully identified as auto-matically downloadable, well-documented, and suitable for machine learning workflows. Several have already been integrated into training pipelines with nearly no modifications needed to the downloaded content.
|
The record appears in these collections: |