001018240 001__ 1018240 001018240 005__ 20231128201904.0 001018240 0247_ $$2doi$$a10.48550/ARXIV.2311.04179 001018240 0247_ $$2datacite_doi$$a10.34734/FZJ-2023-04636 001018240 037__ $$aFZJ-2023-04636 001018240 1001_ $$0P:(DE-Juel1)190306$$aSasse, Leonard$$b0$$ufzj 001018240 245__ $$aOn Leakage in Machine Learning Pipelines 001018240 260__ $$barXiv$$c2023 001018240 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1701175918_23345 001018240 3367_ $$2ORCID$$aWORKING_PAPER 001018240 3367_ $$028$$2EndNote$$aElectronic Article 001018240 3367_ $$2DRIVER$$apreprint 001018240 3367_ $$2BibTeX$$aARTICLE 001018240 3367_ $$2DataCite$$aOutput Types/Working Paper 001018240 520__ $$aMachine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines. 001018240 536__ $$0G:(DE-HGF)POF4-5254$$a5254 - Neuroscientific Data Analytics and AI (POF4-525)$$cPOF4-525$$fPOF IV$$x0 001018240 588__ $$aDataset connected to DataCite 001018240 650_7 $$2Other$$aMachine Learning (cs.LG) 001018240 650_7 $$2Other$$aArtificial Intelligence (cs.AI) 001018240 650_7 $$2Other$$aFOS: Computer and information sciences 001018240 7001_ $$0P:(DE-HGF)0$$aNicolaisen-Sobesky, Eliana$$b1 001018240 7001_ $$0P:(DE-Juel1)177727$$aDukart, Jürgen$$b2$$ufzj 001018240 7001_ $$0P:(DE-Juel1)131678$$aEickhoff, Simon B.$$b3$$ufzj 001018240 7001_ $$0P:(DE-HGF)0$$aGötz, Michael$$b4 001018240 7001_ $$0P:(DE-Juel1)184874$$aHamdan, Sami$$b5$$ufzj 001018240 7001_ $$0P:(DE-Juel1)187351$$aKomeyer, Vera$$b6$$ufzj 001018240 7001_ $$0P:(DE-HGF)0$$aKulkarni, Abhijit$$b7 001018240 7001_ $$0P:(DE-Juel1)179423$$aLahnakoski, Juha$$b8$$ufzj 001018240 7001_ $$0P:(DE-HGF)0$$aLove, Bradley C.$$b9 001018240 7001_ $$0P:(DE-Juel1)185083$$aRaimondo, Federico$$b10$$ufzj 001018240 7001_ $$0P:(DE-Juel1)172843$$aPatil, Kaustubh R.$$b11$$eCorresponding author$$ufzj 001018240 773__ $$a10.48550/ARXIV.2311.04179 001018240 8564_ $$uhttps://juser.fz-juelich.de/record/1018240/files/on_leakage.pdf$$yOpenAccess 001018240 909CO $$ooai:juser.fz-juelich.de:1018240$$popenaire$$popen_access$$pVDB$$pdriver$$pdnbdelivery 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)190306$$aForschungszentrum Jülich$$b0$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)177727$$aForschungszentrum Jülich$$b2$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)131678$$aForschungszentrum Jülich$$b3$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)184874$$aForschungszentrum Jülich$$b5$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)187351$$aForschungszentrum Jülich$$b6$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)179423$$aForschungszentrum Jülich$$b8$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)185083$$aForschungszentrum Jülich$$b10$$kFZJ 001018240 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)172843$$aForschungszentrum Jülich$$b11$$kFZJ 001018240 9131_ $$0G:(DE-HGF)POF4-525$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5254$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vDecoding Brain Organization and Dysfunction$$x0 001018240 9141_ $$y2023 001018240 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess 001018240 920__ $$lyes 001018240 9201_ $$0I:(DE-Juel1)INM-7-20090406$$kINM-7$$lGehirn & Verhalten$$x0 001018240 980__ $$apreprint 001018240 980__ $$aVDB 001018240 980__ $$aUNRESTRICTED 001018240 980__ $$aI:(DE-Juel1)INM-7-20090406 001018240 9801_ $$aFullTexts