RESTORE: Automated Regression Testing for Datasets

dc.contributor.authorZhang, Lei
dc.contributor.authorHoward, Sean
dc.contributor.authorMontpool, Tom
dc.contributor.authorMoore, Jessica
dc.contributor.authorMahajan, Krittika
dc.contributor.authorMiranskyy, Andriy V.
dc.date.accessioned2025-04-23T20:30:41Z
dc.date.available2025-04-23T20:30:41Z
dc.date.issued2019-01-01
dc.description.abstractThere has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle. This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience.
dc.description.sponsorshipThe work reported in this paper is supported and funded by Natural Sciences and Engineering Research Council of Canada, Ontario Centres of Excellence, and Environics Analytics. We thank Environics Analytics data scientists for their valuable feedback.
dc.description.urihttps://openreview.net/forum?id=6nt9PtCTTo
dc.format.extent39 pages
dc.genrejournal articles
dc.identifierdoi:10.13016/m231pw-vd2z
dc.identifier.citationZhang, Lei, Sean Howard, Tom Montpool, Jessica Moore, Krittika Mahajan, and Andriy V. Miranskyy. “RESTORE: Automated Regression Testing for Datasets.” CoRR, January 1, 2019. https://openreview.net/forum?id=6nt9PtCTTo.
dc.identifier.urihttp://hdl.handle.net/11603/37980
dc.language.isoen_US
dc.publisherOpen Review
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Information Systems Department
dc.rightsAttribution-ShareAlike 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/
dc.titleRESTORE: Automated Regression Testing for Datasets
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0001-9343-3654

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
automated190303676v2.pdf
Size:
618.48 KB
Format:
Adobe Portable Document Format