Automated data validation: An industrial experience report
| dc.contributor.author | Zhang, Lei | |
| dc.contributor.author | Howard, Sean | |
| dc.contributor.author | Montpool, Tom | |
| dc.contributor.author | Moore, Jessica | |
| dc.contributor.author | Mahajan, Krittika | |
| dc.contributor.author | Miranskyy, Andriy | |
| dc.date.accessioned | 2024-02-28T14:18:33Z | |
| dc.date.available | 2024-02-28T14:18:33Z | |
| dc.date.issued | 2022-12-16 | |
| dc.description.abstract | Abstract There has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle. This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience. | |
| dc.description.sponsorship | The work reported in this paper is supported and funded by Natural Sciences and Engineering Research Council of Canada, Ontario Centres of Excellence , and Environics Analytics. We thank Environics Analytics data scientists for their valuable feedback. | |
| dc.description.uri | https://www.sciencedirect.com/science/article/pii/S0164121222002497 | |
| dc.format.extent | 39 pages | |
| dc.genre | journal articles | |
| dc.genre | preprints | |
| dc.identifier | doi:10.13016/m2ljyy-8i7q | |
| dc.identifier.citation | Zhang, Lei, Sean Howard, Tom Montpool, Jessica Moore, Krittika Mahajan, and Andriy Miranskyy. “Automated Data Validation: An Industrial Experience Report.” Journal of Systems and Software 197 (March 1, 2023): 111573. https://doi.org/10.1016/j.jss.2022.111573. | |
| dc.identifier.uri | https://doi.org/10.1016/j.jss.2022.111573 | |
| dc.identifier.uri | http://hdl.handle.net/11603/31734 | |
| dc.language.iso | en_US | |
| dc.publisher | Elsevier | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Information Systems Department Collection | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
| dc.title | Automated data validation: An industrial experience report | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0001-9343-3654 |
