Data validation for machine learning
Data validation for machine learning Breck et al., SysML’19
Last time out we looked at continuous integration testing of machine learning models, but arguably even more important than the model is the data. Garbage in, garbage out.
In this paper we focus on the problem of validation the input data fed to ML pipelines. The importance of this problem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model.
Breck et al. describe for us the data validation pipeline deployed in production at Google, “used by hundreds of product teams to continuously monitor and validate several petabytes of production data per day.” That’s trillions of training and serving examples per day, across more than 700 machine learning pipelines. More than enough to have accumulated some hard-won experience on what can go wrong and the kinds of safeguards it is useful to have in place!
What could possibly go wrong?
The motivating example is based on an actual production outage at Google, and demonstrates a couple of the trickier issues: feedback loops caused by training on corrupted data, and distance between data Continue reading







Remarks from President Donald Trump suggest the U.K. could change its position on letting Huawei...
Joe Kinsella, CTO of CloudHealth at VMware, describes the multi-cloud management product, which is...
Fusion has been on a downward spiral — struggling to pay down its debt, taking on more loans to...

