Explanations for Data Repair Through Shapley Values


Data repair, i.e., the identification and fix of errors in the data, is a central component of the Data Science cycle. As such, significant research effort has been devoted to automate the repair process. Yet it still requires significant manual labor by the Data Scientists, tweaking and optimizing repair modules, up to 80% of their time, according to surveys. To this end, we propose in this paper a novel framework for explaining the results of any data repair module. Explanations involve identifying the table cells and database constraints having the strongest influence on the process. Influence, in turn, is quantified through the game-theoretic notion of Shapley values, commonly used for explaining Machine Learning classifier results. The main technical challenge is that exact computation of Shapley values incurs exponential time. We consequently devise and optimize novel approximation algorithms, and analyze them both theoretically and empirically. Our results show the efficiency of our approach when compared to the alternative of adapting existing Shapley value computation techniques to the data repair settings.