Data Quality Assessment and Repair

Jul 10, 2024

An academic viewpoint on data quality

There is a disconnect between the academic literature on data quality and data repairing, and the way that the industry repairs and cleans data. While papers on data quality have well-defined models for how data can be measured for quality and how it can be repaired to achieve higher quality, it seems that this viewpoint has not been widely exported to companies who possess and handle large quantities of data. The goal of this article is to try to give a partial, high-level view of the academic side of data quality and give a taste of the areas and projects that we are working on.

Why care about data quality?

Many studies and papers have shown that poor data quality may impact the performance of applications that use it and specifically, Machine Learning models that train on it (e.g., Budach et. al.), validating the famous saying “Garbage In – Garbage Out”. Thus, it makes sense that companies spend over a billion dollars in an effort to improve data quality.

But what is data quality? Is it simply having no “NaN”s in the data or is it something more? Can we find a better definition for data quality that can be applied to measure and improve its quality?

Here, we can look to the field of Database Research to get a more formal description of data quality. Specifically, we can talk about two types of constraints: Integrity Constraints and Equity Constraints.

High-Level Data Repair Model

Integrity Constraints

The standard approach of measuring data quality and repairing it involves Integrity Constraints (ICs). Such constraints aim to ensure that the data satisfies conditions that reflect reality. An IC is usually expressed as a logical statement that is either true of false (although relaxations have also been considered in the literature). One example of an IC is a Functional Dependency (FD) which has the form $A \rightarrow B$, where $A, B$ are subsets of the attributes (columns) of the dataset. This implying that for any two records $t_1, t_2$ in the dataset, if $t_1.C_A = t_2.C_A$ holds for all attributes $C_A \in A$, then $t_1.C_B =t_2.C_B$ holds for all $C_B \in B$. For example, the constraint $\texttt{address}\rightarrow \texttt{state,zipCode}$ saying that a person’s address determines their state and zipCode. In other words, we can interpret this constraint as stating that no two people can have the same address and a different state or ZipCode.
Generalizations of FDs such as conditional FDs, metric FDs, and Denial Constraints (DCs) have also been proposed. We will devote a different article to these more complex constraints.

Equity Constraints

There has been a significant effort in recent years to both define fairness and equity. Such constraints can be defined as conditions over query results, such as the number of Males in the result should be identical to the number of Females or that a specific attribute should be independent of another attribute (e.g., Income should be independent of Gender), but they can also be statistical conditions that refer to the performance of an algorithm over the data (a very prolific vein of literature called “Algorithmic Fairness”). The latter can be defined as an equal probability of a positive outcome regardless of the value of a sensitive attribute, e.g., Race. An example of such constraints is Demographic Parity which says that the proportions of positive predictions for each population group should be approximately identical. Formally, $\frac{Pr(O = 1 \mid S = 1)} {Pr(O = 1 \mid S = 0)} \geq 0.8$ where $O$ is the prediction of a classifier, e.g., “loan approved” and $S$ is a sensitive attribute, e.g., Race. Another example is Equalized Odds which states that a person has the same opportunity for a positive outcome if they have the same qualifications as people from another population segment and it without regard to whether they belongs to a protected population or not (i.e., both populations have equal true positive rates and equal false positive rates) $Pr(O = 1 \mid Y = 1, S = 1) = Pr(O = 1 \mid Y = 1,S = 0)$ (e.g., $S$ is Gender, $Y$ is qualification for getting a loan, and $O$ is the bank’s decision for whether than person gets the loan). More definitions and examples can be found on Google’s page devoted to fairness We will not elaborate further on equity and fairness constraints further as they deserve a (very) detailed discussion.

Measuring data quality

We can first start with measuring data quality. For this we can use the “downstream task” view and see accurate or discriminatory our model is when we train it over our data. This is a valid measure especially when we have a particular task in mind that we designate the dataset for. However, often times, we first just aim to make our data reliable and non-biased. Therefore, we may with to use measures of quality that only take the data into account. Such measures have been proposed for ICs in Livshits et. al. and aim to measure the inconsistency of a dataset with respect to a set of constraints.

Approaches for Data Repairing with Integrity Constraints

Given a set of constraints, we can look at several approaches for changing the data to ensure that is satisfies them. Specifically, previous works have considered two main approaches: (1) repairing attribute values in cells (e.g., Rekatsinas et. al. and Chu et. al.) and (2) tuple deletion (e.g., Chomicki et. al. and Gilad et. al.). The cell value updates are useful when we do not want to lose records to deletion, so we simply change them to ensure they fit the constraints, while tuple deletion is useful to verify that we do not accidentally create artificial tuples (for example, in data derived from population surveys, we may not want to update peoples’ information since this may “create” people who do not exist in the population). It is worth noting that repairs by deletions are much better understood theoretically than repairs through cell updates and have specific complexity bounds on them and even algorithms (when the constraints are FDs) that allow us to distinguish between tractable and intractable cases (see Livshits et. al. for more details).

One way of looking at the data repairing problem is as an optimization problem. To formally define it, we need a few ingredients. First, we are given a database $D$, a set of constraints $\Sigma$, and a repair model (e.g., tuple-deletion or cell updates) $\mathcal{M}$, all of which we have already discussed. Finally, we are also given a distance metric $\Delta$ saying how we are measuring distance between different databases. Formally, $\Delta$ is a function taking two databases, $D,D’$ and returning a number that expresses the distance between them. This can be the number of differing tuples between $D$ and $D’$, the number of differing cell values between $D$ and $D’$, etc.

Now that all of the ingredients are in place we can define the optimization problem as finding the closest database, according to $\Delta$, that can be obtained by the repair model, $\mathcal{M}$, and satisfies all the constraints in the set $\Sigma$.

$$ \begin{equation} \boxed{\begin{aligned} & \underset{D’ \in \mathcal{M}(\mathcal{D}, D)}{\text{minimize}} & & \Delta(D,D’) \ & \text{subject to} & & D’ \models \Sigma \end{aligned}} \end{equation} $$

Other approaches such as tuple additions and imputations of missing values (NaNs) also exist.

Approaches for Data Repairing with Equity Constraints

If we have a statistical (or causal) constraint that expresses some fairness condition, one approach, detailed in Calmon et. al., is to modify the training procedure of the model, but another is to change the data so that it satisfies the constraint. Basically, the goal is to find a randomized mapping that transforms the given dataset into a new dataset that can be be made into a training or test dataset. In this transformation the discriminatory attributes (e.g., gender and race) remain unchanged, while the other attributes that include the outcome attribute do change. The goal is to change the data so that the joint distribution of the transformed data is close to the original distribution, and individual attributes are not considerably distorted. Another approach, detailed in Salimi et. al. gives a causal criterion for fairness and proposes to repair the data by converting the causal criterion into a set of ICs called Multi-Valued Dependencies (MVDs) and then change the data with the minimal number of insertions and deletions of tuples to make it satisfy the MVDs.

Approaches vary widely for repairing data according to ECs and their exact definition and this again warrants a more detailed discussion.

Discovering Integrity Constraints

Constraints can either be manually written by users who possess domain knowledge and expertise, or, alternatively, they can be automatically generated by systems for constraints generation such as Chu et. al., Bleifuss et. al. and Pena et. al.. Such systems aim to automatically generate them from the data, as doing this manually requires domain knowledge, skill, time, and effort.

Challenges with Automatic Repairs and Automatic Constraint Discovery

Users can, therefore, devise (or automatically discover) constraints and choose a suitable repair algorithm from the variety of existing ones, based on their domain knowledge. This combination, along with the input database, will result in a repaired database. However, these choices are highly non-trivial, and can lead to uncertainty with regards to the quality of the repair. In particular:

Repair: There are many possible repairs but there is insufficient information to decide which of them is the correct one since the correct repair can be defined in different ways. Therefore, most algorithms are heuristic and can aim for different repairs. Different heuristics can perform differently, depending on the scenario. Additionally, repair algorithms can be arbitrarily complex, and it is difficult to predict if the interaction between the constraints and the repair algorithm will generate a good repair.
Constraints: Constraints are expensive to create and verify, and even domain experts can make mistakes. Moreover, automatic constraint discovery algorithms may learn wrong constraints, as they learn them from noisy data. Thus, repairs may be based on incorrect constraints. When the user notices a wrong repair, it is non-trivial to understand whether it is due to an issue in the algorithm or the constraints. Furthermore, automating the constraint discovery step may result in a large number of constraints. For instance, running DCFinder and Hydra can generate hundreds of constraints. Some of these constraints can be overly specific to the data instance and thus describe wrong statements simply because there are no tuples that disprove them or because the data that the constraints are derived from is dirty.

Summary

The goal here is to give a taste of the field of data repair in academia and discuss some of the future directions and challenges that we are actively working on. We note that there is a wide variety of constraints and algorithms that can be used to repair databases based on such constraints. In this short article, we only covered a small subset of these and did not discuss other prominent types of constraints such as containment constraints and other approaches for data repair such as missing value imputation. Additionally, there are many approaches for automatically discovering and verifying integrity constraints, which are intruiging to investigate in-depth.

Software Development Github