Data Quality Assessment and Repair

An academic viewpoint on data quality

There is a disconnect between the academic literature on data quality and data repairing, and the way that the industry repairs and cleans data. While papers on data quality have well-defined models for how data can be measured for quality and how it can be repaired to achieve higher quality, it seems that this viewpoint has not been widely exported to companies who possess and handle large quantities of data. The goal of this article is to try to give a partial, high-level view of the academic side of data quality and give a taste of the areas and projects that we are working on.

Why care about data quality?

Many studies and papers have shown that poor data quality may impact the performance of applications that use it and specifically, Machine Learning models that train on it (e.g., Budach et. al.), validating the famous saying “Garbage In – Garbage Out”. Thus, it makes sense that companies spend over a billion dollars in an effort to improve data quality.

But what is data quality? Is it simply having no “NaN”s in the data or is it something more? Can we find a better definition for data quality that can be applied to measure and improve its quality?

Here, we can look to the field of Database Research to get a more formal description of data quality. Specifically, we can talk about two types of constraints: Integrity Constraints and Equity Constraints.

High-Level Data Repair Model

Integrity Constraints

The standard approach of measuring data quality and repairing it involves Integrity Constraints (ICs). Such constraints aim to ensure that the data satisfies conditions that reflect reality. An IC is usually expressed as a logical statement that is either true of false (although relaxations have also been considered in the literature). One example of an IC is a Functional Dependency (FD) which has the form $A \rightarrow B$, where $A, B$ are subsets of the attributes (columns) of the dataset. This implying that for any two records $t_1, t_2$ in the dataset, if $t_1.C_A = t_2.C_A$ holds for all attributes $C_A \in A$, then $t_1.C_B =t_2.C_B$ holds for all $C_B \in B$. For example, the constraint $\texttt{address}\rightarrow \texttt{state,zipCode}$ saying that a person’s address determines their state and zipCode. In other words, we can interpret this constraint as stating that no two people can have the same address and a different state or ZipCode.
Generalizations of FDs such as conditional FDs, metric FDs, and Denial Constraints (DCs) have also been proposed. We will devote a different article to these more complex constraints.

Equity Constraints

There has been a significant effort in recent years to both define fairness and equity. Such constraints can be defined as conditions over query results, such as the number of Males in the result should be identical to the number of Females or that a specific attribute should be independent of another attribute (e.g., Income should be independent of Gender), but they can also be statistical conditions that refer to the performance of an algorithm over the data (a very prolific vein of literature called “Algorithmic Fairness”). The latter can be defined as an equal probability of a positive outcome regardless of the value of a sensitive attribute, e.g., Race. An example of such constraints is Demographic Parity which says that the proportions of positive predictions for each population group should be approximately identical. Formally, $\frac{Pr(O = 1 \mid S = 1)} {Pr(O = 1 \mid S = 0)} \geq 0.8$ where $O$ is the prediction of a classifier, e.g., “loan approved” and $S$ is a sensitive attribute, e.g., Race. Another example is Equalized Odds which states that a person has the same opportunity for a positive outcome if they have the same qualifications as people from another population segment and it without regard to whether they belongs to a protected population or not (i.e., both populations have equal true positive rates and equal false positive rates) $Pr(O = 1 \mid Y = 1, S = 1) = Pr(O = 1 \mid Y = 1,S = 0)$ (e.g., $S$ is Gender, $Y$ is qualification for getting a loan, and $O$ is the bank’s decision for whether than person gets the loan). More definitions and examples can be found on Google’s page devoted to fairness We will not elaborate further on equity and fairness constraints further as they deserve a (very) detailed discussion.

Measuring data quality

We can first start with measuring data quality. For this we can use the “downstream task” view and see accurate or discriminatory our model is when we train it over our data. This is a valid measure especially when we have a particular task in mind that we designate the dataset for. However, often times, we first just aim to make our data reliable and non-biased. Therefore, we may with to use measures of quality that only take the data into account. Such measures have been proposed for ICs in Livshits et. al. and aim to measure the inconsistency of a dataset with respect to a set of constraints.

Approaches for Data Repairing with Integrity Constraints

Given a set of constraints, we can look at several approaches for changing the data to ensure that is satisfies them. Specifically, previous works have considered two main approaches: (1) repairing attribute values in cells (e.g., Rekatsinas et. al. and Chu et. al.) and (2) tuple deletion (e.g., Chomicki et. al. and Gilad et. al.). The cell value updates are useful when we do not want to lose records to deletion, so we simply change them to ensure they fit the constraints, while tuple deletion is useful to verify that we do not accidentally create artificial tuples (for example, in data derived from population surveys, we may not want to update peoples’ information since this may “create” people who do not exist in the population). It is worth noting that repairs by deletions are much better understood theoretically than repairs through cell updates and have specific complexity bounds on them and even algorithms (when the constraints are FDs) that allow us to distinguish between tractable and intractable cases (see Livshits et. al. for more details).

One way of looking at the data repairing problem is as an optimization problem. To formally define it, we need a few ingredients. First, we are given a database $D$, a set of constraints $\Sigma$, and a repair model (e.g., tuple-deletion or cell updates) $\mathcal{M}$, all of which we have already discussed. Finally, we are also given a distance metric $\Delta$ saying how we are measuring distance between different databases. Formally, $\Delta$ is a function taking two databases, $D,D’$ and returning a number that expresses the distance between them. This can be the number of differing tuples between $D$ and $D’$, the number of differing cell values between $D$ and $D’$, etc.

Now that all of the ingredients are in place we can define the optimization problem as finding the closest database, according to $\Delta$, that can be obtained by the repair model, $\mathcal{M}$, and satisfies all the constraints in the set $\Sigma$.

$$ \begin{equation} \boxed{\begin{aligned} & \underset{D’ \in \mathcal{M}(\mathcal{D}, D)}{\text{minimize}} & & \Delta(D,D’) \ & \text{subject to} & & D’ \models \Sigma \end{aligned}} \end{equation} $$

Other approaches such as tuple additions and imputations of missing values (NaNs) also exist.

Approaches for Data Repairing with Equity Constraints

If we have a statistical (or causal) constraint that expresses some fairness condition, one approach, detailed in Calmon et. al., is to modify the training procedure of the model, but another is to change the data so that it satisfies the constraint. Basically, the goal is to find a randomized mapping that transforms the given dataset into a new dataset that can be be made into a training or test dataset. In this transformation the discriminatory attributes (e.g., gender and race) remain unchanged, while the other attributes that include the outcome attribute do change. The goal is to change the data so that the joint distribution of the transformed data is close to the original distribution, and individual attributes are not considerably distorted. Another approach, detailed in Salimi et. al. gives a causal criterion for fairness and proposes to repair the data by converting the causal criterion into a set of ICs called Multi-Valued Dependencies (MVDs) and then change the data with the minimal number of insertions and deletions of tuples to make it satisfy the MVDs.

Approaches vary widely for repairing data according to ECs and their exact definition and this again warrants a more detailed discussion.

Discovering Integrity Constraints

Constraints can either be manually written by users who possess domain knowledge and expertise, or, alternatively, they can be automatically generated by systems for constraints generation such as Chu et. al., Bleifuss et. al. and Pena et. al.. Such systems aim to automatically generate them from the data, as doing this manually requires domain knowledge, skill, time, and effort.

Challenges with Automatic Repairs and Automatic Constraint Discovery

Users can, therefore, devise (or automatically discover) constraints and choose a suitable repair algorithm from the variety of existing ones, based on their domain knowledge. This combination, along with the input database, will result in a repaired database. However, these choices are highly non-trivial, and can lead to uncertainty with regards to the quality of the repair. In particular:

  1. Repair: There are many possible repairs but there is insufficient information to decide which of them is the correct one since the correct repair can be defined in different ways. Therefore, most algorithms are heuristic and can aim for different repairs. Different heuristics can perform differently, depending on the scenario. Additionally, repair algorithms can be arbitrarily complex, and it is difficult to predict if the interaction between the constraints and the repair algorithm will generate a good repair.
  2. Constraints: Constraints are expensive to create and verify, and even domain experts can make mistakes. Moreover, automatic constraint discovery algorithms may learn wrong constraints, as they learn them from noisy data. Thus, repairs may be based on incorrect constraints. When the user notices a wrong repair, it is non-trivial to understand whether it is due to an issue in the algorithm or the constraints. Furthermore, automating the constraint discovery step may result in a large number of constraints. For instance, running DCFinder and Hydra can generate hundreds of constraints. Some of these constraints can be overly specific to the data instance and thus describe wrong statements simply because there are no tuples that disprove them or because the data that the constraints are derived from is dirty.

Summary

The goal here is to give a taste of the field of data repair in academia and discuss some of the future directions and challenges that we are actively working on. We note that there is a wide variety of constraints and algorithms that can be used to repair databases based on such constraints. In this short article, we only covered a small subset of these and did not discuss other prominent types of constraints such as containment constraints and other approaches for data repair such as missing value imputation. Additionally, there are many approaches for automatically discovering and verifying integrity constraints, which are intruiging to investigate in-depth.

Collaborating with Industry

What do I do?

We work with organizations that depend on data for analytics, machine learning, and operational decision making and need their data to be reliable without slowing down delivery. My focus is on turning data quality from an ongoing source of friction into a systematic and automated capability embedded directly into data pipelines.I help teams identify where data quality breaks in practice and put concrete mechanisms in place to detect and address those failures early. This typically includes:

  • Assessing the current state of data quality across critical datasets and pipelines
  • Making implicit data assumptions explicit through constraints and quality expectations
  • Designing automated checks for completeness, consistency, validity, and freshness
  • Identifying root causes of recurring data issues rather than treating symptoms
  • Integrating data quality assessment into existing workflows and infrastructure

The emphasis is always on practical outcomes rather than tooling for its own sake.

How Engagements Work

Engagements are focused and time bounded. They are designed to deliver clear value quickly rather than producing abstract recommendations. A typical engagement results in:

  • A clear map of data quality risks that affect business critical use cases
  • A concrete set of automated quality checks aligned with how the data is actually used
  • Guidance on where to repair data, where to block pipelines, and where to tolerate noise
  • Documentation that makes data quality expectations explicit and shareable across teams

This enables teams to move faster with greater confidence in their data.

Who This Is For

I typically work with:

  • Heads of data and analytics teams
  • Engineering leaders responsible for data platforms
  • Organizations scaling machine learning or AI systems
  • Teams experiencing repeated data incidents or loss of trust in dashboards and models

If data issues are slowing decisions, breaking pipelines, or forcing repeated manual validation, this work is directly applicable.

The Outcome

The goal is simple and measurable. Teams spend less time firefighting data issues, downstream systems behave more predictably, and data driven work proceeds with fewer surprises. Data becomes something teams can rely on rather than something they must constantly question. If this approach aligns with your challenges, the next step is usually a short and focused assessment to identify where automated data quality improvements will have the highest impact.

Amir Gilad
Amir Gilad
Assistant Professor

The Hebrew University

Related