Managing Probabilistic Duplicates in Databases

Citation:
Hafez, M. M., Managing Probabilistic Duplicates in Databases, , Giza, Cairo University, 2014.

Thesis Type:

M. Sc. Thesis

Abstract:

Data fusion in the virtual data integration environment starts after detecting and clustering duplicated records from the different integrated data sources. It refers to the process of selecting from attribute values in the clustered records, an attribute value to form a single record representing the real world object.

Many trials were done to solve the inconsistencies at the data level, but all of them didn't perform the data fusion process in full automation without any predefined metadata or any user intervention.

In this thesis, a new branch is opened to do data fusion in a fully-automated process and two data fusion techniques are proposed. The proposed Data Dependency (DD) technique solves conflicts using some final statistical scores for each requested attribute based on two scores. A local score reflecting the level of correlation between this attribute and a unified detector over all of the data sources, and another score reflecting how common an attribute value is within its clustered records. On the other hand, the Information Gain (IG) technique resolves data conflicts based on two factors: one measuring the amount of information gain for each attribute to be partitioned based on the unified detector, and the other one is the same as the one used with Data Dependency that measure the level of popularity of a given data value within its cluster.

A simulation is done to evaluate both techniques, which could be summarized that DD performs better when the dependency between data is high, while IG gives better results for low data dependency. Other conclusions are extracted about the behavior of both techniques reacting to different sets of simulation input parameters. As a conclusion, more techniques should be developed to solve the problems where both of our proposed techniques failed to give acceptable matching results.