What is deduplication?

The term “deduplication” refers to the identification of a record that is repeated or contains the same data as another record in a database. This term is not used by the RAE, but in the Data Management environment it is widely used to refer to the location of one or more duplicate records in a file.

By comparing communication, identification and/or location data we can identify the degree of similarity between two records. Depending on this degree of similarity, we can tell whether there is duplication or whether it is a single record. All this allows us to obtain a unique customer view.

To locate duplicates, the name and postal address data are compared and other support parameters can be used such as telephone number, tax number, etc. and, depending on the similarity, it is determined whether the records are the same.

At all times we talk about the identification or location of duplicate records in a database for subsequent processing, since after this identification, actions such as the deletion of duplicate data, the extension of information by combining both records or any other processing desired can be carried out.

The software incorporates the degree of similarity (weight) between the compared records, the lower the value of the difference weight the greater the similarity between the records. This weight indicator is of great help when making decisions about the results obtained in the process.

Without a doubt, in order to obtain an increase in the guarantee of similarity of the records in a database, it is recommended to apply, beforehand, the norms of standardization, standardization and codification of the data. Comparing raw (non-standardised) data will increase the risk of not considering duplicates.

A clear example:

LUIS MARTINEZ PEREZ PSO CASTELLANA 23 MADRID

LUIS MTNEZ PEREZ AVDA GENERALISIMO 23 MADRID (Old name of Paseo de la Castellana)

If the duplicate finding process is not executed on a precisely standardised database these two records in the above example would not be marked as equal. However, if the file is previously standardised the result to be compared would be

LUIS MARTINEZ PEREZ PSO CASTELLANA 23 MADRID

At first glance, it can be seen that the records are not only duplicated, but also identical.

The search for duplicates begins with the generation of a key for each record that speeds up the comparison of the same, this key is generated with the information of the name and postal address of each record.

Similar to the identification of duplicates, the technology used in Deyde MyDataQ is useful for comparing records between files and deletion lists.

Another strength of Deyde MyDataQ’s technology with respect to the identification of duplicate records lies in the identification of family and/or neighbourhood units. The ability to group several records belonging to the same environment is fundamental when it comes to developing the company’s strategy, and also achieves a significant reduction in time and costs.