In our daily work we often need to make comparisons of data between and within datasets. Such operations including looking for multiple copies of records for elimination or some other analysis task. Another type of operation is a “join” which is a method of combining two tables into one. For example, we may have one table that contains products and their prices while another table contains products and the current number of orders for them. To predict anticipated revenue we really need one table with all of that information at once and a join operation on the tables can produce such a table.
These types of procedures may seem rudimentary and common and for the most part this is true. If our datasets contain unique keys for each record then these tasks are simple and can be conducted without much thought. However, there are many scenarios in which we do not possess unique keys for our data. This frequently occurs when the datasets have different origins. For example, in our analysis of the health of patients we may have a dataset that comes from a hospital and another that comes from an insurance company. Within these datasets there may be a unique patient ID or a unique policy holder ID, but it is virtually guaranteed that these keys do not match between the datasets. Therefore, the algorithms and methods that rely on unique keys will not work and we have no traditional manner of performing such tasks as joining datasets.
To tackle such situations we must instead rely on a separate group of operations characterized as Fuzzy Matching. Unlike regular deterministic record comparison, Fuzzy Matching does not strive to create matches between records with 100% certainty. This is where the name “Fuzzy” comes from as the matches are not perfectly clear and guaranteed. Despite of this lack of determinism, Fuzzy Matching can obtain results that are accurate. While classic matching operates with equality between record keys, Fuzzy Matching operates with equivalence between record attributes. To illustrate this, observe the following data:
|Steven M Johnson||967 N Woodwillow Street||8/11/1985|
|Steve Johnson||967 N Woodwillow St||8/11/1985|
These records are not equal as they have many differences in values, however, a human reader would not be taking great risk is suggesting that they are the same person. The records are not equal, but they are equivalent.
Fuzzy Matching allows us to specify the defining attributes of a record which will be used to identify how to join it with other datasets. For example, with the medical and insurance datasets from earlier, the defining attributes would be those that make up an individual person which may include their name, phone number, address, and date of birth. Note that these attributes are not always assumed to be unique as many people can have the same name or birthdate. Additionally, these attributes rarely have perfectly normal forms as the same name can be spelled in various ways and an address has some components that may be absent in some record keeping system. Despite this, Fuzzy Matching can link records corresponding to the same entity based on this “fuzzy” information by abstracting and normalizing the data and then constructing equivalency classes rather than equality classes. A person in one dataset may have their name spelled differently in another and Fuzzy Join is able to identify that the two are actually the same if enough supplementary information is available. We will discuss how we actually calculate equivalency of records quantitatively in following posts.