Improving how mixed sources of data are accurately merged together through the use of fuzzy joins
In our daily work we often need to combine two or more datasets together into one. This type of operation, known as a join, is rather simple when each record contains a unique ID present in both datasets. However, there are many scenarios where datasets use different methods of creating unique keys and thus do not match or do not have unique keys at all. In these situations the traditional join operation does not suffice. For example, we have many projects involving the analysis of individual people. One dataset may be from one source such as a hospital which will contain medical data for that person while another dataset may be from another source such as an insurance company which will contain policy information. It is unlikely that these two institutions share the same record keeping system in which real world individuals are given the same unique key in both…