Article Four: Comparing data sets with fuzzy logic, the benefits and pitfalls.
Welcome to week 4, the 2nd last article in this insightful series.
Fuzzy logic, what is it? Essentially fuzzy logic is a process. It uses matching algorithms to determine if two or more records are the same, when they are not. So if we had two records with the same last name, but one record was at 1/1 Apple St and the other at 1 Aple St, then depending upon the algorithm, Fuzzy matching would be able to determine that these two records are actually the same. As you can see, the advantages of fuzzy matching come into play when either a data entry, or other issue, reduces the completeness/accuracy of a record. In these cases, fuzzy matching is able to still perform a result that the business finds satisfactory.
Benefits;
Well the benefits certainly outweigh the pitfalls, but no system that I have worked with is 100% perfect and as such, you only have two options. You either “Over” dedupe your data, or “Under” dedupe your data. And this decision would be solely based on the data you were deduping, and why you were deduping it. For example, if you’re looking to send out a direct mailer to your existing customers, then you could accept the small loss of records and “Over” dedupe. This will represent a higher level of professionalism, to ensure no customer receives two mailers. Conversely, if you were to look at your customer database and merging this to remove duplicates for the sake of a data migration, you’d make the decision to “Under” dedupe your data, as you’d not want two distinct customers to be linked together.
Pitfalls;
As already stated, no system is perfect! You WILL have to make a choice between “Under” or “Over” deduping. But I don’t see this as negative. It’s more important to understand the weakness of a process and deal with it, then to just blindly use and trust it. When it comes to deduping data sets, it’s not just “Plug and Play”, you have to work with the system to develop the rules inline with the business and the respective data sets. I can honestly say that so far, no two companies have been the same. And the reason for this, each respective company has had their own procedures employed to derive/key/import/purchase data.
There are of course examples where fuzzy logic is not required. This is generally the case when the data that you’re manipulating is of such a high quality already, that there is no chance of erroneous entries. Electoral Data, Council Data and some other government datasets spring to mind. They are however, few and far between.