“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.