Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Hendrik Frentrup, systemati.co
Maps and Meaning
Graph-based Entity Resolution
#UnifiedDataAnalytics #SparkAISummit
Maps and Meaning
Graph based Entity Resolution
3#UnifiedDataAnalytics #SparkAISummit
Source: Jordi Guzmán (creative common...
Building Value Streams
Source: Malcolm Manners (creative commons)
Data Extraction
Data Refining
Data Warehousing
Data Pipeline
Source 1
Source 3
Source N
…
Visualisation
Presentation
Dashboards
Machine Learning
Statistical Analysis
Inf...
Upstream integrations
Source 1
Source 3
Source N
…
First Order Transformation:
• Deduplication -> df.dictinct()
• Transfor...
Outline
• Motivation
• Entity Resolution Example
• Graph-based Entity Resolution Algorithm
• Data Pipeline Architecture
• ...
Example: Find Duplicates
• Merge records in your Address Book
ID First Name Last Name Email Mobile Number Phone Number
1 H...
…such as Google Contacts
ID First Name Last Name Email Mobile Number Phone Number Source
1 Harry Mulisch harry@mulisch.nl +31 101 1001 Phone
2 S Na...
Graph Algorithm Walkthrough
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Na...
2
1
Copyright 2019 © systemati.co
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31...
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Na...
2
1
Harry Mulisch/HKV
harry@mulisch.nl,
author@heaven.nl,
Harry.Mulish@gmail.com
+31 123 4567
+31 666 7777
+31 101 1001
3
...
Entity Resolution Pipeline
Architecture
Source 1
Source 3
Source N
…
Extract
Data Hub/Lake/Warehouse
Clean
Records
Source
...
Technical Implementation
Graphs in Apache Spark
GraphX GraphFrames
Python API
👍
Scala API
👍 👍
With GraphFrames
Create nodes
• Add an id column to the dataframe of records
+---+------------+-----------+-----------+---------+----------...
Edge creation
match_cols = [”ssn", ”email"]
mirrorColNames = [f"_{col}" for col in records.columns]
mirror = records.toDF(...
Resolve entities and consolidation
• Connected Components
graph = gf.GraphFrame(nodes, edges)
sc.setCheckpointDir("/tmp/ch...
With GraphX
Strongly Typed Scala
• Defining the schema of our data
24
val record_schema = StructType( Seq(
StructField(name = ”id", da...
Node creation
• Add an ID column to records
• Turn DataFrame into RDD
val nodesRDD = records.map(r => (r.getAs[VertexId]("...
Edge creation
val mirrorColNames = for (col <- records.columns) yield "_"+col.toString
val mirror = records.toDF(mirrorCol...
Resolve entities and consolidation
• Connected Components
val graph = Graph(nodesRDD, edgesRDD)
val cc = graph.connectedCo...
Resolve operation
Columns to match:
[“ssn”,”email”]
Input:
DataFrame
Output:
DataFrame
Evaluation
• Number of source records per entity
• Business logic:
– Conflicts (multiple SSNs)
• Distribution of matches
v...
Evolving
Entity Resolution
Machine learning in Entity Resolution
• Pairwise comparison
– String matching / distance measures
– Incorporate temporal d...
Machine learning in Entity Resolution
• Structuring connected Data
• Partitioning of the graph based on
clustering of reco...
Feeding a knowledge graph
Human Interface:
• Analytics
• Forensics
• Discovery
• Iterative
Improvements:
• Data Quality
• ...
Get started yourself
• GitHub Project: Resolver & Notebook:
– https://github.com/hendrikfrentrup/maps-meaning
• Docker con...
Key Takeaways
• Data pipeline coalesces into a single record table
• Connected Components at the core of resolving
• Edge ...
Thanks!
Any questions?
Comments?
Observations?
36
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Upcoming SlideShare
Loading in …5
×

of

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 1 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 2 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 3 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 4 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 5 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 6 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 7 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 8 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 9 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 10 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 11 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 12 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 13 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 14 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 15 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 16 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 17 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 18 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 19 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 20 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 21 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 22 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 23 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 24 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 25 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 26 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 27 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 28 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 29 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 30 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 31 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 32 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 33 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 34 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 35 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 36 Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX Slide 37
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

Download to read offline

Data integration and the automation of tedious data extraction tasks are the fundamental building blocks of a data-driven organizations and are overlooked or underestimated at times. Aside from data extraction, scraping and ETL tasks, entity resolution is a crucial step in successfully combining datasets. The combination of data sources is usually what provides richness in features and variance. Building an expertise in entity resolution is important for data engineerings to successfully combine data sources. Graph-based entity resolution algorithms have emerged as a highly effective approach.

This talk will present the implementation of a graph-bases entity resolution technique in GraphX and in GraphFrames respectively. Working from concept, through how to implement the algorithm in Spark, the technique will also be illustrated by walking through a practical example. The technique will exhibit an example where efficacy can be achieved based on simple heuristics, and at the same time map a path to a machine-learning assisted entity resolution engine with a powerful knowledge graph at its center.

The role of ML can be found upstream in building the graph, for example by using classification algorithms in determining the link strength between nodes based on data, or downstream where dimensionality reduction can play a role in clustering and reduce the computational load in the resolution stage. The audience will leave with a clear picture of a scalable data pipeline performing entity resolution effectively and a thorough understanding of the internal mechanism, ready to apply it to their use cases.

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Hendrik Frentrup, systemati.co Maps and Meaning Graph-based Entity Resolution #UnifiedDataAnalytics #SparkAISummit
  3. 3. Maps and Meaning Graph based Entity Resolution 3#UnifiedDataAnalytics #SparkAISummit Source: Jordi Guzmán (creative commons) Data is the new oil
  4. 4. Building Value Streams Source: Malcolm Manners (creative commons) Data Extraction Data Refining Data Warehousing
  5. 5. Data Pipeline Source 1 Source 3 Source N … Visualisation Presentation Dashboards Machine Learning Statistical Analysis Inference Predictions Data Extraction Transformation Integration Data Modelling
  6. 6. Upstream integrations Source 1 Source 3 Source N … First Order Transformation: • Deduplication -> df.dictinct() • Transformations -> df.withColumn(col, expr(col)) • Mapping -> df.withColumnRenamed(old, new) Nth Order Transformation: • Merge N Sources -> Entity Resolution Second Order Transformation: • Denormalisation -> lhs.join(rhs, key)
  7. 7. Outline • Motivation • Entity Resolution Example • Graph-based Entity Resolution Algorithm • Data Pipeline Architecture • Implementation – In GraphFrames (Python API) – In GraphX (Scala API) • The Role of Machine Learning in Entity Resolution
  8. 8. Example: Find Duplicates • Merge records in your Address Book ID First Name Last Name Email Mobile Number Phone Number 1 Harry Mulisch harry@mulisch.nl +31 101 1001 2 HKV Mulisch Harry.Mulish@gmail.com +31 666 7777 3 author@heaven.nl +31 101 1001 4 Harry Mulisch +31 123 4567 +31 666 7777 ID First Name Last Name Email Mobile Number Phone Number 1 Harry/HKV Mulisch harry@mulisch.nl, Harry.Mulish@gmail.com, author@heaven.nl +31 101 1001, +31 123 4567 +31 666 7777
  9. 9. …such as Google Contacts
  10. 10. ID First Name Last Name Email Mobile Number Phone Number Source 1 Harry Mulisch harry@mulisch.nl +31 101 1001 Phone 2 S Nadolny +49 899 9898 Phone 3 Harry Mulisch +31 123 4567 +31 666 7777 Phone 4 author@heaven.nl +31 101 1001 Gmail 5 Sten Nadolny sten@slow.de +49 899 9898 Gmail 6 Max Frisch max@andorra.ch Outlook 7 HKV Harry.Mulish@gmail.com +31 666 7777 Outlook Example: Resolving records
  11. 11. Graph Algorithm Walkthrough
  12. 12. 2 1 Harry Mulisch harry@mulisch.nl +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny sten@slow.de +49 899 9898 5 author@heaven.nl +31 101 1001 6 7 Max Frisch max@andorra.ch HKV Harry.Mulish@gmail.com +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  13. 13. 2 1 Copyright 2019 © systemati.co Harry Mulisch harry@mulisch.nl +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny sten@slow.de +49 899 9898 5 author@heaven.nl +31 101 1001 6 7 Max Frisch max@andorra.ch HKV Harry.Mulish@gmail.com +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  14. 14. 2 1 Harry Mulisch harry@mulisch.nl +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny sten@slow.de +49 899 9898 5 author@heaven.nl +31 101 1001 6 7 Max Frisch max@andorra.ch HKV Harry.Mulish@gmail.com +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  15. 15. 2 1 Harry Mulisch/HKV harry@mulisch.nl, author@heaven.nl, Harry.Mulish@gmail.com +31 123 4567 +31 666 7777 +31 101 1001 3 4 Sten/S Nadolny sten@slow.de +49 899 9898 5 6 7 Max Frisch max@andorra.ch • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  16. 16. Entity Resolution Pipeline Architecture Source 1 Source 3 Source N … Extract Data Hub/Lake/Warehouse Clean Records Source Copy …… Consolidated Nodes Appended records Resolved records Resolve Entities Merge Entities
  17. 17. Technical Implementation
  18. 18. Graphs in Apache Spark GraphX GraphFrames Python API 👍 Scala API 👍 👍
  19. 19. With GraphFrames
  20. 20. Create nodes • Add an id column to the dataframe of records +---+------------+-----------+-----------+---------+----------+--------------+ | id| ssn| email| phone| address| DoB| Name| +---+------------+-----------+-----------+---------+----------+--------------+ | 0| 714-12-4462| len@sma.ll| 6088881234| ...| 15/4/1937| Lennie Small | | 1| 481-33-1024| geo@mil.tn| 6077654980| ...| 15/4/1937| Goerge Milton| Identifiers Attributes from pyspark.sql.functions import monotonically_increasing_id nodes = records.withColumn("id", monotonically_increasing_id())
  21. 21. Edge creation match_cols = [”ssn", ”email"] mirrorColNames = [f"_{col}" for col in records.columns] mirror = records.toDF(*mirrorColNames) mcond = [col(c) == col(f'_{c}') for c in match_cols] cond = [(col("id") != col("_id")) & reduce(lambda x,y: x | y, mcond)] edges = records.join(mirror, cond) cond: [Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))
  22. 22. Resolve entities and consolidation • Connected Components graph = gf.GraphFrame(nodes, edges) sc.setCheckpointDir("/tmp/checkpoints") cc = graph.connectedComponents() entities = cc.groupby(”components”).collect_set(”name”) • Consolidate Components
  23. 23. With GraphX
  24. 24. Strongly Typed Scala • Defining the schema of our data 24 val record_schema = StructType( Seq( StructField(name = ”id", dataType = LongType, nullable = false), StructField(name = ”name", StringType, true), StructField(name = ”email", StringType, true), StructField(name = ”ssn", LongType, true), StructField(name = ”attr", StringType, true) ))
  25. 25. Node creation • Add an ID column to records • Turn DataFrame into RDD val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd
  26. 26. Edge creation val mirrorColNames = for (col <- records.columns) yield "_"+col.toString val mirror = records.toDF(mirrorColNames: _*) def conditions(matchCols: Seq[String]): Column = { col("id")=!=col("_id") && matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _) } val edges = records.join(mirror, conditions(Seq(”ssn", ”email”))) val edgesRDD = edges .select("id","_id") .map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null)) .rdd
  27. 27. Resolve entities and consolidation • Connected Components val graph = Graph(nodesRDD, edgesRDD) val cc = graph.connectedComponents() val entities = cc.vertices.toDF() val resolved_records = records.join(entities, $"id"===$"_1") val res_records = resolved_records .withColumnRenamed("_2", ”e_id") .groupBy(”e_id") .agg(collect_set($”name")) • Consolidate Components
  28. 28. Resolve operation Columns to match: [“ssn”,”email”] Input: DataFrame Output: DataFrame
  29. 29. Evaluation • Number of source records per entity • Business logic: – Conflicts (multiple SSNs) • Distribution of matches vs. 0 50 100 150 200 250 300 in one source in two sources in three sources in four sources Entities by Nr of Source
  30. 30. Evolving Entity Resolution
  31. 31. Machine learning in Entity Resolution • Pairwise comparison – String matching / distance measures – Incorporate temporal data into edge creation { 1, 0 } or P(match)=0.8762 H Muiisch Harry.Mulish@gmail.com 1 Harry Mulisch harry@mulisch.nl • Edge creation is the most computationally heavy step
  32. 32. Machine learning in Entity Resolution • Structuring connected Data • Partitioning of the graph based on clustering of records • Using weighted edges and learning a classifier to evaluate links between records
  33. 33. Feeding a knowledge graph Human Interface: • Analytics • Forensics • Discovery • Iterative Improvements: • Data Quality • Contextual Information • Use case driven
  34. 34. Get started yourself • GitHub Project: Resolver & Notebook: – https://github.com/hendrikfrentrup/maps-meaning • Docker container with pySpark & GraphFrames: – https://hub.docker.com/r/hendrikfrentrup/pyspark- graphframes 34
  35. 35. Key Takeaways • Data pipeline coalesces into a single record table • Connected Components at the core of resolving • Edge creation is the expensive operation • Batch operation over a single corpus 35
  36. 36. Thanks! Any questions? Comments? Observations? 36
  37. 37. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • gianvitosiciliano

    Mar. 13, 2020
  • huijunyu

    Mar. 6, 2020

Data integration and the automation of tedious data extraction tasks are the fundamental building blocks of a data-driven organizations and are overlooked or underestimated at times. Aside from data extraction, scraping and ETL tasks, entity resolution is a crucial step in successfully combining datasets. The combination of data sources is usually what provides richness in features and variance. Building an expertise in entity resolution is important for data engineerings to successfully combine data sources. Graph-based entity resolution algorithms have emerged as a highly effective approach. This talk will present the implementation of a graph-bases entity resolution technique in GraphX and in GraphFrames respectively. Working from concept, through how to implement the algorithm in Spark, the technique will also be illustrated by walking through a practical example. The technique will exhibit an example where efficacy can be achieved based on simple heuristics, and at the same time map a path to a machine-learning assisted entity resolution engine with a powerful knowledge graph at its center. The role of ML can be found upstream in building the graph, for example by using classification algorithms in determining the link strength between nodes based on data, or downstream where dimensionality reduction can play a role in clustering and reduce the computational load in the resolution stage. The audience will leave with a clear picture of a scalable data pipeline performing entity resolution effectively and a thorough understanding of the internal mechanism, ready to apply it to their use cases.

Views

Total views

769

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

26

Shares

0

Comments

0

Likes

2

×