SlideShare a Scribd company logo
1 of 31
| 1
Large-scale extraction, structuring and
matching of data
Deep Kayal
Machine Learning Engineer, Elsevier
| 2
How we managed to make sense of more
than 100 million things!
Deep Kayal
Machine Learning Engineer, Elsevier
| 3
Quick Introduction
• I work as a Machine Learning Engineer
• At Elsevier
• To use data (mostly text)
• To make lives easier for people in healthcare and education (amongst others!)
| 4
Setting the tone..
Good Data:
• We know how it looks like
• We could improve it’s
quality
Data dump:
• All over the place!
• Could add information to the
Good Data
| 5
Specifically..
Good Data:
• We know how it looks like
• We could improve it’s
quality
| 6
Specifically..
Data dump:
• All over the place!
• Could add information to the
Good Data
| 7
What is so large-scale?
Good Data + Data Dump = Over 100 million files..
| 8
How do we do this?
The relevant questions are:
• How to untangle the data mess?
• How to extract useful information?
• Using this information, how to it match to the Good Data?
• Recurring: How to do this at scale?
| 9
Tech stack?
Win!
| 10
How to start untangling?
• It is (probably) hard to generalizably automate the structuring of a data dump
• But one can formulate some good enough assumptions about what’s in the
dump(s)
• By utilizing prior knowledge on how the data came to be
• Or by sampling from the data
• And use them to make an attempt at unarchiving
| 11
Our data dump
Simple or nested
zips, gzips, tars
| 12
A very simple example of unzipping at scale
Distribute the files to Spark executors
| 13
A very simple example of unzipping at scale..
Write some functions to unzip and flatten
| 14
A very simple example of unzipping at scale..
Use the functions via Spark to produce sequence files
containing the unzipped file content
| 15
In the sequence file..
| 16
On to the next problem: extracting useful information
• Like the last problem, this one needed us to make some well-formed assumptions
too
• Our task was to extract bibliographic information
• Amongst the files we deemed relevant were
• Mostly XML files
• And PDFs
• Extracting things from XML is relatively simple: using the xml library
• Structuring PDFs is very hard: we tried using CERMINE
(https://github.com/CeON/CERMINE) to do our best!
| 17
Let’s go through another example
| 18
Let’s go through another example..
| 19
Scale up
Extract everything needed and make a Row out of it
| 20
Scale up..
Make a table, and we’re ready to match!
| 21
Quick recap
Good Data:
• We now know how it
looks like
Data dump:
• All over the place!
| 22
Matching?
• How to match depends on what to match!
• Matching can be exact or approximate
• Joins are a great way to match exactly
• But it needs some preprocessing:
• This is a title vs This is a title.
• Good preprocessing mechanisms are a great way to avoid approximate matching
| 23
Simple matching – Step 1: Normalize
Write a preprocessing function
| 24
Simple matching – Step 1: Normalize..
| 25
Simple matching – Step 2: Join and Union
| 26
Finally..
Matched pairs between one table (key: pui) and
another table (key: filename)
| 27
In summary, from here..
Good Data:
• We know how it looks like
• We could improve it’s
quality
Data dump:
• All over the place!
• Could add information to the
Good Data
| 28
In summary, to here..
• Match pairs by key
• Match pairs ready to be processed for
enrichment
| 29
Subproblems
• How to untangle the data mess?
• How to extract useful information?
• Using this information, how to it match to the Good Data?
• Recurring: How to do this at scale?
| 30
Thanks to..
| 31
Thank you!
Feel free to reach out to me at:
d.kayal@elsevier.com
And we’re always recruiting people like you:
https://4re.referrals.selectminds.com/elsevier
If you don’t find what you’re looking for there, email me directly and we can set
something up!

More Related Content

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark

Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
PostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and BeyondPostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and BeyondChris Travers
 
Big data explanation with real time use case
 Big data explanation with real time use case Big data explanation with real time use case
Big data explanation with real time use caseN.Jagadish Kumar
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structuressonykhan3
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptxChandra Meena
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVPThinkful
 
An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With DataBlue Elephant Consulting
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
data-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptxdata-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptxiturielescom
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Maninda Edirisooriya
 

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark (20)

Data manipulation
Data manipulationData manipulation
Data manipulation
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
PostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and BeyondPostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and Beyond
 
Big data explanation with real time use case
 Big data explanation with real time use case Big data explanation with real time use case
Big data explanation with real time use case
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
dsa.pptx
dsa.pptxdsa.pptx
dsa.pptx
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVP
 
An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
data-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptxdata-mining.8460598.powerpoint.pptx
data-mining.8460598.powerpoint.pptx
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
 

More from Deep Kayal

State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer VisionDeep Kayal
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionDeep Kayal
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleDeep Kayal
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Deep Kayal
 
A Framework to Automatically Extract Funding Information from Text
A Framework to Automatically Extract Funding Information from TextA Framework to Automatically Extract Funding Information from Text
A Framework to Automatically Extract Funding Information from TextDeep Kayal
 

More from Deep Kayal (6)

State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projection
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Topic Pages. From articles to answers.
Topic Pages. From articles to answers.
 
A Framework to Automatically Extract Funding Information from Text
A Framework to Automatically Extract Funding Information from TextA Framework to Automatically Extract Funding Information from Text
A Framework to Automatically Extract Funding Information from Text
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Large-Scale Data Extraction, Structuring and Matching using Python and Spark

  • 1. | 1 Large-scale extraction, structuring and matching of data Deep Kayal Machine Learning Engineer, Elsevier
  • 2. | 2 How we managed to make sense of more than 100 million things! Deep Kayal Machine Learning Engineer, Elsevier
  • 3. | 3 Quick Introduction • I work as a Machine Learning Engineer • At Elsevier • To use data (mostly text) • To make lives easier for people in healthcare and education (amongst others!)
  • 4. | 4 Setting the tone.. Good Data: • We know how it looks like • We could improve it’s quality Data dump: • All over the place! • Could add information to the Good Data
  • 5. | 5 Specifically.. Good Data: • We know how it looks like • We could improve it’s quality
  • 6. | 6 Specifically.. Data dump: • All over the place! • Could add information to the Good Data
  • 7. | 7 What is so large-scale? Good Data + Data Dump = Over 100 million files..
  • 8. | 8 How do we do this? The relevant questions are: • How to untangle the data mess? • How to extract useful information? • Using this information, how to it match to the Good Data? • Recurring: How to do this at scale?
  • 10. | 10 How to start untangling? • It is (probably) hard to generalizably automate the structuring of a data dump • But one can formulate some good enough assumptions about what’s in the dump(s) • By utilizing prior knowledge on how the data came to be • Or by sampling from the data • And use them to make an attempt at unarchiving
  • 11. | 11 Our data dump Simple or nested zips, gzips, tars
  • 12. | 12 A very simple example of unzipping at scale Distribute the files to Spark executors
  • 13. | 13 A very simple example of unzipping at scale.. Write some functions to unzip and flatten
  • 14. | 14 A very simple example of unzipping at scale.. Use the functions via Spark to produce sequence files containing the unzipped file content
  • 15. | 15 In the sequence file..
  • 16. | 16 On to the next problem: extracting useful information • Like the last problem, this one needed us to make some well-formed assumptions too • Our task was to extract bibliographic information • Amongst the files we deemed relevant were • Mostly XML files • And PDFs • Extracting things from XML is relatively simple: using the xml library • Structuring PDFs is very hard: we tried using CERMINE (https://github.com/CeON/CERMINE) to do our best!
  • 17. | 17 Let’s go through another example
  • 18. | 18 Let’s go through another example..
  • 19. | 19 Scale up Extract everything needed and make a Row out of it
  • 20. | 20 Scale up.. Make a table, and we’re ready to match!
  • 21. | 21 Quick recap Good Data: • We now know how it looks like Data dump: • All over the place!
  • 22. | 22 Matching? • How to match depends on what to match! • Matching can be exact or approximate • Joins are a great way to match exactly • But it needs some preprocessing: • This is a title vs This is a title. • Good preprocessing mechanisms are a great way to avoid approximate matching
  • 23. | 23 Simple matching – Step 1: Normalize Write a preprocessing function
  • 24. | 24 Simple matching – Step 1: Normalize..
  • 25. | 25 Simple matching – Step 2: Join and Union
  • 26. | 26 Finally.. Matched pairs between one table (key: pui) and another table (key: filename)
  • 27. | 27 In summary, from here.. Good Data: • We know how it looks like • We could improve it’s quality Data dump: • All over the place! • Could add information to the Good Data
  • 28. | 28 In summary, to here.. • Match pairs by key • Match pairs ready to be processed for enrichment
  • 29. | 29 Subproblems • How to untangle the data mess? • How to extract useful information? • Using this information, how to it match to the Good Data? • Recurring: How to do this at scale?
  • 31. | 31 Thank you! Feel free to reach out to me at: d.kayal@elsevier.com And we’re always recruiting people like you: https://4re.referrals.selectminds.com/elsevier If you don’t find what you’re looking for there, email me directly and we can set something up!