Large-Scale Data Extraction, Structuring and Matching using Python and Spark

•Download as PPTX, PDF•

0 likes•160 views

Matching data collections with the aim to augment and integrate the information for any available data point that lies in two or more of these collections, is a problem that nowadays arises often. Notable examples of such data points are scientific publications for which metadata and data are kept in various repositories, and users’ profiles, whose metadata and data exist in several social networks or platforms. In our case, collections were as follows: (1) A large dump of compressed data files on s3 containing archives in the form of zips, tars, bzips and gzips, which were expected to contain published papers in the form of xmls and pdfs, amongst other files, and (2) A large store of xmls in the form of xmls, some of which are to be matched to Collection 1. The problems, then, are: (1) How to best unzip the compressed archives and extract the relevant files? (2) How to extract meta-information from the xml or pdf files? (3) How to match the meta-information from the two different collections? And all of these must be done in a big-data environment. The presentation will describe the solution process and the use of python and Spark in the large-scale unzipping and extraction of files from archives, and how metadata was then extracted from the files to perform the matches on.

Data & Analytics

| 1
Large-scale extraction, structuring and
matching of data
Deep Kayal
Machine Learning Engineer, Elsevier

| 2
How we managed to make sense of more
than 100 million things!
Deep Kayal
Machine Learning Engineer, Elsevier

| 3
Quick Introduction
• I work as a Machine Learning Engineer
• At Elsevier
• To use data (mostly text)
• To make lives easier for people in healthcare and education (amongst others!)

| 4
Setting the tone..
Good Data:
• We know how it looks like
• We could improve it’s
quality
Data dump:
• All over the place!
• Could add information to the
Good Data

| 5
Specifically..
Good Data:
• We know how it looks like
• We could improve it’s
quality

| 6
Specifically..
Data dump:
• All over the place!
• Could add information to the
Good Data

| 7
What is so large-scale?
Good Data + Data Dump = Over 100 million files..

| 8
How do we do this?
The relevant questions are:
• How to untangle the data mess?
• How to extract useful information?
• Using this information, how to it match to the Good Data?
• Recurring: How to do this at scale?

| 10
How to start untangling?
• It is (probably) hard to generalizably automate the structuring of a data dump
• But one can formulate some good enough assumptions about what’s in the
dump(s)
• By utilizing prior knowledge on how the data came to be
• Or by sampling from the data
• And use them to make an attempt at unarchiving

| 11
Our data dump
Simple or nested
zips, gzips, tars

| 12
A very simple example of unzipping at scale
Distribute the files to Spark executors

| 13
A very simple example of unzipping at scale..
Write some functions to unzip and flatten

| 14
A very simple example of unzipping at scale..
Use the functions via Spark to produce sequence files
containing the unzipped file content

| 16
On to the next problem: extracting useful information
• Like the last problem, this one needed us to make some well-formed assumptions
too
• Our task was to extract bibliographic information
• Amongst the files we deemed relevant were
• Mostly XML files
• And PDFs
• Extracting things from XML is relatively simple: using the xml library
• Structuring PDFs is very hard: we tried using CERMINE
(https://github.com/CeON/CERMINE) to do our best!

| 19
Scale up
Extract everything needed and make a Row out of it

| 20
Scale up..
Make a table, and we’re ready to match!

| 21
Quick recap
Good Data:
• We now know how it
looks like
Data dump:
• All over the place!

| 22
Matching?
• How to match depends on what to match!
• Matching can be exact or approximate
• Joins are a great way to match exactly
• But it needs some preprocessing:
• This is a title vs This is a title.
• Good preprocessing mechanisms are a great way to avoid approximate matching

| 23
Simple matching – Step 1: Normalize
Write a preprocessing function

| 24
Simple matching – Step 1: Normalize..

| 25
Simple matching – Step 2: Join and Union

| 26
Finally..
Matched pairs between one table (key: pui) and
another table (key: filename)

| 27
In summary, from here..
Good Data:
• We know how it looks like
• We could improve it’s
quality
Data dump:
• All over the place!
• Could add information to the
Good Data

| 28
In summary, to here..
• Match pairs by key
• Match pairs ready to be processed for
enrichment

| 29
Subproblems
• How to untangle the data mess?
• How to extract useful information?
• Using this information, how to it match to the Good Data?
• Recurring: How to do this at scale?

| 31
Thank you!
Feel free to reach out to me at:
d.kayal@elsevier.com
And we’re always recruiting people like you:
https://4re.referrals.selectminds.com/elsevier
If you don’t find what you’re looking for there, email me directly and we can set
something up!

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark

Data manipulationMohammed Hadra

A data analyst view of Bigdata Venkata Reddy Konasani

Hadoop for Data ScienceDonald Miner

POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas

bigdata.pptxVIJAYAPRABAP

Data council sf amundsen presentationTao Feng

PostgreSQL at 20TB and BeyondChris Travers

Big data explanation with real time use caseN.Jagadish Kumar

Algorithms and Data Structuressonykhan3

CPP19 - RevisionMichael Heron

DATA preprocessing.pptxChandra Meena

dsa.pptxDr.Shweta

Predicting the NBA MVPThinkful

An Introduction To Python - Working With DataBlue Elephant Consulting

5 Things that Make Hadoop a Game ChangerCaserta

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.

data-mining.8460598.powerpoint.pptxiturielescom

Correlation does not mean causationPeter Varhol

Data science and HadoopDonald Miner

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Maninda Edirisooriya

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark (20)

Data manipulation

A data analyst view of Bigdata

Hadoop for Data Science

POWRR Tools: Lessons learned from an IMLS National Leadership Grant

bigdata.pptx

Data council sf amundsen presentation

PostgreSQL at 20TB and Beyond

Big data explanation with real time use case

Algorithms and Data Structures

CPP19 - Revision

DATA preprocessing.pptx

dsa.pptx

Predicting the NBA MVP

An Introduction To Python - Working With Data

5 Things that Make Hadoop a Game Changer

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...

data-mining.8460598.powerpoint.pptx

Correlation does not mean causation

Data science and Hadoop

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Halmar dropshipping via API with DroFxolyaivanovalion

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Introduction-to-Machine-Learning (1).pptxfirstjob4

Ukraine War presentation: KNOW THE BASICSAishani27

Midocean dropshipping via API with DroFxolyaivanovalion

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Invezz.com - Grow your wealth with trading signalsInvezz1

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service

100-Concepts-of-AI by Anupama Kate .pptx

BigBuy dropshipping via API with DroFx.pptx

Halmar dropshipping via API with DroFx

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Brighton SEO | April 2024 | Data Storytelling

Schema on read is obsolete. Welcome metaprogramming..pdf

BabyOno dropshipping via API with DroFx.pptx

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Introduction-to-Machine-Learning (1).pptx

Ukraine War presentation: KNOW THE BASICS

Midocean dropshipping via API with DroFx

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Ravak dropshipping via API with DroFx.pptx

Week-01-2.ppt BBB human Computer interaction

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Invezz.com - Grow your wealth with trading signals

CebaBaby dropshipping via API with DroFX.pptx

Large-Scale Data Extraction, Structuring and Matching using Python and Spark

1. | 1 Large-scale extraction, structuring and matching of data Deep Kayal Machine Learning Engineer, Elsevier

2. | 2 How we managed to make sense of more than 100 million things! Deep Kayal Machine Learning Engineer, Elsevier

3. | 3 Quick Introduction • I work as a Machine Learning Engineer • At Elsevier • To use data (mostly text) • To make lives easier for people in healthcare and education (amongst others!)

4. | 4 Setting the tone.. Good Data: • We know how it looks like • We could improve it’s quality Data dump: • All over the place! • Could add information to the Good Data

5. | 5 Specifically.. Good Data: • We know how it looks like • We could improve it’s quality

6. | 6 Specifically.. Data dump: • All over the place! • Could add information to the Good Data

7. | 7 What is so large-scale? Good Data + Data Dump = Over 100 million files..

8. | 8 How do we do this? The relevant questions are: • How to untangle the data mess? • How to extract useful information? • Using this information, how to it match to the Good Data? • Recurring: How to do this at scale?

9. | 9 Tech stack? Win!

10. | 10 How to start untangling? • It is (probably) hard to generalizably automate the structuring of a data dump • But one can formulate some good enough assumptions about what’s in the dump(s) • By utilizing prior knowledge on how the data came to be • Or by sampling from the data • And use them to make an attempt at unarchiving

11. | 11 Our data dump Simple or nested zips, gzips, tars

12. | 12 A very simple example of unzipping at scale Distribute the files to Spark executors

13. | 13 A very simple example of unzipping at scale.. Write some functions to unzip and flatten

14. | 14 A very simple example of unzipping at scale.. Use the functions via Spark to produce sequence files containing the unzipped file content

15. | 15 In the sequence file..

16. | 16 On to the next problem: extracting useful information • Like the last problem, this one needed us to make some well-formed assumptions too • Our task was to extract bibliographic information • Amongst the files we deemed relevant were • Mostly XML files • And PDFs • Extracting things from XML is relatively simple: using the xml library • Structuring PDFs is very hard: we tried using CERMINE (https://github.com/CeON/CERMINE) to do our best!

17. | 17 Let’s go through another example

18. | 18 Let’s go through another example..

19. | 19 Scale up Extract everything needed and make a Row out of it

20. | 20 Scale up.. Make a table, and we’re ready to match!

21. | 21 Quick recap Good Data: • We now know how it looks like Data dump: • All over the place!

22. | 22 Matching? • How to match depends on what to match! • Matching can be exact or approximate • Joins are a great way to match exactly • But it needs some preprocessing: • This is a title vs This is a title. • Good preprocessing mechanisms are a great way to avoid approximate matching

23. | 23 Simple matching – Step 1: Normalize Write a preprocessing function

24. | 24 Simple matching – Step 1: Normalize..

25. | 25 Simple matching – Step 2: Join and Union

26. | 26 Finally.. Matched pairs between one table (key: pui) and another table (key: filename)

27. | 27 In summary, from here.. Good Data: • We know how it looks like • We could improve it’s quality Data dump: • All over the place! • Could add information to the Good Data

28. | 28 In summary, to here.. • Match pairs by key • Match pairs ready to be processed for enrichment

29. | 29 Subproblems • How to untangle the data mess? • How to extract useful information? • Using this information, how to it match to the Good Data? • Recurring: How to do this at scale?

30. | 30 Thanks to..

31. | 31 Thank you! Feel free to reach out to me at: d.kayal@elsevier.com And we’re always recruiting people like you: https://4re.referrals.selectminds.com/elsevier If you don’t find what you’re looking for there, email me directly and we can set something up!

Large-Scale Data Extraction, Structuring and Matching using Python and Spark

Recommended

Recommended

More Related Content

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark

Similar to Large-Scale Data Extraction, Structuring and Matching using Python and Spark (20)

More from Deep Kayal

More from Deep Kayal (6)

Recently uploaded

Recently uploaded (20)

Large-Scale Data Extraction, Structuring and Matching using Python and Spark