SlideShare a Scribd company logo
1 of 17
Download to read offline
OpenStreetMap in the age of Spark
@adrianulbona
OpenSteetMap
- The Wikipedia of Maps
- https://www.openstreetmap.org
- nodes: geo-localized points on the map
- ways: roads, building contours, borders (multiple nodes), …
- relations: highways (multiple ways), schools (building contours), ...
OpenSteetMap - the data
The data is available at http://planet.openstreetmap.org and it comes in two
formats:
- XML ~ around 53 GB
- PBF ~ around 34 GB
Take a look at http://osmstats.neis-one.org, if interested on how the map evolves
on a daily basis.
OpenSteetMap - story 1
- download the PBF/XML files
- wait days to import the data in PostgreSQL
- write some SQL
- grab a coffee
- grab a second coffee
- see some query results
- …
- manage scary scripts that keep your OSM DB updated
OpenSteetMap - story 2
- download the PBF/XML files
- extract various pieces of information in obscure CSVs
- write MR jobs full of string parsing bugs
- run the jobs
- grab a coffee
- fix your job
- grab a second coffee
- …
- MR is not cool anymore
OpenSteetMap - story 3
- one day some weird guy comes and asks:
what is the total road network length from OSM?
- you have ways as collections node ids
- you have nodes with ids and coordinates (latitude, longitude)
- all this mixed-up in a huge protobuf file
- options?
what the actual problem is?
we need parallel data access
we need the data structured
Spark will handle the rest
Parquet will give you more …
performance
Apache Parquet
Apache Parquet is a columnar storage format available to any project in the
Hadoop ecosystem, regardless of the choice of data processing framework, data
model or programming language.
google paper: Dremel - Interactive Analysis of Web-Scale Datasets
twitter blog: Dremel Made Simple with Parquet
Apache Parquet
row-based storage
column-based storage
data
Apache Parquet
- protobuf-like
- primitives, arrays, structs
- bonus: nested structs
Parquet + Spark
osm-parquetizer
- github.com/adrianulbona/osm-parquetizer
- input: one OSM PBF file
- output: one parquet file for each entity type (nodes, ways, relations)
- minutes for countries like Romania
- between 2 and 4 hours for the entire planet
- parquet files size ~ 3 x original PBF (~ 100 GB the planet)
http://bit.ly/2n9TRF3

More Related Content

What's hot

Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXLuis Bermudez
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinetehuard
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018Keisuke Suzuki
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin KomenCLARIAH
 
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeIntroduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeFinTechopedia
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents
 
Mongo db present
Mongo db presentMongo db present
Mongo db presentscottmsims
 
KOS evolution in Linked Data
KOS evolution in Linked DataKOS evolution in Linked Data
KOS evolution in Linked DataJoachim Neubert
 
ICOS Carbon Data Portal
ICOS Carbon Data PortalICOS Carbon Data Portal
ICOS Carbon Data PortalOscar Fonts
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsRuben Taelman
 
Principles of programming languages(Functional programming Languages using LISP)
Principles of programming languages(Functional programming Languages using LISP)Principles of programming languages(Functional programming Languages using LISP)
Principles of programming languages(Functional programming Languages using LISP)Preethi T G
 

What's hot (20)

Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDX
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Inside database
Inside databaseInside database
Inside database
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek MukherjeeIntroduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Gdal introduction
Gdal introductionGdal introduction
Gdal introduction
 
Mongo db present
Mongo db presentMongo db present
Mongo db present
 
KOS evolution in Linked Data
KOS evolution in Linked DataKOS evolution in Linked Data
KOS evolution in Linked Data
 
ICOS Carbon Data Portal
ICOS Carbon Data PortalICOS Carbon Data Portal
ICOS Carbon Data Portal
 
Circos
CircosCircos
Circos
 
No sql
No sqlNo sql
No sql
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern Fragments
 
Big data quiz
Big data quizBig data quiz
Big data quiz
 
Principles of programming languages(Functional programming Languages using LISP)
Principles of programming languages(Functional programming Languages using LISP)Principles of programming languages(Functional programming Languages using LISP)
Principles of programming languages(Functional programming Languages using LISP)
 

Similar to OpenStreetMap in the age of Spark

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scaleDenis Chapligin
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark PresentationNidhin Pattaniyil
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
OpenMapTiles FOSS4G 2019
OpenMapTiles FOSS4G 2019OpenMapTiles FOSS4G 2019
OpenMapTiles FOSS4G 2019MapTiler
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaThomas Kurz
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 

Similar to OpenStreetMap in the age of Spark (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
ERS downscale2016
ERS downscale2016ERS downscale2016
ERS downscale2016
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scale
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
 
Cluj meetup bigdata-final-version
Cluj meetup bigdata-final-versionCluj meetup bigdata-final-version
Cluj meetup bigdata-final-version
 
Hadoop
HadoopHadoop
Hadoop
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
EEDC Apache Pig Language
EEDC Apache Pig LanguageEEDC Apache Pig Language
EEDC Apache Pig Language
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
OpenMapTiles FOSS4G 2019
OpenMapTiles FOSS4G 2019OpenMapTiles FOSS4G 2019
OpenMapTiles FOSS4G 2019
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache Marmotta
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 

Recently uploaded

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 

OpenStreetMap in the age of Spark

  • 1. OpenStreetMap in the age of Spark @adrianulbona
  • 2. OpenSteetMap - The Wikipedia of Maps - https://www.openstreetmap.org - nodes: geo-localized points on the map - ways: roads, building contours, borders (multiple nodes), … - relations: highways (multiple ways), schools (building contours), ...
  • 3. OpenSteetMap - the data The data is available at http://planet.openstreetmap.org and it comes in two formats: - XML ~ around 53 GB - PBF ~ around 34 GB Take a look at http://osmstats.neis-one.org, if interested on how the map evolves on a daily basis.
  • 4. OpenSteetMap - story 1 - download the PBF/XML files - wait days to import the data in PostgreSQL - write some SQL - grab a coffee - grab a second coffee - see some query results - … - manage scary scripts that keep your OSM DB updated
  • 5. OpenSteetMap - story 2 - download the PBF/XML files - extract various pieces of information in obscure CSVs - write MR jobs full of string parsing bugs - run the jobs - grab a coffee - fix your job - grab a second coffee - … - MR is not cool anymore
  • 6. OpenSteetMap - story 3 - one day some weird guy comes and asks: what is the total road network length from OSM? - you have ways as collections node ids - you have nodes with ids and coordinates (latitude, longitude) - all this mixed-up in a huge protobuf file - options?
  • 7. what the actual problem is?
  • 8. we need parallel data access
  • 9. we need the data structured
  • 10. Spark will handle the rest
  • 11. Parquet will give you more … performance
  • 12. Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. google paper: Dremel - Interactive Analysis of Web-Scale Datasets twitter blog: Dremel Made Simple with Parquet
  • 14. Apache Parquet - protobuf-like - primitives, arrays, structs - bonus: nested structs
  • 16. osm-parquetizer - github.com/adrianulbona/osm-parquetizer - input: one OSM PBF file - output: one parquet file for each entity type (nodes, ways, relations) - minutes for countries like Romania - between 2 and 4 hours for the entire planet - parquet files size ~ 3 x original PBF (~ 100 GB the planet)