SlideShare a Scribd company logo
Vehicle Id Trip Id Timestamp Latitude Longitude Altitude
…
10152875766888406 57169639 1445377623000 36.966819 -122.012298 1809
10152875766888406 57169639 1445377625000 36.966845 -122.012248 1809
10152875766888406 57169639 1445377627000 36.966877 -122.012228 1814
10152875766888406 57169639 1445377629000 36.966913 -122.012236 1814
10152875766888406 57169639 1445377630000 36.966946 -122.012236 1814
10152875766888406 57169639 1445377631000 36.966984 -122.012263 1815
10152875766888406 57169639 1445377632000 36.967027 -122.012281 1815
…
39 TB of raw location data:
• 584 billion data points
• 116 million trips
• Many trips per
vehicle.
• Want to be able
to pull a range
of locations by
timestamp for
trip display.
Vehicle Id Timestamp Latitude Longitude
…
10152875766888406 1445374423000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
10152875766888406 1445377631000 36.966984 -122.012263
10152875766888406 1445379512000 36.967027 -122.012281
…
This is a challenge with a large dataset:
• A traditional relational database typically requires
hand sharding to scale to PBs of data (eg. Postgres).
• Highly indexed non relational solutions can be very
expensive (eg. MongoDB).
• Lightly indexed solutions are a good fit because we
really only have one query we need to execute
against the data. (HBase, Cassandra, and Azure
Table Storage)
PartitionKey
(vehicleId)
RowKey
(timestamp)
Latitude Longitude
10152875766888406 1445377623000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
…
• Want to query a
set of trip in a
bounding box.
• Also want to
filter activities
based on
distance and
duration.
Trip Id start (sec) finish (sec) distance (m) duration (m) bbox
(geometry)
101528 1445377625 1445383025 50023 6222 [-104.990,
39.7392...
101643 1445362577 1445373616 28778 2498 [-122.01228,
36.96…
101843 1445377627 1445382432 4629 701 [0.1278, 51.5074
…
101901 1445362577 1445374713 99691 14232 [139.6917,
35.699...
102102 1445374713 1445374713 25259 6657 [1.3521,
103.8129…
user Id timestamp latitude longitude
10152875766888406 1445377623 36.966819 -122.012298
10152875766888406 1445377625 36.966845 -122.012248
…
10152875766888406 1445383025 36.966913 -122.012236
10152875766888406 1445383030 36.966946 -122.012236
activity
id
start finish … bbox
101528 1445362577 1445373616 … [-104.990, 39.7392...
101643 1445377625 1445383025 … [-122.01228, 36.96…
101843 1445377627 1445382432 … [0.1278, 51.5074 …
101901 1445362577 1445374713 … [139.6917, 35.699...
102102 1445374713 1445374713 … [1.3521, 103.8129…
Location Data
(Azure Table Storage)
Trip Data
(Postgres + PostGIS)
• Total number
of location
samples in a
geographical
area.
• Whole dataset
operation.
• Divides world
up into tiles.
• Each tile has
four children at
the next higher
zoom level.
• Maps 2
dimension
space to 1
dimension.
• Can think of it is as “Hadoop the Next
Generation”
• Better performance (10-100x)
• Cleaner programming model
• Used HDInsight Spark (Azure) to avoid
operational difficulties of running our own Spark
cluster.
For each location, map to tiles at every zoom level:
(36.9741, -122.0308)  [
(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1),
(14_6378_2638, 1), (15_12757_5276,1),
(16_25514_10552, 1), (17_51028_21105, 1),
(18_102057_42211, 1)
]
Reduce all these mappings with the same key into an
aggregate value:
(10_398_164, 151)  [
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1)
]
def tile_id_mapper(location):
tileMappings = []
tileIds = Tile.tile_ids_for_zoom_levels(
location['latitude'],
location['longitude'],
MIN_ZOOM_LEVEL,
MAX_ZOOM_LEVEL
)
for tileId in tileIds:
tileMappings.append(
(tileId, 1)
)
return tileMappings
lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/')
locations = lines.flatMap(json_loader)
heatmap = locations
.flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/');
Building the heatmap then boils down to this in Spark:
…
Geolocated Tweets Summary Updates
6_25_31
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64
39 TB 23 TB
JSON Avro
60%
Smaller
• geotile: http://github.com/timfpark/geotile
• XYZ tile math in C#, JavaScript, and Python
• heatmap: http://github.com/timfpark/heatmap
• Spark code for building heatmaps
• tileIndex:
http://github.com/timfpark/tileIndexPusher
• Azure Function for pushing tile indexes.
Tim Park
@timpark

More Related Content

Similar to Processing Planetary Sized Datasets

PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
Swagatam Mitra
 
Analisis gap dan thurstone
Analisis gap dan thurstoneAnalisis gap dan thurstone
Analisis gap dan thurstone
M Taufiq Budi H
 
Data Science Is More Than Just Statistics
Data Science Is More Than Just StatisticsData Science Is More Than Just Statistics
Data Science Is More Than Just Statistics
Digital Transformation EXPO Event Series
 
Processing planetary sized datasets
Processing planetary sized datasetsProcessing planetary sized datasets
Processing planetary sized datasets
Tim Park
 
Using amazon machine learning to identify trends in io t data technical 201
Using amazon machine learning to identify trends in io t data   technical 201Using amazon machine learning to identify trends in io t data   technical 201
Using amazon machine learning to identify trends in io t data technical 201
Amazon Web Services
 
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Amazon Web Services
 
Sample Calculations for solar rooftop project in India
Sample Calculations for solar rooftop project in IndiaSample Calculations for solar rooftop project in India
Sample Calculations for solar rooftop project in India
disruptiveenergy
 
BMW 740li report
BMW 740li reportBMW 740li report
BMW 740li report
Mariraja Ponraj
 
Funcion solver
Funcion solverFuncion solver
Research data
Research dataResearch data
Engineering Simulation
Engineering SimulationEngineering Simulation
Engineering Simulation
Engineering Software
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
bin971209zhou
 
Volvo EC55B Compact Excavator Service Repair Manual
Volvo EC55B Compact Excavator Service Repair ManualVolvo EC55B Compact Excavator Service Repair Manual
Volvo EC55B Compact Excavator Service Repair Manual
fujdfjjskrtekme
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
f8usejkdmdd8i
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
fjskemdmmded
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
fujsekmdd9dik
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
fujsekmd9idd1
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
fujsekddmdmdm
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
fyhsejkdm8u
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
ttf99929781
 

Similar to Processing Planetary Sized Datasets (20)

PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
PROJECT 8th SEM - DEVELOPMENT OF SOME INTEGRATED DECISION-MAKING FRAMEWORK FO...
 
Analisis gap dan thurstone
Analisis gap dan thurstoneAnalisis gap dan thurstone
Analisis gap dan thurstone
 
Data Science Is More Than Just Statistics
Data Science Is More Than Just StatisticsData Science Is More Than Just Statistics
Data Science Is More Than Just Statistics
 
Processing planetary sized datasets
Processing planetary sized datasetsProcessing planetary sized datasets
Processing planetary sized datasets
 
Using amazon machine learning to identify trends in io t data technical 201
Using amazon machine learning to identify trends in io t data   technical 201Using amazon machine learning to identify trends in io t data   technical 201
Using amazon machine learning to identify trends in io t data technical 201
 
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
 
Sample Calculations for solar rooftop project in India
Sample Calculations for solar rooftop project in IndiaSample Calculations for solar rooftop project in India
Sample Calculations for solar rooftop project in India
 
BMW 740li report
BMW 740li reportBMW 740li report
BMW 740li report
 
Funcion solver
Funcion solverFuncion solver
Funcion solver
 
Research data
Research dataResearch data
Research data
 
Engineering Simulation
Engineering SimulationEngineering Simulation
Engineering Simulation
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual
Volvo EC55B Compact Excavator Service Repair ManualVolvo EC55B Compact Excavator Service Repair Manual
Volvo EC55B Compact Excavator Service Repair Manual
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 
Volvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdfVolvo EC55B Compact Excavator Service Repair Manual.pdf
Volvo EC55B Compact Excavator Service Repair Manual.pdf
 

Recently uploaded

Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 

Recently uploaded (20)

Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 

Processing Planetary Sized Datasets

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Vehicle Id Trip Id Timestamp Latitude Longitude Altitude … 10152875766888406 57169639 1445377623000 36.966819 -122.012298 1809 10152875766888406 57169639 1445377625000 36.966845 -122.012248 1809 10152875766888406 57169639 1445377627000 36.966877 -122.012228 1814 10152875766888406 57169639 1445377629000 36.966913 -122.012236 1814 10152875766888406 57169639 1445377630000 36.966946 -122.012236 1814 10152875766888406 57169639 1445377631000 36.966984 -122.012263 1815 10152875766888406 57169639 1445377632000 36.967027 -122.012281 1815 …
  • 17.
  • 18. 39 TB of raw location data: • 584 billion data points • 116 million trips
  • 19. • Many trips per vehicle. • Want to be able to pull a range of locations by timestamp for trip display.
  • 20. Vehicle Id Timestamp Latitude Longitude … 10152875766888406 1445374423000 36.966819 -122.012298 10152875766888406 1445377625000 36.966845 -122.012248 10152875766888406 1445377627000 36.966877 -122.012228 10152875766888406 1445377629000 36.966913 -122.012236 10152875766888406 1445377630000 36.966946 -122.012236 10152875766888406 1445377631000 36.966984 -122.012263 10152875766888406 1445379512000 36.967027 -122.012281 …
  • 21. This is a challenge with a large dataset: • A traditional relational database typically requires hand sharding to scale to PBs of data (eg. Postgres). • Highly indexed non relational solutions can be very expensive (eg. MongoDB). • Lightly indexed solutions are a good fit because we really only have one query we need to execute against the data. (HBase, Cassandra, and Azure Table Storage)
  • 22. PartitionKey (vehicleId) RowKey (timestamp) Latitude Longitude 10152875766888406 1445377623000 36.966819 -122.012298 10152875766888406 1445377625000 36.966845 -122.012248 10152875766888406 1445377627000 36.966877 -122.012228 10152875766888406 1445377629000 36.966913 -122.012236 10152875766888406 1445377630000 36.966946 -122.012236 …
  • 23. • Want to query a set of trip in a bounding box. • Also want to filter activities based on distance and duration.
  • 24. Trip Id start (sec) finish (sec) distance (m) duration (m) bbox (geometry) 101528 1445377625 1445383025 50023 6222 [-104.990, 39.7392... 101643 1445362577 1445373616 28778 2498 [-122.01228, 36.96… 101843 1445377627 1445382432 4629 701 [0.1278, 51.5074 … 101901 1445362577 1445374713 99691 14232 [139.6917, 35.699... 102102 1445374713 1445374713 25259 6657 [1.3521, 103.8129…
  • 25. user Id timestamp latitude longitude 10152875766888406 1445377623 36.966819 -122.012298 10152875766888406 1445377625 36.966845 -122.012248 … 10152875766888406 1445383025 36.966913 -122.012236 10152875766888406 1445383030 36.966946 -122.012236 activity id start finish … bbox 101528 1445362577 1445373616 … [-104.990, 39.7392... 101643 1445377625 1445383025 … [-122.01228, 36.96… 101843 1445377627 1445382432 … [0.1278, 51.5074 … 101901 1445362577 1445374713 … [139.6917, 35.699... 102102 1445374713 1445374713 … [1.3521, 103.8129… Location Data (Azure Table Storage) Trip Data (Postgres + PostGIS)
  • 26.
  • 27.
  • 28. • Total number of location samples in a geographical area. • Whole dataset operation.
  • 29. • Divides world up into tiles. • Each tile has four children at the next higher zoom level. • Maps 2 dimension space to 1 dimension.
  • 30. • Can think of it is as “Hadoop the Next Generation” • Better performance (10-100x) • Cleaner programming model • Used HDInsight Spark (Azure) to avoid operational difficulties of running our own Spark cluster.
  • 31. For each location, map to tiles at every zoom level: (36.9741, -122.0308)  [ (10_398_164, 1), (11_797_329, 1) (12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1), (15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1) ]
  • 32. Reduce all these mappings with the same key into an aggregate value: (10_398_164, 151)  [ (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1) ]
  • 33. def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings
  • 34. lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/') locations = lines.flatMap(json_loader) heatmap = locations .flatMap(tile_id_mapper) .reduceByKey(lambda agg1,agg2: agg1+agg2) heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/'); Building the heatmap then boils down to this in Spark:
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 51.
  • 52.
  • 54. 39 TB 23 TB JSON Avro 60% Smaller
  • 55. • geotile: http://github.com/timfpark/geotile • XYZ tile math in C#, JavaScript, and Python • heatmap: http://github.com/timfpark/heatmap • Spark code for building heatmaps • tileIndex: http://github.com/timfpark/tileIndexPusher • Azure Function for pushing tile indexes.

Editor's Notes

  1. Thanks… I thought I would start today by briefly looking at where we’ve been as an industry, and how that has shaped our practice of building software
  2. So, in the beginning there was a computer. It filled a room and was something that only a government had access to. It was very sexily called the ENIAC and it did mathematically computations at the blazing speed of 360 calculations per second Required an army of people in suits to keep it operating. It was also central to the scientific understanding of how far neutrons could penetrate matter before hitting a nucleus and how much energy it would give off when it did. To calculate this, the engineers on the project to invent the Monte Carlo Method to accomplish this, which basically uses random samples of a problem to arrive at a converged numerical result. We still use this technique today in software profiling.
  3. Over a decade or so, computers got smaller and only took up half a room. They became accessible in price to universities and large corporations. 1000x faster than the first computer. 1000x more access by developers to computers. Let to many more geniuses at the keyboard (like Kernigan and Riche here) Their desire to be more productive lead to an explosion of useful software and techniques that we still use today: First modern systems language in C Modern operating systems like UNIX We still model many of our programming languages and operating systems off of this work.
  4. All of these techniques trickled down into the original PCs like the IBM PC and the Macintosh. These computers put computing in hands of ordinary people, workers, students, and hackers. This lead to the first widespread usage of computers by non-engineers and lead to a wide reinvention of a wide number of tasks
  5. Writers moved from typewriters to…
  6. … To word processing …
  7. Accountants and inventory management moved from error prone paper…
  8. …To spreadsheets and enterprise resource management systems The big point here is that the personal computer enabled the automation of a wide range of office work tasks that were largely abstract. And we as developers invented things like filesystems to manage files and B-trees to enable fast queries against databases. And this is the big meta point. At every point of the evolution of the computing industry, we have had to investigate and discover what have become widely applicable approaches to building applications based on solving practical real world problems.
  9. The mobile phone has obviously taken this technology progression forward another step.
  10. While many people would point to app stores as the most interesting way mobile has changed the software industry. In my opinion, the most interesting thing it has done is expand computing out to the real world... Enabling whole classes of new applications...
  11. Like having a map of the world in our pockets…
  12. …being able to push a button and have a car show up 5 meters away…
  13. And, of course, the all important Pokemon Go, where allows us to catch virtual animals…
  14. In the same way that our predecessors had to figure out the best way to process text and structured data The explosion of mobile, and the fact that many mobile apps revolve around the real 3D world, means that increasingly processing and analyzing geospatial data is woven into the applications that we are building as developers. I work in our developer advocacy team at Microsoft and I really have an awesome job where I get to work with a bunch of customers and partners on their hardest technical challenges. Today I’m going to share some of the things we have learned in the course of those projects around effectively processing geospatial data in the cloud. The first project I wanted to share with you, and the one I’ll use to largely ground this talk, is around a transportation partner that we worked with here in Europe.
  15. This well known transportation company collects location traces for each of the trips that their fleet takes. Here is a visualization of what one of those trips looks like from a prototype we built with them.
  16. The shape of this dataset will probably not surprise you Its basically a whole lot of data that includes: an vehicle id, A bunch of sets of trips that are identified by a Trip id, And a bunch of timestamped location data: latitude, longitude, and altitude data And so, there are many many vehicles Who each have many trips Which all have many timestamped locations.
  17. Its probably not hard for you to imagine that this company ends up with literally a mountain of data. This is pretty common in this space. Collecting geospatial often results in a ton of data landing at your doorstep.
  18. And to give you and idea of this, We worked with them on a month’s worth of data. Even this small time slice of data is still pretty large. A CSV dump of a month, containing 584B locations across 116M trips, is over 39TB in size. The point here is that this is definitely larger than any typical computing node. And therefore we have to use larger scale data techniques in order to store and process it
  19. Let’s first start by talking about how we store the location data As we talked about previously, An trip has an ordered set of locations by timestamp To display an trip, we need to pull all of the associated locations for that trip
  20. So what we need from the storage system is the ability to pull a range of timestamps With this, we can pull all of the locations for a particular trip.
  21. This is pretty standard stuff for a database but becomes a challenge with a large dataset (reasons above)
  22. This brings us to the first pattern that we have used for all of our geospatial projects. And that’s to use a lightly structured storage system for this sort of data. In our projects, we, naturally, almost exclusively use Azure. And for this sort of storage requirement we use Azure Table Storage. For those of you that aren’t familiar with Azure Table Storage, it sits somewhere just above blob storage. And as such, it is very inexpensive, and costs roughly 2 US cents per GB per month. And unlike blob storage, you can access ranges and individual rows in the data. But you can only query on a set of RowKeys within the same PartitionKey. But we only need to query on timestamp. And it satisfies our need to be able to query a range of user locations by timestamp range.
  23. Now that we have an approach for storing locations Let’s look at how we store trip metadata itself One of the key queries we want to do is around querying the trips in a bounding box. We also, in the future, want to be able to filter trips on a distance or duration.
  24. What this essentially means is that unlike the location data, we want the trip data to be highly indexed. We want to be able to do queries like “Give me all of the trips under 10km in length near Dublin” or ”Give me all of the trips over 1 hour in duration near Berlin” And this means we need the data columns highlighted above to be indexed. Which is not a great fit for Azure Table Storage that we saw in the first pattern. But the good news is that there is 10 to 20 thousand times less data as well. Which brings it within range of a traditional relational database like MySQL or Postgres. And, so, we can just set up some schema’ed tables Which allow us to make rich queries against it.
  25. Which brings us to the second pattern for dealing with high scale data like this: Use the best storage system for each scenario that solves a particular application’s needs. We call “polyglot persistence” And what we mean by that is that you shoud choose each storage system because they excel at a portion of the problem that we are trying to solve. As we saw before, I am using Azure Table Storage for the location data. And for the Trip Data, I am using PostgresSQL + PostGIS. The way this works is that we query for trips using Postgres and then when we need the location data, we query for it from Table Storage. Ok, so that is how we are storing and querying trips. We load locations into Table Storage and trips into Postgres at creation and then query them.
  26. As we saw, the data from these vehicle trips Is basically a whole lot of CSV files with location information that includes: User id, activity id (which identifies all of the data that is part of the same run, walk, hike, bike ride, etc.), timestamp, latitude and longitude There are many many users Who each have many activities Which all have many timestamped locations.
  27. So what does the shape of this dataset look like Its basically a whole lot of CSV files with location information that includes: an user id, activity id, timestamp, latitude and longitude There are many many users Who each have many activities Which all have many timestamped locations.
  28. Now that we have talked about how we store a dataset of this size, Let’s dig in and talk more about some of the techniques we can use to process a dataset of this size at scale. The transportation company we worked with also wanted to be able to visualize a heatmap where their fleet spends its time.
  29. This is a pretty common problem. We also tackled this with, Stroeer, an outdoor advertiser in Germany. They are the company that does most of the advertising billboards you see on the sidewalks in urban centers throughout Germany. One of the important factors in advertising is that the overall mood of a particular place can make a particular ad much more or less effective. Given this, Stroer combined a number of datasets, including geotagged social feeds, to come up with an overall estimate of what people were feeling. And then use this plus demographic data to decide which ad to show and how much to charge for each ad. So, building heatmaps like these are a pretty common problem…
  30. Let’s look at how we generated the heatmap for our transportation company, since it is a simpler scenario. In that case, the heatmap is generated by summing up the number of location samples in a particular geographical area. So for every location that a truck sends back, we attribute that location to a summary in the heatmap, and sum up over all of the locations that are associated with that summary.
  31. So before we dive into how we implemented this Let’s discuss how we map a particular location point to a geographic summary. One of the common patterns we’ve seen is using XYZ Tile as a summarization bucket. This is not something we’ve invented. Open Street Maps, Google, and Apple use the same concept for addressing geographical areas in their maps. It is a fairly simple concept. The top level world is divided into 4 tiles. For each zoom level below that, you take the parent tile and recursively divide it into 4 tiles. A individual tile is then addressed by its zoom level and its row and column within that zoom level. And so, in order to build these heatmap summaries, we basically count location samples in these geographic XYZ tiles. This means that it is operating over the whole dataset, and given the large size of the dataset we need to open our big data toolbox to accomplish this.
  32. To accomplish this, we used Apache Spark. For those of you that have used Hadoop, Apache Spark is sort of a “Hadoop the next generation” It operates on data in a similar paradigm But offers much better performance and, in my opinion, a much nicer programming model than the original Hadoop engine. HDInsight Spark is Azure’s hosted version of Apache Spark. We used this hosted offering so we could ignore the significant operational work of running a Spark cluster and focus on the actual problem we are trying to solve.
  33. In order to compute our heatmap, we use a pretty standard map / reduce algorithm. For every location in the dataset, we generate a tile id key/value pair for every zoom level that we want results for. In this case, we are generating tile key/value pairs for the zoom levels from 10 to 18 because we knew that these are the only set of tiles that the user interface would end up using. So for instance, for this location from a vehicle, We generated one for zoom level 10, zoom level 11, … thru zoom level 18.
  34. We then reduce all of these mappings with the same key down into its aggregate heatmap value. In spark, the first element in a tuple is considered the key And the second element is considered the value Which is to say, we take all of the locations with the same tile ids from the previous step and count them.
  35. So let’s look at what an implementation of these concepts look like in Python. We first compute the tileIds for all of the zoom levels that we want to collect results for. And then use that to build tuples for each tileId.
  36. With this mapper, the overall implementation in Spark is fairly straightforward. We point it at the dataset in blob storage, then parse it as json, And then use the tile_id_mapper function we defined earlier to map each location to the appropriate zoom level result. We then reduce all of these individual results by the key to get a final total for each tile result. We implement the reducer as a anonymous lambda function that essentially sums the intermediate aggregates for tiles with the same id. And then write the heatmaps back out to blob storage.
  37. From a programming standpoint, Spark makes this look really easy. But under the covers, Spark is doing a lot of work for us Remember, we are working on a dataset that doesn’t fit into onto a single machine During the map stage, any tile id could be generated by any of the mappers in the Spark cluster since locations are uniformly distributed in the dataset. This means that there needs to be a shuffle step in which the results for a give tile id are assigned to the same reducer so that we can calculate an aggregate value. And therefore there will be potentially billions of these truple floating around across the cluster. The good news is that Spark handles all of this underneath the covers for us.
  38. The next pattern I wanted to talk about is incremental ingestion In the real world, we don't usually have static data but instead data that is constantly arriving. In this case, we have vehicles that are constantly delivering data. And ideally, to make Spark more efficient, we'd like to combine these incoming small trips into a set of large aggregate files. For this we are using Azure Event Hub You can think of Event Hub as a giant cloud buffer where a downstream backend system controlls the rate at which data is read out. Helpful in situations where you have 1) bursty data and 2) where we don't need the data until the future. In this case, we want to use it to buffer up the data for a particular hour time slice And then create hourly summaries of this new data using another Azure service called Stream Analytics. Using these pieces of infrastructure in conjunction with each other to enable incremental ingestion of data is a key pattern for high scale location data.
  39. Incremental ingestion leads us to our next pattern of processing data in slices Which is how do you process these new pieces of incremental data in slices. We do this in a manner analogous to how we processed the whole dataset in the previous slides. But instead, we only operate on the single new data slice. Since we are only processing an individual slice, we do not need to have nearly as large of a cluster to do the processing. We then load in the previous complete result, fold in this newly computed partial heatmap, and then write out the new heatmap. Although this adds a second step, overall we are operating on a much smaller set of data, and therefore it is much more efficient.
  40. The next project I wanted to share with you is some work that we’ve been doing with the United Nations Office for the Coordination of Humanitarian Affairs. This part of the UN is tasked with getting help to the most vulnerable people in the midst of humanitarian crises as quickly as possible.
  41. We worked with them to see how technology could have helped them during the many humanitarian crises that have happened in the Libyan Civil War that has been ongoing since 2014.
  42. Armed conflicts and natural disasters make up nearly equal parts of their remit. These events yield devastated civilian infrastructure…
  43. …and many vulnerable refugees seeking shelter and food.
  44. The state of the art of detecting these sorts of crises is still very human driven. For example, a photojournalist may be documenting the event
  45. And might call back with his observations of where and what help is needed most
  46. The UN then kicks into gear and coordinates a disaster relief effort. But this is a very slow approach – it can be days before critical needs are discovered
  47. And sadly, that is several days too late for some of the victims of these crises
  48. Fortunately, the UN had an idea for how to improve on this, and it leverages the high penetration of mobile in these developing countries.
  49. The idea was pretty simple: Could we search for humanitarian keywords in geolocated tweets and other short messages intersecting and summarizing them against real world features like a city or a state, To build a near real time dashboard to detect more quickly where these crises are occurring.
  50. Geographical features in the real world have complex shapes. This is probably not a surprise to you. The challenge, given this, is how to do these intersections at scale. In the United Nations project we utilized the Open Street Maps dataset to compute intersections using a two stage process. We again employed the XYZ tile to accomplish this. We then loaded and keyed each geolocated tweet. These are the small squares in the slide. We then loaded each open street map feature and keyed it with the tileId that was the maximum zoom level that encompasses the feature. In this case, the black square represents the smallest XYZ tile that spans the Benghazi region completely. In Spark, when you can have two datasets that have keyed tuples with the same units, you can use a join operation to select the elements that exist in both datasets. We use this to join the geolocated tweets (the small rectangles), against the features. This will yield us a set of candidate feature matches. It is not the final set of matches because the black box that represents the spanning tileId for Benghazi does not obviously perfectly fit it. This means that false matches like the red square are included the candidate matches. That said, the join does narrow down the set of potential matches considerably, including the blue square below it. This narrowed set of potential matches allows us to make a second pass, where do a fine grained intersection test against the real border data. So this is how we can do a scaled intersection of the dataset to come up with the set of features each datapoint intersects with so we can aggregate against them. However, as I mentioned previously, the United Nations wanted a system that could react in near real time to humanitarian crises. There are a couple of ways that you can do this. You could do it with Spark Streaming and micro batches, but that can be an expensive way to accomplish this. Maintaining an entire cluster to handle the worst case load means that in the common case a number of the nodes of the cluster will go underutilized. So only really makes sense when you have a pretty consistent volume of data and you can size the cluster appropriately.
  51. Using an entirely batch infrastructure for this can be slow and expensive So instead of using an entirely batch infrastructure to do this, we adopted a Lambda Architecture This is becoming a pretty common pattern in our industry. A lambda architecture pairs a batch layer to do full dataset processing with a speed layer that does near real time stream level processing to keep these results up to date. Projects use this architecture because it enables you to do a full recompute on the dataset to handle, for example, new feature additions. In between those full recomputes, you can use a less expensive mechanism to update the overall view with the latest data.
  52. For the batch layer, we are using, like the previous examples, Apache Spark. This is the architectural diagram that implements the algorithm that I described in the previous slide. From an architectural perspective it is very straightforward: we take in the geolocated tweets and the features, intersect them with a join, and then aggregate the values over the features. And then does a bulk update of the results.
  53. We've talked about what the batch layer for this looks like. The speed processing layer we implemented builds off of the incremental ingestion processing that we described previously. We used a new service in Azure called Azure Functions. Azure Functions enables you to process data as it arrives by applying a function to each of them. In our case, for each tweet that comes in, we find the geographic features they intersect and update the existing dataset. This enables us to have a nearly real time update of the dashboard but if we add features that we want to aggregate over, we can still rerun the batch layer to get summaries over those previous results.
  54. We efficiently implemented this feature intersection service in Azure using a platform called Azure Functions. Azure Functions is a serverless platform, which means that instead of dimensioning the platform in terms of the number of instances that are constantly running… …you instead provide a function that should be executed every time a particular event happens …and the infrastructure handles automatically scaling the number of instances to match the incoming event rate. This slidesshows a very simple example of what one of these functions look like. I wrote it in node.js, but you can write these functions in a variety of languages including C#. Basically, the function receives the tweet data slice that the incremental processing pipeline we previously discussed has dropped into blob storage… ... Breaks it down into individual tweets since the blob contains one per line. … and calls an underlying service to aggregate them on an hourly basis. ... It then stores these such that the frontend that we have that is consuming them can use them. (old) The great news is that Azure has a nice new feature called Functions that does just this. Azure Functions allows you to setup a trigger on a wide variety of events Including queued events, http requests, service bus, timers, etc. It builds on top of our Azure App Service’s Web Jobs functionality To provide a super easy interface When one of these triggers is triggered, a function that you have written is executed. In this case, I wrote my function in JavaScript, but C# and a host of other languages are also supported. I also set up this Azure Function to trigger on a new blob being added to a blob in storage account container. So each time a blob is created by the previous incremental ingestion pattern… ... It is passed into this function so that it can be enriched with elevation data. Note that the whole blob is passed into the function, so you need to make sure that the blobs for your scenario will fit in memory. Once the locations in the blob have been enriched, we then push these out to another blob... ... Which we use downstream.
  55. I wanted to end today by talking about how we present and display this information. Many of you might have suspected it when you saw this the first time but there is a fairly large set of data that sent with each of these heatmap queries. If we made a traditional bounding box database query against a geospatial relational store for each of these heatmaps, you’d likely end up with a solution that didn’t scale well or scaled pretty expensively.
  56. Instead of querying for the heatmaps, we instead precomputed the heatmap elements that should be displayed within a particular view. We use a lower zoom level block, shown here, as a container for the higher zoom level summaries, and then precompute what should be displayed. We store all of these resultsets in blob storage, which is very inexpensive compared to the comparable number of VMs/databases that you’d need to do this. This allows us to turn a querying problem into a “sending json” problem from our web frontends. But also allows us to cache each of these resultsets in the browser.
  57. Architecturally, this looks like this. We use the slice architecture that I talked about and heatmap deltas to determine which heatmaps require updates We then push new heatmap resultsets out to blob storage for each of these. And that’s it – we are trading off using more storage for precomputed views vs. using computation to generate them on the fly – and this is something that you should in general look for in your projects.
  58. One other thing that I glossed over when we described displaying trips. As you remember, the application has an elevation graph associated with the application.
  59. But, also remember, our input data set does not have elevation as part of it.
  60. We’ve worked with Guide Dogs for the Blind to build a device that uses data from Open Street Maps The app helps blind people: * Discover where they are * What's around them * And helps navigate them to locations
  61. That said, while storage is becoming increasingly cheap… … and while JSON is a fantastic format and very developer friendly. … the traditional laws still apply: reading and writing data incurs latency. So when you go to do a project like this for real, consider a binary data serialization format like Avro instead. By establishing a schema, and using a binary serialization format, you can achieve on the order of a 60% size reduction. This improves performance of deserializing and serializing the data from and to blob storage It also, naturally, linearly cuts your data storage costs.
  62. Ok, that’s what I have for you today I’ve open sourced a couple of things as part of this presentation that you should have a look at if you are interested in more details Geotile Heatmap TileIndex: Azure Function for pushing tile indexes Includes a sample dataset that you are work against
  63. I will share out are these slides via Twitter, so follow me @timpark if you’d like to get a copy of those. And with that, thank you for coming out today for the talk, and I’d be happy to take any questions…