SlideShare a Scribd company logo
#EntSAIS18
Exploring a Billion Activity Dataset from Athletes
with Apache Spark
Drew Robb, Infrastructure/Data at Strava
Strava Labs:
#EntSAIS18
1
#EntSAIS18 2
#EntSAIS18
Outline
1. Activity Stream Data “Lake”
2. Grade Adjusted Pace
3. Global Heatmap
a. Rasterize
b. Normalize
c. Recursion/Zoom out
d. Store+Serve
4. Clusterer
5. Final Thoughts
3
#EntSAIS18
- Activity Stream: Array of
measurement data over
time from a single
Run/Ride/etc
- Typically at 1/second rate
- Stream Types: Time,
Position (lat, lng),
elevation, heart rate,...
Data “Lake”
4
#EntSAIS18
- Billions of streams, each with thousands of points
- Canonical stream storage 1 file per stream in s3!
- Keyed by autoincrement id!!
- Over 10 billion total s3 files!
- Can’t actually bulk read-- too error prone, too expensive
- Need replica of data in a different format suitable for bulk
read with spark
Data “Lake”
5
#EntSAIS18
- Delta Encoding + Compression
- Parquet+S3 multi partition hive table
- ~100mb per s3 file (streams for ~30,000 activities)
- New data batch appended as new partitions
- Handle updates: Entire table periodically rewritten in
new s3 path, then update metastore
- Original data store: Cost: ~$5,000, time: 3 days
- Optimized for bulk read: Cost ~$5, time: 5 min
Data “Lake”
6
#EntSAIS18
- A synthetic running pace that reflects how fast
you would have run in the absence of hills
- Helps athletes compare performance of flat vs
hilly runs
- Requires precise understanding of how running
difficulty is affected by elevation gradient
Grade Adjusted Pace
7
#EntSAIS18
- Previous model: Derived from metabolic
measurements of a few athletes running on a
treadmill
- New model: Fit from actual running data drawn
from millions of runs in real world
- Find smooth data sample windows
- Only draw data from near maximum effort for
each athlete.
Grade Adjusted Pace
8
#EntSAIS18 9
#EntSAIS18 10
#EntSAIS18
Global Heatmap
- 700 million activities
- 1.4 trillion GPS points
- 10 billion miles
- 100,000 years of exercise
- 10TB input
- 500GB output
- 5 Layers by Activity Type
- https://strava.com/heatmap
11
#EntSAIS18 12
#EntSAIS18
Rasterize
13
#EntSAIS18
Web Mercator Tile Math
- Pixel coordinate system for Earth Surface.
- Each zoom level: 4x as many tiles at 2x resolution
- Each tile is 256x256 pixels
def fromPoint(lat, lng, zoom) = TilePixel(
zoom = zoom,
tileY = floor((lat+90 ) / (180 * 2^zoom)),
tileX = floor((lng+180) / (360 * 2^zoom)),
pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256),
pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256))
14
#EntSAIS18
Rasterize
- Convert vectorized lat/lng line data into
discrete pixel lines (Bresenham’s line
algorithm)
- Aggregate count of total passes of each
pixel
- Total pixel counts ~ 7 trillion!
- Total tile counts ~ 30 million
- Total memory load of tiles is 30TB, avoid
simultaneous allocation by having many
more tasks than cores
15
#EntSAIS18
Rasterize
Boundary case: adjacent points in different tiles
16
#EntSAIS18
Rasterize
Solution: Add virtual points on tile boundary
17
#EntSAIS18
Rasterize
- Add noise to
“uncorrect” GPS
road corrections
- Other filtering of
erroneous data
18
#EntSAIS18
Normalize
19
#EntSAIS18
Normalize
- Raw count of activity per pixel has domain [0, inf)
- To visualize with a colormap we need [0, 1]
- Tiles have dramatically different distributions of raw count values
- General and parameter free solution:
- Normalize using the Cumulative Distribution Function (CDF) of raw
count values within each tile
- Can approximate CDF with sampling (Sample 2^8 out of 2^16 values)
- Evaluate CDF with binary search in sample array
20
#EntSAIS18
Problem:
Normalization function is
different for every tile!
=> Artifacts along tile
boundaries
21
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
22
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
- Bilinear smoothing:
- Every pixel value is a weighted average of CDFs
from 4 nearest tiles
- Continuously varying pixel perfect normalization!
23
#EntSAIS18 24
#EntSAIS18 25
#EntSAIS18
Recursion /
Zoom Out
26
#EntSAIS18
Recursion / Zoom Out
27
#EntSAIS18
Recursion / Zoom Out
def finish(data, zoom) {
write(normalize(data), zoom)
if (zoom > 0) {
finish(zoomOut(data), zoom-1)
}
}
28
#EntSAIS18
Recursion / Zoom out
- Each iteration has ~1/4th as
much data
- Reduce number of tasks to
not waste time
- Exponential speedup of
stage runtime incredibly
satisfying
Level 16 written, 136928925282 bytes
Level 15 written, 53804156232 bytes
Level 14 written, 19795342663 bytes
Level 13 written, 6996622566 bytes
Level 12 written, 2474617135 bytes
Level 11 written, 897467724 bytes
Level 10 written, 331303516 bytes
Level 9 written, 122600962 bytes
Level 8 written, 44957727 bytes
Level 7 written, 16254495 bytes
Level 6 written, 5728247 bytes
Level 5 written, 1978297 bytes
Level 4 written, 673066 bytes
Level 3 written, 228565 bytes
Level 2 written, 78380 bytes
Level 1 written, 27417 bytes
29
#EntSAIS18
Store+Serve
30
#EntSAIS18
Store+Serve
Idea: Group spatially adjacent tiles to get similar file sizes.
Consider tile data as quad tree.
- Starting from all leafs, walk up tree accumulating total file size
at each node.
- When file size exceeds threshold, write all tiles as one file.
- Direct s3 write (no hadoopFS api)
- Repeat for each level of tile data.
31
#EntSAIS18
Tile Packing
Zoom 2:
-(1,0)
-(2,2)
Zoom 1:
-(0,2), (1,2), (0,3), (1,3)
-(3,2), (2,3), (3,3)
Zoom 0:
- Everything else
32
#EntSAIS18
Store+Serve
Reads:
- Reconstruct tree of tile groups in memory
- Traverse tree to find tile location
- Cache the tile group file-- nearby tiles also likely to be
served.
- Persisted data is 1 byte per pixel, colormap and PNG
applied at read time.
- Final rendered PNG also cached by cloudfront
33
#EntSAIS18
Clusterer
- Discover sets of
spatially similar activities
using clustering
- Compute patterns in
aggregated activity
properties
- Build catalog of routes
for every cluster (a few
million clusters)
34
#EntSAIS18
Clusterer
- Locality Sensitive Hashing (SuperMinHash O Ertl, 2017)
of quantized lat/lng shingles from activity streams
- MinHash pairwise similarity of activities: 10,000x
speedup over direct stream based methods
- Similarity join on collisions to efficiently find sample of
highly similar pairs
- Single linkage clustering globally on pairs
- Split locally and repeat, rehashing sample of activities in
each cluster
35
#EntSAIS18 36
#EntSAIS18
Spark Environment
- Spark on mesos on AWS
- Job submitted with Metronome/Marathon
- Isolated context with job-specific docker image for
driver+executors
- Mesos cluster running multiple spark contexts
- Scale up instances based on unscheduled tasks
- Prevent scale down on instances with active executors
- 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd)
- 5 hours for full heatmap build
37
#EntSAIS18
Spark Lessons
- Parameter tuning seems unavoidable in well utilized distributed systems
- With scala, managing application dependency conflicts can be hell (guava,
netty, logging frameworks…)
- Optimizing locally can be very misleading, must profile on real cluster
- Even with fantastic configuration management tools, lots of time wasted on
configuration
- Building from scratch may be worthwhile to fully understand performance
- Invest in generating representative sample data to use in prototypes
- Spilling to disk can use a surprising amount of IOPS
- Probably still a lot more optimizations possible
38
#EntSAIS18
Resources
Strava Global Heatmap: strava.com/heatmap
Strava Labs: labs.strava.com
Strava Engineering Blog: medium.com/strava-engineering
We are hiring Data Scientists: strava.com/careers
Strava Labs Clusterer*: labs.strava.com/clusterer
*Currently offline, will be back soon
39

More Related Content

What's hot

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
David Martínez Rego
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Whiteklay
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 

Similar to Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb

Spark streaming
Spark streamingSpark streaming
Spark streaming
Venkateswaran Kandasamy
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
Amazon Web Services
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
rveiga100
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
Keira Zhou
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
David Pilato
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
Francesc Lordan Gomis
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
David Pilato
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
AbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
snowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 

Similar to Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb

  • 1. #EntSAIS18 Exploring a Billion Activity Dataset from Athletes with Apache Spark Drew Robb, Infrastructure/Data at Strava Strava Labs: #EntSAIS18 1
  • 3. #EntSAIS18 Outline 1. Activity Stream Data “Lake” 2. Grade Adjusted Pace 3. Global Heatmap a. Rasterize b. Normalize c. Recursion/Zoom out d. Store+Serve 4. Clusterer 5. Final Thoughts 3
  • 4. #EntSAIS18 - Activity Stream: Array of measurement data over time from a single Run/Ride/etc - Typically at 1/second rate - Stream Types: Time, Position (lat, lng), elevation, heart rate,... Data “Lake” 4
  • 5. #EntSAIS18 - Billions of streams, each with thousands of points - Canonical stream storage 1 file per stream in s3! - Keyed by autoincrement id!! - Over 10 billion total s3 files! - Can’t actually bulk read-- too error prone, too expensive - Need replica of data in a different format suitable for bulk read with spark Data “Lake” 5
  • 6. #EntSAIS18 - Delta Encoding + Compression - Parquet+S3 multi partition hive table - ~100mb per s3 file (streams for ~30,000 activities) - New data batch appended as new partitions - Handle updates: Entire table periodically rewritten in new s3 path, then update metastore - Original data store: Cost: ~$5,000, time: 3 days - Optimized for bulk read: Cost ~$5, time: 5 min Data “Lake” 6
  • 7. #EntSAIS18 - A synthetic running pace that reflects how fast you would have run in the absence of hills - Helps athletes compare performance of flat vs hilly runs - Requires precise understanding of how running difficulty is affected by elevation gradient Grade Adjusted Pace 7
  • 8. #EntSAIS18 - Previous model: Derived from metabolic measurements of a few athletes running on a treadmill - New model: Fit from actual running data drawn from millions of runs in real world - Find smooth data sample windows - Only draw data from near maximum effort for each athlete. Grade Adjusted Pace 8
  • 11. #EntSAIS18 Global Heatmap - 700 million activities - 1.4 trillion GPS points - 10 billion miles - 100,000 years of exercise - 10TB input - 500GB output - 5 Layers by Activity Type - https://strava.com/heatmap 11
  • 14. #EntSAIS18 Web Mercator Tile Math - Pixel coordinate system for Earth Surface. - Each zoom level: 4x as many tiles at 2x resolution - Each tile is 256x256 pixels def fromPoint(lat, lng, zoom) = TilePixel( zoom = zoom, tileY = floor((lat+90 ) / (180 * 2^zoom)), tileX = floor((lng+180) / (360 * 2^zoom)), pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256), pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256)) 14
  • 15. #EntSAIS18 Rasterize - Convert vectorized lat/lng line data into discrete pixel lines (Bresenham’s line algorithm) - Aggregate count of total passes of each pixel - Total pixel counts ~ 7 trillion! - Total tile counts ~ 30 million - Total memory load of tiles is 30TB, avoid simultaneous allocation by having many more tasks than cores 15
  • 16. #EntSAIS18 Rasterize Boundary case: adjacent points in different tiles 16
  • 17. #EntSAIS18 Rasterize Solution: Add virtual points on tile boundary 17
  • 18. #EntSAIS18 Rasterize - Add noise to “uncorrect” GPS road corrections - Other filtering of erroneous data 18
  • 20. #EntSAIS18 Normalize - Raw count of activity per pixel has domain [0, inf) - To visualize with a colormap we need [0, 1] - Tiles have dramatically different distributions of raw count values - General and parameter free solution: - Normalize using the Cumulative Distribution Function (CDF) of raw count values within each tile - Can approximate CDF with sampling (Sample 2^8 out of 2^16 values) - Evaluate CDF with binary search in sample array 20
  • 21. #EntSAIS18 Problem: Normalization function is different for every tile! => Artifacts along tile boundaries 21
  • 22. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles 22
  • 23. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles - Bilinear smoothing: - Every pixel value is a weighted average of CDFs from 4 nearest tiles - Continuously varying pixel perfect normalization! 23
  • 28. #EntSAIS18 Recursion / Zoom Out def finish(data, zoom) { write(normalize(data), zoom) if (zoom > 0) { finish(zoomOut(data), zoom-1) } } 28
  • 29. #EntSAIS18 Recursion / Zoom out - Each iteration has ~1/4th as much data - Reduce number of tasks to not waste time - Exponential speedup of stage runtime incredibly satisfying Level 16 written, 136928925282 bytes Level 15 written, 53804156232 bytes Level 14 written, 19795342663 bytes Level 13 written, 6996622566 bytes Level 12 written, 2474617135 bytes Level 11 written, 897467724 bytes Level 10 written, 331303516 bytes Level 9 written, 122600962 bytes Level 8 written, 44957727 bytes Level 7 written, 16254495 bytes Level 6 written, 5728247 bytes Level 5 written, 1978297 bytes Level 4 written, 673066 bytes Level 3 written, 228565 bytes Level 2 written, 78380 bytes Level 1 written, 27417 bytes 29
  • 31. #EntSAIS18 Store+Serve Idea: Group spatially adjacent tiles to get similar file sizes. Consider tile data as quad tree. - Starting from all leafs, walk up tree accumulating total file size at each node. - When file size exceeds threshold, write all tiles as one file. - Direct s3 write (no hadoopFS api) - Repeat for each level of tile data. 31
  • 32. #EntSAIS18 Tile Packing Zoom 2: -(1,0) -(2,2) Zoom 1: -(0,2), (1,2), (0,3), (1,3) -(3,2), (2,3), (3,3) Zoom 0: - Everything else 32
  • 33. #EntSAIS18 Store+Serve Reads: - Reconstruct tree of tile groups in memory - Traverse tree to find tile location - Cache the tile group file-- nearby tiles also likely to be served. - Persisted data is 1 byte per pixel, colormap and PNG applied at read time. - Final rendered PNG also cached by cloudfront 33
  • 34. #EntSAIS18 Clusterer - Discover sets of spatially similar activities using clustering - Compute patterns in aggregated activity properties - Build catalog of routes for every cluster (a few million clusters) 34
  • 35. #EntSAIS18 Clusterer - Locality Sensitive Hashing (SuperMinHash O Ertl, 2017) of quantized lat/lng shingles from activity streams - MinHash pairwise similarity of activities: 10,000x speedup over direct stream based methods - Similarity join on collisions to efficiently find sample of highly similar pairs - Single linkage clustering globally on pairs - Split locally and repeat, rehashing sample of activities in each cluster 35
  • 37. #EntSAIS18 Spark Environment - Spark on mesos on AWS - Job submitted with Metronome/Marathon - Isolated context with job-specific docker image for driver+executors - Mesos cluster running multiple spark contexts - Scale up instances based on unscheduled tasks - Prevent scale down on instances with active executors - 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd) - 5 hours for full heatmap build 37
  • 38. #EntSAIS18 Spark Lessons - Parameter tuning seems unavoidable in well utilized distributed systems - With scala, managing application dependency conflicts can be hell (guava, netty, logging frameworks…) - Optimizing locally can be very misleading, must profile on real cluster - Even with fantastic configuration management tools, lots of time wasted on configuration - Building from scratch may be worthwhile to fully understand performance - Invest in generating representative sample data to use in prototypes - Spilling to disk can use a surprising amount of IOPS - Probably still a lot more optimizations possible 38
  • 39. #EntSAIS18 Resources Strava Global Heatmap: strava.com/heatmap Strava Labs: labs.strava.com Strava Engineering Blog: medium.com/strava-engineering We are hiring Data Scientists: strava.com/careers Strava Labs Clusterer*: labs.strava.com/clusterer *Currently offline, will be back soon 39