SlideShare a Scribd company logo
Kelvin Chu, Hadoop Platform, Uber
Gang Wu, Hadoop Platform, Uber
Spark Uber Development Kit
Spark Summit 2016
June 07, 2016
“Transportation as reliable as
running water, everywhere,
for everyone”
Uber Mission
About Us
● Hadoop team of Data Infrastructure at Uber
● Schema systems
● HDFS data lake
● Analytics engines on Hadoop
● Spark computing framework and toolings
Execution Environment
Complexity
Cluster Sizes
20 times
YARN Mesos
Docker JVM
Parquet ORC
Sequence Text
Home Built Services
Hive Kafka ELK
Consequence:
Pretty hard for beginners, sometimes hard for experienced
users too.
Goals:
Multi-Platform: Abstract out environment
Self-Service: Create and run Spark jobs super easily
Reliability: Prevent harm to infrastructure systems
Engineers SRE
API
Tools
Engineers SRE
API
Tools
Easy
Self-Service
Multi-Platform
No Harm
Reliability
• SCBuilder
• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
SCBuilder
Encapsulate cluster environment details
● Builder Pattern for SparkContext
● Incentive for users:
○ performance optimized (default can’t pass 100GB)
○ debug optimized (history server, event logs)
○ don’t need to ask around YARN, history servers, HDFS configs
● Best practices enforcement:
○ SRE approved CPU and memory settings
○ resource efficient serialization
Kafka Dispersal
Kafka as data sink of RDD result
● Incentive for users:
○ RDD as first class citizen => parallelization
○ built-in HA
● Best practices enforcement:
○ rate limiting
○ message integrity by schema
○ bad messages tracking
publish(data: RDD, topic: String, schemaId: Int, appId: String)
SparkPlug
Kickstart job development
● A collection of popular job templates
○ Two commands to run the first job in Dev
● One use case per template
○ e.g. Ozzie + SparkSQL + Incremental processing
○ e.g. Incremental processing + Kafka dispersal
● Best Practices
○ built-in unit tests, test coverage, Jenkins
○ built-in Kafka, HDFS mocks
Example Spark Application Data Flow Chart
Query
UI
ELK Search
Dashboards
Report
HIVE
Topic 1
Topic 1
Topic 1
Topic 2
Topic 2
Topic 2
Topic 3
Topic 3
Topic 4
HIVE
Shared
DB
HIVE
Custom
DB
App
Web
UI
App
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
A
B
Title: Job
message
TITLE TITLE TITLE: WORKFLOW ENV
NAME
TITLE: WORKFLOW ENV ID TITLE: LOCATION
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
Commonly used UDFs
GeoSpatial UDF
within(trip_location, city_shape) contains(geofence, auto_location)
Find if a car is inside a city Find all autos in one area
Commonly used UDFs
GeoSpatial UDF
overlaps(trip1, trip2) intersects(trip_location, gas_location)
Find trips that have similar
routes
Find all gas stations a trip route has
passed by
Objective: associate all trips with city_id for a single day.
SELECT trip.trip_id, city.city_id
FROM trip JOIN city
WHERE contains(city.city_shape, trip.start_location)
AND trip.datestr = ‘2016-06-07’
Spatial Join
Common query at Uber
Spatial Join
Problem
It takes nearly ONE WEEK to run at Uber’s data scale.
1. Spark does not have broadcast join optimization for non-equation join.
2. Not scalable, only one executor is used for cartesian join.
Spatial Join
Build a UDF to broadcast geo-spatial index
Spatial Join
Runtime Index Generation
Index data is small but change often (city table)
Get fields from geo tables (city_id and city_shape)
Build QuadTree or RTree index at Spark Driver
1. Build Index
Spatial Join
Executor Execution
UDF code is part of the Spark UDK jar.
⇒ get_city_id(location), returns city_id of a location
Use the broadcasted spatial index for fast spatial retrieval
2. Broadcast Index
Spatial Join
Runtime UDF Generation
SELECT
trip_id, get_city_id(start_location)
FROM
trip
WHERE
datestr = ‘2016-06-07’
3. Rewrite Query (2 mins only! compared to 1 week before)
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkChamber
• SparkPlug
Engineers SRE
API
Tools
Spark Debugging
1. Tons of local log files across many
machines.
2. Overall file size is huge and difficult
to be handled by a single machine.
3. Painful for debugging, which log is
useful?
Spark Chamber
Distributed Log Debugger for Spark
Extend Spark Shell by Hooks.
Easy to adopt for Spark developers.
Interactive
Spark Chamber Session
my_user_name
Spark Chamber
Distributed Log Debugger for Spark
1. Get all recent Spark Application IDs.
2. Get first exception, all exceptions grouped by types sorted by time, etc.
3. Display CPU, memory, I/O metrics.
4. Dive into a specific driver/executor/machine
5. Search
Features
Spark Chamber
Distributed Log Debugger for Spark
Developer mode: debug developer’s own Spark job.
SRE mode: view and check all users’ Spark job information.
Security
YARN aggregates log files on HDFS Files are named after host names
All application IDs of the same user are
under same place.
One machine has one log file, regardless
of # executors on that machine.
Spark Chamber
Enable Yarn Log Aggregation
/ tmp / logs / username / logs // tmp / logs / username /
username
username
username
username
username
username
username
username
username
username
username
username
username
username
Spark Chamber
Use Spark to debug Spark
Extend the Spark Shell by Hooks:
1. For ONE application Id, distribute log files to different executors.
2. Extract each lines and save into DataFrame.
3. Sort log dataframe by time and hostname.
4. Retrieve target log via SparkSQL DataFrame APIs.
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• SparkChamber
• SparkPlug
• SparkChamber
Engineers SRE
API
Tools
Future
Work
Spark Chamber
SRE version - Cluster wide insights
● Dimensions - Jobs
○ All
○ Single team
○ Single engineer
● Dimensions - Time
○ Last month, week, day
● Dimensions - Hardware
○ Specific rack, pod
Spark Chamber
SRE version - Analytics and Machine Learning
● Analytics
○ Resource auditing
○ Data access auditing
● Machine Learning
○ Failures diagnostics
○ Malicious jobs detection
○ Performance optimization
• Geo-spatial processing
• SCBuilder
• Kafka dispersal
• Hive table registration (Didn’t cover today)
• Incremental processing (Didn’t cover
today)
• Debug logging
• Metrics
• Configurations
• Data Freshness
• Resource usage
• SparkChamber
• SparkPlug
• Unit testing (Didn’t cover today)
• Oozie integration (Didn’t cover today)
• SparkChamber
• Resource usage auditing
• Data access auditing
• Machine learning on jobs
Engineers SRE
API
Tools
Future
Work
Today, Tuesday, June 7
4:50 PM – 5:20 PM
Room: Ballroom B
SPARK: INTERACTIVE TO PRODUCTION
Dara Adib, Uber
Tomorrow, Wednesday, June 8
5:25 PM – 5:55 PM
Room: Imperial
Locality Sensitive Hashing by Spark
Alain Rodriguez, Fraud Platform, Uber
Kelvin Chu, Hadoop Platform, Uber
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

More Related Content

What's hot

Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Spark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Databricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
Spark Summit
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 

What's hot (20)

Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 

Viewers also liked

Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Jen Aman
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
Jen Aman
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark Workflows
Spark Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
Jen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark Workflows
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 

Similar to Spark Uber Development Kit

Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
Damien Dallimore
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
Amazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
Amazon Web Services
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Oleksiy Panchenko
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Big Data in the Cloud
Big Data in the CloudBig Data in the Cloud
Big Data in the Cloud
Amazon Web Services
 
Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17
Vinay Kumar
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2
Matthew McCullough
 

Similar to Spark Uber Development Kit (20)

Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
resumePdf
resumePdfresumePdf
resumePdf
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Big Data in the Cloud
Big Data in the CloudBig Data in the Cloud
Big Data in the Cloud
 
Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 

More from Jen Aman (18)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Recently uploaded

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

Spark Uber Development Kit

  • 1. Kelvin Chu, Hadoop Platform, Uber Gang Wu, Hadoop Platform, Uber Spark Uber Development Kit Spark Summit 2016 June 07, 2016
  • 2. “Transportation as reliable as running water, everywhere, for everyone” Uber Mission
  • 3. About Us ● Hadoop team of Data Infrastructure at Uber ● Schema systems ● HDFS data lake ● Analytics engines on Hadoop ● Spark computing framework and toolings
  • 10. Consequence: Pretty hard for beginners, sometimes hard for experienced users too.
  • 11. Goals: Multi-Platform: Abstract out environment Self-Service: Create and run Spark jobs super easily Reliability: Prevent harm to infrastructure systems
  • 14. • SCBuilder • Kafka dispersal • SparkPlug Engineers SRE API Tools
  • 15. SCBuilder Encapsulate cluster environment details ● Builder Pattern for SparkContext ● Incentive for users: ○ performance optimized (default can’t pass 100GB) ○ debug optimized (history server, event logs) ○ don’t need to ask around YARN, history servers, HDFS configs ● Best practices enforcement: ○ SRE approved CPU and memory settings ○ resource efficient serialization
  • 16. Kafka Dispersal Kafka as data sink of RDD result ● Incentive for users: ○ RDD as first class citizen => parallelization ○ built-in HA ● Best practices enforcement: ○ rate limiting ○ message integrity by schema ○ bad messages tracking publish(data: RDD, topic: String, schemaId: Int, appId: String)
  • 17. SparkPlug Kickstart job development ● A collection of popular job templates ○ Two commands to run the first job in Dev ● One use case per template ○ e.g. Ozzie + SparkSQL + Incremental processing ○ e.g. Incremental processing + Kafka dispersal ● Best Practices ○ built-in unit tests, test coverage, Jenkins ○ built-in Kafka, HDFS mocks
  • 18. Example Spark Application Data Flow Chart Query UI ELK Search Dashboards Report HIVE Topic 1 Topic 1 Topic 1 Topic 2 Topic 2 Topic 2 Topic 3 Topic 3 Topic 4 HIVE Shared DB HIVE Custom DB App Web UI App
  • 20. • Geo-spatial processing • SCBuilder • Kafka dispersal • SparkPlug Engineers SRE API Tools
  • 21. Commonly used UDFs GeoSpatial UDF within(trip_location, city_shape) contains(geofence, auto_location) Find if a car is inside a city Find all autos in one area
  • 22. Commonly used UDFs GeoSpatial UDF overlaps(trip1, trip2) intersects(trip_location, gas_location) Find trips that have similar routes Find all gas stations a trip route has passed by
  • 23. Objective: associate all trips with city_id for a single day. SELECT trip.trip_id, city.city_id FROM trip JOIN city WHERE contains(city.city_shape, trip.start_location) AND trip.datestr = ‘2016-06-07’ Spatial Join Common query at Uber
  • 24. Spatial Join Problem It takes nearly ONE WEEK to run at Uber’s data scale. 1. Spark does not have broadcast join optimization for non-equation join. 2. Not scalable, only one executor is used for cartesian join.
  • 25. Spatial Join Build a UDF to broadcast geo-spatial index
  • 26. Spatial Join Runtime Index Generation Index data is small but change often (city table) Get fields from geo tables (city_id and city_shape) Build QuadTree or RTree index at Spark Driver 1. Build Index
  • 27. Spatial Join Executor Execution UDF code is part of the Spark UDK jar. ⇒ get_city_id(location), returns city_id of a location Use the broadcasted spatial index for fast spatial retrieval 2. Broadcast Index
  • 28. Spatial Join Runtime UDF Generation SELECT trip_id, get_city_id(start_location) FROM trip WHERE datestr = ‘2016-06-07’ 3. Rewrite Query (2 mins only! compared to 1 week before)
  • 29. • Geo-spatial processing • SCBuilder • Kafka dispersal • SparkChamber • SparkPlug Engineers SRE API Tools
  • 30. Spark Debugging 1. Tons of local log files across many machines. 2. Overall file size is huge and difficult to be handled by a single machine. 3. Painful for debugging, which log is useful?
  • 31. Spark Chamber Distributed Log Debugger for Spark Extend Spark Shell by Hooks. Easy to adopt for Spark developers. Interactive
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Spark Chamber Distributed Log Debugger for Spark 1. Get all recent Spark Application IDs. 2. Get first exception, all exceptions grouped by types sorted by time, etc. 3. Display CPU, memory, I/O metrics. 4. Dive into a specific driver/executor/machine 5. Search Features
  • 39. Spark Chamber Distributed Log Debugger for Spark Developer mode: debug developer’s own Spark job. SRE mode: view and check all users’ Spark job information. Security
  • 40. YARN aggregates log files on HDFS Files are named after host names All application IDs of the same user are under same place. One machine has one log file, regardless of # executors on that machine. Spark Chamber Enable Yarn Log Aggregation / tmp / logs / username / logs // tmp / logs / username / username username username username username username username username username username username username username username
  • 41. Spark Chamber Use Spark to debug Spark Extend the Spark Shell by Hooks: 1. For ONE application Id, distribute log files to different executors. 2. Extract each lines and save into DataFrame. 3. Sort log dataframe by time and hostname. 4. Retrieve target log via SparkSQL DataFrame APIs.
  • 42. • Geo-spatial processing • SCBuilder • Kafka dispersal • SparkChamber • SparkPlug • SparkChamber Engineers SRE API Tools Future Work
  • 43. Spark Chamber SRE version - Cluster wide insights ● Dimensions - Jobs ○ All ○ Single team ○ Single engineer ● Dimensions - Time ○ Last month, week, day ● Dimensions - Hardware ○ Specific rack, pod
  • 44. Spark Chamber SRE version - Analytics and Machine Learning ● Analytics ○ Resource auditing ○ Data access auditing ● Machine Learning ○ Failures diagnostics ○ Malicious jobs detection ○ Performance optimization
  • 45. • Geo-spatial processing • SCBuilder • Kafka dispersal • Hive table registration (Didn’t cover today) • Incremental processing (Didn’t cover today) • Debug logging • Metrics • Configurations • Data Freshness • Resource usage • SparkChamber • SparkPlug • Unit testing (Didn’t cover today) • Oozie integration (Didn’t cover today) • SparkChamber • Resource usage auditing • Data access auditing • Machine learning on jobs Engineers SRE API Tools Future Work
  • 46. Today, Tuesday, June 7 4:50 PM – 5:20 PM Room: Ballroom B SPARK: INTERACTIVE TO PRODUCTION Dara Adib, Uber
  • 47. Tomorrow, Wednesday, June 8 5:25 PM – 5:55 PM Room: Imperial Locality Sensitive Hashing by Spark Alain Rodriguez, Fraud Platform, Uber Kelvin Chu, Hadoop Platform, Uber
  • 48. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.