SlideShare a Scribd company logo
1 of 20
Download to read offline
Pulkit Bhanot, Amit Nene
Risk Platform
Large-scale Feature Aggregation Using
Apache Spark
#Dev1SAIS
#Dev1SAIS
• Motivation
• Challenges
• Architecture Deep Dive
• Role of Spark
• Takeaways
2
Agenda
#Dev1SAIS
Build a scalable, self-service Feature Engineering Platform for
predictive decisioning (based on ML and Business Rules)
• Feature Engineering: use of domain knowledge to create Features
• Self-Service for Data Scientists and Analysts without reliance on Engineers
• Generic Platform: consolidating work towards wider ML efforts at Uber
3
Team Mission
#Dev1SAIS
Promotions
Detect and prevent bad actors in real-time
number of trips over X hours/weeks
trips cancelled over Y months
count of referrals over lifetime
...
4
Payments
Sample use case: Fraud detection
#Dev1SAIS
Indexed
databases
Streaming
aggregations
Needs
• Lifetime of entity
• Sliding long-window: days/weeks/months
• Sliding short-window: mins/hours
• Real-time
Existing solutions
• None satisfies all of above
• Complex onboarding
Warehouse
None
fits the
bill!
5
Time Series Aggregations
#Dev1SAIS
• Scale: 1000s of aggregations for 100s million of
business entities
• Long-window aggregation queries slow even with
indexes (seconds). Millis at high QPS needed.
• Onboarding complexity: many moving parts
• Recovery from accrual of errors
6
Technical Challenges
#Dev1SAIS
One-stop shop for aggregation
• Single system to interact with
• Single spec: automate configurations of underlying system
Scalable
• Leverage the scale of Batch system for long window
• Combine with real-time aggregation for freshness
• Rollups: aggregate over time intervals
• Fast query over rolled up aggregates
• Caveat: summable functions
Self-healing
• Batch system auto-corrects any errors
7
Our approach
#Dev1SAIS
Aggregator
Aggregated
Features
Raw Events:
Streaming+Hive
Aggregation
Function:
sum, count, etc.
Aggregation
Window:
LTD, 7 days, 30
days, etc.
Grain Size:
5 min (realtime),
1 day (offline), etc.
Input parameters to black
box
➔ Source events
➔ Grain size
➔ Aggregation
functions
➔ Aggregation windows
8
Aggregator as a black box
#Dev1SAIS
Specs
Feature Store
Batch
Aggregator
(Spark Apps)
Real-time
Aggregator
(Streaming)
Feature Access
(Microservice)
• Batch (Spark)
– Long-window: weeks, months
– Bootstrap, incremental modes
• Streaming (e.g. Kafka events)
– Short-window (<24 hrs)
– Near-real time
• Real-time Access
– Merge offline and streaming
• Feature Store
– Save rolled-up aggregates in Hive
and Cassandra
9
1
1
2
2
3
4
Overall architecture
#Dev1SAIS
Computation
Hive
Specs
Batch Aggregator (Spark Apps)
Feature Store
(Cassandra)
Feature
Extractor
Scheduler
Rollup
Generator
Bootstrap
Periodic
Snapshot
Manager
Optimizer
Feature
Access
Tbl1:<2018-04-10>
Tb1:<2018-04-09>
Optimized
Snapshot
Full Snapshot
Incremental
Snapshot
Dispersal
Decisioning
System
10
Batch Spark Engine
1
2
3
4
5
6
7
8
9
10
Optimized
Snapshot
#Dev1SAIS
Hive
Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13>
ATable-1_LTD:<2018-04-13>
Table-1
-partition-<2018-04-10>
col1, col2, col3, col4
-partition-<2018-04-11>
col1, col2, col3, col4
-partition-<2018-04-12>
col1, col2, col3, col4
-partition-<2018-04-13>
col1, col2, col3, col4
Tbl1:<2018-04-10>
Tbl1:<2018-04-13>
…….
ATable-1_Lifetime
-partition-<2018-04-10>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1_ltd, f2_ltd
ATable-1_daily
-partition-<2018-04-10>
uuid, f1, f2
-partition-<2018-04-11>
uuid, f1, f2
-partition-<2018-04-12>
uuid, f1, f2
-partition-<2018-04-13>
uuid, f1, f2
ATable-1_joined
-partition-<2018-04-10>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1, f2, f1_ltd, f2_ltd
Daily Partitioned
Source Tables
Rolled-up Tables
Features involving Lifetime computation.
Features involving sliding window
computation.
Dispersed to real-
time store
11
Batch Storage
Daily Lifetime
Snapshot
Daily Incremental
Rollup
#Dev1SAIS 12
● Orchestrator of ETL pipelines
○ Scheduling of subtasks
○ Record incremental progress
● Optimally resize HDFS files: scale with
size of data set.
● Rich set of APIs to enable complex
optimizations
e.g of an optimization in bootstrap dispersal
dailyDataset.join(
ltdData,
JavaConverters.asScalaIteratorConverter(
Arrays.asList(pipelineConfig.getEntityKey()).iterator())
.asScala()
.toSeq(),
"outer");
uuid _ltd daily_buckets
44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08-
22":"3"},{"2017-09-21":"4"},{"2017-
08-08":"3"},{"2017-10-
03":"3"},{"2017-10-19":"5"},{"2017-
09-06":"1"},{"2017-08-
17":"5"},{"2017-09-09":"12"},{"2017-
10-05":"5"},{"2017-09-
25":"4"},{"2017-09-17":"13"}]
Role of Spark
#Dev1SAIS 13
• Ability to disperse billions of records
– HashPartitioner to the rescue
//Partition the data by hash
HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber);
JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner);
//Fetch each hash partition and process
foreach partition{
JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId);
raise error if partition mismatch
Dataset<Row> filteredDataSet =
etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(),
data.org$apache$spark$sql$Dataset$$encoder);
//repartition filteredDataSet, update checkpoint and records processed after successful
completion.
Paren
t
RDD
P1
P2
P3
Pn
…..
Process
Each
Partition
Role of Spark (continued)
#Dev1SAIS 14
2018-02-01
2018-02-02
2018-02-03
2018-03-01
….
Dispersal C*
2018-02-01
2018-02-02
2018-02-03
2018-03-01
Bootstrap
• Global Throttling
– Feature Store can be the bottleneck
– coalesce() to limit the executors
• Inspect data
– Disperse only if any column has
changed
• Monitoring and alert
– create custom metrics
Role of Spark in Dispersal
Full computation snapshots
Optimized snapshots
#Dev1SAIS
● Real-time, summable
aggregations for < 24 hours
● Semantically equivalent to
offline computation
● Aggregation rollups (5 mins)
maintained in feature store
(Cassandra)
event
enrichment
raw kafka
events
microservices
xform_0
xform_1
xform_2
streaming
computation
pipelines
time
window
aggregator
C*
RPCs
Uber Athena streaming
15
Real-time streaming engine
#Dev1SAIS
● Uses time series and clustering
key support in Cassandra
○ 1 table for Lifetime & LTD
values.
○ Multiple tables for realtime
values with grain size 5M
● Consult metadata and assemble
into single result at feature
access time
entity_common_aggr_bt_ltd
UUID
(PK)
trip_count_ltd
entity_common_aggr_bt
UUID
(PK)
eventbucket
(CK)
trip_count
entity_common_aggr_rt_2018_05_08
entity_common_aggr_rt_2018_05_09
entity_common_aggr_rt_2018_05_10
UUID
(PK)
eventbucket
(CK)
trip_coun
t
Feature access Service
Metadata
Service
Query
Planner
e.g - lifetime trip count
- trips over last 51 hrs
- trips over previous 2
days
16
Final aggregation in real time
#Dev1SAIS
Create Query
(Spark SQL)
Configure Spec
Commit to Prod
Test Spec
17
Self-service onboarding
#Dev1SAIS
Backfill Support: what is the value of a feature f1 for
an entity E1 from Thist to Tnow
• Bootstrap to historic point in time: Thist
• Incrementally compute from Thist to Tnow
How ?
• Lifetime: feature f1 on Thist access partition Thist
• Windowed: feature f2 on Thist with window N days
• Merge partitions Thist-N to Thist
18
Machine learning support
T-120
T-119
T-90
T-1
T
…..
…..
Lifetime
value on a
given date
Last 30 day
trips at T-90
Last 30 day
trips at T-89
#Dev1SAIS
● Use of Spark to achieve massive scale
● Combine with Streaming aggregation for freshness
● Low latency access in production (P99 <= 20ms) at high QPS
● Simplify onboarding via single spec, onboarding time in hours
● Huge computational cost improvements
19
Takeaways
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
bhanotp@uber.com
anene@uber.com

More Related Content

What's hot

An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief OverviewHal Kalechofsky
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 

What's hot (20)

An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief Overview
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 

Similar to Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene

Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 

Similar to Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene (20)

Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene

  • 1. Pulkit Bhanot, Amit Nene Risk Platform Large-scale Feature Aggregation Using Apache Spark #Dev1SAIS
  • 2. #Dev1SAIS • Motivation • Challenges • Architecture Deep Dive • Role of Spark • Takeaways 2 Agenda
  • 3. #Dev1SAIS Build a scalable, self-service Feature Engineering Platform for predictive decisioning (based on ML and Business Rules) • Feature Engineering: use of domain knowledge to create Features • Self-Service for Data Scientists and Analysts without reliance on Engineers • Generic Platform: consolidating work towards wider ML efforts at Uber 3 Team Mission
  • 4. #Dev1SAIS Promotions Detect and prevent bad actors in real-time number of trips over X hours/weeks trips cancelled over Y months count of referrals over lifetime ... 4 Payments Sample use case: Fraud detection
  • 5. #Dev1SAIS Indexed databases Streaming aggregations Needs • Lifetime of entity • Sliding long-window: days/weeks/months • Sliding short-window: mins/hours • Real-time Existing solutions • None satisfies all of above • Complex onboarding Warehouse None fits the bill! 5 Time Series Aggregations
  • 6. #Dev1SAIS • Scale: 1000s of aggregations for 100s million of business entities • Long-window aggregation queries slow even with indexes (seconds). Millis at high QPS needed. • Onboarding complexity: many moving parts • Recovery from accrual of errors 6 Technical Challenges
  • 7. #Dev1SAIS One-stop shop for aggregation • Single system to interact with • Single spec: automate configurations of underlying system Scalable • Leverage the scale of Batch system for long window • Combine with real-time aggregation for freshness • Rollups: aggregate over time intervals • Fast query over rolled up aggregates • Caveat: summable functions Self-healing • Batch system auto-corrects any errors 7 Our approach
  • 8. #Dev1SAIS Aggregator Aggregated Features Raw Events: Streaming+Hive Aggregation Function: sum, count, etc. Aggregation Window: LTD, 7 days, 30 days, etc. Grain Size: 5 min (realtime), 1 day (offline), etc. Input parameters to black box ➔ Source events ➔ Grain size ➔ Aggregation functions ➔ Aggregation windows 8 Aggregator as a black box
  • 9. #Dev1SAIS Specs Feature Store Batch Aggregator (Spark Apps) Real-time Aggregator (Streaming) Feature Access (Microservice) • Batch (Spark) – Long-window: weeks, months – Bootstrap, incremental modes • Streaming (e.g. Kafka events) – Short-window (<24 hrs) – Near-real time • Real-time Access – Merge offline and streaming • Feature Store – Save rolled-up aggregates in Hive and Cassandra 9 1 1 2 2 3 4 Overall architecture
  • 10. #Dev1SAIS Computation Hive Specs Batch Aggregator (Spark Apps) Feature Store (Cassandra) Feature Extractor Scheduler Rollup Generator Bootstrap Periodic Snapshot Manager Optimizer Feature Access Tbl1:<2018-04-10> Tb1:<2018-04-09> Optimized Snapshot Full Snapshot Incremental Snapshot Dispersal Decisioning System 10 Batch Spark Engine 1 2 3 4 5 6 7 8 9 10 Optimized Snapshot
  • 11. #Dev1SAIS Hive Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13> ATable-1_LTD:<2018-04-13> Table-1 -partition-<2018-04-10> col1, col2, col3, col4 -partition-<2018-04-11> col1, col2, col3, col4 -partition-<2018-04-12> col1, col2, col3, col4 -partition-<2018-04-13> col1, col2, col3, col4 Tbl1:<2018-04-10> Tbl1:<2018-04-13> ……. ATable-1_Lifetime -partition-<2018-04-10> uuid, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1_ltd, f2_ltd ATable-1_daily -partition-<2018-04-10> uuid, f1, f2 -partition-<2018-04-11> uuid, f1, f2 -partition-<2018-04-12> uuid, f1, f2 -partition-<2018-04-13> uuid, f1, f2 ATable-1_joined -partition-<2018-04-10> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1, f2, f1_ltd, f2_ltd Daily Partitioned Source Tables Rolled-up Tables Features involving Lifetime computation. Features involving sliding window computation. Dispersed to real- time store 11 Batch Storage Daily Lifetime Snapshot Daily Incremental Rollup
  • 12. #Dev1SAIS 12 ● Orchestrator of ETL pipelines ○ Scheduling of subtasks ○ Record incremental progress ● Optimally resize HDFS files: scale with size of data set. ● Rich set of APIs to enable complex optimizations e.g of an optimization in bootstrap dispersal dailyDataset.join( ltdData, JavaConverters.asScalaIteratorConverter( Arrays.asList(pipelineConfig.getEntityKey()).iterator()) .asScala() .toSeq(), "outer"); uuid _ltd daily_buckets 44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08- 22":"3"},{"2017-09-21":"4"},{"2017- 08-08":"3"},{"2017-10- 03":"3"},{"2017-10-19":"5"},{"2017- 09-06":"1"},{"2017-08- 17":"5"},{"2017-09-09":"12"},{"2017- 10-05":"5"},{"2017-09- 25":"4"},{"2017-09-17":"13"}] Role of Spark
  • 13. #Dev1SAIS 13 • Ability to disperse billions of records – HashPartitioner to the rescue //Partition the data by hash HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber); JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner); //Fetch each hash partition and process foreach partition{ JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId); raise error if partition mismatch Dataset<Row> filteredDataSet = etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(), data.org$apache$spark$sql$Dataset$$encoder); //repartition filteredDataSet, update checkpoint and records processed after successful completion. Paren t RDD P1 P2 P3 Pn ….. Process Each Partition Role of Spark (continued)
  • 14. #Dev1SAIS 14 2018-02-01 2018-02-02 2018-02-03 2018-03-01 …. Dispersal C* 2018-02-01 2018-02-02 2018-02-03 2018-03-01 Bootstrap • Global Throttling – Feature Store can be the bottleneck – coalesce() to limit the executors • Inspect data – Disperse only if any column has changed • Monitoring and alert – create custom metrics Role of Spark in Dispersal Full computation snapshots Optimized snapshots
  • 15. #Dev1SAIS ● Real-time, summable aggregations for < 24 hours ● Semantically equivalent to offline computation ● Aggregation rollups (5 mins) maintained in feature store (Cassandra) event enrichment raw kafka events microservices xform_0 xform_1 xform_2 streaming computation pipelines time window aggregator C* RPCs Uber Athena streaming 15 Real-time streaming engine
  • 16. #Dev1SAIS ● Uses time series and clustering key support in Cassandra ○ 1 table for Lifetime & LTD values. ○ Multiple tables for realtime values with grain size 5M ● Consult metadata and assemble into single result at feature access time entity_common_aggr_bt_ltd UUID (PK) trip_count_ltd entity_common_aggr_bt UUID (PK) eventbucket (CK) trip_count entity_common_aggr_rt_2018_05_08 entity_common_aggr_rt_2018_05_09 entity_common_aggr_rt_2018_05_10 UUID (PK) eventbucket (CK) trip_coun t Feature access Service Metadata Service Query Planner e.g - lifetime trip count - trips over last 51 hrs - trips over previous 2 days 16 Final aggregation in real time
  • 17. #Dev1SAIS Create Query (Spark SQL) Configure Spec Commit to Prod Test Spec 17 Self-service onboarding
  • 18. #Dev1SAIS Backfill Support: what is the value of a feature f1 for an entity E1 from Thist to Tnow • Bootstrap to historic point in time: Thist • Incrementally compute from Thist to Tnow How ? • Lifetime: feature f1 on Thist access partition Thist • Windowed: feature f2 on Thist with window N days • Merge partitions Thist-N to Thist 18 Machine learning support T-120 T-119 T-90 T-1 T ….. ….. Lifetime value on a given date Last 30 day trips at T-90 Last 30 day trips at T-89
  • 19. #Dev1SAIS ● Use of Spark to achieve massive scale ● Combine with Streaming aggregation for freshness ● Low latency access in production (P99 <= 20ms) at high QPS ● Simplify onboarding via single spec, onboarding time in hours ● Huge computational cost improvements 19 Takeaways
  • 20. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. bhanotp@uber.com anene@uber.com