SlideShare a Scribd company logo
A Tale of Lambdas, Kappas & Pancakes
@osamakhn
Who am I?
Osama Khan
Big Data Engineer @ACLServices
Grad Student @GTComputing
AWS Big Data Specialist+
! Vancouver, BC
" : Java " : C# (met thru J#) # : Python
$ : Golang, NodeJS % : Scala
Previously: Robot Soccer, BI, Credit Rating, AML,
O&G Portfolio, NLP/Governance, Doctor Triage,
Energy Monitoring, Consulting, Private Equity
Recently: Data/ML Pipeline, Tools & Platforms
What am I going to talk
about today?
The goal of this talk is to provide a high level overview
of the big data landscape to help software engineers
distinguish signal from noise
1) How big is BIG: Lets get our scales recalibrated to
understand what we mean by BIG Data
2) Lineage: Evolution of the Big Data ecosystem; from
EDW to Data Lakes
3) Lambda & Kappa Architectures: The foundation
of data pipelines and machine learning systems
4) Technology Choices: SMACK that PANCAKE
BUT I ❤ Serverless
5) Demos: Athena, EMR, Redshift, Quicksight,
Sagemaker, ModelDB
How big is BIG?
Big is BIG when Bieber breaks the Google Cloud (wat!?!)
Lets get our scales recalibrated to understand what we
mean by BIG Data
This is BIG …
§ 390 Hyperscale Datacenters ( < 300, 2016)
§ Hyperscale == (5k servers, 10k sq.ft space)
§ > 400, 100M+ total servers
§ 56% web content in English
§ 8,000 languages spoken globally
§ Hello, friend.
§ 100M+ active users, 40M+ subscribers
§ 30M+ songs, 20K new per day, 2B+ playlists, 1B+ plays per day
§ 2,500 node Hadoop cluster, 100 PB+ Disk, 100TB+ RAM
§ 60TB+ per day log ingestion, 20k+ jobs per day
§ Listening History Query
§ user x track x [day/week/month/all time]
§ 300B elements
§ 800 workers, 32 core, 208 GB ram
§ 240TB in, 90TB out
§ Top Tracks in Vancouver (June 2017)
§ 30 date partitioned tables, 60TB data
§ 1 metadata table, 418GB
§ 94.2s, 4.82TB processed
§ Despacito – Remix (Luis Fonsi)
§ (2017)
§ 2.8 B+ US Tweets
§ Donald Trump (901.8 M)
§ Hillary Clinton (123.2 M)
§ Mike Pence (31.4 M)
§ 30x more than VP, 7x more than opponent
(2013)
§ 170M individual metrics (timeseries) per minute
§ 200M queries served/day, 47 charts/user
Lineage
Evolution of the Big Data ecosystem; from EDW to
Data Lakes
A journey from ETL to Distributed Transactions via the ELT alley…
2007 20172003 20142009 2011201020042000
FaunaDB,
Aurora
Lambda,
Kappa
Architectures
UC
Berkeley
Spark
Google
File
System
LinkedIn
Kafka
FB
Cassandra
AWS
DynamoDB
Google
Dremel
2012: Google
Spanner
IBM,
Oracle,
MSFT,
Terradata,
SAP
CAP
Theorem
Google
Map
Reduce
AWS, GCP,
Azure,
Hana
2006:
Yahoo! Hadoop
Occupy the Cloud:
Distributed
Computing for the
99%
Big Data Landscape
Distributed systems rule the !
Yet Another Big Data Framework (YABDF)
Doesn’t fit on a slide or two … and you
thought you had library fatigue in the JS
world !
http://mattturck.com/wp-content/uploads/2017/05/Matt-Turck-FirstMark-2017-Big-Data-Landscape.png
Lambda & Kappa Architectures
The foundation of data pipelines for enterprise insights
Lambda Architecture: First Principles & Desired Properties
Data special information from which everything else is derived Information processed data
Data System query = function(ALL_DATA)
1. Robustness & Fault Tolerance
2. Low Latency Read & Update
3. Scalability
4. Generalization
5. Ad-hoc
6. Minimal Maintenance
7. Debuggability
The Lambda Architecture
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
∑ ALL_DATA
Δ NEW_DATA
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> home of ‘master data’
> precomputed_batch_view = fn(ALL_DATA)
> user_query = fn(precomputed_batch_view)
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> (distributed) db to store batch view data
> produce fast results for known queries
> allow (random) reads by users/systems
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> compensate for high latency of batch updates
> run fast, incremental algorithms (probabilistic data structures, for the win)
> realtime_view = fn(realtime_view, new_data)
> user_query = fn(realtime_view) > user_query =
fn(precomputed_batch_view)
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> user_query = fn(
precomputed_batch_view,
realtime_view)
The Lambda Architecture (ingest? speedlayer?)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Lambda Architecture (ingest? speedlayer?)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Kappa Architecture
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Lambda Architecture (SMACK)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
Big Picture of Metadata Management for Data Governance
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
lineage
impact
analysis
semantic
lineage
Enterprise
Vocabulary
Semantic
Mapping
(metadata harvesting)
(metadata stitching)
Big Picture of Metadata Management for Data Governance
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
lineage
impact
analysis
semantic
lineage
Enterprise
Vocabulary
Semantic
Mapping
Machine Learning Pipelines
BAML PIT
Machine Learning Pipelines
BAML PITBAML PIT == $$100MM
Blockchain based Adversarial Machine Learning
Platform for IoT Testing
classic model
classic model
Pancake Stack
[Presto Arrow Nifi Cassandra Airflow Kafka ElasticSearch Spark Tensorflow AlgeBird CoreNLP Kibana]
data science silo
Data Source Data & Feature
Engineering
Adaptation of slide by Ben Lorica
Model
Building
Deploy
Monitor
maturity spectrum
what’s changing(-ed)?
1. Cloud (faas, serverless data pipelines, ml-as-a-service)
2. Consumer demand for ML features/products/applications
3. Targeted Models (we need to manage 20MM models for 10MM users maybe)
4. Localization (ASEAN facial recognition)
5. Security (Adv. ML, Side-channel attacks)
6. Transparency (Bias is a BUG)
7. Many toy sophisticated solutions but conventional, simpler techniques (regression)
still deliver more business value!
8. Monitoring to ensure deployed models are making high quality predictions
9. Need practices to maintain (update or rebuild) models over time
10. and ….
feature engineering, wat?
By @MLpuppy
rise of machine learning engineers
rise of machine learning engineers
Online Machine Learning Pipeline
Model Inventory
Model Output Monitoring
Take Action
ML Serving Layer
Hyper-parameter Tuning
www.productionml.org

More Related Content

What's hot

Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
Kevin McEntee
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
Clusterpoint
 
Clickstream Analysis With Apache Spark
Clickstream Analysis With Apache SparkClickstream Analysis With Apache Spark
Clickstream Analysis With Apache Spark
Andreas Zitzelsberger
 
Earth on AWS - Next-Generation Open Data Platforms
Earth on AWS - Next-Generation Open Data PlatformsEarth on AWS - Next-Generation Open Data Platforms
Earth on AWS - Next-Generation Open Data Platforms
Amazon Web Services
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management system
Clusterpoint
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
Sudhir Tonse
 
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
Jeff Hung
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data Perspective
Databricks
 
BDA304 Data-Driven Post Mortems
BDA304 Data-Driven Post MortemsBDA304 Data-Driven Post Mortems
BDA304 Data-Driven Post Mortems
Amazon Web Services
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
Jeff Hung
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Databricks
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
confluent
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Sriskandarajah Suhothayan
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach
 
Building event-driven Microservices with Kafka Ecosystem
Building event-driven Microservices with Kafka EcosystemBuilding event-driven Microservices with Kafka Ecosystem
Building event-driven Microservices with Kafka Ecosystem
Guido Schmutz
 

What's hot (20)

Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
 
Clickstream Analysis With Apache Spark
Clickstream Analysis With Apache SparkClickstream Analysis With Apache Spark
Clickstream Analysis With Apache Spark
 
Earth on AWS - Next-Generation Open Data Platforms
Earth on AWS - Next-Generation Open Data PlatformsEarth on AWS - Next-Generation Open Data Platforms
Earth on AWS - Next-Generation Open Data Platforms
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Javascript & SQL within database management system
Javascript & SQL within database management systemJavascript & SQL within database management system
Javascript & SQL within database management system
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
 
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data Perspective
 
BDA304 Data-Driven Post Mortems
BDA304 Data-Driven Post MortemsBDA304 Data-Driven Post Mortems
BDA304 Data-Driven Post Mortems
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Building event-driven Microservices with Kafka Ecosystem
Building event-driven Microservices with Kafka EcosystemBuilding event-driven Microservices with Kafka Ecosystem
Building event-driven Microservices with Kafka Ecosystem
 

Similar to Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes

Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
confluent
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Keynote – When Open Source Meets the Enterprise
Keynote – When Open Source Meets the EnterpriseKeynote – When Open Source Meets the Enterprise
Keynote – When Open Source Meets the Enterprise
MariaDB plc
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
Ian Foster
 
BigData
BigDataBigData
BigData
Shankar R
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305
Mark Tabladillo
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
Clusterpoint
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Bigdata
BigdataBigdata
Bigdata
Shankar R
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Brian O'Neill
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
Architecting Cloudy Applications
Architecting Cloudy ApplicationsArchitecting Cloudy Applications
Architecting Cloudy Applications
David Chou
 

Similar to Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes (20)

Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Keynote – When Open Source Meets the Enterprise
Keynote – When Open Source Meets the EnterpriseKeynote – When Open Source Meets the Enterprise
Keynote – When Open Source Meets the Enterprise
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
BigData
BigDataBigData
BigData
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Bigdata
BigdataBigdata
Bigdata
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Architecting Cloudy Applications
Architecting Cloudy ApplicationsArchitecting Cloudy Applications
Architecting Cloudy Applications
 

Recently uploaded

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 

Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes

  • 1. A Tale of Lambdas, Kappas & Pancakes @osamakhn
  • 2. Who am I? Osama Khan Big Data Engineer @ACLServices Grad Student @GTComputing AWS Big Data Specialist+ ! Vancouver, BC " : Java " : C# (met thru J#) # : Python $ : Golang, NodeJS % : Scala Previously: Robot Soccer, BI, Credit Rating, AML, O&G Portfolio, NLP/Governance, Doctor Triage, Energy Monitoring, Consulting, Private Equity Recently: Data/ML Pipeline, Tools & Platforms
  • 3. What am I going to talk about today? The goal of this talk is to provide a high level overview of the big data landscape to help software engineers distinguish signal from noise 1) How big is BIG: Lets get our scales recalibrated to understand what we mean by BIG Data 2) Lineage: Evolution of the Big Data ecosystem; from EDW to Data Lakes 3) Lambda & Kappa Architectures: The foundation of data pipelines and machine learning systems 4) Technology Choices: SMACK that PANCAKE BUT I ❤ Serverless 5) Demos: Athena, EMR, Redshift, Quicksight, Sagemaker, ModelDB
  • 4. How big is BIG? Big is BIG when Bieber breaks the Google Cloud (wat!?!) Lets get our scales recalibrated to understand what we mean by BIG Data
  • 5. This is BIG … § 390 Hyperscale Datacenters ( < 300, 2016) § Hyperscale == (5k servers, 10k sq.ft space) § > 400, 100M+ total servers § 56% web content in English § 8,000 languages spoken globally § Hello, friend. § 100M+ active users, 40M+ subscribers § 30M+ songs, 20K new per day, 2B+ playlists, 1B+ plays per day § 2,500 node Hadoop cluster, 100 PB+ Disk, 100TB+ RAM § 60TB+ per day log ingestion, 20k+ jobs per day § Listening History Query § user x track x [day/week/month/all time] § 300B elements § 800 workers, 32 core, 208 GB ram § 240TB in, 90TB out § Top Tracks in Vancouver (June 2017) § 30 date partitioned tables, 60TB data § 1 metadata table, 418GB § 94.2s, 4.82TB processed § Despacito – Remix (Luis Fonsi) § (2017) § 2.8 B+ US Tweets § Donald Trump (901.8 M) § Hillary Clinton (123.2 M) § Mike Pence (31.4 M) § 30x more than VP, 7x more than opponent (2013) § 170M individual metrics (timeseries) per minute § 200M queries served/day, 47 charts/user
  • 6. Lineage Evolution of the Big Data ecosystem; from EDW to Data Lakes
  • 7. A journey from ETL to Distributed Transactions via the ELT alley… 2007 20172003 20142009 2011201020042000 FaunaDB, Aurora Lambda, Kappa Architectures UC Berkeley Spark Google File System LinkedIn Kafka FB Cassandra AWS DynamoDB Google Dremel 2012: Google Spanner IBM, Oracle, MSFT, Terradata, SAP CAP Theorem Google Map Reduce AWS, GCP, Azure, Hana 2006: Yahoo! Hadoop Occupy the Cloud: Distributed Computing for the 99%
  • 8. Big Data Landscape Distributed systems rule the !
  • 9. Yet Another Big Data Framework (YABDF) Doesn’t fit on a slide or two … and you thought you had library fatigue in the JS world ! http://mattturck.com/wp-content/uploads/2017/05/Matt-Turck-FirstMark-2017-Big-Data-Landscape.png
  • 10. Lambda & Kappa Architectures The foundation of data pipelines for enterprise insights
  • 11. Lambda Architecture: First Principles & Desired Properties Data special information from which everything else is derived Information processed data Data System query = function(ALL_DATA) 1. Robustness & Fault Tolerance 2. Low Latency Read & Update 3. Scalability 4. Generalization 5. Ad-hoc 6. Minimal Maintenance 7. Debuggability
  • 12. The Lambda Architecture query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer ∑ ALL_DATA Δ NEW_DATA
  • 13. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 14. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 15. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 16. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > home of ‘master data’ > precomputed_batch_view = fn(ALL_DATA) > user_query = fn(precomputed_batch_view)
  • 17. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > (distributed) db to store batch view data > produce fast results for known queries > allow (random) reads by users/systems
  • 18. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > compensate for high latency of batch updates > run fast, incremental algorithms (probabilistic data structures, for the win) > realtime_view = fn(realtime_view, new_data) > user_query = fn(realtime_view) > user_query = fn(precomputed_batch_view)
  • 19. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > user_query = fn( precomputed_batch_view, realtime_view)
  • 20. The Lambda Architecture (ingest? speedlayer?) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 21. The Lambda Architecture (ingest? speedlayer?) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 22. The Kappa Architecture Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 23. The Lambda Architecture (SMACK) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 24. Big Picture of Metadata Management for Data Governance Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer lineage impact analysis semantic lineage Enterprise Vocabulary Semantic Mapping (metadata harvesting) (metadata stitching)
  • 25. Big Picture of Metadata Management for Data Governance Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer lineage impact analysis semantic lineage Enterprise Vocabulary Semantic Mapping
  • 27. Machine Learning Pipelines BAML PITBAML PIT == $$100MM Blockchain based Adversarial Machine Learning Platform for IoT Testing
  • 30. Pancake Stack [Presto Arrow Nifi Cassandra Airflow Kafka ElasticSearch Spark Tensorflow AlgeBird CoreNLP Kibana]
  • 31. data science silo Data Source Data & Feature Engineering Adaptation of slide by Ben Lorica Model Building Deploy Monitor
  • 33. what’s changing(-ed)? 1. Cloud (faas, serverless data pipelines, ml-as-a-service) 2. Consumer demand for ML features/products/applications 3. Targeted Models (we need to manage 20MM models for 10MM users maybe) 4. Localization (ASEAN facial recognition) 5. Security (Adv. ML, Side-channel attacks) 6. Transparency (Bias is a BUG) 7. Many toy sophisticated solutions but conventional, simpler techniques (regression) still deliver more business value! 8. Monitoring to ensure deployed models are making high quality predictions 9. Need practices to maintain (update or rebuild) models over time 10. and ….
  • 35. rise of machine learning engineers
  • 36. rise of machine learning engineers
  • 37. Online Machine Learning Pipeline Model Inventory Model Output Monitoring Take Action ML Serving Layer Hyper-parameter Tuning