SlideShare a Scribd company logo
Ricardo Fanjul, letgo
Designing a Horizontally
Scalable Event-Driven Big
Data Architecture with
Apache Spark
#SAISExp2
2 0 1 8 : DATA O DY S S E Y
Ricardo Fanjul
Data Engineer
Founded in 2015
100MM+ downloads
400MM+ listings
#SAISExp2
L E T G O D ATA P L AT F O R M I N N U M B E R S
500GB
Data daily
Events Processed Daily
1 billion
50K
Peaks of events per Second
600+
Event Types
200TB
Storage (S3)
< 1sec
NRT Processing Time
#SAISExp2
T H E DAW N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
μ
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
Domain EventsTracking Events
μ
#SAISExp2
Data Ingestion
Storage
INGEST
Stream
Batch
PROCESS
Query
Data exploitation
Orchestration
DISCOVER
#SAISExp2
I N G E S T
Data Ingestion
Storage
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Data Ingestion
#SAISExp2
O U R G OA L
DATA I N G E S T I O N
#SAISExp2
T H E D I S C OV E RY
DATA I N G E S T I O N
#SAISExp2
K A F K A C O N N E C T
DATA I N G E S T I O N
Amazon Aurora
#SAISExp2
T H E J O U R N E Y B E G I N S
Data Ingestion
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Storage
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
We want to store all events coming from Kafka to S3.
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
S O M E T I M E S … S H I T H A P P E N S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
1
2
3
4
5
6
Dirty Buckets
1 Read batch of events from Kafka.
2 Write each event to Cassandra.
3 Write “dirty hours” to compact topic:
Key=(event_type, hour).
4 Read dirty “hours” topic.
5 Read all events with dirty hours.
6 Store in S3
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
1. Eventual consistency
2. Very slow renames
S O M E S 3 B I G DATA P RO B L E M S :
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
#SAISExp2
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
S3GUARD
S3AFileSystem
The Hadoop FileSystem for Amazon S3
FileSystem
Operations
S3 ClientDynamoDB Client
1
WRITE
fs metadata
READ
fs metadata
WRITE
object data
LIST
object info
LIST
object data
Write Path
Read Path 2 3
1 2
2 1 1 2 3
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E


¿Job freeze?
#SAISExp2
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E
New Hadoop 3.1 S3A committers:
• Directory
• Partitioned
• Magic
#SAISExp2
P R O C E S S
INGEST PROCESS
Batch
DISCOVER
Query
Data exploitation
Orchestration
StreamData Ingestion
Storage
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
Stream Journal
User bucket’s changed
1 2
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
JOURNAL
User bucket’s changed
1 2
STREAM
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Is it still available?
What condition is it in?
Could we meet at……?
Is the price negotiable?
I offer you….$
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
{
"type": "meeting_proposal",
"properties": {
"location_name": “Letgo HQ",
"geo": {
"lat": "41.390205",
"lon": "2.154007"
},
"date": "1511193820350",
"meeting_id": "23213213213"
}
}
Structured data
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Meeting proposed + meeting accepted = emit accepted-meeting event
Meeting proposed + nothing in X time = “You have a proposal to meet”
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Query
Data exploitation
Orchestration
Stream
Batch
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
{
"data": {
"id": "105dg3272-8e5f-426f-
bca0-704e98552961",
"type": "some_event",
"attributes": {
"latitude": 42.3677203,
"longitude": -83.1186093
}
},
"meta": {
"created_at": 1522886400036
}
}


Technically correct but
not very actionable
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
•City: Detroit
•Postal code: 48206
•State: Michigan
•DMA: Detroit
•Country: US
What we know:
(42.3677203, -83.1186093)
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
How we do it:
• Populating JTS indices from WKT polygon data
• Custom Spark SQL UDF
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)


#SAISExp2
D I S C O V E R
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Data exploitation
Orchestration
Stream
Batch
Query
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
Amazon Aurora
QU E RY I N G DATA
QU E RY
METASTORE
SPECTRUM
QU E RY I N G DATA
QU E RY
#SAISExp2
QU E RY I N G DATA
QU E RY
METASTORE
Thrift Server
Amazon Aurora
QU E RY I N G DATA
QU E RY
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name")
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?
user=xxx&password=xxx',
forward_spark_s3_credentials 'true')
QU E RY I N G DATA
QU E RY
CREATE TABLE …
STORED AS…
CREATE TABLE …
USING [parquet,json,csv…]
70%
Higher performance!
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
Inserting data
1 2
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(
  user_id STRING,
  column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Inserting data
2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Problem?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY

 200 files because default value of
“spark.sql.shuffle.partition”
QU E RY I N G DATA : B AT C H E S W I T H
S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
?
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Orchestration
Stream
Batch
Query
Data exploitation
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
Amazon Aurora
METASTORE
Thrift Server
DATA S C I E N T I S T S T E A M ?
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S A S I S E E T H E M
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too many
small files!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Huge Query!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too much
shuffle!
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Stream
Batch
Query
Data exploitation
Orchestration
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
I’m happy!!
#SAISExp2
M OV I N G TO S TAT E L E S S
C L U S T E R
I ’ M S O R RY R I C A R D O. I ’ M
A F R A I D I C A N ’ T D O T H AT.
P L AT F O R M L I M I TAT I O N S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
J U P I T E R A N D B E YO N D
T H E I N F I N I T E
S O M E T H I N G WO N D E R F U L W I L L
H A P P E N
J U P I T E R A N D B E YO N D T H E I N F I N I T E
#SAISExp2
R E V E A L
J U P I T E R A N D B E YO N D T H E I N F I N I T E
https://youtu.be/CInMDMuSFwc
– A R T H U R C . C L A R K E
“The only way to discover the limits of the possible
is to go beyond them into the impossible.”
#SAISExp2
D O YO U WA N T TO J O I N U S ?
T H E F U T U R E … ?
https://boards.greenhouse.io/letgo

More Related Content

What's hot

Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
Alex Soto
 
Let’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklistsLet’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklists
Andrea Draghetti
 
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Federico Tenga
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes
Juraj Hantak
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
John Anderson
 
Uncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious InfrastructureUncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious Infrastructure
Andrea Scarfo
 
On the development and distribution of R packages
On the development and distribution of R packagesOn the development and distribution of R packages
On the development and distribution of R packages
Tom Mens
 
April Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate CodeApril Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate Code
April Wensel
 
The net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James BennettThe net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James Bennett
Leo Zhou
 
Decoupled APIs through Microservices
Decoupled APIs through MicroservicesDecoupled APIs through Microservices
Decoupled APIs through Microservices
David Simons
 
Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingSathishkumar A
 
Drupal Decoupled on ARTE
Drupal Decoupled on ARTEDrupal Decoupled on ARTE
Drupal Decoupled on ARTE
jolidog
 
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLBristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
David Simons
 
Statistical Programming with JavaScript
Statistical Programming with JavaScriptStatistical Programming with JavaScript
Statistical Programming with JavaScript
David Simons
 
Python for Data-driven Storytelling
Python for Data-driven StorytellingPython for Data-driven Storytelling
Python for Data-driven Storytelling
Hamlet Batista
 
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaLyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Maria Rajakallio
 

What's hot (16)

Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
Let’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklistsLet’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklists
 
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
 
Uncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious InfrastructureUncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious Infrastructure
 
On the development and distribution of R packages
On the development and distribution of R packagesOn the development and distribution of R packages
On the development and distribution of R packages
 
April Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate CodeApril Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate Code
 
The net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James BennettThe net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James Bennett
 
Decoupled APIs through Microservices
Decoupled APIs through MicroservicesDecoupled APIs through Microservices
Decoupled APIs through Microservices
 
Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hacking
 
Drupal Decoupled on ARTE
Drupal Decoupled on ARTEDrupal Decoupled on ARTE
Drupal Decoupled on ARTE
 
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLBristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
 
Statistical Programming with JavaScript
Statistical Programming with JavaScriptStatistical Programming with JavaScript
Statistical Programming with JavaScript
 
Python for Data-driven Storytelling
Python for Data-driven StorytellingPython for Data-driven Storytelling
Python for Data-driven Storytelling
 
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaLyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
 

Similar to Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
Stephan Hochhaus
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
DISID
 
Fast api
Fast apiFast api
Fast api
Simone Di Maulo
 
Meteor WWNRW Intro
Meteor WWNRW IntroMeteor WWNRW Intro
Meteor WWNRW Intro
Stephan Hochhaus
 
Choosing the right database
Choosing the right databaseChoosing the right database
Choosing the right database
David Simons
 
Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right Database
David Simons
 
10 d bs in 30 minutes
10 d bs in 30 minutes10 d bs in 30 minutes
10 d bs in 30 minutesDavid Simons
 
airflow_aws_snow.pptx
airflow_aws_snow.pptxairflow_aws_snow.pptx
airflow_aws_snow.pptx
rishikakhanna7
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
David Simons
 
Parcours de professionnalisation
Parcours de professionnalisationParcours de professionnalisation
Parcours de professionnalisation
SbastienGuritey
 
eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016
Vijaykumar Vangapandu
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With Style
Stephan Hochhaus
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
Francois Zaninotto
 
Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019
javier ramirez
 
SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017
Luis M Villanueva
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
New Relic
 
Whitepaper - Spinbackup Overview
Whitepaper - Spinbackup OverviewWhitepaper - Spinbackup Overview
Whitepaper - Spinbackup Overview
Spinbackup
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014
StampedeCon
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
PaulSombat
 

Similar to Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark (20)

Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
 
Fast api
Fast apiFast api
Fast api
 
Meteor WWNRW Intro
Meteor WWNRW IntroMeteor WWNRW Intro
Meteor WWNRW Intro
 
Choosing the right database
Choosing the right databaseChoosing the right database
Choosing the right database
 
Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right Database
 
10 d bs in 30 minutes
10 d bs in 30 minutes10 d bs in 30 minutes
10 d bs in 30 minutes
 
airflow_aws_snow.pptx
airflow_aws_snow.pptxairflow_aws_snow.pptx
airflow_aws_snow.pptx
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Parcours de professionnalisation
Parcours de professionnalisationParcours de professionnalisation
Parcours de professionnalisation
 
eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With Style
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
 
Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019
 
SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 
Whitepaper - Spinbackup Overview
Whitepaper - Spinbackup OverviewWhitepaper - Spinbackup Overview
Whitepaper - Spinbackup Overview
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

  • 1. Ricardo Fanjul, letgo Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark #SAISExp2
  • 2. 2 0 1 8 : DATA O DY S S E Y
  • 4. Founded in 2015 100MM+ downloads 400MM+ listings #SAISExp2
  • 5. L E T G O D ATA P L AT F O R M I N N U M B E R S 500GB Data daily Events Processed Daily 1 billion 50K Peaks of events per Second 600+ Event Types 200TB Storage (S3) < 1sec NRT Processing Time #SAISExp2
  • 6. T H E DAW N O F L E T G O
  • 7. #SAISExp2 C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 8. C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 9. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O μ #SAISExp2
  • 10. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O Domain EventsTracking Events μ #SAISExp2
  • 12. I N G E S T
  • 13. Data Ingestion Storage INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Data Ingestion #SAISExp2
  • 14. O U R G OA L DATA I N G E S T I O N #SAISExp2
  • 15. T H E D I S C OV E RY DATA I N G E S T I O N #SAISExp2
  • 16. K A F K A C O N N E C T DATA I N G E S T I O N Amazon Aurora #SAISExp2
  • 17.
  • 18. T H E J O U R N E Y B E G I N S
  • 19. Data Ingestion INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Storage #SAISExp2
  • 20. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 21. B U I L D I N G T H E DATA L A K E S TO R AG E We want to store all events coming from Kafka to S3. #SAISExp2
  • 22. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 23. S O M E T I M E S … S H I T H A P P E N S S TO R AG E #SAISExp2
  • 24. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 25. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 26. ( V E RY ) L AT E E V E N T S S TO R AG E #SAISExp2
  • 27. ( V E RY ) L AT E E V E N T S S TO R AG E 1 2 3 4 5 6 Dirty Buckets 1 Read batch of events from Kafka. 2 Write each event to Cassandra. 3 Write “dirty hours” to compact topic: Key=(event_type, hour). 4 Read dirty “hours” topic. 5 Read all events with dirty hours. 6 Store in S3 #SAISExp2
  • 28. S 3 P RO B L E M S S TO R AG E #SAISExp2
  • 29. S 3 P RO B L E M S S TO R AG E 1. Eventual consistency 2. Very slow renames S O M E S 3 B I G DATA P RO B L E M S :
  • 30. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E #SAISExp2
  • 31. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E S3GUARD S3AFileSystem The Hadoop FileSystem for Amazon S3 FileSystem Operations S3 ClientDynamoDB Client 1 WRITE fs metadata READ fs metadata WRITE object data LIST object info LIST object data Write Path Read Path 2 3 1 2 2 1 1 2 3
  • 32. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E 
 ¿Job freeze? #SAISExp2
  • 33. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E New Hadoop 3.1 S3A committers: • Directory • Partitioned • Magic #SAISExp2
  • 34. P R O C E S S
  • 36. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M Stream Journal User bucket’s changed 1 2 #SAISExp2
  • 37. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M JOURNAL User bucket’s changed 1 2 STREAM #SAISExp2
  • 38. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Is it still available? What condition is it in? Could we meet at……? Is the price negotiable? I offer you….$ #SAISExp2
  • 39. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M { "type": "meeting_proposal", "properties": { "location_name": “Letgo HQ", "geo": { "lat": "41.390205", "lon": "2.154007" }, "date": "1511193820350", "meeting_id": "23213213213" } } Structured data #SAISExp2
  • 40. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Meeting proposed + meeting accepted = emit accepted-meeting event Meeting proposed + nothing in X time = “You have a proposal to meet” #SAISExp2
  • 41. Data Ingestion Storage INGEST PROCESS DISCOVER Query Data exploitation Orchestration Stream Batch #SAISExp2
  • 42. G E O DATA E N R I C H M E N T B AT C H #SAISExp2
  • 43. G E O DATA E N R I C H M E N T B AT C H { "data": { "id": "105dg3272-8e5f-426f- bca0-704e98552961", "type": "some_event", "attributes": { "latitude": 42.3677203, "longitude": -83.1186093 } }, "meta": { "created_at": 1522886400036 } } 
 Technically correct but not very actionable #SAISExp2
  • 44. G E O DATA E N R I C H M E N T B AT C H •City: Detroit •Postal code: 48206 •State: Michigan •DMA: Detroit •Country: US What we know: (42.3677203, -83.1186093) #SAISExp2
  • 45. G E O DATA E N R I C H M E N T B AT C H How we do it: • Populating JTS indices from WKT polygon data • Custom Spark SQL UDF SELECT geodata.dma_name, geodata.dma_number AS dma_number, geodata.city AS city, geodata.state AS state , geodata.zip_code AS zip_code FROM ( SELECT geodata(longitude, latitude) AS geodata FROM …. ) 
 #SAISExp2
  • 46. D I S C O V E R
  • 47. Data Ingestion Storage INGEST PROCESS DISCOVER Data exploitation Orchestration Stream Batch Query #SAISExp2
  • 48. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 49. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 50. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY Amazon Aurora
  • 51. QU E RY I N G DATA QU E RY METASTORE SPECTRUM
  • 52. QU E RY I N G DATA QU E RY #SAISExp2
  • 53. QU E RY I N G DATA QU E RY METASTORE Thrift Server Amazon Aurora
  • 54. QU E RY I N G DATA QU E RY CREATE TABLE IF NOT EXISTS database_name.table_name( some_column STRING, ... dt DATE ) USING json PARTITIONED BY (`dt`) CREATE TEMPORARY VIEW table_name USING org.apache.spark.sql.cassandra OPTIONS ( table "table_name", keyspace "keyspace_name") CREATE EXTERNAL TABLE IF NOT EXISTS database_name.table_name( some_column STRING..., dt DATE ) PARTITIONED BY (`dt`) USING PARQUET LOCATION 's3a://bucket-name/database_name/table_name' CREATE TABLE IF NOT EXISTS database_name.table_name using com.databricks.spark.redshift options ( dbtable 'schema.redshift_table_name', tempdir 's3a://redshift-temp/', url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo? user=xxx&password=xxx', forward_spark_s3_credentials 'true')
  • 55. QU E RY I N G DATA QU E RY CREATE TABLE … STORED AS… CREATE TABLE … USING [parquet,json,csv…] 70% Higher performance!
  • 56. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table Inserting data 1 2 #SAISExp2
  • 57. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table 1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(   user_id STRING,   column_b STRING, ... ) USING PARQUET PARTITIONED BY (`dt` STRING) LOCATION 's3a://example/some_table' #SAISExp2
  • 58. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Inserting data 2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... #SAISExp2
  • 59. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Problem? #SAISExp2
  • 60. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY 
 200 files because default value of “spark.sql.shuffle.partition”
  • 61. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... ? #SAISExp2
  • 62. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY ? DISTRIBUTE BY (dt): Only one file not Sorted CLUSTERED BY (dt, user_id, column_b): Multiple files DISTRIBUTE BY (dt) SORT BY (user_id, column_b): Only one file sorted by user_id, column_b. Good for joins using this properties. #SAISExp2
  • 63. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... DISTRIBUTE BY (dt) SORT BY (user_id) #SAISExp2
  • 64. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY #SAISExp2
  • 65. Data Ingestion Storage INGEST PROCESS DISCOVER Orchestration Stream Batch Query Data exploitation #SAISExp2
  • 66. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N #SAISExp2
  • 67. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N Amazon Aurora METASTORE Thrift Server
  • 68. DATA S C I E N T I S T S T E A M ? DATA E X P L O I TAT I O N #SAISExp2
  • 69. DATA S C I E N T I S T S A S I S E E T H E M DATA E X P L O I TAT I O N #SAISExp2
  • 70. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N #SAISExp2
  • 71. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too many small files! #SAISExp2
  • 72. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Huge Query! #SAISExp2
  • 73. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too much shuffle! #SAISExp2
  • 74. Data Ingestion Storage INGEST PROCESS DISCOVER Stream Batch Query Data exploitation Orchestration #SAISExp2
  • 75. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 76. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 77. A I R F L OW O R C H E S T R AT I O N I’m happy!! #SAISExp2
  • 78. M OV I N G TO S TAT E L E S S C L U S T E R
  • 79. I ’ M S O R RY R I C A R D O. I ’ M A F R A I D I C A N ’ T D O T H AT.
  • 80. P L AT F O R M L I M I TAT I O N S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 81. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 82. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 83. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 84. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 85. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 86. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 87. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 88. J U P I T E R A N D B E YO N D T H E I N F I N I T E
  • 89. S O M E T H I N G WO N D E R F U L W I L L H A P P E N J U P I T E R A N D B E YO N D T H E I N F I N I T E #SAISExp2
  • 90. R E V E A L J U P I T E R A N D B E YO N D T H E I N F I N I T E https://youtu.be/CInMDMuSFwc
  • 91. – A R T H U R C . C L A R K E “The only way to discover the limits of the possible is to go beyond them into the impossible.” #SAISExp2
  • 92. D O YO U WA N T TO J O I N U S ? T H E F U T U R E … ? https://boards.greenhouse.io/letgo