SlideShare a Scribd company logo
1 of 92
Ricardo Fanjul, letgo
Designing a Horizontally
Scalable Event-Driven Big
Data Architecture with
Apache Spark
#SAISExp2
2 0 1 8 : DATA O DY S S E Y
Ricardo Fanjul
Data Engineer
Founded in 2015
100MM+ downloads
400MM+ listings
#SAISExp2
L E T G O D ATA P L AT F O R M I N N U M B E R S
500GB
Data daily
Events Processed Daily
1 billion
50K
Peaks of events per Second
600+
Event Types
200TB
Storage (S3)
< 1sec
NRT Processing Time
#SAISExp2
T H E DAW N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
μ
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
Domain EventsTracking Events
μ
#SAISExp2
Data Ingestion
Storage
INGEST
Stream
Batch
PROCESS
Query
Data exploitation
Orchestration
DISCOVER
#SAISExp2
I N G E S T
Data Ingestion
Storage
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Data Ingestion
#SAISExp2
O U R G OA L
DATA I N G E S T I O N
#SAISExp2
T H E D I S C OV E RY
DATA I N G E S T I O N
#SAISExp2
K A F K A C O N N E C T
DATA I N G E S T I O N
Amazon Aurora
#SAISExp2
T H E J O U R N E Y B E G I N S
Data Ingestion
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Storage
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
We want to store all events coming from Kafka to S3.
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
S O M E T I M E S … S H I T H A P P E N S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
1
2
3
4
5
6
Dirty Buckets
1 Read batch of events from Kafka.
2 Write each event to Cassandra.
3 Write “dirty hours” to compact topic:
Key=(event_type, hour).
4 Read dirty “hours” topic.
5 Read all events with dirty hours.
6 Store in S3
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
1. Eventual consistency
2. Very slow renames
S O M E S 3 B I G DATA P RO B L E M S :
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
#SAISExp2
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
S3GUARD
S3AFileSystem
The Hadoop FileSystem for Amazon S3
FileSystem
Operations
S3 ClientDynamoDB Client
1
WRITE
fs metadata
READ
fs metadata
WRITE
object data
LIST
object info
LIST
object data
Write Path
Read Path 2 3
1 2
2 1 1 2 3
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E


¿Job freeze?
#SAISExp2
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E
New Hadoop 3.1 S3A committers:
• Directory
• Partitioned
• Magic
#SAISExp2
P R O C E S S
INGEST PROCESS
Batch
DISCOVER
Query
Data exploitation
Orchestration
StreamData Ingestion
Storage
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
Stream Journal
User bucket’s changed
1 2
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
JOURNAL
User bucket’s changed
1 2
STREAM
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Is it still available?
What condition is it in?
Could we meet at……?
Is the price negotiable?
I offer you….$
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
{
"type": "meeting_proposal",
"properties": {
"location_name": “Letgo HQ",
"geo": {
"lat": "41.390205",
"lon": "2.154007"
},
"date": "1511193820350",
"meeting_id": "23213213213"
}
}
Structured data
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Meeting proposed + meeting accepted = emit accepted-meeting event
Meeting proposed + nothing in X time = “You have a proposal to meet”
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Query
Data exploitation
Orchestration
Stream
Batch
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
{
"data": {
"id": "105dg3272-8e5f-426f-
bca0-704e98552961",
"type": "some_event",
"attributes": {
"latitude": 42.3677203,
"longitude": -83.1186093
}
},
"meta": {
"created_at": 1522886400036
}
}


Technically correct but
not very actionable
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
•City: Detroit
•Postal code: 48206
•State: Michigan
•DMA: Detroit
•Country: US
What we know:
(42.3677203, -83.1186093)
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
How we do it:
• Populating JTS indices from WKT polygon data
• Custom Spark SQL UDF
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)


#SAISExp2
D I S C O V E R
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Data exploitation
Orchestration
Stream
Batch
Query
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User bucket’s changed
1 2
QU E RY I N G DATA
QU E RY
Amazon Aurora
QU E RY I N G DATA
QU E RY
METASTORE
SPECTRUM
QU E RY I N G DATA
QU E RY
#SAISExp2
QU E RY I N G DATA
QU E RY
METASTORE
Thrift Server
Amazon Aurora
QU E RY I N G DATA
QU E RY
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name")
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?
user=xxx&password=xxx',
forward_spark_s3_credentials 'true')
QU E RY I N G DATA
QU E RY
CREATE TABLE …
STORED AS…
CREATE TABLE …
USING [parquet,json,csv…]
70%
Higher performance!
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
Inserting data
1 2
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(
  user_id STRING,
  column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Inserting data
2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Problem?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY

 200 files because default value of
“spark.sql.shuffle.partition”
QU E RY I N G DATA : B AT C H E S W I T H
S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
?
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Orchestration
Stream
Batch
Query
Data exploitation
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
Amazon Aurora
METASTORE
Thrift Server
DATA S C I E N T I S T S T E A M ?
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S A S I S E E T H E M
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too many
small files!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Huge Query!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too much
shuffle!
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Stream
Batch
Query
Data exploitation
Orchestration
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
I’m happy!!
#SAISExp2
M OV I N G TO S TAT E L E S S
C L U S T E R
I ’ M S O R RY R I C A R D O. I ’ M
A F R A I D I C A N ’ T D O T H AT.
P L AT F O R M L I M I TAT I O N S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
J U P I T E R A N D B E YO N D
T H E I N F I N I T E
S O M E T H I N G WO N D E R F U L W I L L
H A P P E N
J U P I T E R A N D B E YO N D T H E I N F I N I T E
#SAISExp2
R E V E A L
J U P I T E R A N D B E YO N D T H E I N F I N I T E
https://youtu.be/CInMDMuSFwc
– A R T H U R C . C L A R K E
“The only way to discover the limits of the possible
is to go beyond them into the impossible.”
#SAISExp2
D O YO U WA N T TO J O I N U S ?
T H E F U T U R E … ?
https://boards.greenhouse.io/letgo

More Related Content

What's hot

Let’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklistsLet’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklistsAndrea Draghetti
 
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19Federico Tenga
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetesJuraj Hantak
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouJohn Anderson
 
Uncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious InfrastructureUncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious InfrastructureAndrea Scarfo
 
On the development and distribution of R packages
On the development and distribution of R packagesOn the development and distribution of R packages
On the development and distribution of R packagesTom Mens
 
April Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate CodeApril Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate CodeApril Wensel
 
The net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James BennettThe net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James BennettLeo Zhou
 
Decoupled APIs through Microservices
Decoupled APIs through MicroservicesDecoupled APIs through Microservices
Decoupled APIs through MicroservicesDavid Simons
 
Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingSathishkumar A
 
Drupal Decoupled on ARTE
Drupal Decoupled on ARTEDrupal Decoupled on ARTE
Drupal Decoupled on ARTEjolidog
 
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLBristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLDavid Simons
 
Statistical Programming with JavaScript
Statistical Programming with JavaScriptStatistical Programming with JavaScript
Statistical Programming with JavaScriptDavid Simons
 
Python for Data-driven Storytelling
Python for Data-driven StorytellingPython for Data-driven Storytelling
Python for Data-driven StorytellingHamlet Batista
 
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaLyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaMaria Rajakallio
 

What's hot (16)

Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
Let’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklistsLet’s spread Phishing and escape the blocklists
Let’s spread Phishing and escape the blocklists
 
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19Smart Contracts Technical Overview - Meetup Roma - 17/09/19
Smart Contracts Technical Overview - Meetup Roma - 17/09/19
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
 
Uncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious InfrastructureUncovering and Visualizing Malicious Infrastructure
Uncovering and Visualizing Malicious Infrastructure
 
On the development and distribution of R packages
On the development and distribution of R packagesOn the development and distribution of R packages
On the development and distribution of R packages
 
April Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate CodeApril Wensel - Crafting Compassionate Code
April Wensel - Crafting Compassionate Code
 
The net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James BennettThe net is dark and full of terrors - James Bennett
The net is dark and full of terrors - James Bennett
 
Decoupled APIs through Microservices
Decoupled APIs through MicroservicesDecoupled APIs through Microservices
Decoupled APIs through Microservices
 
Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hacking
 
Drupal Decoupled on ARTE
Drupal Decoupled on ARTEDrupal Decoupled on ARTE
Drupal Decoupled on ARTE
 
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLBristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
 
Statistical Programming with JavaScript
Statistical Programming with JavaScriptStatistical Programming with JavaScript
Statistical Programming with JavaScript
 
Python for Data-driven Storytelling
Python for Data-driven StorytellingPython for Data-driven Storytelling
Python for Data-driven Storytelling
 
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaLyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoa
 

Similar to Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstarsStephan Hochhaus
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 DISID
 
Choosing the right database
Choosing the right databaseChoosing the right database
Choosing the right databaseDavid Simons
 
Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right DatabaseDavid Simons
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Parcours de professionnalisation
Parcours de professionnalisationParcours de professionnalisation
Parcours de professionnalisationSbastienGuritey
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With StyleStephan Hochhaus
 
Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019javier ramirez
 
SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017Luis M Villanueva
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]New Relic
 
Whitepaper - Spinbackup Overview
Whitepaper - Spinbackup OverviewWhitepaper - Spinbackup Overview
Whitepaper - Spinbackup OverviewSpinbackup
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014StampedeCon
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning ProgrammingPaulSombat
 
Drones - What's next?
Drones - What's next?Drones - What's next?
Drones - What's next?Speck&Tech
 

Similar to Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark (20)

Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
 
Fast api
Fast apiFast api
Fast api
 
Meteor WWNRW Intro
Meteor WWNRW IntroMeteor WWNRW Intro
Meteor WWNRW Intro
 
Choosing the right database
Choosing the right databaseChoosing the right database
Choosing the right database
 
Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right Database
 
airflow_aws_snow.pptx
airflow_aws_snow.pptxairflow_aws_snow.pptx
airflow_aws_snow.pptx
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Parcours de professionnalisation
Parcours de professionnalisationParcours de professionnalisation
Parcours de professionnalisation
 
eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016eHarmony @ Phoenix Con 2016
eHarmony @ Phoenix Con 2016
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With Style
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
 
Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019Innovations and trends in Cloud. Connectfest Porto 2019
Innovations and trends in Cloud. Connectfest Porto 2019
 
SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017SEO orientado a Ventas - DSMVALENCIA 2017
SEO orientado a Ventas - DSMVALENCIA 2017
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 
Whitepaper - Spinbackup Overview
Whitepaper - Spinbackup OverviewWhitepaper - Spinbackup Overview
Whitepaper - Spinbackup Overview
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
 
Drones - What's next?
Drones - What's next?Drones - What's next?
Drones - What's next?
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

  • 1. Ricardo Fanjul, letgo Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark #SAISExp2
  • 2. 2 0 1 8 : DATA O DY S S E Y
  • 4. Founded in 2015 100MM+ downloads 400MM+ listings #SAISExp2
  • 5. L E T G O D ATA P L AT F O R M I N N U M B E R S 500GB Data daily Events Processed Daily 1 billion 50K Peaks of events per Second 600+ Event Types 200TB Storage (S3) < 1sec NRT Processing Time #SAISExp2
  • 6. T H E DAW N O F L E T G O
  • 7. #SAISExp2 C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 8. C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 9. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O μ #SAISExp2
  • 10. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O Domain EventsTracking Events μ #SAISExp2
  • 12. I N G E S T
  • 13. Data Ingestion Storage INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Data Ingestion #SAISExp2
  • 14. O U R G OA L DATA I N G E S T I O N #SAISExp2
  • 15. T H E D I S C OV E RY DATA I N G E S T I O N #SAISExp2
  • 16. K A F K A C O N N E C T DATA I N G E S T I O N Amazon Aurora #SAISExp2
  • 17.
  • 18. T H E J O U R N E Y B E G I N S
  • 19. Data Ingestion INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Storage #SAISExp2
  • 20. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 21. B U I L D I N G T H E DATA L A K E S TO R AG E We want to store all events coming from Kafka to S3. #SAISExp2
  • 22. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 23. S O M E T I M E S … S H I T H A P P E N S S TO R AG E #SAISExp2
  • 24. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 25. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 26. ( V E RY ) L AT E E V E N T S S TO R AG E #SAISExp2
  • 27. ( V E RY ) L AT E E V E N T S S TO R AG E 1 2 3 4 5 6 Dirty Buckets 1 Read batch of events from Kafka. 2 Write each event to Cassandra. 3 Write “dirty hours” to compact topic: Key=(event_type, hour). 4 Read dirty “hours” topic. 5 Read all events with dirty hours. 6 Store in S3 #SAISExp2
  • 28. S 3 P RO B L E M S S TO R AG E #SAISExp2
  • 29. S 3 P RO B L E M S S TO R AG E 1. Eventual consistency 2. Very slow renames S O M E S 3 B I G DATA P RO B L E M S :
  • 30. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E #SAISExp2
  • 31. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E S3GUARD S3AFileSystem The Hadoop FileSystem for Amazon S3 FileSystem Operations S3 ClientDynamoDB Client 1 WRITE fs metadata READ fs metadata WRITE object data LIST object info LIST object data Write Path Read Path 2 3 1 2 2 1 1 2 3
  • 32. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E 
 ¿Job freeze? #SAISExp2
  • 33. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E New Hadoop 3.1 S3A committers: • Directory • Partitioned • Magic #SAISExp2
  • 34. P R O C E S S
  • 36. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M Stream Journal User bucket’s changed 1 2 #SAISExp2
  • 37. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M JOURNAL User bucket’s changed 1 2 STREAM #SAISExp2
  • 38. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Is it still available? What condition is it in? Could we meet at……? Is the price negotiable? I offer you….$ #SAISExp2
  • 39. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M { "type": "meeting_proposal", "properties": { "location_name": “Letgo HQ", "geo": { "lat": "41.390205", "lon": "2.154007" }, "date": "1511193820350", "meeting_id": "23213213213" } } Structured data #SAISExp2
  • 40. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Meeting proposed + meeting accepted = emit accepted-meeting event Meeting proposed + nothing in X time = “You have a proposal to meet” #SAISExp2
  • 41. Data Ingestion Storage INGEST PROCESS DISCOVER Query Data exploitation Orchestration Stream Batch #SAISExp2
  • 42. G E O DATA E N R I C H M E N T B AT C H #SAISExp2
  • 43. G E O DATA E N R I C H M E N T B AT C H { "data": { "id": "105dg3272-8e5f-426f- bca0-704e98552961", "type": "some_event", "attributes": { "latitude": 42.3677203, "longitude": -83.1186093 } }, "meta": { "created_at": 1522886400036 } } 
 Technically correct but not very actionable #SAISExp2
  • 44. G E O DATA E N R I C H M E N T B AT C H •City: Detroit •Postal code: 48206 •State: Michigan •DMA: Detroit •Country: US What we know: (42.3677203, -83.1186093) #SAISExp2
  • 45. G E O DATA E N R I C H M E N T B AT C H How we do it: • Populating JTS indices from WKT polygon data • Custom Spark SQL UDF SELECT geodata.dma_name, geodata.dma_number AS dma_number, geodata.city AS city, geodata.state AS state , geodata.zip_code AS zip_code FROM ( SELECT geodata(longitude, latitude) AS geodata FROM …. ) 
 #SAISExp2
  • 46. D I S C O V E R
  • 47. Data Ingestion Storage INGEST PROCESS DISCOVER Data exploitation Orchestration Stream Batch Query #SAISExp2
  • 48. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 49. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 50. Stream Journal User bucket’s changed 1 2 QU E RY I N G DATA QU E RY Amazon Aurora
  • 51. QU E RY I N G DATA QU E RY METASTORE SPECTRUM
  • 52. QU E RY I N G DATA QU E RY #SAISExp2
  • 53. QU E RY I N G DATA QU E RY METASTORE Thrift Server Amazon Aurora
  • 54. QU E RY I N G DATA QU E RY CREATE TABLE IF NOT EXISTS database_name.table_name( some_column STRING, ... dt DATE ) USING json PARTITIONED BY (`dt`) CREATE TEMPORARY VIEW table_name USING org.apache.spark.sql.cassandra OPTIONS ( table "table_name", keyspace "keyspace_name") CREATE EXTERNAL TABLE IF NOT EXISTS database_name.table_name( some_column STRING..., dt DATE ) PARTITIONED BY (`dt`) USING PARQUET LOCATION 's3a://bucket-name/database_name/table_name' CREATE TABLE IF NOT EXISTS database_name.table_name using com.databricks.spark.redshift options ( dbtable 'schema.redshift_table_name', tempdir 's3a://redshift-temp/', url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo? user=xxx&password=xxx', forward_spark_s3_credentials 'true')
  • 55. QU E RY I N G DATA QU E RY CREATE TABLE … STORED AS… CREATE TABLE … USING [parquet,json,csv…] 70% Higher performance!
  • 56. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table Inserting data 1 2 #SAISExp2
  • 57. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table 1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(   user_id STRING,   column_b STRING, ... ) USING PARQUET PARTITIONED BY (`dt` STRING) LOCATION 's3a://example/some_table' #SAISExp2
  • 58. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Inserting data 2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... #SAISExp2
  • 59. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Problem? #SAISExp2
  • 60. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY 
 200 files because default value of “spark.sql.shuffle.partition”
  • 61. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... ? #SAISExp2
  • 62. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY ? DISTRIBUTE BY (dt): Only one file not Sorted CLUSTERED BY (dt, user_id, column_b): Multiple files DISTRIBUTE BY (dt) SORT BY (user_id, column_b): Only one file sorted by user_id, column_b. Good for joins using this properties. #SAISExp2
  • 63. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... DISTRIBUTE BY (dt) SORT BY (user_id) #SAISExp2
  • 64. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY #SAISExp2
  • 65. Data Ingestion Storage INGEST PROCESS DISCOVER Orchestration Stream Batch Query Data exploitation #SAISExp2
  • 66. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N #SAISExp2
  • 67. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N Amazon Aurora METASTORE Thrift Server
  • 68. DATA S C I E N T I S T S T E A M ? DATA E X P L O I TAT I O N #SAISExp2
  • 69. DATA S C I E N T I S T S A S I S E E T H E M DATA E X P L O I TAT I O N #SAISExp2
  • 70. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N #SAISExp2
  • 71. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too many small files! #SAISExp2
  • 72. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Huge Query! #SAISExp2
  • 73. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too much shuffle! #SAISExp2
  • 74. Data Ingestion Storage INGEST PROCESS DISCOVER Stream Batch Query Data exploitation Orchestration #SAISExp2
  • 75. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 76. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 77. A I R F L OW O R C H E S T R AT I O N I’m happy!! #SAISExp2
  • 78. M OV I N G TO S TAT E L E S S C L U S T E R
  • 79. I ’ M S O R RY R I C A R D O. I ’ M A F R A I D I C A N ’ T D O T H AT.
  • 80. P L AT F O R M L I M I TAT I O N S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 81. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 82. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 83. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 84. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 85. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 86. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 87. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 88. J U P I T E R A N D B E YO N D T H E I N F I N I T E
  • 89. S O M E T H I N G WO N D E R F U L W I L L H A P P E N J U P I T E R A N D B E YO N D T H E I N F I N I T E #SAISExp2
  • 90. R E V E A L J U P I T E R A N D B E YO N D T H E I N F I N I T E https://youtu.be/CInMDMuSFwc
  • 91. – A R T H U R C . C L A R K E “The only way to discover the limits of the possible is to go beyond them into the impossible.” #SAISExp2
  • 92. D O YO U WA N T TO J O I N U S ? T H E F U T U R E … ? https://boards.greenhouse.io/letgo