1
Letgo Data Platform
A global overview
2
Data Engineer
Ricardo Fanjul
3
LETGO DATA PLATFORM IN NUMBERS
500GB 530+
Data daily Event Types
Events Processed Daily
565M 180TB
Storage (S3)
8K
Events Processed per Second
< 1sec
NRT Processing Time
4
OUR DATA JOURNEY
5
OUR DATA JOURNEY
6
MOVING TO EVENTS
7
MOVING TO EVENTS
Domain EventsTracking Events
8
MOVING TO EVENTS
“A Domain Event captures the memory of something interesting (external stimulus) which affects the domain”
Martin Fowler
Domain EventsTracking Events
9
GO TO EVENTS
{
"data":{
"id":"1c4da9f0-605e-4708-a8e7-0f2c97dff16e",
"type":"message_sent",
"attributes":{
"id":"e0abcd6c-c027-489f-95bf-24796e421e8b",
"conversation_id":"47596e76-0fb4-4155-9aeb-6a5ba14c9cef",
"product_id":"a0ef64a5-0a4d-48b8-9124-dd57371128f5",
"from_talker_id":"5mCK6K8VCc",
"to_talker_ids":[
"def4442e-6df5-4385-938a-8180ddfb6c5e"
],
"message_id":"e0abcd6c-c027-489f-95bf-24796e421e8b",
"type":"offer",
"message":"Is this item still available?",
"belongs_to_spam_conversation":false,
"sent_at":1509235499995
}
},
"meta":{
"created_at":1509235500012,
...
}
10
THE PATH WE CHOSE
11
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
12
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
13
DATA INGESTION
14
DATA INGESTION: Kafka Connect
Kafka Connect: Connectors
15
DATA INGESTION
Event A:
Event …:
16
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
17
STORAGE
We want to store all events coming from Kafka to S3
18
STORAGE
19
20
Duplicated events
21
STORAGE: Deduplication
Deduplication
(Exactly-once)
22
STORAGE: Deduplication
23
Late events
24
STORAGE: Late events
25
1
2
3
4
1
5
6
Read batch of events from Kafka.
Write each event to Cassandra.
Write dirty “hours” to compact topic:
Key=(event_type, hour).
Read dirty “hours” topic.
Read all events with dirty hours.
Store in S3
1
2
3
4
1
5
6
STORAGE: Late events
Dirty Buckets
26
STORAGE: Late events
Some time later: Count events in Cassandra and S3 and
check inconsistencies.
27
S3 Big data problems
28
1. Eventual consistency
2. Very slow renames: Rename = copy + delete.
Some S3 Big Data problems:
STORAGE: S3 Big data problems
Why?
29
1. Eventual consistency
STORAGE: S3 Big data problems
30
Available in:
S3GUARD
Solution
STORAGE: S3 Big data problems
31
¿Job freeze?
2. Slow renames
STORAGE: S3 Big data problems
32
3.1
• New S3A Committers
Solution
STORAGE: S3 Big data problems
33
• Committers Architecture
• S3A Committers
• Spark and S3 with Ryan Blue: SlideShare, YouTube
STORAGE: S3 Big data problems
34
STORAGE: S3 Big data problems
Surprise!
35
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
36
REAL-TIME USER
SEGMENTATION
37
STREAMING: REAL-TIME USER SEGMENTATION
Stream Journal
User bucket’s changed
1 2
38
REAL-TIME USER
STATISTICS
39
STREAMING: REAL-TIME USER STATISTICS
What we do:
• Sessions (Last session)
• Conversation (Last conversation created, counts, …)
• Products (Last product approved, counts, ...)
• User classification (Last bucket changed, counts, ...)
Tool:
40
REAL-TIME PATTERN
DETECTION
41
STREAMING: REAL-TIME PATTERN DETECTION
Is it still available?
Is the price negotiable?
What condition is it in?
I offer you….$
Could we meet at……?
42
STREAMING: REAL-TIME PATTERN DETECTION
Event A + Event B = Complex Event C
Event A + NOTHING in 2 hours = Complex Event D
Some common use cases:
• Fraud detection, ¿Scammers detection?
• Real time recommendations
43
GO TO EVENTS: Tips
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
44
GEODATA ENRICHMENT
45
BATCH: Geodata enrichment
{
"data": {
"id": "105dg3272-8e5f-426f-bca0-
704e98552961",
"type": "some_event",
"attributes": {
"latitude": 39.740028705904,
"longitude": -104.97341156236
}
},
"meta": {
"created_at": 1522886400036
}
}
46
POINT (77.3548351
28.6973627)
Coordinates: Longitude and latitude.
What we know:
BATCH: Geodata enrichment
47
Where is this point?
City, State, zip code, DMA.
BATCH: Geodata enrichment
48
POLYGON((-114.816294
32.508038,-114.814321
32.509023,-114.810159
32.508383,-114.807726
32.508726,-114.805239
32.509985...
We represent in this way: States, Cities, Zip codes, DMAs
What we wanted to know:
BATCH: Geodata enrichment
49
How we do it:
• We load Cities, States, Zip codes,
DMAs… polygons from Well-known
text (WKT)
• Create indexes using JTS Topology
Suite
• Custom Spark SQL UDF.
Tool:
BATCH: Geodata enrichment
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)
50
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
51
QUERYING DATA
52
QUERYING DATA
53
QUERYING DATA
Metastore
54
WHY WE CHOSE SPARK
THRIFT SERVER?
55
QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?
56
QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?
57
QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?
Thrift Server
58
QUERYING DATA
59
QUERYING DATA
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url
'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?user=xxx&password=xx
x',
forward_spark_s3_credentials 'true'
)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name"
)
60
QUERYING DATA
CREATE TABLE …
USING [parquet,json,csv…]
CREATE TABLE …
STORED AS…
VS
61
QUERYING DATA
CREATE TABLE …
STORED AS…
VS 70%
Higher performance!
CREATE TABLE …
USING [parquet,json,csv…]
62
BATCHES WITH SQL
63
QUERYING DATA: Batches with SQL
CREATE EXTERNAL TABLE IF NOT EXISTS
database.some_name(
user_id STRING,
column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
1. Creating the table
64
QUERYING DATA: Batches with SQL
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
2. Inserting data
65
QUERYING DATA: Batches with SQL
66
QUERYING DATA: Batches with SQL
200 files because default value of
“spark.sql.shuffle.partition”
67
QUERYING DATA: Batches with SQL
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
Solution
68
QUERYING DATA: Batches with SQL
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.
69
QUERYING DATA: Batches with SQL
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
70
QUERYING DATA: Batches with SQL
• Hive Bucketing in Apache Spark: SlideShare, YouTube.
71
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
72
DATA EXPLOITATION
Thrift Server
Metastore
73
DATA EXPLOITATION
74
DATA EXPLOITATION: Superset
75
DATA SCIENTISTS
(WINTER IS COMING)
76
DATA EXPLOITATION: Data Scientists
77
DATA EXPLOITATION: Data Scientists
78
79
DATA EXPLOITATION: Data Scientists
Too many small
files!
80
DATA EXPLOITATION: Data Scientists
Huge Query!
81
DATA EXPLOITATION: Data Scientists
Too much
shuffle!
82
DATA EXPLOITATION: Data Scientists
83
Some data scientist tips
Listen carefully
or you will die!
84
Transversal workshops: BI, Data Scientist, Data engineer.
Create a wiki with: Guidelines, tips, links...
Learn from your mistakes!
DATA EXPLOITATION: Data Scientists
Well done!
85
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
86
ORCHESTRATION
• It’s an open source project
writen in Python by Airbnb.
• Easy to Schedule Jobs, write
new tasks and monitoring
them.
87
ORCHESTRATION
88
ORCHESTRATION
I’m happy!!
89
TIPS
90
Bigger is better
91
TIPS: Bigger is better
20 slots
Total used: 3x6x4=72
Total wasted: 2x4=8
Total used: 3x26=78
Total wasted: 2
80 slots
We prefer big Hadoop instances.
We use m4.16xlarge instances.
We are migrating to m5.24xlarge instances.
92
Job Scheduling
93
TIPS: Job Scheduling
1
2
94
Be polite
95
TIPS: Be polite
spark.executor.memory = 4G
spark.executor.cores = 3
spark.driver.memory = 4G
spark.driver.cores = 2
## dynamicAllocation
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.executor.instances=1
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.initialExecutors=3
spark.dynamicAllocation.maxExecutors=20
spark.dynamicAllocation.cachedExecutorIdleTimeout=60s
Others are using the same cluster, don´t over reclaim resources.
96
But not too polite
97
TIPS: But not too polite
98
Metrics, metrics!
99
TIPS: SPARK METRICS
What we did:
• Custom InfluxDB Spark metrics
sink
• Custom Spark metrics using
Dropwizard
• Grafana Dashbords and alerts
100
Mistakes are good
101
TIPS: Mistakes are good
102
Do you want to
join us?

Letgo Data Platform: A global overview