Letgo Data Platform: A global overview

1
Letgo Data Platform
A global overview

2
Data Engineer
Ricardo Fanjul

3
LETGO DATA PLATFORM IN NUMBERS
500GB 530+
Data daily Event Types
Events Processed Daily
565M 180TB
Storage (S3)
8K
Events Processed per Second
< 1sec
NRT Processing Time

7
MOVING TO EVENTS
Domain EventsTracking Events

8
MOVING TO EVENTS
“A Domain Event captures the memory of something interesting (external stimulus) which affects the domain”
Martin Fowler
Domain EventsTracking Events

9
GO TO EVENTS
{
"data":{
"id":"1c4da9f0-605e-4708-a8e7-0f2c97dff16e",
"type":"message_sent",
"attributes":{
"id":"e0abcd6c-c027-489f-95bf-24796e421e8b",
"conversation_id":"47596e76-0fb4-4155-9aeb-6a5ba14c9cef",
"product_id":"a0ef64a5-0a4d-48b8-9124-dd57371128f5",
"from_talker_id":"5mCK6K8VCc",
"to_talker_ids":[
"def4442e-6df5-4385-938a-8180ddfb6c5e"
],
"message_id":"e0abcd6c-c027-489f-95bf-24796e421e8b",
"type":"offer",
"message":"Is this item still available?",
"belongs_to_spam_conversation":false,
"sent_at":1509235499995
}
},
"meta":{
"created_at":1509235500012,
...
}

11
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

12
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

14
DATA INGESTION: Kafka Connect
Kafka Connect: Connectors

15
DATA INGESTION
Event A:
Event …:

16
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

17
STORAGE
We want to store all events coming from Kafka to S3

21
STORAGE: Deduplication
Deduplication
(Exactly-once)

25
1
2
3
4
1
5
6
Read batch of events from Kafka.
Write each event to Cassandra.
Write dirty “hours” to compact topic:
Key=(event_type, hour).
Read dirty “hours” topic.
Read all events with dirty hours.
Store in S3
1
2
3
4
1
5
6
STORAGE: Late events
Dirty Buckets

26
STORAGE: Late events
Some time later: Count events in Cassandra and S3 and
check inconsistencies.

28
1. Eventual consistency
2. Very slow renames: Rename = copy + delete.
Some S3 Big Data problems:
STORAGE: S3 Big data problems
Why?

29
1. Eventual consistency

30
Available in:
S3GUARD
Solution

31
¿Job freeze?
2. Slow renames

32
3.1
• New S3A Committers
Solution

33
• Committers Architecture
• S3A Committers
• Spark and S3 with Ryan Blue: SlideShare, YouTube

34
Surprise!

35
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

36
REAL-TIME USER
SEGMENTATION

37
STREAMING: REAL-TIME USER SEGMENTATION
Stream Journal
User bucket’s changed
1 2

39
STREAMING: REAL-TIME USER STATISTICS
What we do:
• Sessions (Last session)
• Conversation (Last conversation created, counts, …)
• Products (Last product approved, counts, ...)
• User classification (Last bucket changed, counts, ...)
Tool:

40
REAL-TIME PATTERN
DETECTION

41
STREAMING: REAL-TIME PATTERN DETECTION
Is it still available?
Is the price negotiable?
What condition is it in?
I offer you….$
Could we meet at……?

42
STREAMING: REAL-TIME PATTERN DETECTION
Event A + Event B = Complex Event C
Event A + NOTHING in 2 hours = Complex Event D
Some common use cases:
• Fraud detection, ¿Scammers detection?
• Real time recommendations

43
GO TO EVENTS: Tips
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

45
BATCH: Geodata enrichment
{
"data": {
"id": "105dg3272-8e5f-426f-bca0-
704e98552961",
"type": "some_event",
"attributes": {
"latitude": 39.740028705904,
"longitude": -104.97341156236
}
},
"meta": {
"created_at": 1522886400036
}
}

46
POINT (77.3548351
28.6973627)
Coordinates: Longitude and latitude.
What we know:

47
Where is this point?
City, State, zip code, DMA.

48
POLYGON((-114.816294
32.508038,-114.814321
32.509023,-114.810159
32.508383,-114.807726
32.508726,-114.805239
32.509985...
We represent in this way: States, Cities, Zip codes, DMAs
What we wanted to know:

49
How we do it:
• We load Cities, States, Zip codes,
DMAs… polygons from Well-known
text (WKT)
• Create indexes using JTS Topology
Suite
• Custom Spark SQL UDF.
Tool:
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)

50
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

54
WHY WE CHOSE SPARK
THRIFT SERVER?

55
QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?

56

57
Thrift Server

59
QUERYING DATA
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url
'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?user=xxx&password=xx
x',
forward_spark_s3_credentials 'true'
)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name"
)

60
QUERYING DATA
CREATE TABLE …
USING [parquet,json,csv…]
CREATE TABLE …
STORED AS…
VS

61
QUERYING DATA
CREATE TABLE …
STORED AS…
VS 70%
Higher performance!
CREATE TABLE …
USING [parquet,json,csv…]

63
QUERYING DATA: Batches with SQL
CREATE EXTERNAL TABLE IF NOT EXISTS
database.some_name(
user_id STRING,
column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
1. Creating the table

64
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
2. Inserting data

65

66
200 files because default value of
“spark.sql.shuffle.partition”

67
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
Solution

68
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.

69
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)

70
• Hive Bucketing in Apache Spark: SlideShare, YouTube.

71
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

72
DATA EXPLOITATION
Thrift Server
Metastore

74
DATA EXPLOITATION: Superset

75
DATA SCIENTISTS
(WINTER IS COMING)

76
DATA EXPLOITATION: Data Scientists

77

79
Too many small
files!

80
Huge Query!

81
Too much
shuffle!

82

83
Some data scientist tips
Listen carefully
or you will die!

84
Transversal workshops: BI, Data Scientist, Data engineer.
Create a wiki with: Guidelines, tips, links...
Learn from your mistakes!
Well done!

85
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration

86
ORCHESTRATION
• It’s an open source project
writen in Python by Airbnb.
• Easy to Schedule Jobs, write
new tasks and monitoring
them.

88
ORCHESTRATION
I’m happy!!

91
TIPS: Bigger is better
20 slots
Total used: 3x6x4=72
Total wasted: 2x4=8
Total used: 3x26=78
Total wasted: 2
80 slots
We prefer big Hadoop instances.
We use m4.16xlarge instances.
We are migrating to m5.24xlarge instances.

95
TIPS: Be polite
spark.executor.memory = 4G
spark.executor.cores = 3
spark.driver.memory = 4G
spark.driver.cores = 2
## dynamicAllocation
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.executor.instances=1
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.initialExecutors=3
spark.dynamicAllocation.maxExecutors=20
spark.dynamicAllocation.cachedExecutorIdleTimeout=60s
Others are using the same cluster, don´t over reclaim resources.

99
TIPS: SPARK METRICS
What we did:
• Custom InfluxDB Spark metrics
sink
• Custom Spark metrics using
Dropwizard
• Grafana Dashbords and alerts

Letgo Data Platform: A global overview

More Related Content

What's hot

Similar to Letgo Data Platform: A global overview

Recently uploaded

Letgo Data Platform: A global overview