GAM306_Building a Lake of Wisdom

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Building a Lake of Wisdom
H e n r i
H e i s k a n e n
G A M 3 0 6
N o v e m b e r 2 7 , 2 0 1 7

Hollywood Production
Executive...
Moonlighting as Data
Engineering Lead at
Rovio Games
20 years of software
development
5 years of analytics
Who am I?

“Lake of Wisdom”

Agenda
Rovio and data Building the lake Learnings

Possible takeaways
How should I
build my data
lake?
How should I
handle schema
evolution?
What are the
limitations of
Amazon Athena
that I need to
consider?
Are there services
out there that
would help to
accomplish this
easier?
Could this be our
high level
architecture in the
future?
How should I
partition my
data in S3?

Rovio and data

• Creator of Angry Birds
• Founded in 2003
• HQ in Espoo, Finland
• ~400 employees in Finland, Sweden,
UK, China, and US
• Games-first entertainment company
• Games
• Brand licensing
• 3.7 billion game downloads*
• 80 million monthly active users and 11
million daily active users**
• Angry Birds Movie
About Rovio
* at June 30, 2017
** during the three months ended June 30, 2017

0
2
4
6
8
10
12
14
Puzzle & Casino
subgenres
Tap & Clear
Puzzle
Shooter
Casino Card
Casino
Bingo
Puzzle Card
Puzzle
Words
String Match
Impression
Click
Install
Purchase
Rovio games and data
Market research Performance marketing Game optimization

011101010010110001001001010
101001010101010101010101010
111111100101010101010100101
010101010101010100101010101
010010101010101010101001010
101010101010010101010111101
001010101010101010101001110
010100010010100101010100100
101100100101010101010101001
010010101101011100101010001
101010101101010100100101001
010100110100101010101010010
101010101001010101010100101
010101010100101010101010101
010010101010010101010101010
Our data philosophy
Single truth Sharing is caring We own our future
???

Pre-lake architecture Q1/2017

Single truth
¯_(ツ)_/¯
Our data philosophy revisited
“It’s not perfect, but don’t we have more important
things to do?”
Multiple dashboards with different DAU
Sharing is caring
We own our future
Limited access to game
dashboards and raw data
I suppose…

…And then this happened!
Our analytics platform provider got acquired by
another mobile games company!

Let’s have a quick look at
what we already had

Payment Ads Push
Analytics A/B testing Segmentation
What is Beacon?

What is missing?
The ability for data analysts to build game-specific
dashboards using game events joined with portfolio-wide
games business data.

BUY BUILD
What we were set to do
“Well, whatever…
but do it in seven weeks!”
or

Building the lake

What do we need?
Petabyte scale “data warehouse” with fast query access to raw
datasets
Ability to efficiently query and process data from multiple systems
Data visualization SDK for data analysts to create dashboards that can
be integrated into our Beacon UI

Data lakes are all the rage
“A data lake is a method of storing data within a system or
repository, in its natural format, that facilitates the
collocation of data in various schemata and structural
forms, usually object blobs or files.”–Wikipedia

But we have a data lake already then?
Human latency due to cluster launching and lacking
schemas
Query latency due to inefficient data formats
Only few a people who knew how and wanted to access
the raw data

Why Amazon Athena?
We are fluent with AWS
Data and processing layers separated
Pay-per-use billing
Performance is at least on par with alternative solutions

SELECT m.t, COUNT(DISTINCT s.aid1) AS dau FROM abba_raw where processdate = 20170116 group by m.t;
Amazon Athena performance
Data Format Database X Presto 21 x
m3.xlarge
Amazon Athena
Json + SEQ + gzip N/A 9 min 1 min 50 sec
ORC N/A 37 sec 9 sec
DB Native 35 sec N/A N/A

We need columnar format
Flatten
Re-partition
Manage schema

We also need profiles
Analysis is almost always done for a specific player cohort
In the old system, each raw event row was enriched with player state
Can we build a system where we do not need to bake in player state,
but instead join dynamically with acceptable performance?

select b.registration_date, avg(a.level_id) as avg_level from
(
select
max(cast(m_level_id as integer)) as level_id,
s_aid2 as player_id
from default.abba_orc
where m_t = 'level_finished' and processdate = 20170116
group by s_aid2
) as a
left join wolery_orc b
on a.player_id = b.player_id
and b.app_id = 'Abba'
group by b.registration_date;
How about joins in Amazon Athena?
Run time: 12.73 seconds

Profiles and daily aggregates
• Profile is the user journey across the Rovio portfolio
• Origin: paid, xpromo, network, organic
• Time spent
• Purchases
• Predictions: LTV, churn, cheating…
• Aggregates are game agnostic daily/cumulative activities
• Time spent on one day / cumulative time spent until that day
• Data in Amazon Redshift

Amazon Athena
Technology selections

Rovio data crash course

• Produced by game clients and services
• Have standard headers and custom
message body
• In JSON format
• Protobuf packets encoded from
game clients
Analytics events
{
"o": {
"ts": "2015-02-20T13:29:06.054+0000"
},
"s": {
"aid1": "AN7624...",
"cid": "analyticstestapp_7bc0125b",
"cver": "1.6.0",
},
"t": {
"dt": "iPad3,1",
"geo": "FI",
"os": "iOS",
"osv": "7.0.4",
},
"m": {
"t": "Conquer_The_World",
"weapon_of_choice": "water pistol"
}
}

• Service topics contain events from all
games
• Game topics have events from a single
game
• Game clients and/or server
Kafka topics

audit.ads/
audit.session/
processdate=20170926/
1_0_00000000002252646951.gz
1_0_00000000002252832986.gz
1_0_00000000002252943052.gz
1_0_00000000002253053431.gz
...
...
audit.ua/
audit.wallet/
collector.AngryBirdsFriends/
collector.Crimson/
collector.Yellow/
collector.nibblers2_2c37376e/
collector.ocean_2ee459c8/
collector.tntgame_149fb4d3/
...
• Events from Kafka are stored to Amazon
S3 as compressed sequence files with
one JSON object per row
• Partitioned by topic and processing date
Raw event Amazon S3 sink

app_id=Abba/
app_id=AngryBirdsClassic/
topic=audit.ads/
topic=audit.identity/
topic=audit.session/
topic=audit.supermoon/
topic=audit.wallet/
topic=collector.angrybirdsclassic/
eventtype=mightyleaguedemotion/
eventtype=mightyleagueentered/
000058_0
...
app_id=BadPiggiesFull/
...
• Events stored as compressed ORC files
• Partitioned by game, topic, processing
date, and event type
• Additional event type partition
improved performance over
ORC indexing
Target: ORC Amazon S3 sink

CREATE EXTERNAL TABLE abba.client_events (
`o_ts` STRING,
`s_aid1` STRING,
`s_cid` STRING,
`s_cver` STRING,
`t_dt` STRING,
`t_geo` STRING,
`t_os` STRING,
`t_osv` STRING,
`m_weapon_of_choice` STRING,
`t_os` STRING
)
PARTITIONED BY (processdate INT, eventtype STRING)
STORED AS ORC
LOCATION
's3://bucket/app_id=Abba/topic=collector.abba';
• Flat structure
• Schema per game
• Table per Kafka topic
• Partitioned by date and event type
Target: ORC table DDL

Problem: ORC schema evolution
• Years of raw data from multiple games and services
• Monthly game updates introduce new fields
• Reprocessing everything takes time and is expensive
• Solution: Maintain ORC file compatibility
• Do not remove fields
• Do not change the data type of the field
• Append the new fields to the end

static void discover(AmazonS3Client s3, JavaSparkContext sc, String input, String output,
String processdate, String[] whiteList) throws Exception {
for (TableMeta folder : listFolders(s3, input, whiteList)) {
String path = folder.getPath() + (folder.getPath().endsWith("/") ? "" : "/") + "processdate=" + processdate;
JavaPairRDD<Text, Text> rdd = sc.sequenceFile(path, Text.class, Text.class);
JavaRDD<String> raw = rdd.map(tuple -> tuple._2().toString());
SQLContext sql = new SQLContext(sc);
Dataset<Row> df = sql.jsonRDD(raw);
df.printSchema();
// write json schema
writeToS3(s3, folder, df.schema().prettyJson(), output, ".json");
}
}
Schema discovery Spark job

Schema discovery pipeline

Problem: Number of partitions
• Angry Birds 2 has ~100 event types and 2 years of ORC
data
• 100 * 365 * 2 = 73,000 partitions
• ADD COLUMNS did not work
• Schema update needs DROP and CREATE TABLE
• MSCK REPAIR and DROP TABLE times out
• Workaround: Add and remove partitions in batches of 10
and parallelize

SET hive.optimize.sort.dynamic.partition=false;
INSERT OVERWRITE TABLE tmp.abba_orc_tmp_${hiveconf:processdate}
PARTITION (app_id, topic, processdate, eventtype)
SELECT
`h`.`v` AS h_v,
`m`.àd_type` AS m_ad_type,
`m`.ÀNDROID_ADVERTISING_ID` AS m_ANDROID_ADVERTISING_ID,
`m`.àrena_birds_used` AS m_arena_birds_used,
`m`.àrena_entry` AS m_arena_entry,
`m`.àrena_position` AS m_arena_position,
`s`.àid1` AS s_aid1,
'Abba' AS app_id,
'collector.abba' AS topic,
processdate AS processdate,
regexp_replace(regexp_replace(lower(substr(m.t,1,100)), 's', '__'), '[^w]', '_') AS eventtype
FROM abba.client_collector
WHERE processdate=${hiveconf:processdate}
DISTRIBUTE BY CASE WHEN eventtype IN ('frame_unlocked', 'performance_event', 'feat_progress', 'level_finished',
'level_started', 'currency_change') THEN CONCAT(eventtype, ABS(HASH(o_ts))%10) ELSE eventtype END;
Data conversion Hive job

Data conversion pipeline

The lake architecture Q3/2017

So…what was our deliverable?

Well-defined schema…
Table Description
client_events Analytics events sent from the game client
identity_events Analytics events sent from login service
session_events Analytics events sent from new login service
ads_events Impression and click information received from ads service
wallet_events IAP purchase events received from wallet service
supermoon_events Rule matching information from Manage tool
player_profile Player profile information such as registration date and user
origin
player_daily Cumulative daily player activity and purchase information

…You can query in Amazon Athena…

…And publish insight in Beacon

…And as a bonus we got the lake

Learnings

ORC vs. Parquet vs. something else
“ORC is newest and fastest but Parquet is more widely
supported at the moment”

“INSERT INTO” support
“We are running reporting queries in Hive and Presto,
which brings extra cost and complexity to the system”

External metastore support
“We keep two identical metastores up-to-date”

Mind the number of partitions
“The biggest performance problems come from running
the DDL statements”

Concurrent query limits
“The default concurrent query limit of five is far from what
data hungry enterprise needs”

New products and services
“Stay informed on the new product launches and
constantly evolve your architecture”

Schema discovery is easy,
schema management can get complicated
“Maybe take a look at Amazon Glue instead of building
everything by yourself”

Bias
“Unbiased information is hard to find. For every blog there
is a counter blog. Test your own use case, if possible.”

Amazon Athena, Presto, Hive, and Spark are
great together
“Amazon Athena provides the data for a wider audience,
but our data architecture allows us to access the Lake of
Wisdom with other tools for more advanced use cases, if
needed”

Thank You!

GAM306_Building a Lake of Wisdom

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GAM306_Building a Lake of Wisdom

Similar to GAM306_Building a Lake of Wisdom (20)

More from Amazon Web Services

More from Amazon Web Services (20)

GAM306_Building a Lake of Wisdom