Big data challenges are common : we are all doing aggregations , machine learning , anomaly detection, OLAP ...
This presentation describe how InnerActive answer those requirements
2. AT A
GLANCE
1.2M
TRANSACTIONS/S
8T
RAW DATA /d
NEW YORK
LONDON
SAN
FRANCISCO
TEL AVIV
BEIJING
500-1000
SERVERS
2007
FOUNDED
100EMPLOYEES
800M
UNIQUE USERS
InnerActive in numbers
4. Big data challenges are complicated and includes
features like:
● Massive cross dimensions aggregation, drill down
in real time, for presenting ad ROI…
● Analytics (Databricks / Zeppelin) on Raw data
● Cube (Dynamic reports)
● Anomaly detection
● Audience targeting like google adwords:
○ Offline process to analyze users
○ Real time targeting request
● ML: filter bad traffic , targeting recommendation
InnerActive Main DB Challenges
8. Massive
Aggregations
Lambda Flow
● raw data from kafka:
● Needs to do aggregation by:
○ Country
○ Country and OS and AppId
● Batch / Stream
Secor
Stream Hourly +Daily Batch
Stream
aggregate
Hourly
Data
Batch - recovery hourly data from raw data
Analytic
Parquet is being used in order to avoid read-update-save cycle in C*
9. In short...
● Exactly once semantic :
○ recovery from failure (Yarn also helps)
○ recovery from updates (checkpoint)
■ Non idempotent data (aggregation)
■ Non transactional DB (Cassandra)
● Offsets management:
○ Save in Spark checkpoint ? No , we use ZK
● Out of order data
● Flow:
We are saving data in hdfs , folder name contains “from” offset
○ we update ZK with “from” offset in the beginning of each stream iteration
○ In case spark crash , during streaming initialization we retrieve offsets from
ZK (“from” of last iteration)
○ data is partitioned by date to manage out of order data
10. In short...
Example :
● Before update Data in DB, only ZK was updated:
○ read from kafka offset 200-300
○ update ZK with 200
○ crash
○ recover
○ get offsets from ZK
○ read again from 200 , might be till 305
○ write to hdfs - no data lost.
● After update data in DB (and ZK):
○ read from kafka offset 200-300
○ update ZK with 200
○ update in hdfs data in folder “200”
○ crash
○ recover
○ get offsets from ZK
○ read from 200 till 305
○ overwrite data in hdfs folder “200” - no duplication.
11. Can we make
it simple ?
We could use checkpoint , enjoy from built-in offset management , Only during
deployment , fix data with batch but its maintenance nightmare , not aligned with
CI-CD
14. Analytic
ETL,
DashBoard,
● Data Format : Spark batch convert raw data to parquet
format, which reduce significant query time
● ETL : once we have parquet files we run several ETL’s
● Dashboard : Databricks scheduler jobs update dashboard
18. 18PAGE //
Druid
Usage
● Hourly - Stream
○ Segments are available once they are closed, we
are using tranquility(kafka consumer) to create
hourly segment.
● Daily - Batch
○ We are using batch (fire-house) to create daily
segments from hourly segments (from start of
day till now)
● Hourly Batch
○ We are using EMR with batch to fix missing hours
due to streaming failure, to create hourly
segments from S3
21. Flow
We are using Twitter Anomaly detector
https://github.com/twitter/AnomalyDetection
we have scheduler process that query db, generate csv
with data and time series, the csv file is the input for
anomaly detection , send email incase of anomaly
23. 23PAGE //
Requirement
Send Ad to
20 year old,
male, from
New-York...
Data Types
● Regular DMP can provide information per user
like if its male 20 years old , from New york that
like sports and has car (Taxonomy).
● Internal DMP - InnerActive see cross publishers
and demands , InnerActive monitor user ad
experience such as which users clicks more, and
on what, type of ad video user see till end...
Data Collect and Merge
● 8T daily raw data needs to be aggregated to
provide relevant information per user, and then
merged with several DMP’s data.
25. 25PAGE //
Flow
● Spark batch generate hourly / daily / monthly
aggregations per user from S3 (Kafka consumer
upload data to S3) → 800M raws with InnerActive
DMP data
● Data is merged with DMP data(align with taxonomy)
● Raws are ready to use and can be loaded to DB for
analytic usage, and later on for audience selections
throw the console.
1
26. 26PAGE //
Raw Data
● 120 columns with:
○ Dimensions: Age, Gender, Country
○ Metrics: #Clicks, #Impressions, #Video
● 20 columns with MultiValue Dimensions:
○ Domains: [CNN.com, Ebay.com]
○ Interests: [Sport, Shopping, Cars]
● Checking if DB support advanced feature for future
requirements like Map:
{#number of clicks per domain}
in addition to total number of clicks that the user is doing
2
27. 27PAGE //
DB &
Queries
POC
Requirement
3
● 1-2 sec – queries over 400M users (not all queries took 2s)
● complex query but still avoid Join (multiValued / Map)
● Statistics functionality like HyperLogLog / ThetaSketch
● JDBC compliance
● Easy maintained
● Low Concurrency
● Scale Scale Scale
● Supported / Community
● Price estimation
28. 28PAGE //
DB - How we
choosed
1.
columnar
database
● Minimize Read-write from disk:
“select age,name from tbl where age >30”
instead of reading all rows and return all columns, and
select only 2 fields, columnar DB only reads data from
2 columns and return only 2 columns
● High compression
4
29. 29PAGE //
DB - How we
choosed
2.MPP
Data is big,
needs to be
split across
machines to get
result within ~2 sec
0.5-1T
● Massively Parallel Processing:
○ multiple processors working on different
parts of the program.
○ Each processor has its own operating
system and memory.
○ It speeds the performance of huge
databases that deal with massive amounts
of data.
4
30. 30PAGE //
DB - How we
choosed
3.FAST
In Memory
DB Or
OS Cache?
Memory main layers
RAM / In process memory like Java HashMap
Operation System cache
Disk: SSD or not
Speed
Prices
● In Memory
○ Row Data resides in a computer’s random access memory
(RAM) – not physical disk
● OS Cache
○ Columnar/compressed data in Disk, execute query loads the
data to the OS memory
4
31. 31PAGE //
● we found out that it's faster (query time) to use columnar
data that is loaded to OS cache, instead of row data in
memory as it needs to read all data (we use random
queries index will not help)
● Columnar compress data reduce storage size therefore it
load more raw data to OS cache, disadvantage is that we
pay more CPU time to decompress data per query
DB - How we
choosed
3.FAST
In Memory
DB Or
OS Cache?
4
34. 34PAGE //
ClickHouse
Jdbc compliance
Very Fast
FS Columnar DB (load data to OS memory)
MultiValue Support - “IN”
HyperLogLog Support
Easy to Install + Scale
Load CSV data
Build for our use case
No Support - Open source
Community?
35. 35PAGE //
MemSql
Commercial
○ Combination of Aggregator nodes that save data in
memory as raw data, and Leaf nodes that save data
in FS columnar
○ Increase Aggregators to better handle concurrency
○ Increase Leafs to minimize latency
○ We checked both options (raw data in memory,
columnar data in FS)
36. 36PAGE //
MemSql
Memory
VS.
Cache
■ Jdbc compliance
■ Very Fast
■ Columnar DB + Memory
■ MultiValue Support? Json search, we avoid Join
■ Easy to Install + Scale
■ Load CSV data
■ Support / commercial
37. 37PAGE //
RedShift
■ Jdbc compliance
■ Fast
■ Columnar DB + Memory
■ No MultiValue Support - you can use “like”,
Join are not fast enough.
■ service
■ commercial
■ price high
built upon postgress, save data in columnar , no index , scalable , each query run on its node (MPP)
38. 38PAGE //
Solr
We tried Solr although its handling data differently,
Solr index documents and allows very rich queries.
Solr fits our complex queries requierments using its
faceting navigation
once you run queries the data is loaded to cache and it
provide good query time
we pay during loading/indexing data - time
We didn't finished POC on Solr, as we decided to
continue with another solution.
40. 40PAGE //
Presto
How It Works
● Use connectors to Mysql / HDFS / Cassandra
● MPP shared nothing across machines
● Priority Queries
● Can load parquet (columnar compress format) -
no need for Join
● Support Hive queries
41. 41PAGE //
Presto
On Steroid
● Raptor is Presto connector (configuration) that load data to
Presto machines (sharded on SSD)
● Instead of doing query to HDFS or S3 , it queries its local SSD
while the second queries use OS cache
○ Short Load time (<1H) from remote HDFS to locals SSD
(we also tried to use Presto on top of Alluxio, and hdfs & presto on same machine)
42. 42PAGE //
Presto
+
Raptor
Load data
Flow
● Create table using Hive (data located in hdfs cluster)
● Create table in Raptor
○ CREATE TABLE raptor.audienceTbl AS SELECT * FROM hive.default.audience;
● Load data with Raptor to Presto machines
○ update table raptorTbl as select from hiveTbl where day = <day>
● In case of raptor failure, we can continue using Presto to
query hdfs via Hive, and in the meantime load hdfs data to
raptor again.
46. 46PAGE //
We didn't looked for the fastest DB in the world.
We looked for valid solutions for our use case, and we
checked functionality, price, etc …
Some solutions failed in the functionality requirements like
MariaDB (Columnar) Some due to high price (RedShift)
We decided to continue with Presto + Raptor, we can use
our parquet data as is, load it to HDFS, load it to Presto
machines...
And The
Winner
is
48. 48PAGE //
Serving flow ● Once we selected audiences, next phase is real time match
audiences to adRequest
● We are using spark batch that create key-value mapping
between userId and audiences, and we upload it to aerospike
● During adRequest flow we retrieve (~4ms) user audiences...
Key-Value, scale - In memory KV
Lots of reads ,
update once a day in batch