Budapest Spark Meetup - Apache Spark @enbrite.ly presentation held on
March 30, 2016.
The vision we all share at enbrite.ly is to create the next generation decision supporting system in online advertising that combines the market needs; anti-fraud, viewability, brand safety and traffic quality assurances in one platform. We do this by analyzing vast amount of data to create value for our customers. In the last 6 months we created our ETL pipeline, the core component of our data platform based on Apache Spark. In this presentation I share the journey from the whiteboard designs to the maintenance of a TB-scale data pipeline. I share the lessons we learned and the ups and downs using Spark in scale.
Apache Spark @enbrite.ly
Budapest Spark Meetup
March 30, 2016
Who we are?
Our vision is to revolutionize the KPIs and metrics the online
advertisement industry currently using. With our products,
Antifraud, Brandsafety and Viewability we provide actionable
data to our customers.
● What we do?
● How we do? - enbrite.ly data platform
● Real world antifraud example
● LL + Spark in scale +/-
REPORT + API
What we do?
● Most popular cloud service provider
● Amazon Big Data ecosystem
● Applications: Hadoop, Spark, Hive, ….
● Scaling is easy
● Do not trust the BIG guys (API problem)
● Spark application in EMR runs on YARN (cluster
For more information: https://aws.amazon.com/elasticmapreduce/
Tools we use
https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors
Workflow engine, that helps you build
complex data pipelines of batch jobs.
Created by Spotify’s engineering team.
Your friendly plumber, that sticks your Hadoop, Spark, … jobs
with simple dependency definition and failure management.
param = luigi.Parameter(default=42)
with self.output().open('w') as f:
f.write('Hello Spark meetup!')
if __name__ == '__main__':
Tools we created GABO LUIGI
Luigi + enbrite.ly extensions = Gabo Luigi
● Dynamic task configuration + dependencies
● Reshaped web interface
● Define reusable data pipeline template
● Monitoring for each task
Real world example
You are fighting against robots and want to humanize
ad tech era. You have a simple idea to detect bot traffic,
which saves the world. Let’s implement it!
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined
INPUT: Load balancer access logs files on S3
OUTPUT: Print invalid sessions
Step 1: convert access log files to events
Step 2: sessionize events
Step 3: detect too many clicks
How to solve it?
The way to access log
Click event attributes
(created by JS tracker)
Access log format
TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
Step 1: log to event
Simplify: log files are on the local storage, only click events.
SparkConf conf = new SparkConf().setAppName("LogToEvent");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER);
// 2016-02-29T23:50:36.269432Z 220.127.116.11 200 "GET
YOU just saved the world with a
simple idea within ~10 minutes.
Using Spark pros
● Sparking is funny, community, tools
● Easy to start with it
● Language support: Python, Scala, Java, R
● Unified stack: batch, streaming, SQL,
Using Spark cons
● You need memory and memory
● Distributed application, hard to debug
● Hard to optimize
● Do not use default config, always optimize!
● Eliminate technical debt + automate
● Failures happen, use monitoring from the very
first breath + fault tolerant implementation
● Sparking is funny, but not a hammer for
Data platform future
● Would like to play with Redshift
● Change data format (avro, parquet, …)
● Would like to play with streaming
● Would like to play with Spark 2.0
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
WE ARE HIRING!
… is our mood manager, Bigyó :)