Scalding Big (Ad)ta

SCALDING BIG (AD)TA
Boris Trofimoff @ Sigma Software
@b0ris_1

REAL-TIME
BIDDING
or the story about the first second…

THE FIRST SECOND ON A SCALE
User on Site
Request to bid
Accept bid
Deliver Ad
80 ms
to predict viewability
and make decision

REAL-TIME BIDDING IN DETAILS
STORY ACTORS:
User
Publisher
(foxnews.com)
Ad Server
(Google’s Doubleclick)
SSP
(Ad Exchange)
DSP
(decides what ad to show)
Advertiser
(Nike)
COLLECTIVE

THE FIRST SECOND IN DETAILS
Publisher
receives
request
Content
delivered
to user
Ad Server
sends signal to
Ad Exchange
All bidders should
send their decision
(participate? &
price) back
Ad Server
shows page that
 redirects to the DSP
winner server
 makes piggybacking
Publisher
sends
response
back
Site sends
request to
Ad Server
toshow ad
SSP (Ad Exchange)
receives ad request and
opens RTB Auction
Every bidder/DSP
receives info about user:
 ssp_cookie_id
 geo data
 site url
SSP chooses the
winning DSP
and sends this
info back to Ad
Server
User’s web page
 shows ad banner
with ad
 firing DSP’s 1x1
pixel (impression)
20 ms 100 150 170 200 210 280 300 350
~70% users have
this cookie aboard
thankfully to retargeting
>>1 independent companies take part in this auction
80 ms
400 1sec
 Piggybacking – async JS redirect to all DSP
servers with ssp_cookie as a query param
 Chance for DSP to link ssp_cookie_id with own
dsp_cookie (retargeting again)

REAL TIME
OFFLINE
Hadoop’s HDFS
MapReduce Scalding HBASE Oozie Hive
Updating user profiles Data export
Update user’s profiles with new
segments
6
Partners
9
5
Warehouse Data
Scientists
Brand new
feed about
user interests5
7
Return info about new user’s
interests with special markers
(segments) that indicates the new
fact about user, e.g user is man
who has iphone and lives in NYC
and has dog.
Major format:
<cookie_id –segment_id>
hbase keeps user profiles
Bidder Farm
0
Auction
requests
Tag Farm
1
Impressions, Clicks,
Post-click Activities
Hourly
Logs
2
Updates
cached
user
interests
SSP Ad
Exchange
House holders
data
3rd part
data
3
8
4

QUICK FACTS ABOUT TECH STACK
REALTIME TEAM
LANGUAGES
 Scala, Akka
 Java
WEB SERVICES
 RESTful interfaces
 Pure Servlets
 Spray
IN-MEMORY CACHES
 Aerospike
 Redis
HADOOP
 HBASE to keep profile database
 HDFS to store everything
 OOZIE to build workflows + custom coordinator
implementation
 Scalding
CLOUDERRA
 CDH 5.4.1
EVENT-DRIVEN NEW GEN APPROACH
 Spark + Kafka
OFFLINE TEAM

DATA SCIENCE?
or why we need all this science

WHO ARE DATA SCIENTISTS
Data Scientist
Data
Visualization
Computers
Models
Algorithms
Insights Predictions

AUDIENCE FISHING
THE PROBLEM: Data Sparsity
DEEP AUDIENCE TARGETING
 Only 1-4 providers know about 40% users. Rest
40% is Terra Incognita
LIFE CASE
 Customer (Nike) would like to show ad for all man
who live in NYC, have iphone and pets
OTHER ACTIVITIES
 Profile Bridging
 Audience Modelling
 Audience Optimization

DATA SOURCES
DATA USED FOR MODELING
PROPRIETARY DATA 3rd PARTY DATA
VISITATION HISTORY
 Site / Page / Context
 Time
IP DERIVED DATA
 Location
 Connection type
PERFORMANCE HISTORY
 Actions
 Engagements
USER AGENT DATA
 Devices
 OS & Browser
DEMOGRAPHICS
 Age, Gender
 Household
composition
POINT OF SALE
 Purchase history
SEARCH
 In-market for a
product
Find the meaningful signal
in noisy internet data
through high quality
models

QUICK FACTS ABOUT TECH STACK
YARN-BASED
IMPALA & HIVE
 Scalable local
warehouse
SPARK
 Streaming via Kafka
CLASSIC
IBM Netezza
 Local warehouse
PYTHON
R LANGUAGE
 predictions
 modelling
GGPLOT2
 Visualizations
 Insights

BIG DATA AT SCALE
USERS
 >1B active user profiles
MODELS
 1000s of models built weekly
PREDICTIONS
100s of billions predictions daily
MODELING AT SCALEVOLUME
Petabytes of data used
VARIETY
Profiles, formats, screens
VELOCITY
100k+ requests per second
20 billions events per day
VERACITY
Robust measurements

INFRASTRUCTURE
 Expensive infrastructure (Cluster size >>100s machines)
 We use private own cloud
 We use Cloudera paltform (CDH 5.4.1)
 Efficient software and hardware deployment strategy
LESSONS LEARNED
 Do not delegate cluster maintenance to developers
relying on devops engineers and Cloudera
 Any thing should be monitored and measured in real time
 Think like Google. Our data disposes a lot of space. We track and log everything

SHIFTING PARADIGM FROM
BATCH-PROCESSING TO EVENT-DRIVEN
PAST
FUTURE
 Event driven
 Reactive Streams
 Quick feed back
 Long feed back in 1h-1d
 Belated user identification
+
Data science zone
Real-time zone
(Bidders, Clickers et.)

SCALDING IN A NUTSHELL
hdfs
hdfs
CONCISE DSL OVER SCALA
 Developed on top of Java-based Cascading
framework
CAN BE CONSIDERED AS A FLOW PIPE
CONFIGURABLE SOURCE AND SINK
 Multiple sources and sinks
DATA TRANSFORM OPERATIONS:
 Map / flatMap
 Pivot / unpivot
 Project
 groupBy / reduce / foldLeft

JUST ONE EXAMPLE (JAVA WAY)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Source
JUST ONE EXAMPLE (SCALDING WAY)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("s+")
}
}
Sink
Transform
operations

USE CASE 1 - SPLIT
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
MOTIVATION
 Reuse calculated streams
 Performance
MOTIVATION

USE CASE 3 - EXOTIC SOURCES / SINKS
HBASE
HBaseSource
https://github.com/ParallelAI/SpyGlass
 SCAN_ALL
 GET_LIST
 SCAN_RANGE
HBaseRawSource
https://github.com/btrofimov/SpyGlass
 Advanced filtering via base64Scan
 Support of CDH 5.4.1
 Feature to remove particular columns from rows
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"),
toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read

USE CASE 4 - JOIN
 Joining two streams by key
 Different performance strategies:
 joinWithLarger
 joinWithSmaller
 joinWithTiny
 Inner, Left, Right, strategies
MOTIVATION
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

USE CASE 5 – DISTRIBUTED CACHING
AND COUNTERS
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args object for
instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)

MULTISCREEN AND PROFILE BRIDGING
PROBLEM: Often neighboring profiles are the same user
GOAL: Build complete person profile based on bridging information from different sources:
 Mobile
 Desktop
 TV
 Social
WiFi Location X
Time 10:00 am
WiFi Location Y
Time 8:00 pm

ONE EXAMPLE
ssp_cookie_Id1
dsp_cookie_id1
IP
Bridging via
ip address
IP
Bridging two ssp_cookies
via private cookie
imp
ssp_cookie_Id1
dsp_cookie_id2 ssp_cookie_id1

PROFILE BRIDGING THROUGH MATH
LENSES
General task definition:
 Build graph
 Vertexes – user’s interests
 Edges – bridging rules [cookies, IP,…]
 Task – Identify connected components
 We do bridging on daily basis

SCALDING CONNECTED COMPONENTS
/**
* The class represents just one iteration of searching connected component algorithm.
* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.
* If it is zero then we can stop running other iterations
*/
class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {
val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id
val edges = Tsv( args("edges"), ('id_a,'id_b) ).read
val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }
// if count=0 then we can stop running next iterations
groups.groupAll { _.size }.write(Tsv("count"))
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>
val (id, gid, source,target) = param
if (target != null) ( id , min( gid, target ) ) else ( id, gid )
}
new_vertexes.write( Tsv( args("new_vertexes") ) )
}

OTHER SWEET THINGS
Typed Pipes
Elegant and fast Matrix operations
Simple migration on Spark/Kafka
More sources: e.g. retrieve data from hive’s hcatalog, jdbc, …
Simple Integration Testing

USEFUL RESOURCES
 http://www.adopsinsider.com/ad-serving/how-does-ad-serving-work/
 http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/
 https://github.com/twitter/scalding
 https://github.com/ParallelAI/SpyGlass
 https://github.com/btrofimov/SpyGlass
 https://github.com/branky/cascading.hive

Scalding Big (Ad)ta

Recommended

Recommended

More Related Content

Similar to Scalding Big (Ad)ta

Similar to Scalding Big (Ad)ta (20)

More from b0ris_1

More from b0ris_1 (12)

Recently uploaded

Recently uploaded (20)

Scalding Big (Ad)ta

Editor's Notes