4. THE FIRST SECOND ON A SCALE
User on Site
Request to bid
Accept bid
Deliver Ad
80 ms
to predict viewability
and make decision
5. REAL-TIME BIDDING IN DETAILS
STORY ACTORS:
User
Publisher
(foxnews.com)
Ad Server
(Google’s Doubleclick)
SSP
(Ad Exchange)
DSP
(decides what ad to show)
Advertiser
(Nike)
COLLECTIVE
6. THE FIRST SECOND IN DETAILS
Publisher
receives
request
Content
delivered
to user
Ad Server
sends signal to
Ad Exchange
All bidders should
send their decision
(participate? &
price) back
Ad Server
shows page that
redirects to the DSP
winner server
makes piggybacking
Publisher
sends
response
back
Site sends
request to
Ad Server
toshow ad
SSP (Ad Exchange)
receives ad request and
opens RTB Auction
Every bidder/DSP
receives info about user:
ssp_cookie_id
geo data
site url
SSP chooses the
winning DSP
and sends this
info back to Ad
Server
User’s web page
shows ad banner
with ad
firing DSP’s 1x1
pixel (impression)
20 ms 100 150 170 200 210 280 300 350
~70% users have
this cookie aboard
thankfully to retargeting
>>1 independent companies take part in this auction
80 ms
400 1sec
Piggybacking – async JS redirect to all DSP
servers with ssp_cookie as a query param
Chance for DSP to link ssp_cookie_id with own
dsp_cookie (retargeting again)
7. REAL TIME
OFFLINE
Hadoop’s HDFS
MapReduce Scalding HBASE Oozie Hive
Updating user profiles Data export
Update user’s profiles with new
segments
6
Partners
9
5
Warehouse Data
Scientists
Brand new
feed about
user interests5
7
Return info about new user’s
interests with special markers
(segments) that indicates the new
fact about user, e.g user is man
who has iphone and lives in NYC
and has dog.
Major format:
<cookie_id –segment_id>
hbase keeps user profiles
Bidder Farm
0
Auction
requests
Tag Farm
1
Impressions, Clicks,
Post-click Activities
Hourly
Logs
2
Updates
cached
user
interests
SSP Ad
Exchange
House holders
data
3rd part
data
3
8
4
8. QUICK FACTS ABOUT TECH STACK
REALTIME TEAM
LANGUAGES
Scala, Akka
Java
WEB SERVICES
RESTful interfaces
Pure Servlets
Spray
IN-MEMORY CACHES
Aerospike
Redis
HADOOP
HBASE to keep profile database
HDFS to store everything
OOZIE to build workflows + custom coordinator
implementation
Scalding
CLOUDERRA
CDH 5.4.1
EVENT-DRIVEN NEW GEN APPROACH
Spark + Kafka
OFFLINE TEAM
13. WHO ARE DATA SCIENTISTS
Data Scientist
Data
Visualization
Computers
Models
Algorithms
Insights Predictions
14. AUDIENCE FISHING
THE PROBLEM: Data Sparsity
DEEP AUDIENCE TARGETING
Only 1-4 providers know about 40% users. Rest
40% is Terra Incognita
LIFE CASE
Customer (Nike) would like to show ad for all man
who live in NYC, have iphone and pets
OTHER ACTIVITIES
Profile Bridging
Audience Modelling
Audience Optimization
15. DATA SOURCES
DATA USED FOR MODELING
PROPRIETARY DATA 3rd PARTY DATA
VISITATION HISTORY
Site / Page / Context
Time
IP DERIVED DATA
Location
Connection type
PERFORMANCE HISTORY
Actions
Engagements
USER AGENT DATA
Devices
OS & Browser
DEMOGRAPHICS
Age, Gender
Household
composition
POINT OF SALE
Purchase history
SEARCH
In-market for a
product
Find the meaningful signal
in noisy internet data
through high quality
models
16. QUICK FACTS ABOUT TECH STACK
YARN-BASED
IMPALA & HIVE
Scalable local
warehouse
SPARK
Streaming via Kafka
CLASSIC
IBM Netezza
Local warehouse
PYTHON
R LANGUAGE
predictions
modelling
GGPLOT2
Visualizations
Insights
18. BIG DATA AT SCALE
USERS
>1B active user profiles
MODELS
1000s of models built weekly
PREDICTIONS
100s of billions predictions daily
MODELING AT SCALEVOLUME
Petabytes of data used
VARIETY
Profiles, formats, screens
VELOCITY
100k+ requests per second
20 billions events per day
VERACITY
Robust measurements
19. INFRASTRUCTURE
Expensive infrastructure (Cluster size >>100s machines)
We use private own cloud
We use Cloudera paltform (CDH 5.4.1)
Efficient software and hardware deployment strategy
LESSONS LEARNED
Do not delegate cluster maintenance to developers
relying on devops engineers and Cloudera
Any thing should be monitored and measured in real time
Think like Google. Our data disposes a lot of space. We track and log everything
20. SHIFTING PARADIGM FROM
BATCH-PROCESSING TO EVENT-DRIVEN
PAST
FUTURE
Event driven
Reactive Streams
Quick feed back
Long feed back in 1h-1d
Belated user identification
+
Data science zone
Real-time zone
(Bidders, Clickers et.)
22. SCALDING IN A NUTSHELL
hdfs
hdfs
CONCISE DSL OVER SCALA
Developed on top of Java-based Cascading
framework
CAN BE CONSIDERED AS A FLOW PIPE
CONFIGURABLE SOURCE AND SINK
Multiple sources and sinks
DATA TRANSFORM OPERATIONS:
Map / flatMap
Pivot / unpivot
Project
groupBy / reduce / foldLeft
23. JUST ONE EXAMPLE (JAVA WAY)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
24. Source
JUST ONE EXAMPLE (SCALDING WAY)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("s+")
}
}
Sink
Transform
operations
25. USE CASE 1 - SPLIT
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
MOTIVATION
Reuse calculated streams
Performance
MOTIVATION
26. USE CASE 3 - EXOTIC SOURCES / SINKS
HBASE
HBaseSource
https://github.com/ParallelAI/SpyGlass
SCAN_ALL
GET_LIST
SCAN_RANGE
HBaseRawSource
https://github.com/btrofimov/SpyGlass
Advanced filtering via base64Scan
Support of CDH 5.4.1
Feature to remove particular columns from rows
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"),
toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read
27. USE CASE 4 - JOIN
Joining two streams by key
Different performance strategies:
joinWithLarger
joinWithSmaller
joinWithTiny
Inner, Left, Right, strategies
MOTIVATION
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
28. USE CASE 5 – DISTRIBUTED CACHING
AND COUNTERS
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args object for
instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)
30. MULTISCREEN AND PROFILE BRIDGING
PROBLEM: Often neighboring profiles are the same user
GOAL: Build complete person profile based on bridging information from different sources:
Mobile
Desktop
TV
Social
WiFi Location X
Time 10:00 am
WiFi Location Y
Time 8:00 pm
32. PROFILE BRIDGING THROUGH MATH
LENSES
General task definition:
Build graph
Vertexes – user’s interests
Edges – bridging rules [cookies, IP,…]
Task – Identify connected components
We do bridging on daily basis
33. SCALDING CONNECTED COMPONENTS
/**
* The class represents just one iteration of searching connected component algorithm.
* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.
* If it is zero then we can stop running other iterations
*/
class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {
val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id
val edges = Tsv( args("edges"), ('id_a,'id_b) ).read
val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }
// if count=0 then we can stop running next iterations
groups.groupAll { _.size }.write(Tsv("count"))
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>
val (id, gid, source,target) = param
if (target != null) ( id , min( gid, target ) ) else ( id, gid )
}
new_vertexes.write( Tsv( args("new_vertexes") ) )
}
34. OTHER SWEET THINGS
Typed Pipes
Elegant and fast Matrix operations
Simple migration on Spark/Kafka
More sources: e.g. retrieve data from hive’s hcatalog, jdbc, …
Simple Integration Testing
Всем хорош скалдинг кроме одного это документация, я не собираюсь вам тут демонстрировать туториал, но пару темных моментов покажу, которые совсем не очевидны с документации