SlideShare a Scribd company logo
1 of 36
SCALDING BIG (AD)TA
Boris Trofimoff @ Sigma Software
@b0ris_1
REAL-TIME
BIDDING
or the story about the first second…
BE CAREFUL BUYING SHOES
THE FIRST SECOND ON A SCALE
User on Site
Request to bid
Accept bid
Deliver Ad
80 ms
to predict viewability
and make decision
REAL-TIME BIDDING IN DETAILS
STORY ACTORS:
User
Publisher
(foxnews.com)
Ad Server
(Google’s Doubleclick)
SSP
(Ad Exchange)
DSP
(decides what ad to show)
Advertiser
(Nike)
COLLECTIVE
THE FIRST SECOND IN DETAILS
Publisher
receives
request
Content
delivered
to user
Ad Server
sends signal to
Ad Exchange
All bidders should
send their decision
(participate? &
price) back
Ad Server
shows page that
 redirects to the DSP
winner server
 makes piggybacking
Publisher
sends
response
back
Site sends
request to
Ad Server
toshow ad
SSP (Ad Exchange)
receives ad request and
opens RTB Auction
Every bidder/DSP
receives info about user:
 ssp_cookie_id
 geo data
 site url
SSP chooses the
winning DSP
and sends this
info back to Ad
Server
User’s web page
 shows ad banner
with ad
 firing DSP’s 1x1
pixel (impression)
20 ms 100 150 170 200 210 280 300 350
~70% users have
this cookie aboard
thankfully to retargeting
>>1 independent companies take part in this auction
80 ms
400 1sec
 Piggybacking – async JS redirect to all DSP
servers with ssp_cookie as a query param
 Chance for DSP to link ssp_cookie_id with own
dsp_cookie (retargeting again)
REAL TIME
OFFLINE
Hadoop’s HDFS
MapReduce Scalding HBASE Oozie Hive
Updating user profiles Data export
Update user’s profiles with new
segments
6
Partners
9
5
Warehouse Data
Scientists
Brand new
feed about
user interests5
7
Return info about new user’s
interests with special markers
(segments) that indicates the new
fact about user, e.g user is man
who has iphone and lives in NYC
and has dog.
Major format:
<cookie_id –segment_id>
hbase keeps user profiles
Bidder Farm
0
Auction
requests
Tag Farm
1
Impressions, Clicks,
Post-click Activities
Hourly
Logs
2
Updates
cached
user
interests
SSP Ad
Exchange
House holders
data
3rd part
data
3
8
4
QUICK FACTS ABOUT TECH STACK
REALTIME TEAM
LANGUAGES
 Scala, Akka
 Java
WEB SERVICES
 RESTful interfaces
 Pure Servlets
 Spray
IN-MEMORY CACHES
 Aerospike
 Redis
HADOOP
 HBASE to keep profile database
 HDFS to store everything
 OOZIE to build workflows + custom coordinator
implementation
 Scalding
CLOUDERRA
 CDH 5.4.1
EVENT-DRIVEN NEW GEN APPROACH
 Spark + Kafka
OFFLINE TEAM
DATA SCIENCE?
or why we need all this science
WHO ARE DATA SCIENTISTS?
WHO ARE DATA SCIENTISTS?
WHO ARE DATA SCIENTISTS?
WHO ARE DATA SCIENTISTS
Data Scientist
Data
Visualization
Computers
Models
Algorithms
Insights Predictions
AUDIENCE FISHING
THE PROBLEM: Data Sparsity
DEEP AUDIENCE TARGETING
 Only 1-4 providers know about 40% users. Rest
40% is Terra Incognita
LIFE CASE
 Customer (Nike) would like to show ad for all man
who live in NYC, have iphone and pets
OTHER ACTIVITIES
 Profile Bridging
 Audience Modelling
 Audience Optimization
DATA SOURCES
DATA USED FOR MODELING
PROPRIETARY DATA 3rd PARTY DATA
VISITATION HISTORY
 Site / Page / Context
 Time
IP DERIVED DATA
 Location
 Connection type
PERFORMANCE HISTORY
 Actions
 Engagements
USER AGENT DATA
 Devices
 OS & Browser
DEMOGRAPHICS
 Age, Gender
 Household
composition
POINT OF SALE
 Purchase history
SEARCH
 In-market for a
product
Find the meaningful signal
in noisy internet data
through high quality
models
QUICK FACTS ABOUT TECH STACK
YARN-BASED
IMPALA & HIVE
 Scalable local
warehouse
SPARK
 Streaming via Kafka
CLASSIC
IBM Netezza
 Local warehouse
PYTHON
R LANGUAGE
 predictions
 modelling
GGPLOT2
 Visualizations
 Insights
DATA
CHALLENGES
BIG DATA AT SCALE
USERS
 >1B active user profiles
MODELS
 1000s of models built weekly
PREDICTIONS
100s of billions predictions daily
MODELING AT SCALEVOLUME
Petabytes of data used
VARIETY
Profiles, formats, screens
VELOCITY
100k+ requests per second
20 billions events per day
VERACITY
Robust measurements
INFRASTRUCTURE
 Expensive infrastructure (Cluster size >>100s machines)
 We use private own cloud
 We use Cloudera paltform (CDH 5.4.1)
 Efficient software and hardware deployment strategy
LESSONS LEARNED
 Do not delegate cluster maintenance to developers
relying on devops engineers and Cloudera
 Any thing should be monitored and measured in real time
 Think like Google. Our data disposes a lot of space. We track and log everything
SHIFTING PARADIGM FROM
BATCH-PROCESSING TO EVENT-DRIVEN
PAST
FUTURE
 Event driven
 Reactive Streams
 Quick feed back
 Long feed back in 1h-1d
 Belated user identification
+
Data science zone
Real-time zone
(Bidders, Clickers et.)
LET’S SCALDING
SOMETHING
SCALDING IN A NUTSHELL
hdfs
hdfs
CONCISE DSL OVER SCALA
 Developed on top of Java-based Cascading
framework
CAN BE CONSIDERED AS A FLOW PIPE
CONFIGURABLE SOURCE AND SINK
 Multiple sources and sinks
DATA TRANSFORM OPERATIONS:
 Map / flatMap
 Pivot / unpivot
 Project
 groupBy / reduce / foldLeft
JUST ONE EXAMPLE (JAVA WAY)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Source
JUST ONE EXAMPLE (SCALDING WAY)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("s+")
}
}
Sink
Transform
operations
USE CASE 1 - SPLIT
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
MOTIVATION
 Reuse calculated streams
 Performance
MOTIVATION
USE CASE 3 - EXOTIC SOURCES / SINKS
HBASE
HBaseSource
https://github.com/ParallelAI/SpyGlass
 SCAN_ALL
 GET_LIST
 SCAN_RANGE
HBaseRawSource
https://github.com/btrofimov/SpyGlass
 Advanced filtering via base64Scan
 Support of CDH 5.4.1
 Feature to remove particular columns from rows
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"),
toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read
USE CASE 4 - JOIN
 Joining two streams by key
 Different performance strategies:
 joinWithLarger
 joinWithSmaller
 joinWithTiny
 Inner, Left, Right, strategies
MOTIVATION
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
USE CASE 5 – DISTRIBUTED CACHING
AND COUNTERS
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args object for
instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)
USE CASE 5
PROFILE BRIDGING
MULTISCREEN AND PROFILE BRIDGING
PROBLEM: Often neighboring profiles are the same user
GOAL: Build complete person profile based on bridging information from different sources:
 Mobile
 Desktop
 TV
 Social
WiFi Location X
Time 10:00 am
WiFi Location Y
Time 8:00 pm
ONE EXAMPLE
ssp_cookie_Id1
dsp_cookie_id1
IP
Bridging via
ip address
IP
Bridging two ssp_cookies
via private cookie
imp
ssp_cookie_Id1
dsp_cookie_id2 ssp_cookie_id1
PROFILE BRIDGING THROUGH MATH
LENSES
General task definition:
 Build graph
 Vertexes – user’s interests
 Edges – bridging rules [cookies, IP,…]
 Task – Identify connected components
 We do bridging on daily basis
SCALDING CONNECTED COMPONENTS
/**
* The class represents just one iteration of searching connected component algorithm.
* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.
* If it is zero then we can stop running other iterations
*/
class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {
val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id
val edges = Tsv( args("edges"), ('id_a,'id_b) ).read
val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }
// if count=0 then we can stop running next iterations
groups.groupAll { _.size }.write(Tsv("count"))
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>
val (id, gid, source,target) = param
if (target != null) ( id , min( gid, target ) ) else ( id, gid )
}
new_vertexes.write( Tsv( args("new_vertexes") ) )
}
OTHER SWEET THINGS
Typed Pipes
Elegant and fast Matrix operations
Simple migration on Spark/Kafka
More sources: e.g. retrieve data from hive’s hcatalog, jdbc, …
Simple Integration Testing
USEFUL RESOURCES
 http://www.adopsinsider.com/ad-serving/how-does-ad-serving-work/
 http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/
 https://github.com/twitter/scalding
 https://github.com/ParallelAI/SpyGlass
 https://github.com/btrofimov/SpyGlass
 https://github.com/branky/cascading.hive
THANK YOU!

More Related Content

Similar to Scalding Big (Ad)ta

Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamPyData
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Communication between Flutter and native modules Baby Step
Communication between Flutter  and native modules Baby StepCommunication between Flutter  and native modules Baby Step
Communication between Flutter and native modules Baby Step인수 장
 
Meet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMeet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMárton Balassi
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2
 
Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013LivePerson
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computeAngelo Corsaro
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsSriskandarajah Suhothayan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDBBrian Ritchie
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsNeil Avery
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 

Similar to Scalding Big (Ad)ta (20)

Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data stream
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Communication between Flutter and native modules Baby Step
Communication between Flutter  and native modules Baby StepCommunication between Flutter  and native modules Baby Step
Communication between Flutter and native modules Baby Step
 
Meet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMeet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUG
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
 
Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query compute
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 Analytics
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-events
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 

More from b0ris_1

Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...b0ris_1
 
Devoxx 2022
Devoxx 2022Devoxx 2022
Devoxx 2022b0ris_1
 
IT Arena-2021
IT Arena-2021IT Arena-2021
IT Arena-2021b0ris_1
 
New accelerators in Big Data - Upsolver
New accelerators in Big Data - UpsolverNew accelerators in Big Data - Upsolver
New accelerators in Big Data - Upsolverb0ris_1
 
Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]b0ris_1
 
Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
 
Cowboy dating with big data
Cowboy dating with big data Cowboy dating with big data
Cowboy dating with big data b0ris_1
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
 
So various polymorphism in Scala
So various polymorphism in ScalaSo various polymorphism in Scala
So various polymorphism in Scalab0ris_1
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkb0ris_1
 
Spring AOP Introduction
Spring AOP IntroductionSpring AOP Introduction
Spring AOP Introductionb0ris_1
 
Clustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and HazelcastClustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and Hazelcastb0ris_1
 

More from b0ris_1 (12)

Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...
 
Devoxx 2022
Devoxx 2022Devoxx 2022
Devoxx 2022
 
IT Arena-2021
IT Arena-2021IT Arena-2021
IT Arena-2021
 
New accelerators in Big Data - Upsolver
New accelerators in Big Data - UpsolverNew accelerators in Big Data - Upsolver
New accelerators in Big Data - Upsolver
 
Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]
 
Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020
 
Cowboy dating with big data
Cowboy dating with big data Cowboy dating with big data
Cowboy dating with big data
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
 
So various polymorphism in Scala
So various polymorphism in ScalaSo various polymorphism in Scala
So various polymorphism in Scala
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
 
Spring AOP Introduction
Spring AOP IntroductionSpring AOP Introduction
Spring AOP Introduction
 
Clustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and HazelcastClustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and Hazelcast
 

Recently uploaded

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Recently uploaded (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 

Scalding Big (Ad)ta

  • 1. SCALDING BIG (AD)TA Boris Trofimoff @ Sigma Software @b0ris_1
  • 2. REAL-TIME BIDDING or the story about the first second…
  • 4. THE FIRST SECOND ON A SCALE User on Site Request to bid Accept bid Deliver Ad 80 ms to predict viewability and make decision
  • 5. REAL-TIME BIDDING IN DETAILS STORY ACTORS: User Publisher (foxnews.com) Ad Server (Google’s Doubleclick) SSP (Ad Exchange) DSP (decides what ad to show) Advertiser (Nike) COLLECTIVE
  • 6. THE FIRST SECOND IN DETAILS Publisher receives request Content delivered to user Ad Server sends signal to Ad Exchange All bidders should send their decision (participate? & price) back Ad Server shows page that  redirects to the DSP winner server  makes piggybacking Publisher sends response back Site sends request to Ad Server toshow ad SSP (Ad Exchange) receives ad request and opens RTB Auction Every bidder/DSP receives info about user:  ssp_cookie_id  geo data  site url SSP chooses the winning DSP and sends this info back to Ad Server User’s web page  shows ad banner with ad  firing DSP’s 1x1 pixel (impression) 20 ms 100 150 170 200 210 280 300 350 ~70% users have this cookie aboard thankfully to retargeting >>1 independent companies take part in this auction 80 ms 400 1sec  Piggybacking – async JS redirect to all DSP servers with ssp_cookie as a query param  Chance for DSP to link ssp_cookie_id with own dsp_cookie (retargeting again)
  • 7. REAL TIME OFFLINE Hadoop’s HDFS MapReduce Scalding HBASE Oozie Hive Updating user profiles Data export Update user’s profiles with new segments 6 Partners 9 5 Warehouse Data Scientists Brand new feed about user interests5 7 Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user is man who has iphone and lives in NYC and has dog. Major format: <cookie_id –segment_id> hbase keeps user profiles Bidder Farm 0 Auction requests Tag Farm 1 Impressions, Clicks, Post-click Activities Hourly Logs 2 Updates cached user interests SSP Ad Exchange House holders data 3rd part data 3 8 4
  • 8. QUICK FACTS ABOUT TECH STACK REALTIME TEAM LANGUAGES  Scala, Akka  Java WEB SERVICES  RESTful interfaces  Pure Servlets  Spray IN-MEMORY CACHES  Aerospike  Redis HADOOP  HBASE to keep profile database  HDFS to store everything  OOZIE to build workflows + custom coordinator implementation  Scalding CLOUDERRA  CDH 5.4.1 EVENT-DRIVEN NEW GEN APPROACH  Spark + Kafka OFFLINE TEAM
  • 9. DATA SCIENCE? or why we need all this science
  • 10. WHO ARE DATA SCIENTISTS?
  • 11. WHO ARE DATA SCIENTISTS?
  • 12. WHO ARE DATA SCIENTISTS?
  • 13. WHO ARE DATA SCIENTISTS Data Scientist Data Visualization Computers Models Algorithms Insights Predictions
  • 14. AUDIENCE FISHING THE PROBLEM: Data Sparsity DEEP AUDIENCE TARGETING  Only 1-4 providers know about 40% users. Rest 40% is Terra Incognita LIFE CASE  Customer (Nike) would like to show ad for all man who live in NYC, have iphone and pets OTHER ACTIVITIES  Profile Bridging  Audience Modelling  Audience Optimization
  • 15. DATA SOURCES DATA USED FOR MODELING PROPRIETARY DATA 3rd PARTY DATA VISITATION HISTORY  Site / Page / Context  Time IP DERIVED DATA  Location  Connection type PERFORMANCE HISTORY  Actions  Engagements USER AGENT DATA  Devices  OS & Browser DEMOGRAPHICS  Age, Gender  Household composition POINT OF SALE  Purchase history SEARCH  In-market for a product Find the meaningful signal in noisy internet data through high quality models
  • 16. QUICK FACTS ABOUT TECH STACK YARN-BASED IMPALA & HIVE  Scalable local warehouse SPARK  Streaming via Kafka CLASSIC IBM Netezza  Local warehouse PYTHON R LANGUAGE  predictions  modelling GGPLOT2  Visualizations  Insights
  • 18. BIG DATA AT SCALE USERS  >1B active user profiles MODELS  1000s of models built weekly PREDICTIONS 100s of billions predictions daily MODELING AT SCALEVOLUME Petabytes of data used VARIETY Profiles, formats, screens VELOCITY 100k+ requests per second 20 billions events per day VERACITY Robust measurements
  • 19. INFRASTRUCTURE  Expensive infrastructure (Cluster size >>100s machines)  We use private own cloud  We use Cloudera paltform (CDH 5.4.1)  Efficient software and hardware deployment strategy LESSONS LEARNED  Do not delegate cluster maintenance to developers relying on devops engineers and Cloudera  Any thing should be monitored and measured in real time  Think like Google. Our data disposes a lot of space. We track and log everything
  • 20. SHIFTING PARADIGM FROM BATCH-PROCESSING TO EVENT-DRIVEN PAST FUTURE  Event driven  Reactive Streams  Quick feed back  Long feed back in 1h-1d  Belated user identification + Data science zone Real-time zone (Bidders, Clickers et.)
  • 22. SCALDING IN A NUTSHELL hdfs hdfs CONCISE DSL OVER SCALA  Developed on top of Java-based Cascading framework CAN BE CONSIDERED AS A FLOW PIPE CONFIGURABLE SOURCE AND SINK  Multiple sources and sinks DATA TRANSFORM OPERATIONS:  Map / flatMap  Pivot / unpivot  Project  groupBy / reduce / foldLeft
  • 23. JUST ONE EXAMPLE (JAVA WAY) public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 24. Source JUST ONE EXAMPLE (SCALDING WAY) class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.split("s+") } } Sink Transform operations
  • 25. USE CASE 1 - SPLIT val common = Tsv("./file").map(...) val branch1 = common.map(..).write(Tsv("output")) val branch2 = common.groupby(..).write(Tsv("output")) MOTIVATION  Reuse calculated streams  Performance MOTIVATION
  • 26. USE CASE 3 - EXOTIC SOURCES / SINKS HBASE HBaseSource https://github.com/ParallelAI/SpyGlass  SCAN_ALL  GET_LIST  SCAN_RANGE HBaseRawSource https://github.com/btrofimov/SpyGlass  Advanced filtering via base64Scan  Support of CDH 5.4.1  Feature to remove particular columns from rows val scan = new Scan() scan.setCaching(caching) val activity_filters = new FilterList(MUST_PASS_ONE, { val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value)) scvf.setFilterIfMissing(true) scvf.setLatestVersionOnly(true) val scvf2 = ... List(scvf, scvf2) }) scan.setFilter(activity_filters) new HBaseRawSource(tableName, quorum, families, base64Scan = convertScanToBase64(scan)).read. ... val hbs3 = new HBaseSource( tableName, quorum, 'key, List("data"), List('data), sourceMode = SourceMode.SCAN_ALL) .read
  • 27. USE CASE 4 - JOIN  Joining two streams by key  Different performance strategies:  joinWithLarger  joinWithSmaller  joinWithTiny  Inner, Left, Right, strategies MOTIVATION val pipe1 = Tsv("file1").read val pipe2 = Tsv("file2").read // small file val pipe3 = Tsv("file3").read // huge file val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2) val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
  • 28. USE CASE 5 – DISTRIBUTED CACHING AND COUNTERS //somewhere outside Job definition val fl = DistributedCacheFile("/user/boris/zooKeeper.json") // next value can be passed through any Scalding's jobs via Args object for instance val fileName = fl.path ... class Job(val args:Args) { // once we receive fl.path we can read it like a ordinary file val fileName = args.get("fileName") lazy val data = readJSONFromFile(fileName) ... TSV(args.get("input")).read.map('line -> 'word ) { line => ... /* using data json object*/ ... } } // counter example Stat("jdbc.call.counter","myapp").incBy(1)
  • 29. USE CASE 5 PROFILE BRIDGING
  • 30. MULTISCREEN AND PROFILE BRIDGING PROBLEM: Often neighboring profiles are the same user GOAL: Build complete person profile based on bridging information from different sources:  Mobile  Desktop  TV  Social WiFi Location X Time 10:00 am WiFi Location Y Time 8:00 pm
  • 31. ONE EXAMPLE ssp_cookie_Id1 dsp_cookie_id1 IP Bridging via ip address IP Bridging two ssp_cookies via private cookie imp ssp_cookie_Id1 dsp_cookie_id2 ssp_cookie_id1
  • 32. PROFILE BRIDGING THROUGH MATH LENSES General task definition:  Build graph  Vertexes – user’s interests  Edges – bridging rules [cookies, IP,…]  Task – Identify connected components  We do bridging on daily basis
  • 33. SCALDING CONNECTED COMPONENTS /** * The class represents just one iteration of searching connected component algorithm. * Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file. * If it is zero then we can stop running other iterations */ class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) { val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id val edges = Tsv( args("edges"), ('id_a,'id_b) ).read val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a)) .discard('id ) .rename('gid ->'gid_b) .filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 } .project ('gid_a, 'gid_b) .mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) } // if count=0 then we can stop running next iterations groups.groupAll { _.size }.write(Tsv("count")) val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target)) val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin ) .mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) => val (id, gid, source,target) = param if (target != null) ( id , min( gid, target ) ) else ( id, gid ) } new_vertexes.write( Tsv( args("new_vertexes") ) ) }
  • 34. OTHER SWEET THINGS Typed Pipes Elegant and fast Matrix operations Simple migration on Spark/Kafka More sources: e.g. retrieve data from hive’s hcatalog, jdbc, … Simple Integration Testing
  • 35. USEFUL RESOURCES  http://www.adopsinsider.com/ad-serving/how-does-ad-serving-work/  http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/  https://github.com/twitter/scalding  https://github.com/ParallelAI/SpyGlass  https://github.com/btrofimov/SpyGlass  https://github.com/branky/cascading.hive

Editor's Notes

  1. Всем хорош скалдинг кроме одного это документация, я не собираюсь вам тут демонстрировать туториал, но пару темных моментов покажу, которые совсем не очевидны с документации