Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WELCOME	
Alexey	Kharlamov,	VP	Technology
1	
•  Ad	Fraud	–	We	eliminate	fraudulent	impressions	by	
adbots	and	make	sure	ads	don’t	show	up	on	
fraudulent	web	sites	
...
2	
•  Ad	impressions	processed:	5+	Billion/day	
•  HTTP	Requests:	50+	Billion/day	
•  Data	Centers:	10+	(AWS	and	on-premis...
H20	World,	11/10/2015	
	
	
			3	
AD	FRAUD	
NEARLY	ALL	AD	FRAUD	IS	CAUSED	BY	BOT	ACTIVITY	
Ad	Stacking	
Placing	mulVple	ads...
H20	World,	11/10/2015	
	
	
			4	
FRAUD	DETECTION	
facebook
cnn
ebay
nothingtoseehere.com
thisisnotabotnet.com
H20	World,	11/10/2015	
	
	
			5	
FRAUD	DETECTION	
facebook
cnn
ebay
nothingtoseehere.com
thisisnotabotnet.com
6	
REQUIREMENTS	
•  Quickly	idenVfy	freshly	acVvated	bot	
•  High	accuracy	of	detecVon	algorithms	
•  Avoid	transfer	of	pe...
BLOCKING																		MONITORING	
5+	billion	events	per	day
8	
EVENT	SESSIONIZATION	
TimeTransaction 1 Transaction 2
Join Window
Impression 1
Impression 2
UnloadDTDTDTInit
DTDTDTInit...
9	
DATA	FLOW	
InputTopic
Session
Builder
QLogTopic
Fraud
Detection
Hadoop
Model
Training
Assets
Firewall
10	
•  Local	log	aggregaVon	and	processing		
•  Transfer	over	long	links	causes	all	sorts	of	synchronizaVon	
problems	
•  ...
11	
DATA	CENTER	ARCHITECTURE	
Server 1
Front-End
Server
STORMFront-End
Server
Server N
STORM
Front-End
Server
Front-End
Se...
LOG	SOURCING:	TAILER	AGENT	
●  Non-invasive	event	sourcing	
●  Decoupled	data	publica[on	
and	event	processing	
●  Data	fa...
RECOVERY	STRATEGY	
•  Read	logs	in	micro-batches	and	maintain	state	in	
memory	
•  Reliable	Processing	
-  On	success	oper...
LOGICAL	TIME	
●  Wall-clock	does	not	work	
●  Load	spikes		
●  Recovery	rewinds	data	feed	to	
previous	Vme	
●  Logical	clo...
DEBUGGING	AND	MONITORING	
•  Metrics	recording	and	visualizaVon	is	essenVal	
component	of	development	cycle	
-  Ease	failu...
GLOBAL	CONFIGURATION	
16	
EAST COAST EUROPEDC-X
Stream
Mirror
Stream
Mirror
Stream
Mirror
Kafka Backbone
Spark
CENTRAL
Had...
LESSONS	LEARNED	
•  Use	staged	roll-out	
-  Start	from	minimal	infrastructure	for	logs	delivery	
•  Do	not	try	to	build	a	...
THANK	YOU!	
alexey@integralads.com
Upcoming SlideShare
Loading in …5
×

Real-Time Fraud Detection with Storm and Kafka

928 views

Published on

The slides presented on NYC Hadoop User Group Meetup in Nov 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Real-Time Fraud Detection with Storm and Kafka

  1. 1. WELCOME Alexey Kharlamov, VP Technology
  2. 2. 1 •  Ad Fraud – We eliminate fraudulent impressions by adbots and make sure ads don’t show up on fraudulent web sites •  Brand Safety – We make sure ads don’t show up in places that brands don’t want them •  Viewability – We measure whether a person actually viewed an ad WE MEASURE AND ENSURE QUALITY
  3. 3. 2 •  Ad impressions processed: 5+ Billion/day •  HTTP Requests: 50+ Billion/day •  Data Centers: 10+ (AWS and on-premises) •  Data stored in clusters: 6+ petabytes •  New data collected daily: 20+ terabytes •  Hadoop cluster processing cores ~ 20,000 INTEGRAL ENGINEERING BY THE NUMBERS
  4. 4. H20 World, 11/10/2015 3 AD FRAUD NEARLY ALL AD FRAUD IS CAUSED BY BOT ACTIVITY Ad Stacking Placing mulVple ads on top of one other in a single ad placement, with only the top ad in view Illegal Bots Compromised computers with breached security defenses conceded to a third party Pixel Stuffing Stuffing an enVre ad- supported site into a 1x1 pixel AD
  5. 5. H20 World, 11/10/2015 4 FRAUD DETECTION facebook cnn ebay nothingtoseehere.com thisisnotabotnet.com
  6. 6. H20 World, 11/10/2015 5 FRAUD DETECTION facebook cnn ebay nothingtoseehere.com thisisnotabotnet.com
  7. 7. 6 REQUIREMENTS •  Quickly idenVfy freshly acVvated bot •  High accuracy of detecVon algorithms •  Avoid transfer of personal informaVon across borders •  Withstand single data center failure
  8. 8. BLOCKING MONITORING 5+ billion events per day
  9. 9. 8 EVENT SESSIONIZATION TimeTransaction 1 Transaction 2 Join Window Impression 1 Impression 2 UnloadDTDTDTInit DTDTDTInit Timeout Emit Emit Impression 3 DTDTDT Timeout Drop
  10. 10. 9 DATA FLOW InputTopic Session Builder QLogTopic Fraud Detection Hadoop Model Training Assets Firewall
  11. 11. 10 •  Local log aggregaVon and processing •  Transfer over long links causes all sorts of synchronizaVon problems •  Intra-DC links are reliable, Internet is NOT. We can keep data locality and log Vme coherence •  Single firewall server failure is not “stop-the-world” event. Data present on Kaaa cluster. •  A completely autonomous system •  Higher availability due DC redundancy INTRA-DC DATA PROCESSING
  12. 12. 11 DATA CENTER ARCHITECTURE Server 1 Front-End Server STORMFront-End Server Server N STORM Front-End Server Front-End Server
  13. 13. LOG SOURCING: TAILER AGENT ●  Non-invasive event sourcing ●  Decoupled data publica[on and event processing ●  Data fan-out ●  Hard latency requirements ●  <10ms response ●  Periodic checkpoints to recover acer failure
  14. 14. RECOVERY STRATEGY •  Read logs in micro-batches and maintain state in memory •  Reliable Processing -  On success operaVon - write checkpoint -  On failure return to previous checkpoint -  On catastrophic failure rewind data feed to a point before the problem started
  15. 15. LOGICAL TIME ●  Wall-clock does not work ●  Load spikes ●  Recovery rewinds data feed to previous Vme ●  Logical clock ●  Maximum Vmestamp seen by Bolt ●  New messages with smaller Vmestamp are late ●  No clock synchronizaVon ●  All bolts are in “weak synchrony”
  16. 16. DEBUGGING AND MONITORING •  Metrics recording and visualizaVon is essenVal component of development cycle -  Ease failure symptoms correlaVon -  Accelerate build/deploy/debug cycle -  Provide trace for producVon issues •  Monitor business metrics -  This is the only thing you care -  Technical issues may or may not have consequences •  Do it a lot -  150K metrics/sec 15
  17. 17. GLOBAL CONFIGURATION 16 EAST COAST EUROPEDC-X Stream Mirror Stream Mirror Stream Mirror Kafka Backbone Spark CENTRAL Hadoop
  18. 18. LESSONS LEARNED •  Use staged roll-out -  Start from minimal infrastructure for logs delivery •  Do not try to build a fortress -  It is much easier to build a systems accepVng limited data loss •  Minimize persistent state -  Slows system down -  Expensive to maintain •  Hardware magers 17
  19. 19. THANK YOU! alexey@integralads.com

×