• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Storm at spider.io - London Storm Meetup 2013-06-18
 

Storm at spider.io - London Storm Meetup 2013-06-18

on

  • 3,617 views

Slides from my talk at the London Storm Meetup on 2013-06-18. Charting our journey from being a Storm early adopter, to our freeze of Storm releases and switch to batch processing only, to us coming ...

Slides from my talk at the London Storm Meetup on 2013-06-18. Charting our journey from being a Storm early adopter, to our freeze of Storm releases and switch to batch processing only, to us coming full circle and implementing new fraudulent traffic algorithms with Trident.

You might like our blog: http://www.spider.io/blog

Statistics

Views

Total Views
3,617
Views on SlideShare
3,178
Embed Views
439

Actions

Likes
10
Downloads
39
Comments
0

13 Embeds 439

http://blog.fusioncharts.com 170
http://www.scoop.it 129
https://twitter.com 112
http://newsblur.com 12
http://hadoop-big-data-adventures.blogspot.com 4
http://feedly.com 3
http://cloud.feedly.com 2
http://hadoop-big-data-adventures.blogspot.in 2
http://www.newsblur.com 1
http://localhost 1
http://ec2-23-22-16-239.compute-1.amazonaws.com 1
http://hadoop-big-data-adventures.blogspot.com.es 1
http://hadoop-big-data-adventures.blogspot.co.uk 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Storm at spider.io - London Storm Meetup 2013-06-18 Storm at spider.io - London Storm Meetup 2013-06-18 Presentation Transcript

    • Storm atCleaning up fraudulent traffic on the internethttp://xkcd.com/570/
    • Ashley BrownChief ArchitectUsing Storm sinceSeptember 2011Based in the West EndFounded in early 2010Focused on (fighting)advertising fraud since2011
    • What Ill coverThis is a case study: how Storm fits into ourarchitecture and businessNO CODENO WORKER-QUEUE DIAGRAMS1I assume youve seen them all before
    • Our RulesWhy did we pick Storm in the first place?
    • Use the right tool for the jobIf a piece of softwareisnt helping thebusiness, chuck it outOnly jump on abandwagon if itsgoing in your directionDouglas de Jager, CEO
    • Dont write it if you can download itIf theres an opensource project thatdoes whats needed,use thatSometimes thismeans throwing outhome-grown codewhen a new project isreleased (e.g. Storm)Ben Hodgson, Software Engineer
    • Our goalsWhy do we turn up in the morning?
    • Find fraudulent website trafficCollect billions ofserver and client-sidesignals per monthSift, sort, analyse andcondenseIdentify automated,fraudulent andsuspicious trafficJoe Tallett, Software Engineer
    • Protect against itGive our customersthe information theyneed to protect theirbusinesses from badtrafficThis means: givethem clean data tomake businessdecisionsSimon Overell, Chief Scientist
    • Expose itWork with partners toreveal the scale of theproblemDrive people tosolutions which caneliminate fraud
    • Cut it offEliminate bad trafficsources by cutting offtheir revenueBuild reactivesolutions that stopfraud being profitableVegard Johnsen, CCO
    • Storm & our timeline
    • Storm solved a scaling problemimpressions/month 60 million60 millionsignals/month60 million240 million1.5 billion6 billionPre-history(Summer 2010)storm released present dayRabbitMQ queuesPython workersHypertable:API datastoreHadoop batchanalysis/RabbitMQ queuesPython workersVoltDB:API datastore +real-time joinsNo batch analysis(this setup waspretty reliable!)Custom clustermanagement+ worker scalingRabbitMQ queuesStorm Topologies:in-memory joinsHBase:API datastoreCascading for post-failure restores(too much data to dowithout!)billions and billionsand billionsbillions and billionsand billionsLogging via CDNCascading for dataanalysisHive for aggregationsHigh-level aggregatesin MySQL1 2 3
    • Key moments● Enter the advertising anti-fraud market○ 30x increase in impression volume○ Bursty traffic○ Existing queue + worker system not robust enough○ I can do without the 2am wake up calls● Enter Storm.1
    • RabbitMQCluster(Scalable)Python workercluster2(Scalable)Server- andclient-sidesignals1) I lied about the queue/worker diagrams.2) we have a bar in the office; our workers are happy.VoltDBScalable at launch-time only
    • Whats wrong with that?● Queue/worker scaling system relied on ourcode; it worked, but:○ Only 1, maybe 2 pairs of eyes had ever looked at it○ Code becomes a maintenance liability as soon as itis written○ Writing infrastructure software not one of our goals● Maintaining in-memory database for thevolume we wanted was not cost-effective(mainly due to AWS memory costs)● Couldnt scale dynamically: full DB clusterrestart required
    • The Solution● Migrate our internal event-stream-basedworkers to Storm○ Whole community to check and maintain code● Move to HBase for long-term API datastore○ Keep data for longer - better trend decisions● VoltDB joins → in-memory joins & HBase○ small in-memory join window, then flushed○ full 15-minute join achieved by reading from HBase○ Trident solves this now - wasnt around then
    • RabbitMQCluster(Scalable)Storm cluster(Scalable)Server- andclient-sidesignalsHBase(Scalable)Cascading onAmazon ElasticMapReduceFrom logs EMERGENCYRESTORE
    • How long (Storm migration)?● 17 Sept 2011: Released● 21 Sept 2011: Test cluster processing● 29 Sept 2011: Substantial implementation ofcore workers● 30 Sept 2011: Python workers runningunder Storm control● Total engineers: 1
    • The Results (redacted)● Classificationsavailable within 15minutes● Dashboardprovides overviewof legitimate vsother traffic● Better data onwhich to makebusiness decisions
    • Lessons● Storm is easy to install & run● First iteration: use Storm for control andscaling of existing queue+worker systems● Second iteration: use Storm to provideredundancy via acking/replays● Third iteration: remove intermediatequeues to realise performance benefits
    • A Quick Aside on DRPC● Our initial API implementation in HBase wasslow● Large number of partial aggregates toconsume, all handled by a single process● Storms DRPC provided a 10x speedup -machines across the cluster pulled partialsfrom HBase, generated mega-partials; finalstep as a reducer => final totals.
    • Storm solved a scaling problemimpressions/month 60 million60 millionsignals/month60 million240 million1.5 billion6 billionPre-history(Summer 2010)storm released present dayRabbitMQ queuesPython workersHypertable:API datastoreHadoop batchanalysis/RabbitMQ queuesPython workersVoltDB:API datastore +real-time joinsNo batch analysis(this setup waspretty reliable!)Custom clustermanagement+ worker scalingRabbitMQ queuesStorm Topologies:in-memory joinsHBase:API datastoreCascading for post-failure restores(too much data to dowithout!)billions and billionsand billionsbillions and billionsand billionsLogging via CDNCascading for dataanalysisHive for aggregationsHigh-level aggregatesin MySQL1 2 3
    • Key moments● Enabled across substantial internet adinventory○ 10x increase in impression volume○ Low-latency, always-up requirements○ Competitive marketplace● Exit Storm.2
    • What happened?● Stopped upgrading at 0.6.2○ Big customers unable to use real-time data at time○ An unnecessary cost○ Batch options provided better resiliency and costprofile● Too expensive to provide very low-latencydata collection in a compatible way● Legacy systems continue to run...
    • How reliable?● Legacy topologies still running:
    • CDNRabbitMQCluster(Scalable)Storm cluster(Scalable)Server- andclient-sidesignalsHBase(Scalable)Cascading onAmazonElasticMapReduceFrom logsLEGACYHive on EMR(AggregateGeneration)BulkExport
    • The Results● Identification of a botnet cluster attractinginternational press● Many other sources of fraud under activeinvestigation● Using Amazon EC2 spot instances for batchanalysis when cheapest - not paying foralways-up
    • The Results
    • Lessons● Benefit of real-time processing is a businessdecision - batch may be more cost effective● Storm is easy and reliable to use, but youneed supporting infrastructure around it (e.g.queue servers)● It may be the supporting infrastructure thatgives you problems...
    • Storm solved a scaling problemimpressions/month 60 million60 millionsignals/month60 million240 million1.5 billion6 billionPre-history(Summer 2010)storm released present dayRabbitMQ queuesPython workersHypertable:API datastoreHadoop batchanalysis/RabbitMQ queuesPython workersVoltDB:API datastore +real-time joinsNo batch analysis(this setup waspretty reliable!)Custom clustermanagement+ worker scalingRabbitMQ queuesStorm Topologies:in-memory joinsHBase:API datastoreCascading for post-failure restores(too much data to dowithout!)billions and billionsand billionsbillions and billionsand billionsLogging via CDNCascading for dataanalysisHive for aggregationsHigh-level aggregatesin MySQL1 2 3
    • Key moments● Arms race begins○ Fraudsters in control of large botnets able torespond quickly○ Source and signatures of fraud will change fasterand faster in the future, as we close off moreavenues○ Growing demand for more immediate classificationsthan provided by batch-only● Welcome Back Storm.3
    • What now?● Returning to Storm, paired with Mahout● Crunching billions and billions of impressionsusing Cascading + Mahout● Real-time response using Trident + Mahout○ Known-bad signatures identify new botnet IPs,suspect publishers○ Online learning adapts models to emerging threats
    • Lessons● As your business changes, your architecturemust change● Choose Storm if:○ you have existing ad-hoc event streaming systemsthat could use more resiliency○ your business needs a new real-time analysiscomponent that fits an event-streaming model○ youre happy to run appropriate infrastructure aroundit● Dont choose Storm if:○ you have no use for real-time data○ you only want to use it because its cool
    • More Lessons● Using Cascading for Hadoop jobs and Stormfor real-time is REALLY handy○ Retains event-streaming paradigm○ No need to completely re-think implementation whenswitching between them○ In some circumstances can share code○ We have a library which provides common analysiscomponents for both implementations● A reasonably managed Storm cluster willstay up for ages.
    • http://xkcd.com/749/