Your SlideShare is downloading. ×
Hadoop and Storm - AJUG talk
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and Storm - AJUG talk


Published on

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime …

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. ©MapR TechnologiesHadoop and StormAJUG 5/21/2013
  • 2. whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)•
  • 3. Hadoop: A Paradigm Shift Distributed computing platform– Large clusters– Commodity hardware Pioneered at Google– Google File System, MapReduce and BigTable Commercially available as Hadoop
  • 4. Ship the Function to the DataSAN/NASdata data datadata data datadata data datadata data datadata data datafunctionRDBMSTraditional ArchitecturedatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctionDistributed Computing
  • 5. MapReduce FlowInput Map CombineShuffleand sortReduceOutputReduce
  • 6. Variation: No Reduce NecessaryExample: Batch File TransformationInput Map OutputMPG M4V
  • 7. Variation: Multiple MapReducesExample: Fraud Detection in User TransactionsLDA trainingTransactiondataLDA scoringHBase /MapR M7 EditionG2 scoreCandidateevents foranalyst review95 %-ile LDAanomalyMapReduce
  • 8. Pig
  • 9. MR Equivalent to Pig Script
  • 10. Hive
  • 11. MapR Distribution for Apache HadoopComplete HadoopdistributionComprehensivemanagement suiteIndustry-standardinterfacesEnterprise-gradedependabilityEnterprise-grade security(US Intelligence Agency)Patents - IPHigher performance
  • 12. Hadoop Use CasesETL/EDW OffloadSensor / Telemetry DataRecommendation EngineSearch•ML algorithms•eDiscoveryFleet ManagementFraud Detection / Risk ManagementTraffic Decongestion
  • 13. One Platform for Big Data…99.999%HADataProtectionDisasterRecoveryScalability&PerformanceEnterpriseIntegrationMulti-tenancyMapReduceFile-BasedApplicationsSQL Database Search StreamProcessingBatchInteractiveRealtimeBatchLog file AnalysisData Warehouse OffloadFraud DetectionClickstream AnalyticsRealtimeSensor Analysis“Twitterscraping”TelematicsProcess OptimizationInteractiveForensic AnalysisAnalytic ModelingBI User Focus
  • 14. ©MapR TechnologiesStorm“Hadoop for Realtime”
  • 15. ©MapR TechnologiesBefore StormQueues Workers
  • 16. ©MapR TechnologiesExample(simplified)
  • 17. ©MapR TechnologiesStormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction thanmessage passing“Just works”
  • 18. ©MapR TechnologiesUnbounded sequence of tuplesTuple Tuple Tuple Tuple Tuple Tuple TupleStreams
  • 19. ©MapR TechnologiesSource of streamsSpouts
  • 20. ©MapR Technologiespublic interface ISpout extends Serializable {void open(Map conf,TopologyContext context,SpoutOutputCollector collector);void close();void nextTuple();void ack(Object msgId);void fail(Object msgId);}Spouts
  • 21. ©MapR TechnologiesProcesses input streams and produces new streamsTuple Tuple Tuple TupleBolts
  • 22. ©MapR Technologiespublic class DoubleAndTripleBolt extends BaseRichBolt {private OutputCollectorBase _collector;public void prepare(Map conf,TopologyContext context,OutputCollectorBase collector) {_collector = collector;}public void execute(Tuple input) {int val = input.getInteger(0);_collector.emit(input, new Values(val*2, val*3));_collector.ack(input);}public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("double", "triple"));}}Bolts
  • 23. ©MapR TechnologiesNetwork of spouts and boltsTopologies
  • 24. ©MapR TechnologiesTridentTopology topology = new TridentTopology();TridentState wordCounts =topology.newStream("spout1", spout).each(new Fields("sentence"),new Split(),new Fields("word")).groupBy(new Fields("word")).persistentAggregate(new MemoryMapState.Factory(),new Count(),new Fields("count")).parallelismHint(6);TridentCascading for Storm
  • 25. Storm©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesQueue(Kafka)Parallel Cluster Ingest
  • 26. ©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesStormTailSpoutFranzQueue(Kafka)
  • 27. StormKafkaTwitterTwitter APITweetLoggerKafkaClusterKafkaClusterKafkaClusterKafkaAPIStormWeb Service NASWebDataHadoopFlumeHDFSData
  • 28. TwitterTwitterAPICatcher StormTopicQueueWeb-serverhttpWebDataMapRTweetLogger
  • 29. Scaling EstimatesTwitter Firehose Old School – 8+ separateclusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included
  • 30. ©MapR Technologiesgithub• Watch TailSpout & Franz development•{tdunning | boorad | pfcurtis}/mapr-spout• And our example Twitter implementation•{tdunning | boorad | pfcurtis}/mapr-spout-test
  • 31. Demo