Hadoop and Storm - AJUG talk
Upcoming SlideShare
Loading in...5
×
 

Hadoop and Storm - AJUG talk

on

  • 3,671 views

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime ...

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.

Statistics

Views

Total Views
3,671
Views on SlideShare
3,667
Embed Views
4

Actions

Likes
6
Downloads
147
Comments
0

2 Embeds 4

https://twitter.com 3
http://64.73.205.98 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Storm - AJUG talk Hadoop and Storm - AJUG talk Presentation Transcript

  • ©MapR TechnologiesHadoop and StormAJUG 5/21/2013
  • whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• banderson@maprtech.com
  • Hadoop: A Paradigm Shift Distributed computing platform– Large clusters– Commodity hardware Pioneered at Google– Google File System, MapReduce and BigTable Commercially available as Hadoop
  • Ship the Function to the DataSAN/NASdata data datadata data datadata data datadata data datadata data datafunctionRDBMSTraditional ArchitecturedatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctiondatafunctionDistributed Computing
  • MapReduce FlowInput Map CombineShuffleand sortReduceOutputReduce
  • Variation: No Reduce NecessaryExample: Batch File TransformationInput Map OutputMPG M4V
  • Variation: Multiple MapReducesExample: Fraud Detection in User TransactionsLDA trainingTransactiondataLDA scoringHBase /MapR M7 EditionG2 scoreCandidateevents foranalyst review95 %-ile LDAanomalyMapReducehttp://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
  • Pig
  • MR Equivalent to Pig Script
  • Hive
  • MapR Distribution for Apache HadoopComplete HadoopdistributionComprehensivemanagement suiteIndustry-standardinterfacesEnterprise-gradedependabilityEnterprise-grade security(US Intelligence Agency)Patents - IPHigher performance
  • Hadoop Use CasesETL/EDW OffloadSensor / Telemetry DataRecommendation EngineSearch•ML algorithms•eDiscoveryFleet ManagementFraud Detection / Risk ManagementTraffic Decongestion
  • One Platform for Big Data…99.999%HADataProtectionDisasterRecoveryScalability&PerformanceEnterpriseIntegrationMulti-tenancyMapReduceFile-BasedApplicationsSQL Database Search StreamProcessingBatchInteractiveRealtimeBatchLog file AnalysisData Warehouse OffloadFraud DetectionClickstream AnalyticsRealtimeSensor Analysis“Twitterscraping”TelematicsProcess OptimizationInteractiveForensic AnalysisAnalytic ModelingBI User Focus
  • ©MapR TechnologiesStorm“Hadoop for Realtime”
  • ©MapR TechnologiesBefore StormQueues Workers
  • ©MapR TechnologiesExample(simplified)
  • ©MapR TechnologiesStormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction thanmessage passing“Just works”
  • ©MapR TechnologiesUnbounded sequence of tuplesTuple Tuple Tuple Tuple Tuple Tuple TupleStreams
  • ©MapR TechnologiesSource of streamsSpouts
  • ©MapR Technologiespublic interface ISpout extends Serializable {void open(Map conf,TopologyContext context,SpoutOutputCollector collector);void close();void nextTuple();void ack(Object msgId);void fail(Object msgId);}Spouts
  • ©MapR TechnologiesProcesses input streams and produces new streamsTuple Tuple Tuple TupleBolts
  • ©MapR Technologiespublic class DoubleAndTripleBolt extends BaseRichBolt {private OutputCollectorBase _collector;public void prepare(Map conf,TopologyContext context,OutputCollectorBase collector) {_collector = collector;}public void execute(Tuple input) {int val = input.getInteger(0);_collector.emit(input, new Values(val*2, val*3));_collector.ack(input);}public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("double", "triple"));}}Bolts
  • ©MapR TechnologiesNetwork of spouts and boltsTopologies
  • ©MapR TechnologiesTridentTopology topology = new TridentTopology();TridentState wordCounts =topology.newStream("spout1", spout).each(new Fields("sentence"),new Split(),new Fields("word")).groupBy(new Fields("word")).persistentAggregate(new MemoryMapState.Factory(),new Count(),new Fields("count")).parallelismHint(6);TridentCascading for Storm
  • Storm©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesQueue(Kafka)Parallel Cluster Ingest
  • ©MapR TechnologiesHadoopbatchprocessesAppsBusinessValueRawDatarealtimeprocessesStormTailSpoutFranzQueue(Kafka)
  • StormKafkaTwitterTwitter APITweetLoggerKafkaClusterKafkaClusterKafkaClusterKafkaAPIStormWeb Service NASWebDataHadoopFlumeHDFSData
  • TwitterTwitterAPICatcher StormTopicQueueWeb-serverhttpWebDataMapRTweetLogger
  • Scaling EstimatesTwitter Firehose Old School – 8+ separateclusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included
  • ©MapR Technologiesgithub• Watch TailSpout & Franz development• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout• And our example Twitter implementation• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test
  • Demo