Storm - The Real-Time Layer Your Big Data's Been Missing
Upcoming SlideShare
Loading in...5
×
 

Storm - The Real-Time Layer Your Big Data's Been Missing

on

  • 1,465 views

Presentation on Twitter Storm by Dan Lynn, CTO of FullContact, at GlueCon 2012.

Presentation on Twitter Storm by Dan Lynn, CTO of FullContact, at GlueCon 2012.

Statistics

Views

Total Views
1,465
Slideshare-icon Views on SlideShare
1,448
Embed Views
17

Actions

Likes
5
Downloads
65
Comments
0

4 Embeds 17

http://192.168.6.179 13
http://192.168.6.184 2
http://pinterest.com 1
http://jp.pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Storm - The Real-Time Layer Your Big Data's Been Missing Storm - The Real-Time Layer Your Big Data's Been Missing Presentation Transcript

    • StormThe Real-Time Layer Your Big Data’s Been Missing Dan Lynn dan@fullcontact.com @danklynn
    • Keeps Contact Information Current and Complete Based in Denver, Colorado CTO & Co-Founder dan@fullcontact.com @danklynn
    • Turn Partial Contacts Into Full Contacts
    • Your data is old
    • Old data is confusing
    • First, there was the spreadsheet
    • Next, there was SQL
    • Then, there wasn’t SQL
    • Batch Processing = Stale Data
    • StreamingComputation
    • Queues / Workers
    • Queues / WorkersMessagesMessages Messages Messages Messages Messages Queue Messages Workers
    • Queues / WorkersMe ssage rou tingcan be com ple x Messages Messages Messages Messages Messages Messages Queue Messages Workers
    • Queues / Workers
    • Brittle
    • Hard to scale
    • Queues / Workers
    • Queues / Workers
    • Queues / Workers
    • Queues / WorkersRou ting mus t bereco nfig ured whe nsca ling out
    • Storm
    • StormDistributed and fault-tolerant real-time computation
    • StormDistributed and fault-tolerant real-time computation
    • StormDistributed and fault-tolerant real-time computation
    • StormDistributed and fault-tolerant real-time computation
    • Key Concepts
    • TuplesOrdered list of elements
    • Tuples Ordered list of elements("search-01384", "e:dan@fullcontact.com")
    • StreamsUnbounded sequence of tuples
    • Streams Unbounded sequence of tuplesTuple Tuple Tuple Tuple Tuple Tuple
    • Spouts Source of streams
    • Spouts Source of streams
    • Spouts Source of streamsTuple Tuple Tuple Tuple Tuple Tuple
    • Spouts can talk with some images from http://commons.wikimedia.org
    • Spouts can talk with•Queues some images from http://commons.wikimedia.org
    • Spouts can talk with•Queues•Web logs some images from http://commons.wikimedia.org
    • Spouts can talk with•Queues•Web logs•API calls some images from http://commons.wikimedia.org
    • Spouts can talk with•Queues•Web logs•API calls•Event data some images from http://commons.wikimedia.org
    • BoltsProcess tuples and create new streams
    • Bolts Tuple Tuple Tuple Tuple Tuple TupleTuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple some images from http://commons.wikimedia.org
    • Bolts some images from http://commons.wikimedia.org
    • Bolts•Apply functions / transforms some images from http://commons.wikimedia.org
    • Bolts•Apply functions / transforms•Filter some images from http://commons.wikimedia.org
    • Bolts•Apply functions / transforms•Filter•Aggregation some images from http://commons.wikimedia.org
    • Bolts•Apply functions / transforms•Filter•Aggregation•Streaming joins some images from http://commons.wikimedia.org
    • Bolts•Apply functions / transforms•Filter•Aggregation•Streaming joins•Access DBs, APIs, etc... some images from http://commons.wikimedia.org
    • TopologiesA directed graph of Spouts and Bolts
    • This is a Topology some images from http://commons.wikimedia.org
    • This is also a topology some images from http://commons.wikimedia.org
    • TasksProcesses which execute Streams or Bolts
    • Running a Topology$ storm jar my-code.jar com.example.MyTopology arg1 arg2
    • Storm Cluster Nathan Marz
    • Storm ClusterIf thi s we reHadoo p... Nathan Marz
    • Storm ClusterIf thi s we reHadoo p...Job Tracke r Nathan Marz
    • Storm ClusterIf thi s we reHadoo p... Tas k Tracke rs Nathan Marz
    • Storm ClusterBut it’s not Hado opCoo rdi nates eve ry thi ng Nathan Marz
    • Example:Streaming Word Count
    • Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
    • Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
    • Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }} SplitSentence.java
    • Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }} SplitSentence.java splitsentence.py
    • Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }} SplitSentence.java
    • Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); java
    • Streaming Word Countpublic static class WordCount extends BaseBasicBolt {    Map<String, Integer> counts = new HashMap<String, Integer>();    @Override    public void execute(Tuple tuple, BasicOutputCollector collector) {        String word = tuple.getString(0);        Integer count = counts.get(word);        if(count==null) count = 0;        count++;        counts.put(word, count);        collector.emit(new Values(word, count));    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }} WordCount.java
    • Streaming Word Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); javaGro upings con tro l how tup les are rou ted
    • Shuffle groupingTuples are randomly distributed across all of the tasks running the bolt
    • Fields groupingGroups tuples by specific named fields and routes them to the same task
    • o p’s ado r t o H av i o o us b e hAn a lo g i ng rt i t io npa Fields grouping Groups tuples by specific named fields and routes them to the same task
    • Distributed RPC
    • Before Distributed RPC,time-sensitive queries relied on a pre-computed index
    • What if you didn’t need an index?
    • Distributed RPC
    • Try it out!Huge thanks to Nathan Marz - @nathanmarz http://github.com/nathanmarz/stormhttps://github.com/nathanmarz/storm-starter @stormprocessor
    • Questions? dan@fullcontact.com