Beyond Batch
  HBase, Drill, & Storm

  Brad Anderson


©MapR Technologies
whoami
• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• ‘boorad’ most places (twitter, github)

• banderson@maprtech.com
©MapR Technologies
•    The open enterprise-grade distribution for Hadoop
     • Easy, dependable and fast
     • Open source with standards-based extensions

•    MapR is deployed at 1000’s of companies
     • From small Internet startups to the world’s largest enterprises

•    MapR customers analyze massive amounts of data:
     • Hundreds of billions of events daily
     • 90% of the world’s Internet population monthly
     • $1 trillion in retail purchases annually

•    MapR Cloud Partners
     • Google to provide Hadoop on Google Compute Engine
     • Amazon for Elastic Map Reduce + instances
©MapR Technologies
Beyond Batch
• HBase & M7

• Apache Drill

• Storm




©MapR Technologies
Latency Matters

         Batch       Interactive   Streaming




©MapR Technologies
HBase Issues
Reliability
• Compactions disrupt operations
• Very slow crash recovery
• Unreliable splitting

Business continuity
• Common hardware/software issues cause downtime
• Administration requires downtime
• No point-in-time recovery
• Complex backup process

Performance
• Many bottlenecks result in low throughput
• Limited data locality
• Limited # of tables

Manageability
• Compactions, splits and merges must be done manually (in reality)
• Basic operations like backup or table rename are complex
©MapR Technologies
M7
    An integrated system for unstructured and structured data
     – Unified namespace for files and tables
     – Data management
     – Data protection
     – Disaster recovery
     – No additional administration

    An architecture that delivers reliability and performance
     – Fewer layers
     – No compactions
     – Seamless splits
     – Automatic merges
     – Single network hop
     – Instant recovery
     – Reduced read and write amplification

©MapR Technologies
Unified Namespace
$ pwd
/mapr/default/user/boorad

$ ls
file1 file2 table1 table2

$ hbase shell
hbase(main):003:0> create '/user/boorad/table3', 'cf1', 'cf2', 'cf3'
0 row(s) in 0.1570 seconds

$ ls
file1 file2 table1 table2 table3

$ hadoop fs -ls /user/boorad
Found 5 items
-rw-r--r-- 3 mapr mapr       16 2012-09-28 08:34 /user/boorad/file1
-rw-r--r-- 3 mapr mapr       22 2012-09-28 08:34 /user/boorad/file2
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:32 /user/boorad/table1
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:33 /user/boorad/table2
trwxr-xr-x 3 mapr mapr       2 2012-09-28 08:38 /user/boorad/table3
 ©MapR Technologies
Simplifying HBase Architecture

               HBase
                     JVM


                     DFS   HBase
                     JVM    JVM

                 ext3      MapR    Unified

                Disks      Disks   Disks
    Other Distributions

©MapR Technologies
No RegionServers?

                            One network hop
No daemons to manage




                              One cache




©MapR Technologies
                       15
No RegionServers?

                            One network hop
No daemons to manage




                              One cache




©MapR Technologies
                       15
Region Assignment




©MapR Technologies
Region Assignment




©MapR Technologies
Instant Recovery

    Apache HBase experiences an outage when any node
     crashes
     – Each RegionServer replays WAL before any region can be
       recovered
     – All regions served by that RegionServer cannot be accessed
    M7 provides instant recovery
     –   M7 uses small WALs
         •   Multiple WALs per region vs. 1 per RegionServer (1000 regions)
     –   Instant recovery on put
     –   1000-10000x faster recovery on get
    How?
     –   M7 leverages unique MapR-FS capabilities, not impacted by
         HDFS limitations
         •   Append support
         •   No limit to # of files
©MapR Technologies
LSMT (FTW)
 Traditional disk-based index structures like B-
  Trees are expensive to maintain in real-time
 Log Structured Merge Trees reduce the cost by
  deferring and batching index changes
 Writes
     – Writes        go to an in-memory index
         •   And a commit log in case the node crashes and recovery is
             needed
     – The   in-memory index is occasionally merged into the
         disk-based index
         •   This may trigger a compaction
    Reads
     – Reads         hit the in-memory index and the disk-based
         index
©MapR Technologies
Storage Subsystem Performance
What does it cost to merge the in-memory index into the disk-
based index?
                          HBase-style         LevelDB-style        M7
Examples                  BigTable, HBase, Cassandra, Riak         M7
                          Cassandra, Riak
WAF                       Low              High                    Low
RAF                       High                Low                  Low
I/O storms                Yes                 No                   No
Disk space                High (2x)           Low                  Low
overhead
Skewed data               Bad                 Good                 Good
handling
Rewrite large             Yes                 Yes                  No
values
Terminology:
    Write-amplification factor (WAF): The ratio between writes to disk and
     application writes. Note that data must be rewritten in every indexed structure.
    Read-amplification factor (RAF): The ratio between reads from disk and
     application reads.
    Skewed data handling: When inserting values with similar keys (eg, increasing
©MapR Technologies
Other M7 Features
    Smaller disk footprint
     – HBase  stores key & column name for every version of
       every cell
     – M7 never repeats the key or column name

    Columnar layout
     – HBasesupports 2-3 column families in practice
     – M7 supports 64 column families

    Online schema changes
     – No   need to disable table to add/remove/modify
         column families


©MapR Technologies
©MapR Technologies
Big Data Picture
                        Batch processing    Interactive analysis     Stream processing

Query runtime           Minutes to hours   Milliseconds to minutes     Never-ending

Data volume                TBs to PBs           GBs to PBs           Continuous stream

Programming model         MapReduce               Queries                  DAG

Users                      Developers      Analysts and Developers      Developers

Google project            MapReduce               Dremel

Open source project Hadoop MapReduce                                     Storm, S4




   ©MapR Technologies
Big Data Picture
                        Batch processing    Interactive analysis     Stream processing

Query runtime           Minutes to hours   Milliseconds to minutes     Never-ending

Data volume                TBs to PBs           GBs to PBs           Continuous stream

Programming model         MapReduce               Queries                  DAG

Users                      Developers      Analysts and Developers      Developers

Google project            MapReduce               Dremel

Open source project Hadoop MapReduce                                     Storm, S4




                                           Apache Drill
   ©MapR Technologies
Google Dremel
• Interactive analysis of large-scale datasets
      • Trillion records at interactive speeds
      • Complementary to MapReduce
      • Used by thousands of Google employees
      • Paper published at VLDB 2010
• Model
      • Nested data model with schema
          • Most data at Google is stored/transferred in Protocol Buffers
          • Normalization (to relational) is prohibitive
      • SQL-like query language with nested data support
• Implementation
      • Column-based storage and processing
      • In-situ data access (GFS and Bigtable)
      • Tree architecture as in Web search (and databases)
©MapR Technologies
Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
          • Files must be in CSV format
          • Nested data not supported [yet] except built-in datasets
          • Schema definition required




©MapR Technologies
DrQL Example
 DocId: 10
 Links
  Forward: 20        SELECT DocId AS Id,
  Forward: 40         COUNT(Name.Language.Code) WITHIN Name AS
  Forward: 60        Cnt,
 Name                 Name.Url + ',' + Name.Language.Code AS Str
  Language           FROM t
    Code: 'en-us'    WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
    Country: 'us'
  Language
    Code: 'en'                         Id: 10
  Url: 'http://A'                      Name
 Name                                   Cnt: 2
  Url: 'http://B'                       Language
 Name                                     Str: 'http://A,en-us'
  Language                                Str: 'http://A,en'
    Code: 'en-gb'                      Name
    Country: 'gb'                       Cnt: 0
©MapR Technologies
                                       * Example from the Dremel paper
Data Flow




©MapR Technologies
Extensibility
• Nested query languages
      •   Pluggable model
      •   DrQL
      •   Mongo Query Language
      •   Cascading
• Distributed execution engine
      • Extensible model (eg, Dryad)
      • Low-latency
      • Fault tolerant



©MapR Technologies
Extensibility
• Nested data formats
      • Pluggable model
        • Column-based (ColumnIO/Dremel, Trevni, RCFile)
        • Row-based (RecordIO, Avro, JSON, CSV)
        • Schema (Protocol Buffers, Avro, CSV)
        • Schema-less (JSON, BSON)
• Scalable data sources
      • Pluggable model
      • Hadoop
      • HBase


©MapR Technologies
Architecture


• Only the execution engine knows the physical attributes of the
  cluster
      • # nodes, hardware, file locations, …


• Public interfaces enable extensibility
      • Developers can build parsers for new query languages
      • Developers can provide an execution plan directly


• Each level of the plan has a human readable representation
      • Facilitates debugging and unit testing
©MapR Technologies
Architecture




©MapR Technologies
Query Components
• Query components:
      •   SELECT
      •   FROM
      •   WHERE
      •   GROUP BY
      •   HAVING
      •   (JOIN)

• Key logical operators:
      •   Scan
      •   Filter
      •   Aggregate
      •   (Join)
©MapR Technologies
Execution Engine Layers
• Drill execution engine has two layers
      • Operator layer is serialization-aware
          • Processes individual records
      • Execution layer is not serialization-aware
          • Processes batches of records (blobs)
          • Responsible for communication, dependencies and fault tolerance




©MapR Technologies
Design Principles
    Flexible                         Easy
•     Pluggable query languages     • Unzip and run
•     Extensible execution engine   • Zero configuration
•     Pluggable data formats        • Reverse DNS not needed
     • Column-based and row-        • IP addresses can change
     based                          • Clear and concise log
     • Schema and schema-less       messages


    Fast                             Dependable
• C/C++ core with Java              • No SPOF
support                             • Instant recovery from
  • Google C++ style guide          crashes
• Min latency and max
   throughput (limited only by
   hardware)
 ©MapR Technologies
Hadoop Integration
• Hadoop data sources
      • Hadoop FileSystem API (HDFS/MapR-FS)
      • HBase
• Hadoop data formats
      • Apache Avro
      • RCFile
• MapReduce-based tools to create column-based
  formats




©MapR Technologies
Fully Open




©MapR Technologies
Storm




©MapR Technologies
Before Storm




                     Queues   Workers


©MapR Technologies
Example




©MapR Technologies
                     (simplified)
Storm

                     Guaranteed data processing
                     Horizontal scalability
                     Fault-tolerance
                     No intermediate message brokers!
                     Higher level abstraction than
                     message passing
                     “Just works”
©MapR Technologies
Concepts




©MapR Technologies
Streams



  Tuple               Tuple   Tuple   Tuple   Tuple   Tuple   Tuple




                     Unbounded sequence of tuples
©MapR Technologies
Spouts

                                                         Tuple
                                       Tuple Tuple Tuple
                     Tuple Tuple Tuple



                     Tuple Tuple
                                 Tuple Tuple
                                             Tuple Tuple
                                                         Tuple




                     Source of streams

©MapR Technologies
Spouts

public interface ISpout extends Serializable {
  void open(Map conf,
         TopologyContext context,
         SpoutOutputCollector collector);
  void close();
  void nextTuple();
  void ack(Object msgId);
  void fail(Object msgId);
}



©MapR Technologies
Bolts

 Tuple      Tuple     Tuple   Tuple   Tuple   Tuple   Tuple

                                                              Tuple   Tuple   Tuple   Tuple


                                              Tuple   Tuple
                              Tuple   Tuple
            Tuple     Tuple
 Tuple




Processes input streams and produces new streams

 ©MapR Technologies
Bolts
  public class DoubleAndTripleBolt extends BaseRichBolt {
    private OutputCollectorBase _collector;

       public void prepare(Map conf,
                    TopologyContext context,
                    OutputCollectorBase collector) {
         _collector = collector;
       }

       public void execute(Tuple input) {
         int val = input.getInteger(0);
         _collector.emit(input, new Values(val*2, val*3));
         _collector.ack(input);
       }

    public void declareOutputFields(OutputFieldsDeclarer
  declarer) {
       declarer.declare(new Fields("double", "triple"));
    }
  }
©MapR Technologies
Topologies




                     Network of spouts and bolts
©MapR Technologies
Trident
Cascading for Storm




©MapR Technologies
Trident
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
   topology.newStream("spout1", spout)
    .each(new Fields("sentence"),
        new Split(),
        new Fields("word"))
    .groupBy(new Fields("word"))
    .persistentAggregate(new MemoryMapState.Factory(),
                  new Count(),
                   new Fields("count"))
    .parallelismHint(6);




 ©MapR Technologies
Interoperability




©MapR Technologies
Spouts
        •Kafka (with transactions)
        •Kestrel
        •JMS
        •AMQP
        •Beanstalkd


©MapR Technologies
Bolts
 •Functions
 •Filters
 •Aggregation
 •Joins
 •Talk to databases, Hadoop write-behind


©MapR Technologies
Storm

                               realtime
                              processes
                                                         Apps
                      Queue


Raw
Data                                                    Business
                                                         Value
                                            Hadoop




                                              batch
                                            processes
 ©MapR Technologies
Storm

                               realtime
                              processes
                                                         Apps
                      Queue


Raw
Data                                                    Business
                                                         Value
                                            Hadoop

Parallel Cluster Ingest

                                              batch
                                            processes
 ©MapR Technologies
Storm

                                        realtime
                                       processes
                                                    Apps
                      Queue


Raw
Data                                               Business
                                                    Value
                              Hadoop




                                     batch
                                   processes
 ©MapR Technologies
Storm

                                realtime
                               processes
                                            Apps
Raw
Data                                       Business
                                            Value
                      Hadoop




                             batch
                           processes
 ©MapR Technologies
Get Involved!
• Get more details on M7
      • http://mapr.com/products/mapr-editions/m7-edition

• Join the Apache Drill mailing list
      • drill-dev-subscribe@incubator.apache.org

• Watch TailSpout development
      • https://github.com/{tdunning | boorad}/mapr-spout

• Join MapR
      • jobs@mapr.com
      • banderson@maprtech.com

• @boorad
©MapR Technologies

TriHUG - Beyond Batch

  • 1.
    Beyond Batch HBase, Drill, & Storm Brad Anderson ©MapR Technologies
  • 2.
    whoami • Brad Anderson •Solutions Architect at MapR (Atlanta) • ATLHUG co-chair • ‘boorad’ most places (twitter, github) • banderson@maprtech.com ©MapR Technologies
  • 3.
    The open enterprise-grade distribution for Hadoop • Easy, dependable and fast • Open source with standards-based extensions • MapR is deployed at 1000’s of companies • From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: • Hundreds of billions of events daily • 90% of the world’s Internet population monthly • $1 trillion in retail purchases annually • MapR Cloud Partners • Google to provide Hadoop on Google Compute Engine • Amazon for Elastic Map Reduce + instances ©MapR Technologies
  • 4.
    Beyond Batch • HBase& M7 • Apache Drill • Storm ©MapR Technologies
  • 5.
    Latency Matters Batch Interactive Streaming ©MapR Technologies
  • 6.
    HBase Issues Reliability • Compactionsdisrupt operations • Very slow crash recovery • Unreliable splitting Business continuity • Common hardware/software issues cause downtime • Administration requires downtime • No point-in-time recovery • Complex backup process Performance • Many bottlenecks result in low throughput • Limited data locality • Limited # of tables Manageability • Compactions, splits and merges must be done manually (in reality) • Basic operations like backup or table rename are complex ©MapR Technologies
  • 7.
    M7  An integrated system for unstructured and structured data – Unified namespace for files and tables – Data management – Data protection – Disaster recovery – No additional administration  An architecture that delivers reliability and performance – Fewer layers – No compactions – Seamless splits – Automatic merges – Single network hop – Instant recovery – Reduced read and write amplification ©MapR Technologies
  • 8.
    Unified Namespace $ pwd /mapr/default/user/boorad $ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/boorad/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -ls /user/boorad Found 5 items -rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/boorad/file1 -rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/boorad/file2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/boorad/table1 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:33 /user/boorad/table2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:38 /user/boorad/table3 ©MapR Technologies
  • 9.
    Simplifying HBase Architecture HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions ©MapR Technologies
  • 10.
    No RegionServers? One network hop No daemons to manage One cache ©MapR Technologies 15
  • 11.
    No RegionServers? One network hop No daemons to manage One cache ©MapR Technologies 15
  • 12.
  • 13.
  • 14.
    Instant Recovery  Apache HBase experiences an outage when any node crashes – Each RegionServer replays WAL before any region can be recovered – All regions served by that RegionServer cannot be accessed  M7 provides instant recovery – M7 uses small WALs • Multiple WALs per region vs. 1 per RegionServer (1000 regions) – Instant recovery on put – 1000-10000x faster recovery on get  How? – M7 leverages unique MapR-FS capabilities, not impacted by HDFS limitations • Append support • No limit to # of files ©MapR Technologies
  • 15.
    LSMT (FTW)  Traditionaldisk-based index structures like B- Trees are expensive to maintain in real-time  Log Structured Merge Trees reduce the cost by deferring and batching index changes  Writes – Writes go to an in-memory index • And a commit log in case the node crashes and recovery is needed – The in-memory index is occasionally merged into the disk-based index • This may trigger a compaction  Reads – Reads hit the in-memory index and the disk-based index ©MapR Technologies
  • 16.
    Storage Subsystem Performance Whatdoes it cost to merge the in-memory index into the disk- based index? HBase-style LevelDB-style M7 Examples BigTable, HBase, Cassandra, Riak M7 Cassandra, Riak WAF Low High Low RAF High Low Low I/O storms Yes No No Disk space High (2x) Low Low overhead Skewed data Bad Good Good handling Rewrite large Yes Yes No values Terminology:  Write-amplification factor (WAF): The ratio between writes to disk and application writes. Note that data must be rewritten in every indexed structure.  Read-amplification factor (RAF): The ratio between reads from disk and application reads.  Skewed data handling: When inserting values with similar keys (eg, increasing ©MapR Technologies
  • 17.
    Other M7 Features  Smaller disk footprint – HBase stores key & column name for every version of every cell – M7 never repeats the key or column name  Columnar layout – HBasesupports 2-3 column families in practice – M7 supports 64 column families  Online schema changes – No need to disable table to add/remove/modify column families ©MapR Technologies
  • 18.
  • 19.
    Big Data Picture Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and Developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm, S4 ©MapR Technologies
  • 20.
    Big Data Picture Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and Developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm, S4 Apache Drill ©MapR Technologies
  • 21.
    Google Dremel • Interactiveanalysis of large-scale datasets • Trillion records at interactive speeds • Complementary to MapReduce • Used by thousands of Google employees • Paper published at VLDB 2010 • Model • Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive • SQL-like query language with nested data support • Implementation • Column-based storage and processing • In-situ data access (GFS and Bigtable) • Tree architecture as in Web search (and databases) ©MapR Technologies
  • 22.
    Google BigQuery • HostedDremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files • Files must be in CSV format • Nested data not supported [yet] except built-in datasets • Schema definition required ©MapR Technologies
  • 23.
    DrQL Example DocId:10 Links Forward: 20 SELECT DocId AS Id, Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Forward: 60 Cnt, Name Name.Url + ',' + Name.Language.Code AS Str Language FROM t Code: 'en-us' WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Country: 'us' Language Code: 'en' Id: 10 Url: 'http://A' Name Name Cnt: 2 Url: 'http://B' Language Name Str: 'http://A,en-us' Language Str: 'http://A,en' Code: 'en-gb' Name Country: 'gb' Cnt: 0 ©MapR Technologies * Example from the Dremel paper
  • 24.
  • 25.
    Extensibility • Nested querylanguages • Pluggable model • DrQL • Mongo Query Language • Cascading • Distributed execution engine • Extensible model (eg, Dryad) • Low-latency • Fault tolerant ©MapR Technologies
  • 26.
    Extensibility • Nested dataformats • Pluggable model • Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV) • Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON) • Scalable data sources • Pluggable model • Hadoop • HBase ©MapR Technologies
  • 27.
    Architecture • Only theexecution engine knows the physical attributes of the cluster • # nodes, hardware, file locations, … • Public interfaces enable extensibility • Developers can build parsers for new query languages • Developers can provide an execution plan directly • Each level of the plan has a human readable representation • Facilitates debugging and unit testing ©MapR Technologies
  • 28.
  • 29.
    Query Components • Querycomponents: • SELECT • FROM • WHERE • GROUP BY • HAVING • (JOIN) • Key logical operators: • Scan • Filter • Aggregate • (Join) ©MapR Technologies
  • 30.
    Execution Engine Layers •Drill execution engine has two layers • Operator layer is serialization-aware • Processes individual records • Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance ©MapR Technologies
  • 31.
    Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row- • IP addresses can change based • Clear and concise log • Schema and schema-less messages Fast Dependable • C/C++ core with Java • No SPOF support • Instant recovery from • Google C++ style guide crashes • Min latency and max throughput (limited only by hardware) ©MapR Technologies
  • 32.
    Hadoop Integration • Hadoopdata sources • Hadoop FileSystem API (HDFS/MapR-FS) • HBase • Hadoop data formats • Apache Avro • RCFile • MapReduce-based tools to create column-based formats ©MapR Technologies
  • 33.
  • 34.
  • 35.
    Before Storm Queues Workers ©MapR Technologies
  • 36.
  • 37.
    Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works” ©MapR Technologies
  • 38.
  • 39.
    Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples ©MapR Technologies
  • 40.
    Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams ©MapR Technologies
  • 41.
    Spouts public interface ISpoutextends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId); } ©MapR Technologies
  • 42.
    Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produces new streams ©MapR Technologies
  • 43.
    Bolts publicclass DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector; public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; } public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } } ©MapR Technologies
  • 44.
    Topologies Network of spouts and bolts ©MapR Technologies
  • 45.
  • 46.
    Trident TridentTopology topology =new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); ©MapR Technologies
  • 47.
  • 48.
    Spouts •Kafka (with transactions) •Kestrel •JMS •AMQP •Beanstalkd ©MapR Technologies
  • 49.
    Bolts •Functions •Filters •Aggregation •Joins •Talk to databases, Hadoop write-behind ©MapR Technologies
  • 50.
    Storm realtime processes Apps Queue Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 51.
    Storm realtime processes Apps Queue Raw Data Business Value Hadoop Parallel Cluster Ingest batch processes ©MapR Technologies
  • 52.
    Storm realtime processes Apps Queue Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 53.
    Storm realtime processes Apps Raw Data Business Value Hadoop batch processes ©MapR Technologies
  • 54.
    Get Involved! • Getmore details on M7 • http://mapr.com/products/mapr-editions/m7-edition • Join the Apache Drill mailing list • drill-dev-subscribe@incubator.apache.org • Watch TailSpout development • https://github.com/{tdunning | boorad}/mapr-spout • Join MapR • jobs@mapr.com • banderson@maprtech.com • @boorad ©MapR Technologies

Editor's Notes

  • #2 \n
  • #3 \n
  • #4 \n
  • #5 \n
  • #6 hbase - random reads/writes - 45% of all hadoop clusters\n\n
  • #7 \n
  • #8 \n
  • #9 \n
  • #10 \n
  • #11 \n
  • #12 \n
  • #13 \n
  • #14 \n
  • #15 \n
  • #16 \n
  • #17 \n
  • #18 \n
  • #19 \n
  • #20 Drill \nRemove schema requirement\nIn-situ for real since we&amp;#x2019;ll support multiple formats\n\nNote: MR needed for big joins so to speak\n
  • #21 Drill\nWill support nested\nNo schema required\n
  • #22 Protocol buffers are conceptual data model\nWill support multiple data models\nWill have to define a way to explain data format\n (filtering, fields, etc)\nSchema-less will have perf penalty\nHbase will be one format\n
  • #23 Likely to support these\nCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon query\nPig as well\n\nPluggability\nData format\nQuery language\n\nSomething 6-9 months alpha quality\nCommunity driven, I can&amp;#x2019;t speak for project\n\nMapR\nFS gives better chunk size control\nNFS support may make small test drivers easier\nUnified namespace will allow multi-cluster access\nMight even have drill component that autoformats data\n\n\nRead only model\n
  • #24 Example query that Drill should support\n\nNeed to talk more here about what Dremel does\n
  • #25 Load data into Drill (optional)\nCould just use as is in &amp;#x201C;row&amp;#x201D; format\nMultiple query languages\nPluggability very important\n
  • #26 Note: we have an already partially built execution engine\n
  • #27 Note: we have an already partially built execution engine\n
  • #28 \n
  • #29 \n
  • #30 \n
  • #31 \n
  • #32 \n
  • #33 \n
  • #34 Be prepared for Apache questions\nCommitter vs committee vs contributor\n\nIf can&amp;#x2019;t answer question, ask them to answer and contribute\nLisa - Need landing page\nReferences to paper and such at end\n
  • #35 \n
  • #36 \n
  • #37 \n
  • #38 scaling is painful\npoor fault tolerance\ncoding is hard\n
  • #39 \n
  • #40 \n
  • #41 tweets stock ticks manufacturing machine data sensor messages\n
  • #42 \n
  • #43 \n
  • #44 \n
  • #45 \n
  • #46 DAG\n\nruns continuously\n
  • #47 abstractions like Cascading, Hive, Pig make MR approachable\n\ncode size reduction\n
  • #48 \n
  • #49 \n
  • #50 kestrel - via thrift\nkafka - transactional topologies, idempotentcy, process only once\nactivemq\n
  • #51 \n
  • #52 current architecture\n\ndata ingest tool for hadoop (avoid Flume madness)\n
  • #53 current architecture\n\ndata ingest tool for hadoop (avoid Flume madness)\n
  • #54 \n