PhillyDB Talk - Beyond Batch


Published on

The venerable MapReduce framework has allowed Hadoop to prove its worth in the big data space, and to store and analyze much larger data sets than was possible before. But there is a lot of activity in the big data ecosystem currently surrounding other major categories of workflows beyond batch.

These emerging tools include low latency i/o (HBase), interactive queries (Drill), stream processing (Storm), and text processing / indexing (Solr). This talk discusses some of the more interesting developments in Drill and Storm, their capabilities, and how they are being put to use in real world situations.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PhillyDB Talk - Beyond Batch

  1. 1. Beyond Batch Drill & Storm Brad Anderson©MapR Technologies
  2. 2. whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• „boorad‟ most places (twitter, github)•©MapR Technologies
  3. 3. MapR - Faster and More ScalableBenchmark MapR 2.1.1 CDH 4.1.1 MapR Speed IncreaseTerasort (1x replication, compression disabled)Total 13m 35s 26m 6s 1.9xMap 7m 58s 21m 8s 2.7xReduce 13m 32s 23m 37s 1.7xDFSIO throughput/node MapR/Googl ApacheRead 1003 MB/s 656 MB/s 1.5x e HadoopWrite 924 MB/s 654 MB/s 1.4x Time 54 sec 62 secYCSB (50% read, 50% update) Nodes 1,003 1,460Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Disks 1,003 5,840 Cores 4,012 11,680Runtime 3.80 hr 11.11 hr 2.9xYCSB (95% read, 5% update)Throughput 24,704.3 op/s 10,776.4 op/s 2.3xRuntime 0.56 hr 1.29 hr 2.3x Benchmark hardware configuration: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE
  4. 4. Beyond Batch HBase & M7 Apache Drill Storm Solr & Elastic Search©MapR Technologies
  5. 5. Latency Matters Batch Interactive Streaming©MapR Technologies
  6. 6. Big Data Picture Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to minutes Never-endingData volume TBs to PBs GBs to PBs Continuous streamProgramming model MapReduce Queries DAGUsers Developers Analysts and Developers DevelopersGoogle project MapReduce DremelOpen source project Hadoop MapReduce Storm, S4 Apache Drill ©MapR Technologies
  7. 7. Interactive SQL Initiatives for Hadoop SQL based OLTP SQL based analytics Real-time interactive queries Impala* Real-time SQL conversion interactive queries to MapReduce* Does not work with other distributions
  8. 8. ©MapR Technologies
  9. 9. Google Dremel• Interactive analysis of large-scale datasets • Trillion records at interactive speeds • Complementary to MapReduce • Used by thousands of Google employees • Paper published at VLDB 2010• Model • Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • SQL-like query language with nested data support• Implementation • Column-based storage and processing • In-situ data access (GFS and Bigtable) • Tree architecture as in Web search (and databases)©MapR Technologies
  10. 10. Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files • Files must be in CSV format • Nested data not supported [yet] except built-in datasets • Schema definition required©MapR Technologies
  11. 11. Drill Design PrinciplesFlexible Easy•Pluggable query languages •Unzip and run•Extensible execution engine •Zero configuration•Pluggable data formats •Reverse DNS not needed • Columns and Rows •IP addresses can change • Schema and Schema-less •Clear and concise log messages•Pluggable data sourcesFast Dependable•C/C++ core with Java support •No SPOF • Google C++ style guide •Instant recovery from crashes•Min latency and max throughput(limited only by hardware) ©MapR Technologies
  12. 12. DrQL Example DocId: 10 Links Forward: 20 SELECT DocId AS Id, Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Cnt, Forward: 60 Name.Url + , + Name.Language.Code AS Str Name FROM t Language WHERE REGEXP(Name.Url, ^http) AND DocId < 20; Code: en-us Country: us Language Code: en Id: 10 Url: http://A Name Name Cnt: 2 Url: http://B Language Name Str: http://A,en-us Language Str: http://A,en Code: en-gb Name Country: gb Cnt: 0©MapR Technologies * Example from the Dremel paper
  13. 13. Data Flow©MapR Technologies
  14. 14. Extensibility• Nested query languages • DrQL • Mongo Query Language • Cascading, Hive, Pig• Distributed execution engine • Extensible model (eg, Dryad) • Low-latency • Fault tolerant©MapR Technologies
  15. 15. ExtensibilityNested data formats • Pluggable model • Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV) • Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON)Scalable data sources • Pluggable model • Hadoop • HBase©MapR Technologies
  16. 16. Drill Architecture Client Cluster Execu2on4 Driver Parser Compiler Data4 Source Engine Query4 (text) AST4 (text) Plan4 (text) API Public interfaces enable extensibility – Add a new query language by implementing a parser – Add a new data source by implementing an API – Provide a plan directly to the execution engine to control execution Each level of the plan has a human readable representation – Facilitates debugging and development
  17. 17. Drill Architecture (2) DrQL% Clients Driver Drill% Query% Servers DrQL% Parser Compiler Drill% Worker Drill%Worker Cascading/Pig/...% Clients Other% Parser Drill%Worker Intermediate% Driver Parser
  18. 18. Query Components• Query components: • SELECT • FROM • WHERE • GROUP BY • HAVING • JOIN• Key logical operators: • Scan • Filter • Aggregate • Join©MapR Technologies
  19. 19. Scan Operators• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported• Fields and predicates can be pushed down into the scan operator• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator output Protocol Buffers JSON-like (MessagePack) ColumnIO (column-based protobuf/Dremel) Supported data JSON RecordIO (row-based protobuf) formats HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)©MapR Technologies
  20. 20. Execution Engine Layers• Drill execution engine has two layers • Operator layer is serialization-aware • Processes individual records • Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance©MapR Technologies
  21. 21. Hadoop Integration• Hadoop data sources • Hadoop FileSystem API (HDFS/MapR-FS) • HBase• Hadoop data formats • Apache Avro • RCFile• MapReduce-based tools to create column-based formats• Table registry in HCatalog• Run long-running services in YARN©MapR Technologies
  22. 22. Fully Open©MapR Technologies
  23. 23. MomentumOver 200 people on the Drill mailing listOver 200 members of the Bay Area Drill User GroupOver 100 participants the first meetup in Sunnyvale, CA • MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …Drill meetups across the US and EuropeOpenDremel team and source code merged with Apache DrillSimba Technologies – ODBC inventor developing a Drill ODBC driver • Tableau, MicroStrategy, Excel, SAP Crystal Reports, …
  24. 24. Storm©MapR Technologies
  25. 25. Before Storm Queues Workers©MapR Technologies
  26. 26. Example©MapR Technologies (simplified)
  27. 27. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”©MapR Technologies
  28. 28. Concepts©MapR Technologies
  29. 29. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples©MapR Technologies
  30. 30. Spouts Source of streams©MapR Technologies
  31. 31. Spoutspublic interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}©MapR Technologies
  32. 32. Bolts Tuple Tuple Tuple TupleProcesses input streams and produces new streams ©MapR Technologies
  33. 33. Bolts public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector; public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; } public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }©MapR Technologies
  34. 34. Topologies Network of spouts and bolts©MapR Technologies
  35. 35. TridentCascading for Storm©MapR Technologies
  36. 36. TridentTridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); ©MapR Technologies
  37. 37. Interoperability©MapR Technologies
  38. 38. Spouts Kafka (with transactions) Kestrel JMS AMQP©MapR Technologies
  39. 39. Bolts Functions Filters Aggregation Joins Talk to databases, Hadoop write- behind©MapR Technologies
  40. 40. Storm realtime processes Apps QueueRaw BusinesData s Value HadoopParallel Cluster Ingest batch processes ©MapR Technologies
  41. 41. Storm realtime processes Apps TailSpout QueueRaw BusinesData s Georg Value Hadoop batch processes ©MapR Technologies
  42. 42. Georg and TailSpout
  43. 43. Get Involved!• Slides •• Join the Apache Drill mailing list •• Watch TailSpout & Georg development •{tdunning | boorad | rlankenau}/mapr-spout• Join MapR • •• @boorad©MapR Technologies