Your SlideShare is downloading. ×
PhillyDB Talk - Beyond Batch
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

PhillyDB Talk - Beyond Batch


Published on

The venerable MapReduce framework has allowed Hadoop to prove its worth in the big data space, and to store and analyze much larger data sets than was possible before. But there is a lot of activity …

The venerable MapReduce framework has allowed Hadoop to prove its worth in the big data space, and to store and analyze much larger data sets than was possible before. But there is a lot of activity in the big data ecosystem currently surrounding other major categories of workflows beyond batch.

These emerging tools include low latency i/o (HBase), interactive queries (Drill), stream processing (Storm), and text processing / indexing (Solr). This talk discusses some of the more interesting developments in Drill and Storm, their capabilities, and how they are being put to use in real world situations.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Beyond Batch Drill & Storm Brad Anderson©MapR Technologies
  • 2. whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• „boorad‟ most places (twitter, github)•©MapR Technologies
  • 3. MapR - Faster and More ScalableBenchmark MapR 2.1.1 CDH 4.1.1 MapR Speed IncreaseTerasort (1x replication, compression disabled)Total 13m 35s 26m 6s 1.9xMap 7m 58s 21m 8s 2.7xReduce 13m 32s 23m 37s 1.7xDFSIO throughput/node MapR/Googl ApacheRead 1003 MB/s 656 MB/s 1.5x e HadoopWrite 924 MB/s 654 MB/s 1.4x Time 54 sec 62 secYCSB (50% read, 50% update) Nodes 1,003 1,460Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Disks 1,003 5,840 Cores 4,012 11,680Runtime 3.80 hr 11.11 hr 2.9xYCSB (95% read, 5% update)Throughput 24,704.3 op/s 10,776.4 op/s 2.3xRuntime 0.56 hr 1.29 hr 2.3x Benchmark hardware configuration: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE
  • 4. Beyond Batch HBase & M7 Apache Drill Storm Solr & Elastic Search©MapR Technologies
  • 5. Latency Matters Batch Interactive Streaming©MapR Technologies
  • 6. Big Data Picture Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to minutes Never-endingData volume TBs to PBs GBs to PBs Continuous streamProgramming model MapReduce Queries DAGUsers Developers Analysts and Developers DevelopersGoogle project MapReduce DremelOpen source project Hadoop MapReduce Storm, S4 Apache Drill ©MapR Technologies
  • 7. Interactive SQL Initiatives for Hadoop SQL based OLTP SQL based analytics Real-time interactive queries Impala* Real-time SQL conversion interactive queries to MapReduce* Does not work with other distributions
  • 8. ©MapR Technologies
  • 9. Google Dremel• Interactive analysis of large-scale datasets • Trillion records at interactive speeds • Complementary to MapReduce • Used by thousands of Google employees • Paper published at VLDB 2010• Model • Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • SQL-like query language with nested data support• Implementation • Column-based storage and processing • In-situ data access (GFS and Bigtable) • Tree architecture as in Web search (and databases)©MapR Technologies
  • 10. Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files • Files must be in CSV format • Nested data not supported [yet] except built-in datasets • Schema definition required©MapR Technologies
  • 11. Drill Design PrinciplesFlexible Easy•Pluggable query languages •Unzip and run•Extensible execution engine •Zero configuration•Pluggable data formats •Reverse DNS not needed • Columns and Rows •IP addresses can change • Schema and Schema-less •Clear and concise log messages•Pluggable data sourcesFast Dependable•C/C++ core with Java support •No SPOF • Google C++ style guide •Instant recovery from crashes•Min latency and max throughput(limited only by hardware) ©MapR Technologies
  • 12. DrQL Example DocId: 10 Links Forward: 20 SELECT DocId AS Id, Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Cnt, Forward: 60 Name.Url + , + Name.Language.Code AS Str Name FROM t Language WHERE REGEXP(Name.Url, ^http) AND DocId < 20; Code: en-us Country: us Language Code: en Id: 10 Url: http://A Name Name Cnt: 2 Url: http://B Language Name Str: http://A,en-us Language Str: http://A,en Code: en-gb Name Country: gb Cnt: 0©MapR Technologies * Example from the Dremel paper
  • 13. Data Flow©MapR Technologies
  • 14. Extensibility• Nested query languages • DrQL • Mongo Query Language • Cascading, Hive, Pig• Distributed execution engine • Extensible model (eg, Dryad) • Low-latency • Fault tolerant©MapR Technologies
  • 15. ExtensibilityNested data formats • Pluggable model • Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV) • Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON)Scalable data sources • Pluggable model • Hadoop • HBase©MapR Technologies
  • 16. Drill Architecture Client Cluster Execu2on4 Driver Parser Compiler Data4 Source Engine Query4 (text) AST4 (text) Plan4 (text) API Public interfaces enable extensibility – Add a new query language by implementing a parser – Add a new data source by implementing an API – Provide a plan directly to the execution engine to control execution Each level of the plan has a human readable representation – Facilitates debugging and development
  • 17. Drill Architecture (2) DrQL% Clients Driver Drill% Query% Servers DrQL% Parser Compiler Drill% Worker Drill%Worker Cascading/Pig/...% Clients Other% Parser Drill%Worker Intermediate% Driver Parser
  • 18. Query Components• Query components: • SELECT • FROM • WHERE • GROUP BY • HAVING • JOIN• Key logical operators: • Scan • Filter • Aggregate • Join©MapR Technologies
  • 19. Scan Operators• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported• Fields and predicates can be pushed down into the scan operator• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator output Protocol Buffers JSON-like (MessagePack) ColumnIO (column-based protobuf/Dremel) Supported data JSON RecordIO (row-based protobuf) formats HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)©MapR Technologies
  • 20. Execution Engine Layers• Drill execution engine has two layers • Operator layer is serialization-aware • Processes individual records • Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance©MapR Technologies
  • 21. Hadoop Integration• Hadoop data sources • Hadoop FileSystem API (HDFS/MapR-FS) • HBase• Hadoop data formats • Apache Avro • RCFile• MapReduce-based tools to create column-based formats• Table registry in HCatalog• Run long-running services in YARN©MapR Technologies
  • 22. Fully Open©MapR Technologies
  • 23. MomentumOver 200 people on the Drill mailing listOver 200 members of the Bay Area Drill User GroupOver 100 participants the first meetup in Sunnyvale, CA • MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …Drill meetups across the US and EuropeOpenDremel team and source code merged with Apache DrillSimba Technologies – ODBC inventor developing a Drill ODBC driver • Tableau, MicroStrategy, Excel, SAP Crystal Reports, …
  • 24. Storm©MapR Technologies
  • 25. Before Storm Queues Workers©MapR Technologies
  • 26. Example©MapR Technologies (simplified)
  • 27. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”©MapR Technologies
  • 28. Concepts©MapR Technologies
  • 29. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples©MapR Technologies
  • 30. Spouts Source of streams©MapR Technologies
  • 31. Spoutspublic interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}©MapR Technologies
  • 32. Bolts Tuple Tuple Tuple TupleProcesses input streams and produces new streams ©MapR Technologies
  • 33. Bolts public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector; public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; } public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }©MapR Technologies
  • 34. Topologies Network of spouts and bolts©MapR Technologies
  • 35. TridentCascading for Storm©MapR Technologies
  • 36. TridentTridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); ©MapR Technologies
  • 37. Interoperability©MapR Technologies
  • 38. Spouts Kafka (with transactions) Kestrel JMS AMQP©MapR Technologies
  • 39. Bolts Functions Filters Aggregation Joins Talk to databases, Hadoop write- behind©MapR Technologies
  • 40. Storm realtime processes Apps QueueRaw BusinesData s Value HadoopParallel Cluster Ingest batch processes ©MapR Technologies
  • 41. Storm realtime processes Apps TailSpout QueueRaw BusinesData s Georg Value Hadoop batch processes ©MapR Technologies
  • 42. Georg and TailSpout
  • 43. Get Involved!• Slides •• Join the Apache Drill mailing list •• Watch TailSpout & Georg development •{tdunning | boorad | rlankenau}/mapr-spout• Join MapR • •• @boorad©MapR Technologies