Your SlideShare is downloading. ×
2013 year of real-time hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

2013 year of real-time hadoop


Published on

geoffrey hendrey

geoffrey hendrey

Published in: Technology

1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. 2013: year of real-time access to Big Data? Geoffrey Hendrey @geoffhendrey @vertascale
  • 2. Agenda• Motivation• Hadoop stack & data formats• File access times and mechanics• Key-based indexing systems (HBase)• MapReduce, Hive/Pig• MPP approaches & alternatives
  • 3. Motivation• Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues• Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract”• Solve “I don’t know what I don’t know”
  • 4. Survey or real-time capabilities• Real-time, in-situ, self-service is the “Holy Grail” for the business analyst• spectrum of real-time capabilities exists on Hadoop Available in Hadoop Proprietary HDFS HBase Drill Easy Hard
  • 5. Hadoop Stack Review
  • 6. Real-time spectrum on HadoopUse Case Support Real-timeSeek to a particular byte in a distributed HDFS YESfileSeek to a particular value in a distributed HBase YESfile, by key (1-dimensional indexing)Answer complex questions expressible in MapReduce NOcode (e.g. matching users to music (Hive, Pig)albums). Data science.Ad-hoc query for scattered records given MPP YESsimple constraints (“field*4+==“music” && Architecturesfield*9+==“dvd”)
  • 7. Hadoop Underpinned By HDFS• Hadoop Distributed File System (HDFS)• inspired by Google FileSystem (GFS)• underpins every piece of data in “Hadoop”• Hadoop FileSystem API is pluggable• HDFS can be replaced with other suitable distributed filesystem – S3 – kosmos – etc
  • 8. Amazon S3
  • 9. Distributed File System
  • 10. Distributed File System Mechanics
  • 11. HDFS performance characteristics• HDFS was designed for high throughput, not low seek latency• best-case configurations have shown HDFS to perform 92K/s random reads []• Personal experience: HDFS very robust. Fault tolerance is “real”. I’ve unplugged machines and never lost data.
  • 12. Typical Hadoop File Formats
  • 13. MapFile for real-time access? – Index file must be loaded by client (slow) – Index file must fit in RAM of client by default – scan an average of 50% of the sampling interval – Large records make scanning intolerable – not a viable “real world” solution for random access
  • 14. Apache HBase• Clone of Google’s Big Table.• Key-based access mechanism• Designed to hold billions of rows• “Tables” stored in HDFS• Supports MapReduce over tables, into tables• Requires you to think hard, and commit to a key design.
  • 15. HBase Architecture
  • 16. HBase random read performance• 7 servers, each with • 8 cores • 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks.• Hbase table • 3 billion records, • 6600 regions. • data size is between 128-256 bytes per row, spread in 1 to 5 columns.
  • 17. Zoomed-in “Get” time histogram
  • 18. MapReduce• “MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers”-wikipedia• MapReduce is strongly tied to HDFS in Hadoop.• Systems built on HDFS (i.e. HBase) leverage this common foundation for integration with the MR paradigm
  • 19. MapReduce and Data Science• Many complex algorithms can be expressed in the MapReduce paradigm – NLP – Graph processing – Image codecs• The more complex the algorithm, the more Map and Reduce processes become complex programs in their own right.• Often cascade multiple MR jobs in succession
  • 20. A very bad* diagram*this diagram makes it appear that data flows through the master node.
  • 21. A better picture
  • 22. Is MapReduce real-time?• MapReduce on Hadoop has certain latencies that are hard to improve – Copy – Shuffle, sort – Iterate• time-dependent on the both the size of the input data and the number of processors available• In a nutshell, it’s a “batch process” and isn’t “real-time”
  • 23. Hive and Pig• Run on top of MapReduce• Provide “Table” metaphor familiar to SQL users• Provide SQL-like (or actually same) syntax• Store a “schema” in a database, mapping tables to HDFS files• Translate “queries” to MapReduce jobs• No more real-time than MapReduce
  • 24. MPP Architectures• Massively Parallel Processing• Lots of machines, so also lots of memoryExamples:• Spark – general purpose data science framework sort of like real-time MapReduce for data science• Dremel – columnar approach, geared toward answering SQL-like aggregations and BI-style questions
  • 25. Spark• Originally designed for iterative machine learning problems at Berkeley• MapReduce does not do a great job on iterative workloads• Spark makes more explicit use of memory caches than Hadoop• Spark can load data from any Hadoop input source
  • 26. Effect of Memory Caching in Spark
  • 27. Is Spark Real-time?• If data fits in memory, execution time for most algorithms still depends on – amount of data to be processed – number of processors• So, it still “depends”• …but definitely more focused on fast time-to- answer• Interactive scala and java shells
  • 28. Dremel MPP architecture• MPP architecture for ad-hoc query on nested data• Apache Drill is an OS clone of Dremel• Dremel originally developed at Google• Features “in situ” data analysis• “Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.” -Dremel: Interactive Analysis of WebScaleDatasets
  • 29. In Situ Analysis• Moving Big Data is a nightmare• In situ: ability to access data in place – In HDFS – In Big Table
  • 30. Uses For Dremel At Google• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Etc, etc.
  • 31. Why so many uses for Dremel?• On any Big Data problem or application, dev team faces these problems: – “I don’t know what I don’t know” about data – Debugging often requires finding and correlating specific needles in the haystack – Support and marketing often require segmentation analysis (identify and characterize wide swaths of data)• Every developer/analyst wants – Faster time to answer – Fewer trips around the mulberry bush
  • 32. Column Oriented Approach
  • 33. Dremel MPP query execution tree
  • 34. Is Dremel real-time?
  • 35. Alternative approaches?• Both MapReduce and MPP query architectures take “throw hardware at the problem” approach.• Alternatives? – Use MapReduce to build distributed indexes on data – Combine columnar storage and inverted indexes to create columnar inverted indexes – Aim for the sweet spot for data scientist and engineer: Ad-hoc queries with results returned in seconds on a single processing node.
  • 36. Contact Info Email: Twitter: @geoffhendrey @vertascale www:
  • 37. references• the-elephant/• clusters-and-the-network/••••• s3_growth_2012_q1_1.png••• view1.png• Dremel: Interactive Analysis of WebScale Datasets