Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDM8 - Near-realtime Big Data Analytics using Impala


Published on

Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.

Published in: Technology
  • Be the first to comment

BDM8 - Near-realtime Big Data Analytics using Impala

  1. 1. Near-realtimeBig Data Analyticsusing ImpalaDavid LauzonBig Data Montreal #8January 10th 2013 1 / 18
  2. 2. Plan• What is Impala?• Why Google built Dremel?• Use cases for Impala• Use cases for Map-Reduce• Cloudera Customer Survey• Impala Features• Impala Performance Expectations and Benchmarks• Impala Components• Impala Architecture• Impala Development Roadmap• Where to learn more and get started 2 / 18
  3. 3. Disclaimer• In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation.• I am not affiliated or sponsored by Cloudera or Google. 3 / 18
  4. 4. What is Impala? “An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia 4 / 18
  5. 5. Seriously, what is Impala?• “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera• Inspired by Google Dremel Paper (2010) – BigQuery is a Dremel implementation service, • It’s proprietary, not free, and requires to upload your data to Google servers 5 / 18
  6. 6. Why Google built Dremel?• Problems with Data Warehouse Solutions for OLAP/BI: – Relational OLAP (ROLAP) : • Need to build indices for every possible query (for performance concerns)  Indices size could take up the whole RAM – Multi-dimensional OLAP (MOLAP): • Require extensive time and money to design and build the data cubes – Ad-hoc query (specific non-optimised query) : • When you don’t know what you’ll need / or need to work in iterations. e.g. quite often !!• Solution: – Increase full-scan speed without requiring indexing or pre- aggregated values 6 / 18
  7. 7. Come on, give me some use cases!• Finding particular records with specified conditions. – “Find all the locations where account “ABC” was accessed from”.• Quick aggregation of statistics with dynamically- changing conditions: – “Can you give me yesterday’s number of impressions for Google AdWords display ads – but only in the Tokyo region?”• Trial-and-error data analysis: – “And between 11am to 1pm?” 7 / 18
  8. 8. Use cases for which you should stickwith Map-Reduce based applications• Very long running, batch-oriented tasks such as ETL: – e.g. exporting large amount of data after processing• Complex event processing: – e.g. stream-processing• “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google 8 / 18
  9. 9. Integration with Hadoop• Cloudera Customer Survey (Aug. 2012) – 80% needs faster queries on Hadoop data – 65% query Hadoop using Hive – 70% move data from Hadoop to RDBMS for interactive SQL – 60% see value today in consolidating to a single platform 9 / 18
  10. 10. Impala Features• Shared with Hive: – Hive MetaStore – Hive SQL (most common SQL-92 features) – ODBC Driver – User Interface (Hue Beeswax)• Specific to Impala: – No Map Reduce, but in memory transfers – Host and Disk Awareness (data locality) – Table data caching in RAM – No virtual columns, or locking 10 / 18
  11. 11. Impala Performance Expectations• Performance improvements over Hive – 3 - 4X for purely I/O bound queries – 7 - 45X for queries with at least one join – 20 - 90X when data available in the cache 11 / 18
  12. 12. External Benchmarks• Searching log files at 37 signals (creators of Ruby on Rails web framework)Workload Impala Hive MySQL Query Query Query Time Time Time5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s8 Gb pageview table – daily pageviews and unique 22.4s 92.2s 180.0svisitors 12 / 18
  13. 13. Impala Components• Impala State Store : 1 per cluster – Coordinates information (location and status) about all the running impalad instances• Impala Daemon : 1 per DataNode – Coordinates and executes queries – Distributes query fragments to other Impala Daemon• Impala Shell : 1 per node – Provides Command Line Interface allowing interactions with Impala 13 / 18
  14. 14. Impala Architecture 14 / 18
  15. 15. Roadmap : 0.3 Beta Version• Operation System: – Only RHEL/CentOS 6.2 is supported• File formats: – Text files, SequenceFiles, HBase table• Compression: – Snappy, Gzip, BZip• No UDFs or user extensibility *• Largest table in joins must be specified first *• Right-side of join must fit in RAM *• No support for complex nested structures * : – e.g. maps, structs and arrays. * Post 1.0 G.A. Version Top Asks 15 / 18
  16. 16. Roadmap : 1.0 General Availability(Q1 2013)• File Formats: – RCFile, Avro, LZO – Trevni : new columnar file-format by Doug Cutting• More OS Support: – Same as those supported by CDH4• Performance: – Faster, bigger, and more memory efficient joins and aggregations – Straggler handling : • more work to faster machines, and less to slower machines• DDL : enables users to create tables from Impala• JDBC Driver (shared with Hive) 16 / 18
  17. 17. Where to learn more and get started• Impala Documentation• Clouder’s Impala Demo VM• Cloudera Blog• Impala-user Google Group!forum/impala-user• (Unofficial) presentation at Apache Asia Road Show• Official announcement of Impala at Strata Conference NY 2012• Dremel: Interactive Analysis of Web-Scale Datasets• BigQuery Technical White Paper 17 / 18
  18. 18. Conclusion• Uses Impala when you need to find / compute quickly little data from a large data source• Impala does not replace batch-oriented jobs• Impala beta and documentation is quite good for a beta – If you can’t wait for Impala v1.0, try BigQuery 18 / 18