BDM8 - Near-realtime Big Data Analytics using Impala

  • 1,591 views
Uploaded on

Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.

Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,591
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
56
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • Marcel Kornacker is the architect of Impala. Prior to joining Cloudera, he was the lead developer for the query engine of Google’s F1 project
  • You may well have an OLAP cube, but not for this specific use case…
  • Impala uses SSE4.2 for checksumming (2X faster than without SSE4.2)e.g. Intel Nehalem+, AMD Bulldozer+
  • 37 signals – Web Application Company (where Ruby on Rails originated)
  • Rappeler objectif de l’exposé:Points importants:Message significatif facile à retenir:

Transcript

  • 1. Near-realtimeBig Data Analyticsusing ImpalaDavid LauzonBig Data Montreal #8January 10th 2013 1 / 18
  • 2. Plan• What is Impala?• Why Google built Dremel?• Use cases for Impala• Use cases for Map-Reduce• Cloudera Customer Survey• Impala Features• Impala Performance Expectations and Benchmarks• Impala Components• Impala Architecture• Impala Development Roadmap• Where to learn more and get started 2 / 18
  • 3. Disclaimer• In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation.• I am not affiliated or sponsored by Cloudera or Google. 3 / 18
  • 4. What is Impala? “An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia 4 / 18
  • 5. Seriously, what is Impala?• “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera• Inspired by Google Dremel Paper (2010) – BigQuery is a Dremel implementation service, • It’s proprietary, not free, and requires to upload your data to Google servers 5 / 18
  • 6. Why Google built Dremel?• Problems with Data Warehouse Solutions for OLAP/BI: – Relational OLAP (ROLAP) : • Need to build indices for every possible query (for performance concerns)  Indices size could take up the whole RAM – Multi-dimensional OLAP (MOLAP): • Require extensive time and money to design and build the data cubes – Ad-hoc query (specific non-optimised query) : • When you don’t know what you’ll need / or need to work in iterations. e.g. quite often !!• Solution: – Increase full-scan speed without requiring indexing or pre- aggregated values 6 / 18
  • 7. Come on, give me some use cases!• Finding particular records with specified conditions. – “Find all the locations where account “ABC” was accessed from”.• Quick aggregation of statistics with dynamically- changing conditions: – “Can you give me yesterday’s number of impressions for Google AdWords display ads – but only in the Tokyo region?”• Trial-and-error data analysis: – “And between 11am to 1pm?” 7 / 18
  • 8. Use cases for which you should stickwith Map-Reduce based applications• Very long running, batch-oriented tasks such as ETL: – e.g. exporting large amount of data after processing• Complex event processing: – e.g. stream-processing• “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google 8 / 18
  • 9. Integration with Hadoop• Cloudera Customer Survey (Aug. 2012) – 80% needs faster queries on Hadoop data – 65% query Hadoop using Hive – 70% move data from Hadoop to RDBMS for interactive SQL – 60% see value today in consolidating to a single platform 9 / 18
  • 10. Impala Features• Shared with Hive: – Hive MetaStore – Hive SQL (most common SQL-92 features) – ODBC Driver – User Interface (Hue Beeswax)• Specific to Impala: – No Map Reduce, but in memory transfers – Host and Disk Awareness (data locality) – Table data caching in RAM – No virtual columns, or locking 10 / 18
  • 11. Impala Performance Expectations• Performance improvements over Hive – 3 - 4X for purely I/O bound queries – 7 - 45X for queries with at least one join – 20 - 90X when data available in the cache 11 / 18
  • 12. External Benchmarks• Searching log files at 37 signals (creators of Ruby on Rails web framework)Workload Impala Hive MySQL Query Query Query Time Time Time5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s8 Gb pageview table – daily pageviews and unique 22.4s 92.2s 180.0svisitorshttp://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence 12 / 18
  • 13. Impala Components• Impala State Store : 1 per cluster – Coordinates information (location and status) about all the running impalad instances• Impala Daemon : 1 per DataNode – Coordinates and executes queries – Distributes query fragments to other Impala Daemon• Impala Shell : 1 per node – Provides Command Line Interface allowing interactions with Impala 13 / 18
  • 14. Impala Architecture 14 / 18
  • 15. Roadmap : 0.3 Beta Version• Operation System: – Only RHEL/CentOS 6.2 is supported• File formats: – Text files, SequenceFiles, HBase table• Compression: – Snappy, Gzip, BZip• No UDFs or user extensibility *• Largest table in joins must be specified first *• Right-side of join must fit in RAM *• No support for complex nested structures * : – e.g. maps, structs and arrays. * Post 1.0 G.A. Version Top Asks 15 / 18
  • 16. Roadmap : 1.0 General Availability(Q1 2013)• File Formats: – RCFile, Avro, LZO – Trevni : new columnar file-format by Doug Cutting• More OS Support: – Same as those supported by CDH4• Performance: – Faster, bigger, and more memory efficient joins and aggregations – Straggler handling : • more work to faster machines, and less to slower machines• DDL : enables users to create tables from Impala• JDBC Driver (shared with Hive) 16 / 18
  • 17. Where to learn more and get started• Impala Documentation https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation• Clouder’s Impala Demo VM https://ccp.cloudera.com/display/SUPPORT/Clouderas+Impala+Demo+VM• Cloudera Blog http://blog.cloudera.com/blog/category/impala/• Impala-user Google Group https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user• (Unofficial) presentation at Apache Asia Road Show http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf• Official announcement of Impala at Strata Conference NY 2012 http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9• Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html• BigQuery Technical White Paper https://cloud.google.com/files/BigQueryTechnicalWP.pdf 17 / 18
  • 18. Conclusion• Uses Impala when you need to find / compute quickly little data from a large data source• Impala does not replace batch-oriented jobs• Impala beta and documentation is quite good for a beta – If you can’t wait for Impala v1.0, try BigQuery 18 / 18