Real-time Big Data Analytics Engine using Impala
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Real-time Big Data Analytics Engine using Impala

on

  • 3,899 views

Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is ...

Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.

Statistics

Views

Total Views
3,899
Views on SlideShare
3,899
Embed Views
0

Actions

Likes
9
Downloads
180
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real-time Big Data Analytics Engine using Impala Presentation Transcript

  • 1. Real-time Big Data Analytics Engine using Impala Jason Shih Etu 28 Sept, HIT 2013
  • 2. Outline •  Motivation & Users’ perspective •  Impala architecture and data analytics stack Overview •  Performance benchmark •  Use Cases (Demo) HIT 2013 2
  • 3. Motivation & Users’ Perspective •  Leverage existing Hadoop deployment •  Reuse HIVE metadata, metastore, DLL & JDBC/ODBC drivers. •  File format widely support in Hadoop •  Read performance: disk awareness and short-circuit •  MPP SQL query engine (over Hadoop) •  billion to trillion records at interactive speeds •  Both analytical & transactional •  General purpose & ad-hoc •  MR •  High latency, dismissed for interactive workload •  Disk-based intermediated outputs •  Execution strategies (lack of optimization base on data statistics) •  Task and scheduling overhead •  Task launch delay 5~10sec (pre-defined delay due to the periodic heartbeat for new scheduled tasks). HIT 2013 3
  • 4. Motivation & Users’ Perspective (cont’) •  High performance •  In memory query engine •  C++ instead of JAVA •  Runtime code generation •  Completely new execution engine (cf. MR framework) •  Data locality and short-circuit read •  HDFS-2246: avoid HDFS API overhead •  HDFS-34: Making Short-Circuit Local Reads Secure •  Intermediate data never hits disk •  Data stream to client HIT 2013 4
  • 5. Motivation & Users’ Perspective (cont’) •  MPP-RDB Paradigm •  HDFS: •  Scalability & Availability •  Price Performance & Commodity •  MPP DW appliance: •  Exadata, Vertica, HANA, Aster (SQL-MapReduce), HWAQ (Pivotal HD) & Dremel etc. •  Pros: •  Very matured & highly optimized engine •  Cons •  Generally not fault-tolerance !  For long run queries when cluster scale-up !  Lack rich analytics (machine learning) HIT 2013 5
  • 6. •  Impala •  Real-time queries in Apache Hadoop sit atop HDFS. •  ~2010-2012, 7 FTE (Marcel Kornacker) •  Completely open source, ASLv2 •  GA: connectors for BI, DW general available Google F1 - The Fault- Tolerant Distributed RDBMS, May 2012 6Ref: http://www.wired.com/wiredenterprise/2012/10/cloudera-impala-hadoop/
  • 7. Impala Overview: SQL Support •  Functionality highlight: •  SQL-92 features minus correlated subqueries •  SELECT, INSERT INTO, , SELECT ... INSERT INTO … VALUES(…) •  ORDER BY requires LIMIT •  Flexible file format: RCFile •  Unsupported/Limitation •  WITH clause does not support recursive queries in the WITH •  Only hash join •  Joined tables has to fit in aggregated memory of all executing nodes •  No beyond SQL •  buckets, samples, transforms, array, structs, maps, xpath and json •  UDF support •  Impala 1.2: Support HIVE UDFs (existing jars without recompile) •  Impala native UDF/UDA and UDF/UDA register in metadata catalog HIT 2013 7
  • 8. Impala SQL: create table HIT 2013 8 Ref: SQL Language Element: http://www.cloudera.com/content/cloudera-content/cloudera-docs/ Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html
  • 9. Architecture Overview •  Two daemons: •  impalad: •  Run on all HDFS DNs •  Functions as distributed query engine •  Handle client and internal requests (query exec) •  Design execution plan for queries and processes query on DNs •  Thrift services for these two roles •  statestored: •  Cluster metadata, name service & metadata distribution –  cf. HIVE metastore: RDB metadata •  Metadata updated when add/delete impalad processes •  Daemon cache metadata (INVALIDATE METADATA or REFRESH) •  Export thrift service •  Send periodic heartbeats, check for live backend and pushes new data •  Fail of statestore wont affect query execution except for stale state of DN HIT 2013 9
  • 10. Architecture Overview: Impala daemons •  Impalad: •  Impala 1.1 integrate Sentry for fine-grained authorization framework •  Daemon startup arg (default): •  impalad -log_dir=/opt/impala/var/log/impala -state_store_port=24000 - state_store_host=impala-server -be_port=22000 •  Enabled security •  Rely on existing Kerberos subsystem for authentication framework •  -use_statestore -kerberos_reinit_interval=60 -principal=impala/impalad- server@TESTDOMAIN.COM -keytab_file=impala.keytab •  Authorization: •  -authorization_policy_file arg., feed with .ini fmt •  divide into [groups] & [roles] (opt: [databases] & [users]) •  [users] will override OS-level mapping of users to groups. •  E.g.: •  Statestored: •  daemon startup: •  statestored -log_dir=/opt/impala/var/log/impala -state_store_port=24000 •  Enable Kerberos: •  -kerberos_reinit_interval=60 –principal=impala/statestored-server@TESTDOMAIN.COM - keytab_file=impala.keytab •  Available flags: •  http://statestored-server:25010/varz HIT 2013 10
  • 11. Architecture Overview (cont’) •  Query execution phases •  Planner, coordinator, executor •  Queries arrive via JDBC/ODBC, Thrift API/CLI, Hue/Beeswax •  Planner turns request into collections of plan fragments •  Coordinator initiates execution on impalad(s) local to data HIT 2013 11
  • 12. Architecture Overview: Query Execution •  Plan fragments upon request from JDBC/ODBC or thrift client •  Initiate execution on impalad by coordinator •  Intermediate result: streamed between impalad •  Results are streamed back to client 12
  • 13. Architecture Overview: Query Plan HIT 2013 •  Plan node & operators: •  Depth-first execution tree •  Scan, HashJoin, HashAggr, Union, TopN, Exchange •  Two phases processes •  Single node plan (left-deep tree) •  Plan fragments: Partitioning operator tree •  Fragment: distributed atomic executable unit (plan nodes) •  Distributed plans: •  Query operators are fully distributed •  Max. scan locality & min. data movement •  Parallel joins: •  Order: FROM clause •  Broadcast join & partitioned join •  Future roadmap: cost-based optimization based on column stats & cost of data transfers 13
  • 14. Architecture Overview: Query Plan (cont’) HIT 2013 14
  • 15. Logging and Profile •  Impala logs: •  Logging level control by •  GLOG_v env: “GLOG” –  Default level = 1, connection logging and execution profile –  Level 2 logged ea. RPC initiated and execution progress info –  Everything plus logging of every row read in 3rd level. •  -logbuflevel daemon startup flag. •  Exam: •  $IMPALA_HOME/var/log/impala/{impalad,statestore}.{INFO,WARNING,ERROR} •  Consolidate: impala-server.log & impala-state-store.log •  http://impalad-server:25000/logs •  Content: •  Startup opt: CPU, available spindles, flags, version and machine info •  Query profile: composition, degree of data locality, throughput statistics and responding time. •  Auditing log featured in release 1.1.1 •  Extensive analytics data for query execution: •  query profile stored in zlib-compressed fmt: •  $IMPALA_HOME/var/log/impala/profiles •  http://impalad-server:25000/queries HIT 2013 15
  • 16. Performance Tip •  Partitioning •  Large table & always or almost always queried with conditions on the partitioning columns •  JOIN •  Broadcast join by default. •  Partitioned join •  suitable for large tables of roughly equal size •  subsets of rows can be processed in parallel by sending portion of each tables •  Join the biggest table first •  Joining the table with the most selective filter •  INSERT •  not suitable for loading large quantities of data into HDFS-based tables, due to the lack of parallelized operations •  Staging temporary files in an ETL pipeline and upload to HDFS (refresh) •  Resource usage: •  Impalad startup flag: “-mem_limits” 16
  • 17. Troubleshooting Hint •  Queries are slow? •  Test: “select count(*) from table” •  Non-zero “Total remote scan volume” shown in impalad log indicate either some DNs not running impalad or impalad instance fail to contact one or more impalad instances. •  Missing impalad instances from DN •  live backend: http://statestore:25010/metrics •  Data locality and native checksuming (>= CDH 4.2) •  Enable properties: “dfs.client.read.shortcircuit” &“dfs.client.read.shortcircuit.skip.checksum” •  Rebuild/reinstall hadoop native lib “libhadoop.so” if needed. •  Error: –  Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata –  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable HIT 2013 17
  • 18. Troubleshooting Hint (cont’) •  Queries getting slower?: •  Impalad paging after mem exceeded •  E.g.: mem-limit.h:86] Query: 0:0Exceeded limit: limit=26996031488 consumption=26996148624 •  Incorrect result? •  Invalid metadata (GA: REFRESH, post-GA: INVALID METADATA) •  Invalid query? •  Cross check the query in HIVE •  Useful debugging info from impala service logs. •  Invalid/unsupported stmt: •  http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using- Impala/ciiu_langref.html#langref •  Auth error: •  Server logging: •  Minor code may provide more information (Cannot contact any KDC for realm or Kerberos: •  GSSAPI Error: Unspecified GSS failure •  Client: “Error connecting: <class 'thrift.transport.TTransport.TTransportException'>, TSocket read 0 bytes” •  Ensured •  valid Kerberos ticket lifetime at client •  Specify “-s” service principal and flag “–k” aim for kerberized impalad connection. HIT 2013 18
  • 19. Limitation and Wish List •  limitation: •  Subquery referenced in the SELECT •  Optional WITH clause before the INSERT. •  Recursive queries in the WITH clauses •  Inconsistent VIEW •  parenthesis in WHERE clauses •  Wish list •  SQL modeling tool •  Fault tolerance query •  Memory management (caching parquet table) & usage estimation •  Aggregation group of columns (> 30 etc.) HIT 2013 19
  • 20. Impala: Now & Future Roadmap •  Now (1.1.x/1.0) •  OS Support: •  RHEL/CentOS 5.7, Ubuntu, Debian, SLES, and Oracle Linux •  Connecters: JDBC/ODBC drivers •  DDL support & SQL performance optimization •  Fast & memory efficient: join & aggregation •  File format: Parquet, Avro & LZO compressed •  Future (1.2) – late 2013 •  UDF and extensibility •  Automatic metadata refresh •  In-memory HDFS caching •  Cost-base join order optimization •  Preview of YARN-integrated resource manager •  2.0 Roadmap – first 3rd of 2014 •  SQL 2003-compliant analytic window functions •  Additional authentication mechanisms •  UDTFs (user-defined table functions) •  Intra-node parallelized aggregations and joins •  Nested data •  YARN-integrated resource manager •  Additional data types – including Date and Decimal types HIT 2013 20
  • 21. More Information & Related Works •  “Dremel: Interactive Analysis of Web-Scale Datasets”, Sergey Melnik et al., Google •  Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time- queries-in-apache- hadoop-for-real/ •  “Impala unlocks Interactive BI on Hadoop with MicroStrategy”, Justin Erickson & Jochen Demuth, Cloudera •  “Cloudera impala Performance Evaluation”, Yukinori SUDA •  “HANA vs Impala, on AWS Cloud”, Aron MacDonald •  “Spark and Shark: High-speed In-memory Analytics over Hadoop Data”, Reynold Xin, AMPLab •  Stinger Initiative http://hortonworks.com/blog/100x-faster-hive/ •  Apache Drill: http://incubator.apache.org/drill/ HIT 2013 21
  • 22. Performance Evaluation 0 20 40 60 80 100 Shark Impala PIG Elephant Km/h Ref: Wiki & http://www.speedofanimals.com
  • 23. Breakdown of DNS Anomaly Analytics HIT 2013 23 Two DN + Master -  Dual DC E5620 2.40GHz -  MEM 32GB ea. -  4 spindles, 2T ea. HDFS(GB) QueryResp.(sec)
  • 24. Data Volume and Ingest HIT 2013 1D 1W 1M 2M Data (Raw) (GB) 5.1 35 140 280 Data (HDFS) (GB) 3.8 25.9 103.6 207.2 Blocks (HDFS) 31 211 844 1598 MEvt 42 291 1,166 2,209 24
  • 25. PIG vs. Impala •  Domain level compute in preprocessing streaming. •  DN sort throughput: ~120MB/s throughput & SIP/Qry ~ 50MB/s. •  Processing time scale linearly with data vol. HIT 2013 27 Query Resp. (sec) Impala: 71s 7 times faster.
  • 26. Observation & Estimation •  Speed-up: 4.5~7 times •  DL Calc.: 57~70% memory usage •  Data ingest !  Est. ~3TB take ~55K sec. •  Plus pre-processing time !  Throughput constrain to GbE linkage (in/out bound) !  Avg. throughput ~80MB/s •  non-encrypted file transfer •  RTQ: ~15K sec for 3TB process !  c.f. 115K base on MR HIT 2013 28
  • 27. Query Throughput & Latency •  Queries •  20 from TPC-DS •  3 categories •  Interactive: 1month •  Reports: several months •  Deep analytics: all data •  Fact table: •  1TB snappy-seq.-files/5Yr •  Resource level: •  20 nodes, 24cores/node. •  Speed-up: •  Interactive: 25~68 •  Reports: 6~56 •  Deep analytics: 6~55 29Ref: “Impala: A Modern, Open-Source SQL Engine for Hadoop”, Marcel Kornacker, Cloudera
  • 28. Impala vs. Stinger •  Stinger •  Optimize execution plan •  TEZ framework optimize execution •  Columnar file format 30Ref: Cloudera Impala Overview, Scott Leberknight, Cloudera.
  • 29. Impala Use Cases Offloads DW for ad hoc query environment, ETL and archiving Interactive BI/analytics on large volume of data Real-time response for unstructured data analysis
  • 30. Impala and HIVE HIT 2013 32 •  Impala: •  Native MPP query engine for low runtime overhead & interactive SQL •  No fault tolerance •  GA: UDF supported •  HIVE •  MapReduce as an execution engine •  Fault-tolerant leveraging MR framework •  High runtime overhead (extensive layering) •  UDF •  Common for client: •  SQL syntax •  highly compatible with HiveQL •  ODBC/JDBC drivers •  Metadata (table definition) •  HUE
  • 31. Data Warehouse Offload 33Ref: Hadoop and the Data Warehouse: When to Use Which, Teradata
  • 32. Query Run Times •  Table with 60M Records 34 Ref: HANA vs Impala, on AWS Cloud
  • 33. TPC-H Query Run Times •  Lineitem table 60M Rows 35 Ref: HANA vs Impala, on AWS Cloud
  • 34. •  On-demand Customer Segmentation based on various demographic and mobile behavior attributes •  On-demand Customer Profiling through fast screening & ranking of critical attributes With the power of distributed in-memory computation on hadoop, Impala enables market analyst to conduct various interactive analytics such as OLAP, statistical correlation, and data mining on big data. HIT 2013 36
  • 35. 「 標族群 」關聯屬性分析 33% 28% 27% 12% Facebook 43% Twitter 31% Google+19% LinkedIn 7% 27% 23% 39% 11% Facebook 44% Twitter 30% Google+17% LinkedIn 9% 53% 47% 56% 44% app 28% app 17% app 23% app 18% app 14% app 25% app 14% app 20% App 33% app 10% – 39
  • 36. DEMO •  CREATE TABLE, LOAD DATA from HDFS DROP TABLE IF EXISTS demo; CREATE EXTERNAL TABLE demo ( a string, b int, c int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/etu/demo'; •  PIG & Impala: •  SUM •  SUM with GROUP BY HIT 2013 40
  • 37. DEMO (cont’) •  SUM in PIG: a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; d = group b by col4; d1 = foreach d generate SUM(b.col4); store d1 into 'demo/count2' using PigStorage(','); •  SUM in Impala: SELECT sum(demo.c) FROM demo; HIT 2013 41
  • 38. DEMO (cont’) •  SUM with GROUP BY in PIG a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; c = group b by col1; c1 = foreach c generate group, SUM(b.col2); store c1 into 'demo/count1' using PigStorage(','); •  SUM with GROUP BY in Impala SELECT demo.a AS tag, sum(demo.b) AS val FROM demo GROUP BY demo.a; HIT 2013 42
  • 39. DEMO (cont’) •  Speed-up: HIT 2013 43 Query Resp. (sec) X 60 X 18 Two DN, same spec for DNS log analytics. Dual DC E5620, MEM 32GB ea. ~100 time faster when cluster scale.
  • 40. 44 Question? jasonshih@etusolution.com Slideshare www.slideshare.net/hlshih/hit2013-impala-0925etu Acknowledgement Dr. CM Fan, MFactory, SYSTEX
  • 41. www.etusolution.com info@etusolution.com Taipei, Taiwan 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 Beijing, China Room B-26, Landgent Center, No. 24, East Third Ring Middle Rd., Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227 Contact