SlideShare a Scribd company logo
1 of 41
Download to read offline
 	
  
Impala: A Modern,
Open-Source SQL
Engine for Hadoop	
  
Mark	
  Grover	
  
So+ware	
  Engineer,	
  Cloudera	
  
January	
  7th,	
  2014	
  
Twi@er:	
  mark_grover	
  
github.com/markgrover/impala-­‐thug/	
  
•  What	
  is	
  Impala	
  and	
  what	
  is	
  Hadoop?	
  
•  ExecuPon	
  frameworks	
  on	
  Hadoop	
  –	
  MR,	
  Hive,	
  etc.	
  
•  Goals	
  and	
  user	
  view	
  of	
  Impala	
  
•  Architecture	
  of	
  Impala	
  
•  Comparing	
  Impala	
  to	
  other	
  systems	
  
•  Impala	
  Roadmap	
  
Agenda	
  
•  General-­‐purpose	
  SQL	
  engine	
  
•  Real-­‐Pme	
  queries	
  in	
  Apache	
  Hadoop	
  
What	
  is	
  Impala?	
  
What	
  is	
  Apache	
  Hadoop?	
  
4	
  
Has the Flexibility to Store
and Mine Any Type of Data
§  Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§  Not bound by a single schema
Excels at
Processing Complex Data
§  Scale-out architecture divides
workloads across multiple nodes
§  Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§  Can be deployed on commodity
hardware
§  Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open
source platform for data storage and
processing that is…
ü  Distributed
ü  Fault tolerant
ü  Scalable
CORE HADOOP SYSTEM COMPONENTS
•  Batch	
  oriented	
  
•  High	
  latency	
  
•  Not	
  all	
  paradigms	
  fit	
  very	
  well	
  
•  Only	
  for	
  developers	
  
So,	
  what’s	
  wrong	
  with	
  MapReduce?	
  
•  MR	
  is	
  hard	
  and	
  only	
  for	
  developers	
  
•  Higher	
  level	
  pla]orms	
  for	
  converPng	
  declaraPve	
  
syntax	
  to	
  MapReduce	
  
•  SQL	
  –	
  Hive	
  
•  workflow	
  language	
  –	
  Pig	
  
•  Build	
  on	
  top	
  of	
  MapReduce	
  
What	
  are	
  Hive	
  and	
  Pig?	
  
•  General-­‐purpose	
  SQL	
  engine	
  
•  Real-­‐Pme	
  queries	
  in	
  Apache	
  Hadoop	
  
•  Beta	
  version	
  released	
  since	
  October	
  2012	
  
•  General	
  availability	
  (v1.0)	
  release	
  out	
  since	
  April	
  2013	
  
•  Open	
  source	
  under	
  Apache	
  license	
  
•  Latest	
  release	
  (v1.2.3)	
  released	
  on	
  December	
  23rd	
  
What	
  is	
  Impala?	
  
Impala	
  Overview:	
  Goals	
  
•  General-­‐purpose	
  SQL	
  query	
  engine:	
  
•  Works	
  for	
  both	
  for	
  analyPcal	
  and	
  transacPonal/single-­‐row	
  
workloads	
  
•  Supports	
  queries	
  that	
  take	
  from	
  milliseconds	
  to	
  hours	
  
•  Runs	
  directly	
  within	
  Hadoop:	
  
•  reads	
  widely	
  used	
  Hadoop	
  file	
  formats	
  
•  talks	
  to	
  widely	
  used	
  Hadoop	
  storage	
  managers	
  	
  
•  runs	
  on	
  same	
  nodes	
  that	
  run	
  Hadoop	
  processes	
  
•  High	
  performance:	
  
•  C++	
  instead	
  of	
  Java	
  
•  runPme	
  code	
  generaPon	
  
•  completely	
  new	
  execuPon	
  engine	
  –	
  No	
  MapReduce	
  
User	
  View	
  of	
  Impala:	
  Overview	
  
•  Runs	
  as	
  a	
  distributed	
  service	
  in	
  cluster:	
  one	
  Impala	
  daemon	
  on	
  
each	
  node	
  with	
  data	
  
•  Highly	
  available:	
  no	
  single	
  point	
  of	
  failure	
  
•  User	
  submits	
  query	
  via	
  ODBC/JDBC,	
  Impala	
  CLI	
  or	
  Hue	
  to	
  any	
  
of	
  the	
  daemons	
  
•  Query	
  is	
  distributed	
  to	
  all	
  nodes	
  with	
  relevant	
  data	
  
•  Impala	
  uses	
  Hive's	
  metadata	
  interface,	
  connects	
  to	
  Hive	
  
metastore	
  
User	
  View	
  of	
  Impala:	
  Overview	
  
•  There	
  is	
  no	
  ‘Impala	
  format’!	
  
•  There	
  is	
  no	
  ‘Impala	
  format’!!	
  
•  Supported	
  file	
  formats:	
  
•  uncompressed/lzo-­‐compressed	
  text	
  files	
  
•  sequence	
  files	
  and	
  RCFile	
  with	
  snappy/gzip	
  compression	
  
•  Avro	
  data	
  files	
  
•  Parquet	
  columnar	
  format	
  (more	
  on	
  that	
  later)	
  
User	
  View	
  of	
  Impala:	
  SQL	
  
•  SQL	
  support:	
  
•  essenPally	
  SQL-­‐92,	
  minus	
  correlated	
  subqueries	
  
•  INSERT	
  INTO	
  …	
  SELECT	
  …	
  
•  only	
  equi-­‐joins;	
  no	
  non-­‐equi	
  joins,	
  no	
  cross	
  products	
  
•  Order	
  By	
  requires	
  Limit	
  
•  (Limited)	
  DDL	
  support	
  
•  SQL-­‐style	
  authorizaPon	
  via	
  Apache	
  Sentry	
  (incubaPng)	
  
•  UDFs	
  and	
  UDAFs	
  are	
  supported	
  
User	
  View	
  of	
  Impala:	
  SQL	
  
•  FuncPonal	
  limitaPons:	
  
•  No	
  file	
  formats,	
  SerDes	
  
•  no	
  beyond	
  SQL	
  (buckets,	
  samples,	
  transforms,	
  arrays,	
  
structs,	
  maps,	
  xpath,	
  json)	
  
•  Broadcast	
  joins	
  and	
  parPPoned	
  hash	
  joins	
  supported	
  
•  Smaller	
  table	
  has	
  to	
  fit	
  in	
  aggregate	
  memory	
  of	
  all	
  execuPng	
  
nodes	
  
User	
  View	
  of	
  Impala:	
  HBase	
  
•  FuncPonality	
  highlights:	
  
•  Support	
  for	
  SELECT,	
  INSERT	
  INTO	
  …	
  SELECT	
  …,	
  and	
  INSERT	
  
INTO	
  …	
  VALUES(…)	
  
•  Predicates	
  on	
  rowkey	
  columns	
  are	
  mapped	
  into	
  start/stop	
  
rows	
  
•  Predicates	
  on	
  other	
  columns	
  are	
  mapped	
  into	
  
SingleColumnValueFilters	
  
•  But:	
  mapping	
  of	
  HBase	
  tables	
  metastore	
  table	
  
pa@erned	
  a+er	
  Hive	
  
•  All	
  data	
  stored	
  as	
  scalars	
  and	
  in	
  ascii	
  
•  The	
  rowkey	
  needs	
  to	
  be	
  mapped	
  into	
  a	
  single	
  string	
  
column	
  
User	
  View	
  of	
  Impala:	
  HBase	
  
•  Roadmap	
  
•  Full	
  support	
  for	
  UPDATE	
  and	
  DELETE	
  
•  Storage	
  of	
  structured	
  data	
  to	
  minimize	
  storage	
  and	
  access	
  
overhead	
  
•  Composite	
  row	
  key	
  encoding,	
  mapped	
  into	
  an	
  arbitrary	
  
number	
  of	
  table	
  columns	
  
Impala	
  Architecture	
  
•  Three	
  binaries:	
  impalad,	
  statestored,	
  catalogd	
  
•  Impala	
  daemon	
  (impalad)	
  –	
  N	
  instances	
  
•  handles	
  client	
  requests	
  and	
  all	
  internal	
  requests	
  related	
  to	
  
query	
  execuPon	
  
•  State	
  store	
  daemon	
  (statestored)	
  –	
  1	
  instance	
  
•  Provides	
  name	
  service	
  and	
  metadata	
  distribuPon	
  
•  Catalog	
  daemon	
  (catalogd)	
  –	
  1	
  instance	
  
•  Relays	
  metadata	
  changes	
  to	
  all	
  impalad’s	
  
Impala	
  Architecture	
  
•  Query	
  execuPon	
  phases	
  
•  request	
  arrives	
  via	
  odbc/jdbc	
  
•  planner	
  turns	
  request	
  into	
  collecPons	
  of	
  plan	
  
fragments	
  
•  coordinator	
  iniPates	
  execuPon	
  on	
  remote	
  
impalad’s	
  
Impala	
  Architecture	
  
•  During	
  execuPon	
  
•  intermediate	
  results	
  are	
  streamed	
  between	
  
executors	
  
•  query	
  results	
  are	
  streamed	
  back	
  to	
  client	
  
•  subject	
  to	
  limitaPons	
  imposed	
  to	
  blocking	
  
operators	
  (top-­‐n,	
  aggregaPon)	
  
Impala	
  Architecture:	
  Query	
  ExecuPon	
  
Request	
  arrives	
  via	
  odbc/jdbc	
  
Query	
  Planner	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  
request	
  
Query	
  Coordinator	
   Query	
  Coordinator	
  
HiveMeta
store	
  
HDFS	
  NN	
  
Statestore	
  
+	
  
Catalogd	
  
Impala	
  Architecture:	
  Query	
  ExecuPon	
  
Planner	
  turns	
  request	
  into	
  collecPons	
  of	
  plan	
  fragments	
  
Coordinator	
  iniPates	
  execuPon	
  on	
  remote	
  impalad's	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
HiveMeta
store	
  
HDFS	
  NN	
  
Statestore	
  
+	
  
Catalogd	
  
Impala	
  Architecture:	
  Query	
  ExecuPon	
  
Intermediate	
  results	
  are	
  streamed	
  between	
  impalad's	
  Query	
  
results	
  are	
  streamed	
  back	
  to	
  client	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
query	
  
results	
  
HiveMeta
store	
  
HDFS	
  NN	
  
Statestore	
  
+	
  
Catalogd	
  
Query	
  Planning:	
  Overview	
  
•  2-­‐phase	
  planning	
  process:	
  
•  single-­‐node	
  plan:	
  le+-­‐deep	
  tree	
  of	
  plan	
  operators	
  
•  plan	
  parPPoning:	
  parPPon	
  single-­‐node	
  plan	
  to	
  maximize	
  scan	
  locality,	
  
minimize	
  data	
  movement	
  
•  ParallelizaPon	
  of	
  operators:	
  
•  All	
  query	
  operators	
  are	
  fully	
  distributed	
  
Query Planning:	
  Single-­‐Node	
  Plan	
  
•  Plan	
  operators:	
  Scan,	
  HashJoin,	
  HashAggregaPon,	
  Union,	
  
TopN,	
  Exchange	
  
Single-­‐Node	
  Plan:	
  Example	
  Query	
  
SELECT	
  t1.cusPd,	
  
	
  	
  	
  	
  	
  	
  	
  SUM(t2.revenue)	
  AS	
  revenue	
  
FROM	
  LargeHdfsTable	
  t1	
  
JOIN	
  LargeHdfsTable	
  t2	
  ON	
  (t1.id1	
  =	
  t2.id)	
  
JOIN	
  SmallHbaseTable	
  t3	
  ON	
  (t1.id2	
  =	
  t3.id)	
  
WHERE	
  t3.category	
  =	
  'Online'	
  
GROUP	
  BY	
  t1.cusPd	
  
ORDER	
  BY	
  revenue	
  DESC	
  LIMIT	
  10;	
  
Query Planning:	
  Single-­‐Node	
  Plan	
  
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
•  Single-­‐node	
  plan	
  for	
  example:	
  	
  
Query Planning:	
  Distributed	
  Plans	
  
•  Goals:	
  
o  maximize	
  scan	
  locality,	
  minimize	
  data	
  movement	
  
o  full	
  distribuPon	
  of	
  all	
  query	
  operators	
  (where	
  
semanPcally	
  correct)	
  
•  Parallel	
  joins:	
  
o  broadcast	
  join:	
  join	
  is	
  collocated	
  with	
  le+	
  input;	
  right-­‐
hand	
  side	
  table	
  is	
  broadcast	
  to	
  each	
  node	
  execuPng	
  join	
  
-­‐>	
  preferred	
  for	
  small	
  right-­‐hand	
  side	
  input	
  
o  parPPoned	
  join:	
  both	
  tables	
  are	
  hash-­‐parPPoned	
  on	
  join	
  
columns	
  
-­‐>	
  preferred	
  for	
  large	
  joins	
  
o  cost-­‐based	
  decision	
  based	
  on	
  column	
  stats/esPmated	
  
cost	
  of	
  data	
  transfers	
  
Query Planning:	
  Distributed	
  Plans	
  
•  Parallel	
  aggregaPon:	
  
o  pre-­‐aggregaPon	
  where	
  data	
  is	
  first	
  materialized	
  
o  merge	
  aggregaPon	
  parPPoned	
  by	
  grouping	
  columns	
  
•  Parallel	
  top-­‐N:	
  
o  iniPal	
  top-­‐N	
  operaPon	
  where	
  data	
  is	
  first	
  materialized	
  
o  final	
  top-­‐N	
  in	
  single-­‐node	
  plan	
  fragment	
  
Query Planning:	
  Distributed	
  Plans	
  
•  In	
  the	
  example:	
  
o  scans	
  are	
  local:	
  each	
  scan	
  receives	
  its	
  own	
  fragment	
  
o  1st	
  join:	
  large	
  x	
  large	
  -­‐>	
  parPPoned	
  join	
  
o  2nd	
  scan:	
  large	
  x	
  small	
  -­‐>	
  broadcast	
  join	
  
o  pre-­‐aggregaPon	
  in	
  fragment	
  that	
  materializes	
  join	
  result	
  
o  merge	
  aggregaPon	
  a+er	
  reparPPoning	
  on	
  grouping	
  
column	
  
o  iniPal	
  top-­‐N	
  in	
  fragment	
  that	
  does	
  merge	
  aggregaPon	
  
o  final	
  top-­‐N	
  in	
  coordinator	
  fragment	
  
Query	
  Planning:	
  Distributed	
  Plans	
  
HashJoinScan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Broadcast
hash t2.idhash t1.id1
hash
t1.custid
at HDFS DN
at HBase RS
at coordinator
Metadata	
  Handling	
  
•  Impala	
  metadata:	
  
o  Hive's	
  metastore:	
  logical	
  metadata	
  (table	
  definiPons,	
  
columns,	
  CREATE	
  TABLE	
  parameters)	
  
o  HDFS	
  NameNode:	
  directory	
  contents	
  and	
  block	
  replica	
  
locaPons	
  
o  HDFS	
  DataNode:	
  block	
  replicas'	
  volume	
  ids	
  
Metadata	
  Handling	
  
•  Caches	
  metadata:	
  no	
  synchronous	
  metastore	
  API	
  calls	
  during	
  
query	
  execuPon	
  
•  impalad	
  instances	
  read	
  metadata	
  from	
  metastore	
  at	
  startup	
  
•  Catalog	
  Service	
  relays	
  metadata	
  when	
  you	
  run	
  DDL	
  or	
  update	
  
metadata	
  on	
  one	
  of	
  Impalad’s	
  
•  REFRESH [<tbl>]:	
  reloads	
  metadata	
  on	
  all	
  impalad’s	
  (if	
  
you	
  added	
  new	
  files	
  via	
  Hive)	
  
•  INVALIDATE METADATA:	
  reloads	
  metadata	
  for	
  all	
  tables	
  
•  Roadmap:	
  HCatalog	
  
Impala	
  ExecuPon	
  Engine	
  
•  Wri@en	
  in	
  C++	
  for	
  minimal	
  execuPon	
  overhead	
  
•  Internal	
  in-­‐memory	
  tuple	
  format	
  puts	
  fixed-­‐width	
  
data	
  at	
  fixed	
  offsets	
  
•  Uses	
  intrinsics/special	
  cpu	
  instrucPons	
  for	
  text	
  
parsing,	
  crc32	
  computaPon,	
  etc.	
  
•  RunPme	
  code	
  generaPon	
  for	
  “big	
  loops”	
  
Impala	
  ExecuPon	
  Engine	
  
•  More	
  on	
  runPme	
  code	
  generaPon	
  
•  example	
  of	
  "big	
  loop":	
  insert	
  batch	
  of	
  rows	
  into	
  hash	
  table	
  
•  known	
  at	
  query	
  compile	
  Pme:	
  #	
  of	
  tuples	
  in	
  a	
  batch,	
  tuple	
  
layout,	
  column	
  types,	
  etc.	
  
•  generate	
  at	
  compile	
  Pme:	
  unrolled	
  loop	
  that	
  inlines	
  all	
  
funcPon	
  calls,	
  contains	
  no	
  dead	
  code,	
  minimizes	
  branches	
  
•  code	
  generated	
  using	
  llvm	
  
Impala's	
  Statestore	
  
•  Central	
  system	
  state	
  repository	
  
•  name	
  service	
  (membership)	
  
•  Metadata	
  
•  Roadmap:	
  other	
  scheduling-­‐relevant	
  or	
  diagnosPc	
  state	
  
•  So+-­‐state	
  
•  all	
  data	
  can	
  be	
  reconstructed	
  from	
  the	
  rest	
  of	
  the	
  system	
  
•  cluster	
  conPnues	
  to	
  funcPon	
  when	
  statestore	
  fails,	
  but	
  per-­‐node	
  state	
  
becomes	
  increasingly	
  stale	
  
•  Sends	
  periodic	
  heartbeats	
  
•  pushes	
  new	
  data	
  
•  checks	
  for	
  liveness	
  
Statestore:	
  Why	
  not	
  ZooKeeper?	
  	
  
•  ZK	
  is	
  not	
  a	
  good	
  pub-­‐sub	
  system	
  
•  Watch	
  API	
  is	
  awkward	
  and	
  requires	
  a	
  lot	
  of	
  client	
  logic	
  
•  mulPple	
  round-­‐trips	
  required	
  to	
  get	
  data	
  for	
  changes	
  to	
  
node's	
  children	
  
•  push	
  model	
  is	
  more	
  natural	
  for	
  our	
  use	
  case	
  
•  Don't	
  need	
  all	
  the	
  guarantees	
  ZK	
  provides:	
  
•  serializability	
  
•  persistence	
  
•  prefer	
  to	
  avoid	
  complexity	
  where	
  possible	
  
•  ZK	
  is	
  bad	
  at	
  the	
  things	
  we	
  care	
  about	
  and	
  good	
  at	
  the	
  
things	
  we	
  don't	
  
Comparing	
  Impala	
  to	
  Dremel	
  
•  What	
  is	
  Dremel?	
  
•  columnar	
  storage	
  for	
  data	
  with	
  nested	
  structures	
  
•  distributed	
  scalable	
  aggregaPon	
  on	
  top	
  of	
  that	
  
•  Columnar	
  storage	
  in	
  Hadoop:	
  Parquet	
  
•  stores	
  data	
  in	
  appropriate	
  naPve/binary	
  types	
  
•  can	
  also	
  store	
  nested	
  structures	
  similar	
  to	
  Dremel's	
  ColumnIO	
  
•  Distributed	
  aggregaPon:	
  Impala	
  
•  Impala	
  plus	
  Parquet:	
  a	
  superset	
  of	
  the	
  published	
  version	
  of	
  
Dremel	
  (which	
  didn't	
  support	
  joins)	
  
•  What	
  is	
  it:	
  
•  container	
  format	
  for	
  all	
  popular	
  serializaPon	
  formats:	
  Avro,	
  Thri+,	
  
Protocol	
  Buffers	
  
•  successor	
  to	
  Trevni	
  
•  jointly	
  developed	
  between	
  Cloudera	
  and	
  Twi@er	
  
•  open	
  source;	
  hosted	
  on	
  github	
  
•  Features	
  
•  rowgroup	
  format:	
  file	
  contains	
  mulPple	
  horiz.	
  slices	
  
•  supports	
  storing	
  each	
  column	
  in	
  separate	
  file	
  
•  supports	
  fully	
  shredded	
  nested	
  data;	
  repePPon	
  and	
  definiPon	
  levels	
  
similar	
  to	
  Dremel's	
  ColumnIO	
  
•  column	
  values	
  stored	
  in	
  naPve	
  types	
  (bool,	
  int<x>,	
  float,	
  double,	
  byte	
  
array)	
  
•  support	
  for	
  index	
  pages	
  for	
  fast	
  lookup	
  
•  extensible	
  value	
  encodings	
  
More	
  about	
  Parquet	
  
Comparing	
  Impala	
  to	
  Hive	
  
•  Hive:	
  MapReduce	
  as	
  an	
  execuPon	
  engine	
  
•  High	
  latency,	
  low	
  throughput	
  queries	
  
•  Fault-­‐tolerance	
  model	
  based	
  on	
  MapReduce's	
  on-­‐disk	
  
checkpoinPng;	
  materializes	
  all	
  intermediate	
  results	
  
•  Java	
  runPme	
  allows	
  for	
  easy	
  late-­‐binding	
  of	
  funcPonality:	
  
file	
  formats	
  and	
  UDFs.	
  
•  Extensive	
  layering	
  imposes	
  high	
  runPme	
  overhead	
  
•  Impala:	
  
•  direct,	
  process-­‐to-­‐process	
  data	
  exchange	
  
•  no	
  fault	
  tolerance	
  
•  an	
  execuPon	
  engine	
  designed	
  for	
  low	
  runPme	
  overhead	
  
Impala	
  Roadmap:	
  2013	
  
•  AddiPonal	
  SQL:	
  
•  ORDER	
  BY	
  without	
  LIMIT	
  
•  AnalyPc	
  window	
  funcPons	
  
•  support	
  for	
  structured	
  data	
  types	
  
•  Improved	
  HBase	
  support:	
  
•  composite	
  keys,	
  complex	
  types	
  in	
  columns,	
  
index	
  nested-­‐loop	
  joins,	
  
INSERT/UPDATE/DELETE	
  
Impala	
  Roadmap:	
  2013	
  	
  
•  RunPme	
  opPmizaPons:	
  
•  straggler	
  handling	
  
•  improved	
  cache	
  management	
  
•  data	
  collocaPon	
  for	
  improved	
  join	
  performance	
  
•  Resource	
  management:	
  
•  goal:	
  run	
  exploratory	
  and	
  producPon	
  workloads	
  in	
  same	
  
cluster,	
  against	
  same	
  data,	
  w/o	
  impacPng	
  producPon	
  jobs	
  
Demo	
  
•  Uses	
  Cloudera’s	
  Quickstart	
  VM
h@p://Pny.cloudera.com/quick-­‐start	
  
Try	
  it	
  out!	
  
	
  	
  
•  Open	
  source!	
  Available	
  at	
  cloudera.com,	
  AWS	
  EMR!	
  
•  We	
  have	
  packages	
  for:	
  
•  RHEL	
  5,6,	
  SLES11,	
  Ubuntu	
  Lucid,	
  Maverick,	
  Precise,	
  
Debain,	
  etc.	
  
•  QuesPons/comments?	
  community.cloudera.com	
  
•  My	
  twi@er	
  handle:	
  mark_grover	
  
•  Slides	
  at:	
  github.com/markgrover/impala-­‐thug	
  

More Related Content

What's hot

Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksEDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 

What's hot (20)

Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 

Similar to Impala Architecture presentation

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 

Similar to Impala Architecture presentation (20)

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Apache drill
Apache drillApache drill
Apache drill
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 

Impala Architecture presentation

  • 1.     Impala: A Modern, Open-Source SQL Engine for Hadoop   Mark  Grover   So+ware  Engineer,  Cloudera   January  7th,  2014   Twi@er:  mark_grover   github.com/markgrover/impala-­‐thug/  
  • 2. •  What  is  Impala  and  what  is  Hadoop?   •  ExecuPon  frameworks  on  Hadoop  –  MR,  Hive,  etc.   •  Goals  and  user  view  of  Impala   •  Architecture  of  Impala   •  Comparing  Impala  to  other  systems   •  Impala  Roadmap   Agenda  
  • 3. •  General-­‐purpose  SQL  engine   •  Real-­‐Pme  queries  in  Apache  Hadoop   What  is  Impala?  
  • 4. What  is  Apache  Hadoop?   4   Has the Flexibility to Store and Mine Any Type of Data §  Ask questions across structured and unstructured data that were previously impossible to ask or solve §  Not bound by a single schema Excels at Processing Complex Data §  Scale-out architecture divides workloads across multiple nodes §  Flexible file system eliminates ETL bottlenecks Scales Economically §  Can be deployed on commodity hardware §  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is… ü  Distributed ü  Fault tolerant ü  Scalable CORE HADOOP SYSTEM COMPONENTS
  • 5. •  Batch  oriented   •  High  latency   •  Not  all  paradigms  fit  very  well   •  Only  for  developers   So,  what’s  wrong  with  MapReduce?  
  • 6. •  MR  is  hard  and  only  for  developers   •  Higher  level  pla]orms  for  converPng  declaraPve   syntax  to  MapReduce   •  SQL  –  Hive   •  workflow  language  –  Pig   •  Build  on  top  of  MapReduce   What  are  Hive  and  Pig?  
  • 7. •  General-­‐purpose  SQL  engine   •  Real-­‐Pme  queries  in  Apache  Hadoop   •  Beta  version  released  since  October  2012   •  General  availability  (v1.0)  release  out  since  April  2013   •  Open  source  under  Apache  license   •  Latest  release  (v1.2.3)  released  on  December  23rd   What  is  Impala?  
  • 8. Impala  Overview:  Goals   •  General-­‐purpose  SQL  query  engine:   •  Works  for  both  for  analyPcal  and  transacPonal/single-­‐row   workloads   •  Supports  queries  that  take  from  milliseconds  to  hours   •  Runs  directly  within  Hadoop:   •  reads  widely  used  Hadoop  file  formats   •  talks  to  widely  used  Hadoop  storage  managers     •  runs  on  same  nodes  that  run  Hadoop  processes   •  High  performance:   •  C++  instead  of  Java   •  runPme  code  generaPon   •  completely  new  execuPon  engine  –  No  MapReduce  
  • 9. User  View  of  Impala:  Overview   •  Runs  as  a  distributed  service  in  cluster:  one  Impala  daemon  on   each  node  with  data   •  Highly  available:  no  single  point  of  failure   •  User  submits  query  via  ODBC/JDBC,  Impala  CLI  or  Hue  to  any   of  the  daemons   •  Query  is  distributed  to  all  nodes  with  relevant  data   •  Impala  uses  Hive's  metadata  interface,  connects  to  Hive   metastore  
  • 10. User  View  of  Impala:  Overview   •  There  is  no  ‘Impala  format’!   •  There  is  no  ‘Impala  format’!!   •  Supported  file  formats:   •  uncompressed/lzo-­‐compressed  text  files   •  sequence  files  and  RCFile  with  snappy/gzip  compression   •  Avro  data  files   •  Parquet  columnar  format  (more  on  that  later)  
  • 11. User  View  of  Impala:  SQL   •  SQL  support:   •  essenPally  SQL-­‐92,  minus  correlated  subqueries   •  INSERT  INTO  …  SELECT  …   •  only  equi-­‐joins;  no  non-­‐equi  joins,  no  cross  products   •  Order  By  requires  Limit   •  (Limited)  DDL  support   •  SQL-­‐style  authorizaPon  via  Apache  Sentry  (incubaPng)   •  UDFs  and  UDAFs  are  supported  
  • 12. User  View  of  Impala:  SQL   •  FuncPonal  limitaPons:   •  No  file  formats,  SerDes   •  no  beyond  SQL  (buckets,  samples,  transforms,  arrays,   structs,  maps,  xpath,  json)   •  Broadcast  joins  and  parPPoned  hash  joins  supported   •  Smaller  table  has  to  fit  in  aggregate  memory  of  all  execuPng   nodes  
  • 13. User  View  of  Impala:  HBase   •  FuncPonality  highlights:   •  Support  for  SELECT,  INSERT  INTO  …  SELECT  …,  and  INSERT   INTO  …  VALUES(…)   •  Predicates  on  rowkey  columns  are  mapped  into  start/stop   rows   •  Predicates  on  other  columns  are  mapped  into   SingleColumnValueFilters   •  But:  mapping  of  HBase  tables  metastore  table   pa@erned  a+er  Hive   •  All  data  stored  as  scalars  and  in  ascii   •  The  rowkey  needs  to  be  mapped  into  a  single  string   column  
  • 14. User  View  of  Impala:  HBase   •  Roadmap   •  Full  support  for  UPDATE  and  DELETE   •  Storage  of  structured  data  to  minimize  storage  and  access   overhead   •  Composite  row  key  encoding,  mapped  into  an  arbitrary   number  of  table  columns  
  • 15. Impala  Architecture   •  Three  binaries:  impalad,  statestored,  catalogd   •  Impala  daemon  (impalad)  –  N  instances   •  handles  client  requests  and  all  internal  requests  related  to   query  execuPon   •  State  store  daemon  (statestored)  –  1  instance   •  Provides  name  service  and  metadata  distribuPon   •  Catalog  daemon  (catalogd)  –  1  instance   •  Relays  metadata  changes  to  all  impalad’s  
  • 16. Impala  Architecture   •  Query  execuPon  phases   •  request  arrives  via  odbc/jdbc   •  planner  turns  request  into  collecPons  of  plan   fragments   •  coordinator  iniPates  execuPon  on  remote   impalad’s  
  • 17. Impala  Architecture   •  During  execuPon   •  intermediate  results  are  streamed  between   executors   •  query  results  are  streamed  back  to  client   •  subject  to  limitaPons  imposed  to  blocking   operators  (top-­‐n,  aggregaPon)  
  • 18. Impala  Architecture:  Query  ExecuPon   Request  arrives  via  odbc/jdbc   Query  Planner   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Executor   HDFS  DN   HBase   SQL   request   Query  Coordinator   Query  Coordinator   HiveMeta store   HDFS  NN   Statestore   +   Catalogd  
  • 19. Impala  Architecture:  Query  ExecuPon   Planner  turns  request  into  collecPons  of  plan  fragments   Coordinator  iniPates  execuPon  on  remote  impalad's   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   HiveMeta store   HDFS  NN   Statestore   +   Catalogd  
  • 20. Impala  Architecture:  Query  ExecuPon   Intermediate  results  are  streamed  between  impalad's  Query   results  are  streamed  back  to  client   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   query   results   HiveMeta store   HDFS  NN   Statestore   +   Catalogd  
  • 21. Query  Planning:  Overview   •  2-­‐phase  planning  process:   •  single-­‐node  plan:  le+-­‐deep  tree  of  plan  operators   •  plan  parPPoning:  parPPon  single-­‐node  plan  to  maximize  scan  locality,   minimize  data  movement   •  ParallelizaPon  of  operators:   •  All  query  operators  are  fully  distributed  
  • 22. Query Planning:  Single-­‐Node  Plan   •  Plan  operators:  Scan,  HashJoin,  HashAggregaPon,  Union,   TopN,  Exchange  
  • 23. Single-­‐Node  Plan:  Example  Query   SELECT  t1.cusPd,                SUM(t2.revenue)  AS  revenue   FROM  LargeHdfsTable  t1   JOIN  LargeHdfsTable  t2  ON  (t1.id1  =  t2.id)   JOIN  SmallHbaseTable  t3  ON  (t1.id2  =  t3.id)   WHERE  t3.category  =  'Online'   GROUP  BY  t1.cusPd   ORDER  BY  revenue  DESC  LIMIT  10;  
  • 24. Query Planning:  Single-­‐Node  Plan   HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg •  Single-­‐node  plan  for  example:    
  • 25. Query Planning:  Distributed  Plans   •  Goals:   o  maximize  scan  locality,  minimize  data  movement   o  full  distribuPon  of  all  query  operators  (where   semanPcally  correct)   •  Parallel  joins:   o  broadcast  join:  join  is  collocated  with  le+  input;  right-­‐ hand  side  table  is  broadcast  to  each  node  execuPng  join   -­‐>  preferred  for  small  right-­‐hand  side  input   o  parPPoned  join:  both  tables  are  hash-­‐parPPoned  on  join   columns   -­‐>  preferred  for  large  joins   o  cost-­‐based  decision  based  on  column  stats/esPmated   cost  of  data  transfers  
  • 26. Query Planning:  Distributed  Plans   •  Parallel  aggregaPon:   o  pre-­‐aggregaPon  where  data  is  first  materialized   o  merge  aggregaPon  parPPoned  by  grouping  columns   •  Parallel  top-­‐N:   o  iniPal  top-­‐N  operaPon  where  data  is  first  materialized   o  final  top-­‐N  in  single-­‐node  plan  fragment  
  • 27. Query Planning:  Distributed  Plans   •  In  the  example:   o  scans  are  local:  each  scan  receives  its  own  fragment   o  1st  join:  large  x  large  -­‐>  parPPoned  join   o  2nd  scan:  large  x  small  -­‐>  broadcast  join   o  pre-­‐aggregaPon  in  fragment  that  materializes  join  result   o  merge  aggregaPon  a+er  reparPPoning  on  grouping   column   o  iniPal  top-­‐N  in  fragment  that  does  merge  aggregaPon   o  final  top-­‐N  in  coordinator  fragment  
  • 28. Query  Planning:  Distributed  Plans   HashJoinScan: t1 Scan: t3 Scan: t2 HashJoin TopN Pre-Agg MergeAgg TopN Broadcast Broadcast hash t2.idhash t1.id1 hash t1.custid at HDFS DN at HBase RS at coordinator
  • 29. Metadata  Handling   •  Impala  metadata:   o  Hive's  metastore:  logical  metadata  (table  definiPons,   columns,  CREATE  TABLE  parameters)   o  HDFS  NameNode:  directory  contents  and  block  replica   locaPons   o  HDFS  DataNode:  block  replicas'  volume  ids  
  • 30. Metadata  Handling   •  Caches  metadata:  no  synchronous  metastore  API  calls  during   query  execuPon   •  impalad  instances  read  metadata  from  metastore  at  startup   •  Catalog  Service  relays  metadata  when  you  run  DDL  or  update   metadata  on  one  of  Impalad’s   •  REFRESH [<tbl>]:  reloads  metadata  on  all  impalad’s  (if   you  added  new  files  via  Hive)   •  INVALIDATE METADATA:  reloads  metadata  for  all  tables   •  Roadmap:  HCatalog  
  • 31. Impala  ExecuPon  Engine   •  Wri@en  in  C++  for  minimal  execuPon  overhead   •  Internal  in-­‐memory  tuple  format  puts  fixed-­‐width   data  at  fixed  offsets   •  Uses  intrinsics/special  cpu  instrucPons  for  text   parsing,  crc32  computaPon,  etc.   •  RunPme  code  generaPon  for  “big  loops”  
  • 32. Impala  ExecuPon  Engine   •  More  on  runPme  code  generaPon   •  example  of  "big  loop":  insert  batch  of  rows  into  hash  table   •  known  at  query  compile  Pme:  #  of  tuples  in  a  batch,  tuple   layout,  column  types,  etc.   •  generate  at  compile  Pme:  unrolled  loop  that  inlines  all   funcPon  calls,  contains  no  dead  code,  minimizes  branches   •  code  generated  using  llvm  
  • 33. Impala's  Statestore   •  Central  system  state  repository   •  name  service  (membership)   •  Metadata   •  Roadmap:  other  scheduling-­‐relevant  or  diagnosPc  state   •  So+-­‐state   •  all  data  can  be  reconstructed  from  the  rest  of  the  system   •  cluster  conPnues  to  funcPon  when  statestore  fails,  but  per-­‐node  state   becomes  increasingly  stale   •  Sends  periodic  heartbeats   •  pushes  new  data   •  checks  for  liveness  
  • 34. Statestore:  Why  not  ZooKeeper?     •  ZK  is  not  a  good  pub-­‐sub  system   •  Watch  API  is  awkward  and  requires  a  lot  of  client  logic   •  mulPple  round-­‐trips  required  to  get  data  for  changes  to   node's  children   •  push  model  is  more  natural  for  our  use  case   •  Don't  need  all  the  guarantees  ZK  provides:   •  serializability   •  persistence   •  prefer  to  avoid  complexity  where  possible   •  ZK  is  bad  at  the  things  we  care  about  and  good  at  the   things  we  don't  
  • 35. Comparing  Impala  to  Dremel   •  What  is  Dremel?   •  columnar  storage  for  data  with  nested  structures   •  distributed  scalable  aggregaPon  on  top  of  that   •  Columnar  storage  in  Hadoop:  Parquet   •  stores  data  in  appropriate  naPve/binary  types   •  can  also  store  nested  structures  similar  to  Dremel's  ColumnIO   •  Distributed  aggregaPon:  Impala   •  Impala  plus  Parquet:  a  superset  of  the  published  version  of   Dremel  (which  didn't  support  joins)  
  • 36. •  What  is  it:   •  container  format  for  all  popular  serializaPon  formats:  Avro,  Thri+,   Protocol  Buffers   •  successor  to  Trevni   •  jointly  developed  between  Cloudera  and  Twi@er   •  open  source;  hosted  on  github   •  Features   •  rowgroup  format:  file  contains  mulPple  horiz.  slices   •  supports  storing  each  column  in  separate  file   •  supports  fully  shredded  nested  data;  repePPon  and  definiPon  levels   similar  to  Dremel's  ColumnIO   •  column  values  stored  in  naPve  types  (bool,  int<x>,  float,  double,  byte   array)   •  support  for  index  pages  for  fast  lookup   •  extensible  value  encodings   More  about  Parquet  
  • 37. Comparing  Impala  to  Hive   •  Hive:  MapReduce  as  an  execuPon  engine   •  High  latency,  low  throughput  queries   •  Fault-­‐tolerance  model  based  on  MapReduce's  on-­‐disk   checkpoinPng;  materializes  all  intermediate  results   •  Java  runPme  allows  for  easy  late-­‐binding  of  funcPonality:   file  formats  and  UDFs.   •  Extensive  layering  imposes  high  runPme  overhead   •  Impala:   •  direct,  process-­‐to-­‐process  data  exchange   •  no  fault  tolerance   •  an  execuPon  engine  designed  for  low  runPme  overhead  
  • 38. Impala  Roadmap:  2013   •  AddiPonal  SQL:   •  ORDER  BY  without  LIMIT   •  AnalyPc  window  funcPons   •  support  for  structured  data  types   •  Improved  HBase  support:   •  composite  keys,  complex  types  in  columns,   index  nested-­‐loop  joins,   INSERT/UPDATE/DELETE  
  • 39. Impala  Roadmap:  2013     •  RunPme  opPmizaPons:   •  straggler  handling   •  improved  cache  management   •  data  collocaPon  for  improved  join  performance   •  Resource  management:   •  goal:  run  exploratory  and  producPon  workloads  in  same   cluster,  against  same  data,  w/o  impacPng  producPon  jobs  
  • 40. Demo   •  Uses  Cloudera’s  Quickstart  VM h@p://Pny.cloudera.com/quick-­‐start  
  • 41. Try  it  out!       •  Open  source!  Available  at  cloudera.com,  AWS  EMR!   •  We  have  packages  for:   •  RHEL  5,6,  SLES11,  Ubuntu  Lucid,  Maverick,  Precise,   Debain,  etc.   •  QuesPons/comments?  community.cloudera.com   •  My  twi@er  handle:  mark_grover   •  Slides  at:  github.com/markgrover/impala-­‐thug