SlideShare a Scribd company logo
1 
Cloudera Impala 
LV Big Data Monthly Meetup #1 
November 5th 2014 
Maxime Dumas 
Systems Engineer
Thirty Seconds About Max 
• Systems Engineer 
• aka Sales Engineer 
• SoCal, AZ, NV 
• former coder of PHP 
• teaches meditation + yoga 
• from Montreal, Canada 
2
What Does Cloudera Do? 
• product 
• distribution of Hadoop components, Apache licensed 
• enterprise tooling 
• support 
• training 
• services (aka consulting) 
• community 
3
What This Talk Isn’t About 
• deploying 
• Puppet, Chef, Ansible, homegrown scripts, intern labor 
• sizing & tuning 
• depends heavily on data and workload 
• coding 
• unless you count XML or CSV or SQL 
• algorithms 
4
What is Cloudera Impala? 
5
Public Domain IFCAR
cloud·e·ra im·pal·a 
7 
/kloudˈi(ə)rə imˈpalə/ 
noun 
a modern, open source, MPP SQL query engine 
for Apache Hadoop. 
“Cloudera Impala provides fast, ad hoc SQL query 
capability for Apache Hadoop, complementing 
traditional MapReduce batch processing.”
Impala adoption 
8 
Component (and Founder) Vendor Support 
Cloudera MapR Amazon IBM Pivotal Hortonworks 
Impala (Cloudera) ✔ ✔ ✔ X X X 
Hue (Cloudera) ✔ ✔ X X X ✔ 
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X 
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔ 
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X 
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ 
Ambari (Hortonworks) X X X X ✔ ✔ 
Knox (Hortonworks) X X X X X ✔ 
Tez (Hortonworks) X X X X X ✔ 
Drill (MapR) X ✔ X X X X
9 
The Apache Hadoop Ecosystem 
Quick and dirty, for context.
©2014 Cloudera, Inc. All rights 
reserved. 
Why Hadoop? 
• Scalability 
• Simply scales just by adding nodes 
• Local processing to avoid network bottlenecks 
• Efficiency 
• Cost efficiency (<$1k/TB) on commodity hardware 
• Unified storage, metadata, security (no duplication or synchronization) 
• Flexibility 
• All kinds of data (blobs, documents, records, etc) 
• In all forms (structured, semi-structured, unstructured) 
• Store anything then later analyze what you need
Why “Ecosystem?” 
• In the beginning, just Hadoop 
• HDFS 
• MapReduce 
• Today, dozens of interrelated components 
• I/O 
• Processing 
• Specialty Applications 
• Configuration 
• Workflow 
11
HDFS 
• Distributed, highly fault-tolerant filesystem 
• Optimized for large streaming access to data 
• Based on Google File System 
• http://research.google.com/archive/gfs.html 
12
Lots of Commodity Machines 
13 
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR) 
• Programming paradigm 
• Batch oriented, not realtime 
• Works well with distributed computing 
• Lots of Java, but other languages supported 
• Based on Google’s paper 
• http://research.google.com/archive/mapreduce.html 
14
Apache Hive 
• Abstraction of Hadoop’s Java API 
• HiveQL “compiles” down to MR 
• a “SQL-like” language 
• Eases analysis using MapReduce 
15
Apache Hive Metastore 
• Maps HDFS files to DB-like resources 
• Databases 
• Tables 
• Column/field names, data types 
• Roles/users 
• InputFormat/OutputFormat 
16
Architecture 
CLOUDERA’S ENTERPRISE DATA HUB 
©2014 Cloudera, Inc. All rights 
reserved. 
3RD PARTY 
APPS 
STORAGE FOR ANY TYPE OF DATA 
UNIFIED, ELASTIC, RESILIENT, SECURE 
BATCH 
PROCESSING 
MAPREDUCE, 
SPARK 
ANALYTIC 
SQL 
IMPALA 
SEARCH 
SOLR 
MACHINE 
LEARNING 
STREAM 
PROCESSING 
SPARK 
WORKLOAD MANAGEMENT YARN 
FILESYSTEM 
HDFS 
ONLINE NOSQL 
HBASE 
MANAGEMENT 
CLOUDERA NAVIGATOR 
DATA 
MANAGEMENT 
CLOUDERA MANAGER 
SYSTEM 
SENTRY 
PARTNERS, 
MAHOUT
But wait… 
WHY DO WE NEED THIS? 
18
19
20 
Cloudera Impala 
Familiar interface, but more powerful.
Cloudera Impala 
• Interactive query on Hadoop 
• think seconds, not minutes 
• ANSI-92 standard SQL 
• compatible with HiveQL 
• Native MPP query engine 
• built for low-latency queries 
• HDFS and HBase storage 
21
Cloudera Impala – Design Choices 
• Native daemons, written in C/C++ 
• No JVM, no MapReduce 
• Saturate disks on reads 
• Uses in-memory HDFS caching 
• Re-uses Hive metastore 
• Not as fault-tolerant as MapReduce 
22
Benefits of Impala 
Unlocks BI/analytics on Hadoop 
• Interactive SQL in seconds 
• Highly concurrent to handle 100s of users 
Native Hadoop flexibility 
• No data migration, conversion, or duplication required 
• Query existing Hadoop data 
• Run multiple frameworks on the same data at the same time 
• Supports Parquet for best-of-breed columnar performance 
Native MPP query engine designed into Hadoop: 
• Unified Hadoop storage 
• Unified Hadoop metadata (uses Hive and HCatalog) 
• Unified Hadoop security 
• Fine-grained role-based access controls with Sentry 
Apache-licensed open source 
Proven in 
Production 
23
Cloudera Impala – Architecture 
• Impala Daemon 
• runs on every node 
• handles client requests 
• handles query planning & execution 
• State Store Daemon 
• provides name service 
• metadata distribution 
• used for finding data 
24
Impala Query Execution 
25 
1) Request arrives via ODBC/JDBC/HUE/Shell 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
Hive 
Metastore 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL request
Impala Query Execution 
26 
2) Planner turns request into collections of plan fragments 
3) Coordinator initiates execution on impalad(s) local to data 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
Hive 
Metastore 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase
Impala Query Execution 
27 
4) Intermediate results are streamed between impalad(s) 
5) Query results are streamed back to client 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
Hive 
Metastore 
HDFS NN Statestore 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query results
Cloudera Impala – Results 
• Allows for fast iteration/discovery 
• How much faster? 
• 3-4x faster on I/O bound workloads 
• up to 45x faster on multi-MR queries 
• up to 90x faster on in-memory cache 
28
Latest SQL Performance 
350 
300 
250 
200 
150 
100 
50 
0 
Impala Spark SQL Presto Hive-on-Tez 
Time (in seconds) 
Single User vs 10 User Response Time/Impala 
Times Faster 
(Lower bars = better) 
Single User, 5 
10 Users, 11 
Single User, 25 
10 Users, 120 
10 Users, 302 
10 Users, 202 
Single User, 37 
Single User, 77 
5.0x 
10.6x 
7.4x 
27.4x 
15.4x 
18.3x 
Independent validation by IBM Research SQL-on-Hadoop VLDB paper: 
“Impala’s database architecture provides significant performance gains” 
29
Previous Milestones 
Impala 1.0 
(GA) 
Impala 1.1 
(Security) 
Impala 1.2 
(Usability) 
Impala 1.3 
(Resource 
Management) 
Impala 1.4 
(Extensibility) 
Impala 2.0 
(SQL) 
Analytic Database 
Capabilities 
Spring 
2013 
Summer 
2013 
Fall 
2013 
Spring 
2014 
Summer 
2014 
Fall 
2014 
30
Cloudera Impala 2.0 
Window Functions 
“Aggregate function applied to a partition of the result set” (SQL 2003) 
Ex: 
sum(population) OVER (PARTITION BY city) 
rank() OVER (PARTITION BY state, ORDER BY population) 
We’ve implemented most of the spec 
• PARTITION BY, ORDER BY 
• WINDOW 
• PRECEEDING, FOLLOWING 
• ROWS 
• Any number of analytic functions in one query 
31
Cloudera Impala 2.0 
Subqueries 
A query that is part of another query. Ex: 
select col from t1 
where col in 
(select c2 from t2) 
Support: 
• Correlated and uncorrelated subqueries. 
• IN, NOT IN, EXISTS, NOT EXISTS 
32
Cloudera Impala 2.0 
Spill to disk joins & aggregations 
• Previously, if a query ran out of memory, Impala would abort it 
• This means some big joins (fact table – fact table) joins could never run. 
• All operators that accumulate memory can now spill to disk if 
necessary. 
• Order by (Impala 1.4) 
• Join/Agg (Impala 2.0) 
• Analytic Functions (Impala 2.0) 
• Transparent to existing workloads 
33
Cloudera Impala 2.1 + 
34 
• Nested data – enables queries on complex nested structures including maps, structs, 
and arrays (early 2015) 
• MERGE statement – enables merging in updates into existing tables 
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET 
• SQL SET operators – MINUS, INTERSECT 
• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase 
• UDTFs (user-defined table functions) – for more advanced user functions and 
extensibility 
• Intra-node parallelized aggregations and joins – to provide even faster joins and 
aggregations on on top of the performance gains of Impala 
• Parquet enhancements – continued performance gains including index pages 
• Amazon S3 integration
35 
Quick Demo 
Hold onto something, folks.
Apache-licensed open source 
• Download: cloudera.com/downloads 
• Email: impala-user@cloudera.org 
• Join: groups.cloudera.org 
Cloudera Live 
Free, Interactive Tutorials at cloudera.com/live 
©2014 Cloudera, Inc. All rights 
reserved. 
Try It Out
Special thanks: 
LAS VEGAS BIG DATA 
37
38 
Questions? 
Preferably related to the talk… or not.
39 
Thank You! 
Maxime Dumas 
mdumas@cloudera.com 
We’re hiring.

More Related Content

What's hot

Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 

What's hot (20)

Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 

Viewers also liked

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
Amazon Web Services Korea
 
clock-pro
clock-proclock-pro
clock-pro
huliang64
 
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera, Inc.
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
Cloudera, Inc.
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
mislam77
 
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지
충섭 김
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Douglas Bernardini
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
beom kyun choi
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
IMC Institute
 
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
NAVER LABS
 
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
Kim Log
 
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
beom kyun choi
 
AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAmazon Web Services
 
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
pyrasis
 

Viewers also liked (20)

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
 
clock-pro
clock-proclock-pro
clock-pro
 
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
 
Hive case studies
Hive case studiesHive case studies
Hive case studies
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지CoreOS : 설치부터 컨테이너 배포까지
CoreOS : 설치부터 컨테이너 배포까지
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
 
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
 
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
 
AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVP
 
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
 

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
Colin Charles
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 

Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014 (20)

Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 

More from cdmaxime

Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
cdmaxime
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 

More from cdmaxime (6)

Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 

Recently uploaded

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 

Recently uploaded (20)

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

  • 1. 1 Cloudera Impala LV Big Data Monthly Meetup #1 November 5th 2014 Maxime Dumas Systems Engineer
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. What This Talk Isn’t About • deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor • sizing & tuning • depends heavily on data and workload • coding • unless you count XML or CSV or SQL • algorithms 4
  • 5. What is Cloudera Impala? 5
  • 7. cloud·e·ra im·pal·a 7 /kloudˈi(ə)rə imˈpalə/ noun a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”
  • 8. Impala adoption 8 Component (and Founder) Vendor Support Cloudera MapR Amazon IBM Pivotal Hortonworks Impala (Cloudera) ✔ ✔ ✔ X X X Hue (Cloudera) ✔ ✔ X X X ✔ Sentry (Cloudera) ✔ ✔ X ✔ ✔ X Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔ Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ Ambari (Hortonworks) X X X X ✔ ✔ Knox (Hortonworks) X X X X X ✔ Tez (Hortonworks) X X X X X ✔ Drill (MapR) X ✔ X X X X
  • 9. 9 The Apache Hadoop Ecosystem Quick and dirty, for context.
  • 10. ©2014 Cloudera, Inc. All rights reserved. Why Hadoop? • Scalability • Simply scales just by adding nodes • Local processing to avoid network bottlenecks • Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplication or synchronization) • Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-structured, unstructured) • Store anything then later analyze what you need
  • 11. Why “Ecosystem?” • In the beginning, just Hadoop • HDFS • MapReduce • Today, dozens of interrelated components • I/O • Processing • Specialty Applications • Configuration • Workflow 11
  • 12. HDFS • Distributed, highly fault-tolerant filesystem • Optimized for large streaming access to data • Based on Google File System • http://research.google.com/archive/gfs.html 12
  • 13. Lots of Commodity Machines 13 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 14. MapReduce (MR) • Programming paradigm • Batch oriented, not realtime • Works well with distributed computing • Lots of Java, but other languages supported • Based on Google’s paper • http://research.google.com/archive/mapreduce.html 14
  • 15. Apache Hive • Abstraction of Hadoop’s Java API • HiveQL “compiles” down to MR • a “SQL-like” language • Eases analysis using MapReduce 15
  • 16. Apache Hive Metastore • Maps HDFS files to DB-like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat 16
  • 17. Architecture CLOUDERA’S ENTERPRISE DATA HUB ©2014 Cloudera, Inc. All rights reserved. 3RD PARTY APPS STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE BATCH PROCESSING MAPREDUCE, SPARK ANALYTIC SQL IMPALA SEARCH SOLR MACHINE LEARNING STREAM PROCESSING SPARK WORKLOAD MANAGEMENT YARN FILESYSTEM HDFS ONLINE NOSQL HBASE MANAGEMENT CLOUDERA NAVIGATOR DATA MANAGEMENT CLOUDERA MANAGER SYSTEM SENTRY PARTNERS, MAHOUT
  • 18. But wait… WHY DO WE NEED THIS? 18
  • 19. 19
  • 20. 20 Cloudera Impala Familiar interface, but more powerful.
  • 21. Cloudera Impala • Interactive query on Hadoop • think seconds, not minutes • ANSI-92 standard SQL • compatible with HiveQL • Native MPP query engine • built for low-latency queries • HDFS and HBase storage 21
  • 22. Cloudera Impala – Design Choices • Native daemons, written in C/C++ • No JVM, no MapReduce • Saturate disks on reads • Uses in-memory HDFS caching • Re-uses Hive metastore • Not as fault-tolerant as MapReduce 22
  • 23. Benefits of Impala Unlocks BI/analytics on Hadoop • Interactive SQL in seconds • Highly concurrent to handle 100s of users Native Hadoop flexibility • No data migration, conversion, or duplication required • Query existing Hadoop data • Run multiple frameworks on the same data at the same time • Supports Parquet for best-of-breed columnar performance Native MPP query engine designed into Hadoop: • Unified Hadoop storage • Unified Hadoop metadata (uses Hive and HCatalog) • Unified Hadoop security • Fine-grained role-based access controls with Sentry Apache-licensed open source Proven in Production 23
  • 24. Cloudera Impala – Architecture • Impala Daemon • runs on every node • handles client requests • handles query planning & execution • State Store Daemon • provides name service • metadata distribution • used for finding data 24
  • 25. Impala Query Execution 25 1) Request arrives via ODBC/JDBC/HUE/Shell Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request
  • 26. Impala Query Execution 26 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase
  • 27. Impala Query Execution 27 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query results
  • 28. Cloudera Impala – Results • Allows for fast iteration/discovery • How much faster? • 3-4x faster on I/O bound workloads • up to 45x faster on multi-MR queries • up to 90x faster on in-memory cache 28
  • 29. Latest SQL Performance 350 300 250 200 150 100 50 0 Impala Spark SQL Presto Hive-on-Tez Time (in seconds) Single User vs 10 User Response Time/Impala Times Faster (Lower bars = better) Single User, 5 10 Users, 11 Single User, 25 10 Users, 120 10 Users, 302 10 Users, 202 Single User, 37 Single User, 77 5.0x 10.6x 7.4x 27.4x 15.4x 18.3x Independent validation by IBM Research SQL-on-Hadoop VLDB paper: “Impala’s database architecture provides significant performance gains” 29
  • 30. Previous Milestones Impala 1.0 (GA) Impala 1.1 (Security) Impala 1.2 (Usability) Impala 1.3 (Resource Management) Impala 1.4 (Extensibility) Impala 2.0 (SQL) Analytic Database Capabilities Spring 2013 Summer 2013 Fall 2013 Spring 2014 Summer 2014 Fall 2014 30
  • 31. Cloudera Impala 2.0 Window Functions “Aggregate function applied to a partition of the result set” (SQL 2003) Ex: sum(population) OVER (PARTITION BY city) rank() OVER (PARTITION BY state, ORDER BY population) We’ve implemented most of the spec • PARTITION BY, ORDER BY • WINDOW • PRECEEDING, FOLLOWING • ROWS • Any number of analytic functions in one query 31
  • 32. Cloudera Impala 2.0 Subqueries A query that is part of another query. Ex: select col from t1 where col in (select c2 from t2) Support: • Correlated and uncorrelated subqueries. • IN, NOT IN, EXISTS, NOT EXISTS 32
  • 33. Cloudera Impala 2.0 Spill to disk joins & aggregations • Previously, if a query ran out of memory, Impala would abort it • This means some big joins (fact table – fact table) joins could never run. • All operators that accumulate memory can now spill to disk if necessary. • Order by (Impala 1.4) • Join/Agg (Impala 2.0) • Analytic Functions (Impala 2.0) • Transparent to existing workloads 33
  • 34. Cloudera Impala 2.1 + 34 • Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015) • MERGE statement – enables merging in updates into existing tables • Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET • SQL SET operators – MINUS, INTERSECT • Apache HBase CRUD – allows use of Impala for inserts and updates into HBase • UDTFs (user-defined table functions) – for more advanced user functions and extensibility • Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala • Parquet enhancements – continued performance gains including index pages • Amazon S3 integration
  • 35. 35 Quick Demo Hold onto something, folks.
  • 36. Apache-licensed open source • Download: cloudera.com/downloads • Email: impala-user@cloudera.org • Join: groups.cloudera.org Cloudera Live Free, Interactive Tutorials at cloudera.com/live ©2014 Cloudera, Inc. All rights reserved. Try It Out
  • 37. Special thanks: LAS VEGAS BIG DATA 37
  • 38. 38 Questions? Preferably related to the talk… or not.
  • 39. 39 Thank You! Maxime Dumas mdumas@cloudera.com We’re hiring.

Editor's Notes

  1. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  2. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  3. Furthermore, for projects that carry the Apache License, open-ness does not always guarantee freedom from lock-in to a single support provider. For example, Drill, Knox, Tez, and Falcon are all open source, and all shipped by a single vendor – what’s a better example of “lock-in” than that?
  4. We’re going to breeze through these really quick, just to show how Search plugs in later…
  5. Lose a server, no problem. Lose a rack, no problem.
  6. We’re going to breeze through these really quick, just to show how Search plugs in later…
  7. More & Faster Value from Big Data Provides an interactive BI/Analytics experience on Hadoop Previously BI/Analytics was impractical due to the batch orientation of MapReduce Enables more users to gain value from organizational data assets (SQL/BI users) Makes more data available for analysis (raw data, multi-structured data, historical data) Removes delays from data migration Into specialized analytical DBMSs Into proprietary file formats that happen to be stored in HDFS Into transient in-memory stores Flexibility Query across existing data in Hadoop HDFS and HBase Access data immediately and directly in its native format Select best-fit file formats Use raw data formats when unsure of access patterns (text files, RCFiles, LZO) Increase performance with optimized file formats when access patterns are known (Parquet, Avro) Run multiple frameworks on the same data at the same time All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same time Run multiple frameworks on the same data at the same time All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. Cost Efficiency Reduce movement, duplicate storage & compute Data movement: no time or resource penalty for migrating data into specialized systems or formats Duplicate storage: no need to duplicate data across systems or within the same system in different file formats Compute: use the same compute resources as the rest of the Hadoop system – You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce) You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions 10% to 1% the cost of analytic DMBS Less than $1,000/TB Full Fidelity Analysis No loss of fidelity from aggregations or conforming to fixed schemas If the attribute exists in the raw data, you can query against it
  8. These run continuously, always ready. In C/C++ for the most-part.
  9. Impala 1.0 ~SQL-92 (minus correlated sub-queries) Native Hadoop file formats (Parquet, Avro, text, Sequence, …) Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) Service-level resource isolation with other Hadoop frameworks Impala 1.1 Fine-grained, role-based authorization via Apache Sentry Auditing (Impala 1.1.1 and CM 4.7+) Impala 1.2 Custom language extensibility (UDFs, UDAFs) Cost-based join-order optimization On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) Resource management
  10. Do not support RANGE windows. Range windows let you specify a range based on the current row’s value (as opposed to ROWS, which is the ordinal). Example: sum(c) OVER(ORDER BY year BETWEEN RANGE 1 PRECEEDING and 2 FOLLOWING) Error: “RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW." No UDA support Not all aggregate functions are supported (ndv, etc) Looking at both for 2.1.
  11. All subqueries are rewritten as joins. No “Independent evaluation” We’ve added additional join types to support this: LEFT/RIGHT ANTI-JOIN RIGHT SEMI-JOIN NULL AWARE LEFT ANTI JOIN Subqueries are only supported in the WHERE clause. Impala can’t reason if a subquery returns one row in all cases: select col limit 1  works select min(col)  works select min(col) group by x where x = 1  doesn’t Can manually add a limit 1 to the subquery. See docs for more details These should all have error messages explaining why We implemented the common use cases.
  12. Impala hash partitions the input to the operator, spilling partitions as necessary When all the input is partitioned, Impala processes the partitions that are still in memory (did not spill) Impala then processed the spilled partitions 1 by 1, repartitioning if necessary. Impala tries to minimize the number of spilled bytes. Peak memory usage when the first spill happened Stays high until we handled all the non-spilled partitions Lower as we handle the spilled partitions 1 by 1.
  13. We’re going to breeze through these really quick, just to show how Search plugs in later…