The document discusses Impala, an SQL query engine for Hadoop. It provides an overview of Impala, details improvements in versions 1.4 and 2.0, and describes new features like subqueries, analytic functions, and data types. Performance optimizations like HDFS caching and partition pruning are also covered.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
Apache Sqoop 2 is the next generation of the massively successful open source tool designed to transfer data between traditional SQL databases and warehouses into Apache Hadoop. Sqoop 2 is designed as a client-server system with a repository which stores connection and job information. Sqoop 2 is designed to support secure job submission and multiple different roles for users. In this talk, we will discuss the issues users faced in Sqoop 1, and the design of Sqoop 2 and how the issues faced in Sqoop 1 are being handled in Sqoop 2.
Presenter(s): Hari Shreedharan, Software Engineer, Cloudera
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
Apache Sqoop 2 is the next generation of the massively successful open source tool designed to transfer data between traditional SQL databases and warehouses into Apache Hadoop. Sqoop 2 is designed as a client-server system with a repository which stores connection and job information. Sqoop 2 is designed to support secure job submission and multiple different roles for users. In this talk, we will discuss the issues users faced in Sqoop 1, and the design of Sqoop 2 and how the issues faced in Sqoop 1 are being handled in Sqoop 2.
Presenter(s): Hari Shreedharan, Software Engineer, Cloudera
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
Presented by Mark Miller, Software Developer, Cloudera
Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Les index columnstore sont apparus avec SQL Server 2012 et bon nombre de limitations ou d'améliorations ont vu le jour avec SQL Server 2014 et bientôt SQL Server 2016. Il en va de même pour les tables In-Memory à partir de SQL Server 2014. Découvrez lors de cette session comment SQL Server 2016 répond aux besoins d'analyse opérationnelle en temps réel en introduisant et en mixant ces 2 technologies In-Memory
Streaming ETL - from RDBMS to Dashboard with KSQLBjoern Rost
Apache Kafka is a massively scalable message queue that is being used at more and more places connecting more and more data sources. This presentation will introduce Kafka from the perspective of a mere mortal DBA and share the experience of (and challenges with) getting events from the database to Kafka using Kafka connect including poor-man’s CDC using flashback query and traditional logical replication tools. To demonstrate how and why this is a good idea, we will build an end-to-end data processing pipeline. We will discuss how to turn changes in database state into events and stream them into Apache Kafka. We will explore the basic concepts of streaming transformations using windows and KSQL before ingesting the transformed stream in a dashboard application.
In this first of a series of presentations, we'll overview the differences between SQL and PL/SQL, and the first steps in optimization, as understanding RULE vs. COST, and how to slash 90% response time in data extractions running in SQL*Plus.
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
Session from ILOUG I presented in May, 2016
Introducing the new tool from the developers of SQL Developer: SQLcl – a new command line tool from the SQL Developer team that might replace SQL*Plus and all of its functions which has been around for over 30 years!
In this session, we will explore the new functionality of the SQLcl, and use a live demonstration to show what SQLcl has to offer over the old SQL*Plus. We will use real life example to see what makes this tool such a time saver in day-to-day tasks for DBAs and developers who prefer using the command line interface.
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Japan
Data Engineering and Data Analysis Workshop #1 での有賀 (@chezou)の発表です。
https://cyberagent.connpass.com/event/58808/
Cloudera Data Science WorkbenchとPySparkを使い、Pythonで好きなライブラリを分散実行する方法についてです。日本語の形態素解析ライブラリMeCabをPySparkから実行します。
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. 2
Today’s
Topic
• What
is
Cloudera
Impala?
• Impala
1.4
/
2.0
update
• Performance
Improvement
• Query
Language
• Resource
Management
and
Security
• Others
3. 3
Who
am
I
?
• Pre-‐sales
SoluLons
Architect
• joined
Cloudera
in
2011,
the
first
Japanese
employee
at
Cloudera
• email:
sho@cloudera.com
• twiTer:
@shiumachi
5. 5
What
is
Impala?
• MPP
SQL
query
engine
for
Hadoop
environment
• wriTen
in
naLve
code
for
maximum
hardware
efficiency
• open-‐source!
• hTp://impala.io/
• Supported
by
Cloudera,
Amazon,
and
MapR
• History
• 2012/10
Public
Beta
released
• 2013/04
Impala
1.0
released
• current
version:
Impala
2.0
6. 6
Impala
is
easy
to
use
• create
tables
as
virtual
views
over
data
stored
in
HDFS
/
HBase
• schema
metadata
is
stored
in
Metastore
• shared
with
Hive,
Pig,
etc.
• connect
via
ODBC
/
JDBC
• authenLcate
via
Kerberos
/
LDAP
• run
standard
SQL
• ANSI
SQL-‐92
based
• limited
to
SELECT
and
bulk
INSERT
• no
correlated
subqueries
available
in
2.0
• UDF
/
UDAF
7. 7
Impala
1.4
(2014/07)
• DECIMAL(<precision>,
<scale>)
• HDFS
caching
DDL
• column
definiLon
based
on
Parquet
file
(CREATE
TABLE
…
LIKE
PARQUET)
• ORDER
BY
without
LIMIT
• LDAP
connecLons
through
TLS
• SHOW
PARTITIONS
• YARN
integrated
resource
manager
will
be
producLon
ready
• Llama
HA
support
• CREATE
TABLE
…
STORED
AS
AVRO
• SUMMARY
command
in
impala-‐shell
(provides
high-‐level
summary
of
query
plan)
• faster
COMPUTE
STATS
• Performance
improvements
for
parLLon
pruning
• impala
shell
supports
UTF-‐8
characters
• addiLonal
built-‐ins
from
EDW
systems
8. 8
Impala
2.0
(2014/10)
• hash
table
can
spill
to
disk
• join
and
aggregate
tables
of
arbitrary
size
• Subquery
enhancements
• allowed
in
WHERE
queries
• EXISTS
/
NOT
EXISTS
• IN
/
NOT
IN
can
operate
on
the
result
set
from
a
subquery
• correlated
/
uncorrelated
subqueries
• scalar
subqueries
• SQL
2003
compliant
analyLc
window
funcLons
• LEAD(),
LAG(),
RANK(),
FIRST_VALUE(),
etc.
• New
Data
Type:
VARCHAR,
CHAR
• Security
Enhancements
• mulLple
authenLcaLon
methods
• GRANT
/
REVOKE
/
CREATE
ROLE
/
DROP
ROLE
/
SHOW
ROLES
/
etc.
• text
+
gzip
/
bzip2
/
Snappy
• Hint
inside
views
• QUERY_TIMEOUT_S
• DATE_PART()
/
EXTRACT()
• Parquet
default
block
size
is
changed
to
256MB
(was:
1GB)
• LEFT
ANTI
JOIN
/
RIGHT
ANTI
JOIN
• impala-‐shell
can
read
sesngs
from
$HOME/.impalarc
10. 10
HDFS
caching
• When
HDFS
files
are
cached
in
memory,
Impala
can
read
the
cached
data
without
any
disk
reads,
and
without
making
an
addiLonal
copy
of
the
data
in
memory
• avoids
checksumming
and
data
copies
• new
HDFS
API
is
available
in
CDH
5.0
• configure
cache
with
Impala
DDL
• CREATE
TABLE
tbl_name
CACHED
IN
‘<pool>’
• ALTER
TABLE
tbl_name
ADD
PARTITION
…
CACHED
IN
‘<pool>’
11. 11
ParLLon
Pruning
improvement
•
Previously,
Impala
typically
queried
tables
with
up
to
approximately
3000
parLLons.
With
the
performance
improvement
in
parLLon
pruning,
now
Impala
can
comfortably
handle
tables
with
tens
of
thousands
of
parLLons.
12. 12
Spilling
to
Disk
SQL
OperaLon
• write
temporary
data
to
when
Impala
is
close
to
exceeding
its
memory
limit
• In
PROFILE,
BlockMgr.BytesWriTen
counter
reports
how
much
data
was
wriTen
to
disk
during
the
query
14. 14
Subquery
Scalar
subquery:
produces
a
result
set
with
a
single
row
containing
a
single
column
SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);!
Uncorrelated
subquery:
not
refer
to
any
tables
from
the
outer
block
of
the
query
SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);!
Correlated
subquery:
compare
one
or
more
values
from
the
outer
query
block
to
values
referenced
in
the
WHERE
clause
of
the
subquery
SELECT employee_name, employee_id FROM employees one WHERE!
salary > (SELECT avg(salary) FROM employees two WHERE
one.dept_id = two.dept_id);!
15. 15
AnalyLc
FuncLons
(a.k.a
Window
FuncLons)
• supported
in
2.0
and
later
• supported
funcLons
• RANK()
/
DENSE_RANK()
• FIRST_VALUE()
/
LAST_VALUE()
• LAG()
/
LEAD()
• ROW_NUMBER()
• Aggregate
funcLons
are
already
implemented
• MAX(),
MIN(),
AVG(),
SUM(),
etc.
16. 16
AnalyLc
FuncLons
Example
For
each
day,
the
query
prints
the
closing
price
alongside
the
previous
day's
closing
price:
select stock_symbol, closing_date, closing_price,!
lag(closing_price,1) over (partition by stock_symbol order by closing_date) as
"yesterday closing"!
from stock_ticker!
order by closing_date;!
+--------------+---------------------+---------------+-------------------+!
| stock_symbol | closing_date | closing_price | yesterday closing |!
+--------------+---------------------+---------------+-------------------+!
| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |!
| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |!
| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |!
| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |!
| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |!
| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |!
| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |!
+--------------+---------------------+---------------+-------------------+!
17. 17
ApproximaLon
features
• APPX_COUNT_DISTINCT
query
opLon
• rewrite
COUNT(DISTINCT)
calls
to
use
NDV()
• speeds
up
the
operaLon
• allows
mulLple
COUNT(DISTINCT)
in
a
single
query
• APPX_MEDIAN()
• returns
a
value
that
is
approximately
the
median
(midpoint)
of
values
in
the
set
of
input
values
19. 19
CREATE
TABLE
…
LIKE
PARQUET
• CREATE
TABLE
...
LIKE
PARQUET
'hdfs_path_of_parquet_file'
• The
column
names
and
data
types
are
automaLcally
configured
based
on
the
Parquet
data
file
20. 20
ORDER
BY
without
LIMIT
• LIMIT
clause
is
now
opLonal
for
queries
that
use
the
ORDER
BY
clause
• Impala
automaLcally
uses
a
temporary
disk
work
area
to
perform
the
sort
if
the
sort
operaLon
would
otherwise
exceed
the
Impala
memory
limit
for
a
parLcular
data
node.
22. 22
ANTI
JOIN
LEFT
ANTI
JOIN
/
RIGHT
ANTI
JOIN
are
supported
in
Impala
2.0
[localhost:21000] > create table t1 (x int);!
[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);!
!
[localhost:21000] > create table t2 (y int);!
[localhost:21000] > insert into t2 values (2), (4), (6);!
!
[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);!
+---+!
| x |!
+---+!
| 1 |!
| 3 |!
| 5 |!
+---+!
!
23. 23
new
data
types
• DECIMAL
(Impala
1.4)
• column_name
DECIMAL[(precision[,scale])]
• with
no
precision
or
scale
values
is
equivalent
to
DECIMAL(9,0)
• VARCHAR
(Impala
2.0)
• STRING
with
a
max
length
• CHAR
(Impala
2.0)
• STRING
with
a
precise
length
24. 24
new
built-‐in
funcLons
• EXTRACT()
:
returns
one
date
or
Lme
field
from
a
TIMESTAMP
value
• TRUNC()
:
truncates
date/Lme
values
to
year,
month,
etc.
• ADD_MONTHS():
alias
for
MONTHS_ADD()
• ROUND():
rounds
DECIMAL
values
• for
compuLng
properLes
for
staLsLcal
distribuLons
• STDDEV()
• STDDEV_SAMP()
/
STDDEV_POP()
• VARIANCE()
• VARIANCE_SAMP()
/
VARIANCE_POP()
• MAX_INT()
/
MIN_SMALLINT()
• IS_INF()
/
IS_NAN()
26. 26
SUMMARY
• impala-‐shell
command
• easy-‐to-‐digest
overview
of
the
Lmings
for
the
different
phases
of
execuLon
for
a
query
[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;!
+---------------------+!
| avg(ss_sales_price) |!
+---------------------+!
| 37.80770926328327 |!
+---------------------+!
[localhost:21000] > summary;!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |!
| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |!
| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |!
| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |!
+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
27. 27
SET
statement
• Before
Impala
2.0,
SET
can
be
used
only
in
impala-‐
shell
• In
Impala
2.0,
you
can
use
SET
in
client
app
through
JDBC
/
ODBC
APIs.
29. 29
Admission
Control
(Impala
1.3)
• Fast
and
lightweight
resource
management
mechanism
• avoids
oversubscripLon
of
resources
for
concurrent
workloads
• queries
are
queued
when
reaching
configurable
limits
• Run
on
every
impalad
• no
SPOF
30. 30
YARN
and
Llama
• Llama:
Low
Latency
ApplicaLon
MAster
• Subdivides
coarse-‐grain
YARN
scheduling
into
finer-‐
granularity
for
low-‐latency
and
short-‐lived
queries
• Llama
registers
one
long-‐lived
AM
per
YARN
pool
• Llama
caches
resources
allocated
by
YARN
for
a
short
Lme,
so
that
they
can
be
quickly
re-‐allocated
to
Impala
queries
• much
faster
than
waiLng
for
YARN
• Impala
1.4:
GA.
Llama
HA
support
31. 31
Query
Timeout
• A
new
query
opLon,
QUERY_TIMEOUT_S,
lets
you
specify
a
Lmeout
period
in
seconds
for
individual
queries
• Note:
The
Lmeout
clock
for
queries
and
sessions
only
starts
Lcking
when
the
query
or
session
is
idle
32. 32
Security
• Impala
2.0
can
accept
either
kind
of
auth.
request
• ex)
host
A
with
Kerberos,
and
host
B
with
LDAP
• Security
related
statement
• GRANT
• REVOKE
• CREATE
ROLE
• DROP
ROLE
• SHOW
ROLES
• SHOW
ROLE
GRANT
• -‐-‐disk_spill_encrypLon
opLon
34. 34
Text
+
gzip,
bzip2,
and
Snappy
• In
Impala
2.0
and
later,
Impala
supports
using
text
data
files
that
employ
gzip,
bzip2,
or
Snappy
compression
• use
ROW
FORMAT
with
delimiter
and
escape
character
to
create
table
CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)!
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!