2. • We are building innovative advertising management platforms to assist our customers to get
smarter decisions, reach their business goals faster and better in real time.
• We are proud to have the most cutting edge products and lead the performance and video online
advertising market while striving to build long-term relationships with our clients and partners.
• Our intention is to simplify the complexity existing in the ad-tech industry and provide our
customers with the ability to earn more revenues while using our products and services.
• Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs
over 100 team members.
2
3. • Data & BI Team Leader at Edge
• Experienced in wide range of RDBMS technologies
• Working with Hadoop since 2014
• Certified Cloudera Administrator and trainer
• Oracle Certified Professional
3
4. • Big Data at Edge
• Our Goals
• About Impala
• High Overview
• Why We Chose to Work with Impala
• Our Challenges
• Our Setup
• Designing Impala Tables
4
6. 6
• Deliver insights on data in real time
Fraud Detection
Time-series Analysis
Predictive analytics
Interactive exploratory analytics on our data sets
• Provide a convenient way to interact with the data
• Continuously load batches of data, and make them visible with
minimal delay.
• Handle high number of concurrent users
7. 7
• Cloudera's open source massively parallel processing (MPP)
SQL query engine
• Runs on Hadoop clusters
• 100% open source, released under the Apache Software
license
8. 8
• Does not rely on a general purpose
data processing engine such as
MapReduce
• Executes queries directly on the
Hadoop cluster
• Well-suited for executing interactive
analytics queries on large data sets
• Tables are really directories of files
in HDFS
9. 9
• Impala Servers run on each node of a cluster.
• The Impala State Store Server is responsible
for confirming which nodes are healthy and
can accept new work
• The Catalog Server (new in CDH5) is
responsible for sending the new
metadata to all other Impala
nodes
• You can submit a query to the Impala Server running on any node
10. 10
• Supports A large set of SQL statements, including SELECT and
INSERT, JOIN, Subqueries, and SQL Analytic Functions.
• Highly compatible with HiveQL
• Using Cloudera Manager, Impala services can deployed and
managed
• Allows the usage of Hue for queries.
• Impala is certified to run against Tableau
11. 11
• Querying data stored in HDFS (provides a distributed,
high-performance queries)
• Each Impala daemon can handle multiple concurrent client
requests
• Impala is pioneering the use of the Parquet file format, a
columnar storage layout that is optimized for large-scale queries
typical in data warehouse scenarios.
12. 12
• Allows the usage of partitioning
• By default, all the data files for a table are located in a
single directory. Partitioning is a technique for physically
dividing the data during loading, based on values from one
or more columns.
• Impala is a widely adopted standard across the ecosystem,
with many users and extensive documentation
13. 13
• Hadoop Cluster
o Cluster sizing
o Workload testing (query throughput and and response time)
• Database Design
o Identify access pattern based on real use cases
o Make sure we’re not generating too many partitions
o Make sure the data in each partition is large enough
o Design our “Star Schema” data warehouse
• Data Types
o Data consistency across Pig, Hive, and Impala
• File formats
• Tune queries
15. 15
● Year = 2017
○ Month = 03
■ Day = 01
■ Day = 02
■ Day = 03
■ …
○ Month = 04
■ Day = 01
■ Day = 02
■ ...
16. 16
• Although we use a “Star Schema” design in Impala. There are a
lot of architectural differences between our Impala layout and
the old RDBMS system.
• Keep that in mind and avoid using your existing RDBMS data
storage and processing strategies in Impala
19. 19
• Instead of using MapReduce, Impala reads the HDFS data
directly
• Impala allows users to query data in HDFS using an SQL-like
language
• The administrative tasks related to Impala are greatly simplified
by Cloudera Manager
20. 20
• Quickly get started with Cloudera using a preconfigured VM or a
Docker Image
• Impala Frequently Asked Questions
• More details on Apache Parquet
• The Impala Cookbook