1
• We are building innovative advertising management platforms to assist our customers to get
smarter decisions, reach their business goals faster and better in real time.
• We are proud to have the most cutting edge products and lead the performance and video online
advertising market while striving to build long-term relationships with our clients and partners.
• Our intention is to simplify the complexity existing in the ad-tech industry and provide our
customers with the ability to earn more revenues while using our products and services.
• Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs
over 100 team members.
2
• Data & BI Team Leader at Edge
• Experienced in wide range of RDBMS technologies
• Working with Hadoop since 2014
• Certified Cloudera Administrator and trainer
• Oracle Certified Professional
3
• Big Data at Edge
• Our Goals
• About Impala
• High Overview
• Why We Chose to Work with Impala
• Our Challenges
• Our Setup
• Designing Impala Tables
4
5
6
• Deliver insights on data in real time
Fraud Detection
Time-series Analysis
Predictive analytics
Interactive exploratory analytics on our data sets
• Provide a convenient way to interact with the data
• Continuously load batches of data, and make them visible with
minimal delay.
• Handle high number of concurrent users
7
• Cloudera's open source massively parallel processing (MPP)
SQL query engine
• Runs on Hadoop clusters
• 100% open source, released under the Apache Software
license
8
• Does not rely on a general purpose
data processing engine such as
MapReduce
• Executes queries directly on the
Hadoop cluster
• Well-suited for executing interactive
analytics queries on large data sets
• Tables are really directories of files
in HDFS
9
• Impala Servers run on each node of a cluster.
• The Impala State Store Server is responsible
for confirming which nodes are healthy and
can accept new work
• The Catalog Server (new in CDH5) is
responsible for sending the new
metadata to all other Impala
nodes
• You can submit a query to the Impala Server running on any node
10
• Supports A large set of SQL statements, including SELECT and
INSERT, JOIN, Subqueries, and SQL Analytic Functions.
• Highly compatible with HiveQL
• Using Cloudera Manager, Impala services can deployed and
managed
• Allows the usage of Hue for queries.
• Impala is certified to run against Tableau
11
• Querying data stored in HDFS (provides a distributed,
high-performance queries)
• Each Impala daemon can handle multiple concurrent client
requests
• Impala is pioneering the use of the Parquet file format, a
columnar storage layout that is optimized for large-scale queries
typical in data warehouse scenarios.
12
• Allows the usage of partitioning
• By default, all the data files for a table are located in a
single directory. Partitioning is a technique for physically
dividing the data during loading, based on values from one
or more columns.
• Impala is a widely adopted standard across the ecosystem,
with many users and extensive documentation
13
• Hadoop Cluster
o Cluster sizing
o Workload testing (query throughput and and response time)
• Database Design
o Identify access pattern based on real use cases
o Make sure we’re not generating too many partitions
o Make sure the data in each partition is large enough
o Design our “Star Schema” data warehouse
• Data Types
o Data consistency across Pig, Hive, and Impala
• File formats
• Tune queries
14
15
● Year = 2017
○ Month = 03
■ Day = 01
■ Day = 02
■ Day = 03
■ …
○ Month = 04
■ Day = 01
■ Day = 02
■ ...
16
• Although we use a “Star Schema” design in Impala. There are a
lot of architectural differences between our Impala layout and
the old RDBMS system.
• Keep that in mind and avoid using your existing RDBMS data
storage and processing strategies in Impala
17
18
19
• Instead of using MapReduce, Impala reads the HDFS data
directly
• Impala allows users to query data in HDFS using an SQL-like
language
• The administrative tasks related to Impala are greatly simplified
by Cloudera Manager
20
• Quickly get started with Cloudera using a preconfigured VM or a
Docker Image
• Impala Frequently Asked Questions
• More details on Apache Parquet
• The Impala Cookbook
21

Impala use case @ edge

  • 1.
  • 2.
    • We arebuilding innovative advertising management platforms to assist our customers to get smarter decisions, reach their business goals faster and better in real time. • We are proud to have the most cutting edge products and lead the performance and video online advertising market while striving to build long-term relationships with our clients and partners. • Our intention is to simplify the complexity existing in the ad-tech industry and provide our customers with the ability to earn more revenues while using our products and services. • Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs over 100 team members. 2
  • 3.
    • Data &BI Team Leader at Edge • Experienced in wide range of RDBMS technologies • Working with Hadoop since 2014 • Certified Cloudera Administrator and trainer • Oracle Certified Professional 3
  • 4.
    • Big Dataat Edge • Our Goals • About Impala • High Overview • Why We Chose to Work with Impala • Our Challenges • Our Setup • Designing Impala Tables 4
  • 5.
  • 6.
    6 • Deliver insightson data in real time Fraud Detection Time-series Analysis Predictive analytics Interactive exploratory analytics on our data sets • Provide a convenient way to interact with the data • Continuously load batches of data, and make them visible with minimal delay. • Handle high number of concurrent users
  • 7.
    7 • Cloudera's opensource massively parallel processing (MPP) SQL query engine • Runs on Hadoop clusters • 100% open source, released under the Apache Software license
  • 8.
    8 • Does notrely on a general purpose data processing engine such as MapReduce • Executes queries directly on the Hadoop cluster • Well-suited for executing interactive analytics queries on large data sets • Tables are really directories of files in HDFS
  • 9.
    9 • Impala Serversrun on each node of a cluster. • The Impala State Store Server is responsible for confirming which nodes are healthy and can accept new work • The Catalog Server (new in CDH5) is responsible for sending the new metadata to all other Impala nodes • You can submit a query to the Impala Server running on any node
  • 10.
    10 • Supports Alarge set of SQL statements, including SELECT and INSERT, JOIN, Subqueries, and SQL Analytic Functions. • Highly compatible with HiveQL • Using Cloudera Manager, Impala services can deployed and managed • Allows the usage of Hue for queries. • Impala is certified to run against Tableau
  • 11.
    11 • Querying datastored in HDFS (provides a distributed, high-performance queries) • Each Impala daemon can handle multiple concurrent client requests • Impala is pioneering the use of the Parquet file format, a columnar storage layout that is optimized for large-scale queries typical in data warehouse scenarios.
  • 12.
    12 • Allows theusage of partitioning • By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the data during loading, based on values from one or more columns. • Impala is a widely adopted standard across the ecosystem, with many users and extensive documentation
  • 13.
    13 • Hadoop Cluster oCluster sizing o Workload testing (query throughput and and response time) • Database Design o Identify access pattern based on real use cases o Make sure we’re not generating too many partitions o Make sure the data in each partition is large enough o Design our “Star Schema” data warehouse • Data Types o Data consistency across Pig, Hive, and Impala • File formats • Tune queries
  • 14.
  • 15.
    15 ● Year =2017 ○ Month = 03 ■ Day = 01 ■ Day = 02 ■ Day = 03 ■ … ○ Month = 04 ■ Day = 01 ■ Day = 02 ■ ...
  • 16.
    16 • Although weuse a “Star Schema” design in Impala. There are a lot of architectural differences between our Impala layout and the old RDBMS system. • Keep that in mind and avoid using your existing RDBMS data storage and processing strategies in Impala
  • 17.
  • 18.
  • 19.
    19 • Instead ofusing MapReduce, Impala reads the HDFS data directly • Impala allows users to query data in HDFS using an SQL-like language • The administrative tasks related to Impala are greatly simplified by Cloudera Manager
  • 20.
    20 • Quickly getstarted with Cloudera using a preconfigured VM or a Docker Image • Impala Frequently Asked Questions • More details on Apache Parquet • The Impala Cookbook
  • 21.