Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
• We are building innovative advertising management platforms to assist our customers to get
smarter decisions, reach thei...
• Data & BI Team Leader at Edge
• Experienced in wide range of RDBMS technologies
• Working with Hadoop since 2014
• Certi...
• Big Data at Edge
• Our Goals
• About Impala
• High Overview
• Why We Chose to Work with Impala
• Our Challenges
• Our Se...
5
6
• Deliver insights on data in real time
Fraud Detection
Time-series Analysis
Predictive analytics
Interactive explorator...
7
• Cloudera's open source massively parallel processing (MPP)
SQL query engine
• Runs on Hadoop clusters
• 100% open sour...
8
• Does not rely on a general purpose
data processing engine such as
MapReduce
• Executes queries directly on the
Hadoop ...
9
• Impala Servers run on each node of a cluster.
• The Impala State Store Server is responsible
for confirming which node...
10
• Supports A large set of SQL statements, including SELECT and
INSERT, JOIN, Subqueries, and SQL Analytic Functions.
• ...
11
• Querying data stored in HDFS (provides a distributed,
high-performance queries)
• Each Impala daemon can handle multi...
12
• Allows the usage of partitioning
• By default, all the data files for a table are located in a
single directory. Part...
13
• Hadoop Cluster
o Cluster sizing
o Workload testing (query throughput and and response time)
• Database Design
o Ident...
14
15
● Year = 2017
○ Month = 03
■ Day = 01
■ Day = 02
■ Day = 03
■ …
○ Month = 04
■ Day = 01
■ Day = 02
■ ...
16
• Although we use a “Star Schema” design in Impala. There are a
lot of architectural differences between our Impala lay...
17
18
19
• Instead of using MapReduce, Impala reads the HDFS data
directly
• Impala allows users to query data in HDFS using an ...
20
• Quickly get started with Cloudera using a preconfigured VM or a
Docker Image
• Impala Frequently Asked Questions
• Mo...
21
Upcoming SlideShare
Loading in …5
×

Impala use case @ edge

Using Impala to enable near real time analytics

  • Be the first to comment

  • Be the first to like this

Impala use case @ edge

  1. 1. 1
  2. 2. • We are building innovative advertising management platforms to assist our customers to get smarter decisions, reach their business goals faster and better in real time. • We are proud to have the most cutting edge products and lead the performance and video online advertising market while striving to build long-term relationships with our clients and partners. • Our intention is to simplify the complexity existing in the ad-tech industry and provide our customers with the ability to earn more revenues while using our products and services. • Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs over 100 team members. 2
  3. 3. • Data & BI Team Leader at Edge • Experienced in wide range of RDBMS technologies • Working with Hadoop since 2014 • Certified Cloudera Administrator and trainer • Oracle Certified Professional 3
  4. 4. • Big Data at Edge • Our Goals • About Impala • High Overview • Why We Chose to Work with Impala • Our Challenges • Our Setup • Designing Impala Tables 4
  5. 5. 5
  6. 6. 6 • Deliver insights on data in real time Fraud Detection Time-series Analysis Predictive analytics Interactive exploratory analytics on our data sets • Provide a convenient way to interact with the data • Continuously load batches of data, and make them visible with minimal delay. • Handle high number of concurrent users
  7. 7. 7 • Cloudera's open source massively parallel processing (MPP) SQL query engine • Runs on Hadoop clusters • 100% open source, released under the Apache Software license
  8. 8. 8 • Does not rely on a general purpose data processing engine such as MapReduce • Executes queries directly on the Hadoop cluster • Well-suited for executing interactive analytics queries on large data sets • Tables are really directories of files in HDFS
  9. 9. 9 • Impala Servers run on each node of a cluster. • The Impala State Store Server is responsible for confirming which nodes are healthy and can accept new work • The Catalog Server (new in CDH5) is responsible for sending the new metadata to all other Impala nodes • You can submit a query to the Impala Server running on any node
  10. 10. 10 • Supports A large set of SQL statements, including SELECT and INSERT, JOIN, Subqueries, and SQL Analytic Functions. • Highly compatible with HiveQL • Using Cloudera Manager, Impala services can deployed and managed • Allows the usage of Hue for queries. • Impala is certified to run against Tableau
  11. 11. 11 • Querying data stored in HDFS (provides a distributed, high-performance queries) • Each Impala daemon can handle multiple concurrent client requests • Impala is pioneering the use of the Parquet file format, a columnar storage layout that is optimized for large-scale queries typical in data warehouse scenarios.
  12. 12. 12 • Allows the usage of partitioning • By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the data during loading, based on values from one or more columns. • Impala is a widely adopted standard across the ecosystem, with many users and extensive documentation
  13. 13. 13 • Hadoop Cluster o Cluster sizing o Workload testing (query throughput and and response time) • Database Design o Identify access pattern based on real use cases o Make sure we’re not generating too many partitions o Make sure the data in each partition is large enough o Design our “Star Schema” data warehouse • Data Types o Data consistency across Pig, Hive, and Impala • File formats • Tune queries
  14. 14. 14
  15. 15. 15 ● Year = 2017 ○ Month = 03 ■ Day = 01 ■ Day = 02 ■ Day = 03 ■ … ○ Month = 04 ■ Day = 01 ■ Day = 02 ■ ...
  16. 16. 16 • Although we use a “Star Schema” design in Impala. There are a lot of architectural differences between our Impala layout and the old RDBMS system. • Keep that in mind and avoid using your existing RDBMS data storage and processing strategies in Impala
  17. 17. 17
  18. 18. 18
  19. 19. 19 • Instead of using MapReduce, Impala reads the HDFS data directly • Impala allows users to query data in HDFS using an SQL-like language • The administrative tasks related to Impala are greatly simplified by Cloudera Manager
  20. 20. 20 • Quickly get started with Cloudera using a preconfigured VM or a Docker Image • Impala Frequently Asked Questions • More details on Apache Parquet • The Impala Cookbook
  21. 21. 21

×