Etl with apache impala by athemaster

WE SEE,
WE REFORM.
炬識科技股份有限公司
APACHE IMPALA 在
HADOOP平台實作ETL案例
Chen, Chih Chun2017/01/21

專長：
1.  系統架構設計、
2.  系統軟體開發、
3.  系統軟體測試
講師：陳之駿 Chen Chih Chun2
www.athemaster.com

•  導言
•  實際案例分享及分析
•  方法論
•  總結
大綱3
www.athemaster.com

A Hadoop ETL Project.
•  What is a Hadoop ETL Project?
•  What are the components?
•  What are the characteristics?
•  Can those be define by
“Equations”?
導言4
www.athemaster.com

Hadoop = ???
www.athemaster.com
5
Hadoop
= A Ecosystem
= Lots of Animals  
(and new one keeps coming … …)
= Diversity Platform
= Tools fit for a job
= Chaos

ETL = ???
www.athemaster.com
6
ETL
= Data X Systems
= [Data(src) + Data(target)] X
[System(src) + System(target)]
= Many many “Pipe-lines”
= Humongous Integrations… …
= Troubles

People = ???
www.athemaster.com
7
People
= People(persons)
= SYS_ADMINS, POWER_USERS, USERS, 
HACKERS !!?? ... ...
= Tons of requirements
= Challenges

Hadoop + ETL + People =???
www.athemaster.com
8
Hadoop + ETL + People
= ((Chaos)^Troubles)^Challenges
= Lots of Chaos!!!
= End of the world !!!!
…
…
What is missing?

Solution!!!
www.athemaster.com
9
Order
= Organized logics
= Rules for doing things
= Methodologies Disciplines
= Structures Interfaces
= Engineering
The Final Equation:
(Hadoop + ETL + People)-Order
= Success

Conclusion
www.athemaster.com
10
¨  A Successful Hadoop ETL Project
needs Data Engineering.
¨  Data Engineering makes “Right
people doing right things to
deliver right stuffs at right
time”.

•  Customer Main Use Case
•  Final Customer Requirements
•  Customer Profile
•  Deliverable
實際案例分享及分析11
www.athemaster.com

Customer Main Use Cases
www.athemaster.com
12
•  Data lake:
Store data from many sources for the current
report system up to a specific period of time.
•  Data Processing:
Produce standardized data by using BZ
logics for analytic tools.
•  Data Mart:
Provide standardized Data by using BZ logics
for current report system .
•  In-time queries:
Data must be accessible with in a specific
time frame.

Final Customer Requirements
www.athemaster.com
13
•  Replace current SQL base reporting
system.
1,000+ reports a day.
•  Minimum Code changes.
80% unchanged.
•  In-time Queries
3 ~ 5 GB / min
3,000 EPS

Customer Profile
www.athemaster.com
14
•  End User Knows
•  SQL
•  Java JDBC
•  New to Hadoop.

Final Deliverable – The Structure
www.athemaster.com
15
The Basic Logical Component Stacks
Storage
Sources
ETL
JDBC
Metadata
Targets

Deliverable
www.athemaster.com
16
Mark I
n  Data in-take
n  ETL process
n  Data Storage
n  File structure design
HDFS
Sources
Data in-take

Deliverable
www.athemaster.com
17
Mark II
n  Data Accessing
n  Basic Reporting function
n  SQL statements porting
n  Code migration
n  Data processing
workflow
n  Meta data design
n  Schema
n  Partitioning
HDFS
Sources
Data in-take
JDBC
Impala
(External tables)
Report System
Data
Proc.

HBase
Deliverable
www.athemaster.com
18
Mark III
n  Increasing in Data
n  Quantity
n  Complexity
n  In-time query HDFS
Sources
Data in-take
JDBC
Impala
(Parquet,
internal table)
Report
System
Data
Proc.
In-time
Query

Final Deliverable
www.athemaster.com
19
•  Why Impala?
•  Ease of use
•  SQL base
•  JDBC
•  Pack with Power
•  Many Built-in transformation functions
i.e. String, date, time.
•  Compatible with Hadoop.
•  i.e. make data extraction and loading easy.
•  Reasonable performance
•  Well Supported; good documentation.

天道酬勤, 厚德載物
Discipline and Integrity
Methodology20
www.athemaster.com

Overview
www.athemaster.com
21
¤  Data Engineering Concepts
n  Abstraction (Interface)
n  Modularization
n  Logging

Overview
www.athemaster.com
22
¤  Data Engineering Methods
n  POC
n  Architecture Design
n  Implement UT
n  BVT
n  RPT
n  SIT

Q A
總結23
www.athemaster.com

info@athemaster.com
Thank you24
www.athemaster.com

Etl with apache impala by athemaster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Etl with apache impala by athemaster

Similar to Etl with apache impala by athemaster (20)

Recently uploaded

Recently uploaded (20)

Etl with apache impala by athemaster