Running Cloudera Impala on PostgreSQL

By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12
Story coming from…
• Data gravity
• Why big data
• Why SQL on big data
Today agenda
•
•
•
•
•
•

Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Perfor...
What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 ser...
Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan

• Mobile Monitor
– Hadoop-l...
Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL inte...
SQL on Hadoop
•
•
•
•
•

Google Dremel
Apache Drill
Cloudera Impala
Facebook Presto
EMC Greenplum/Pivotal

Latency matters...
What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join

• Small result size
Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way

• Good code base
– Modularize...
Typical Arch.
SQL Interface

Meta Store

Query
Planner

Query
Planner

Query
Planner

Coordinat
or

Coordinat
or

Coordina...
Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed

• A mixed data source MPP query engine
– Join t...
Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info

• Put changes in configuration
– Thrift Protocol up...
Front end
• Meta store update
– Link data to the table name
– Table location management

• Front end
– Compute table locat...
Back end
• Coordinator
– pg host

• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor
SQL Plan
• select count(distinct id)
from table
– MR like process

HDFS/PG scan
Aggr. : group by id

Exchange node
Aggr. :...
Env.
• Ads impression logs
– 150 millions, 100KB/line

• 3 servers
–
–
–
–

24 cores
32 G mem
2T * 12 HD
100Mbps LAN

• Qu...
Performance
• Group by speed / core
• 20 M /s

impala

hive
pg+impala
With index
Codegen on/off
• select count(distinct id)
from t group by c
• select distinct id
from t
•

select id from t
group by id
h...
Multi-users
Conclusion
• Source quality
– Readable
– Google C++ style
– Robust

• MPP solution based on PG
– Proved perf.
– Easy to sc...
What’s next
•
•
•
•
•

Yarn integrating
UDF
Join with Big table
BI roadmap
Fail over
Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and
%20BigQuery.ppt‎
• http://www.cub...
Thanks!
Q&A
Upcoming SlideShare
Loading in...5
×

刘诚忠:Running cloudera impala on postgre sql

435

Published on

BDTC 2013 Beijing China

Published in: Technology, Sports
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
435
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

刘诚忠:Running cloudera impala on postgre sql

  1. 1. Running Cloudera Impala on PostgreSQL By Chengzhong Liu liuchengzhong@miaozhen.com 2013.12
  2. 2. Story coming from… • Data gravity • Why big data • Why SQL on big data
  3. 3. Today agenda • • • • • • Big data in Miaozhen 秒针系统 Overview of Cloudera Impala Hacking practice in Cloudera Impala Performance Conclusions Q&A
  4. 4. What happened in miaozhen • 3 billion Ads impression per day • 20TB data scan for report generation every morning • 24 servers cluster • Besides this – – – – TV Monitor Mobile Monitor Site Monitor …
  5. 5. Before Hadoop • Scrat – PostgreSQL 9.1 cluster – Write a simple proxy – <2s for 2TB data scan • Mobile Monitor – Hadoop-like distribute computing system – Rabbit MQ + 3 computing servers – Write a Map-Reduce in C++ – Handles 30 millions to 500 millions Ads impression
  6. 6. Problem & Chance • Database cluster • SQL on Hadoop • Miscellaneous data • Requirements – Most data is rational – SQL interface
  7. 7. SQL on Hadoop • • • • • Google Dremel Apache Drill Cloudera Impala Facebook Presto EMC Greenplum/Pivotal Latency matters Pig Impala/Drill /Pivotal/Presto Map Reduce HDFS Hive
  8. 8. What’s this • A kind of MPP engine • In memory processing • Small to big join – Broadcast join • Small result size
  9. 9. Why Cloudera Impala • The team move fast – UDF coming out – Better join strategy on the way • Good code base – Modularize – Easy to add sub classes • Really fast – Llvm code generation • 80s/95s – uv test – Distributed aggregation Tree – In-situ data processing (inside storage)
  10. 10. Typical Arch. SQL Interface Meta Store Query Planner Query Planner Query Planner Coordinat or Coordinat or Coordinat or Exec Engine Exec Engine Exec Engine
  11. 11. Our target • A MPP database – Build on PostgreSQL9.1 – Scale well – Speed • A mixed data source MPP query engine – Join two tables in different sources – In fact…
  12. 12. Hacking… from where • Add, not change – Scan Node type – DB Meta info • Put changes in configuration – Thrift Protocol update • TDBHostInfo • TDBScanNode
  13. 13. Front end • Meta store update – Link data to the table name – Table location management • Front end – Compute table location
  14. 14. Back end • Coordinator – pg host • New scan node type – db scan node • Pg scan node • Psql library using cursor
  15. 15. SQL Plan • select count(distinct id) from table – MR like process HDFS/PG scan Aggr. : group by id Exchange node Aggr. : group by id Aggr. : count(id) Exchange node Aggr.: sum(count(id)
  16. 16. Env. • Ads impression logs – 150 millions, 100KB/line • 3 servers – – – – 24 cores 32 G mem 2T * 12 HD 100Mbps LAN • Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’
  17. 17. Performance • Group by speed / core • 20 M /s impala hive pg+impala
  18. 18. With index
  19. 19. Codegen on/off • select count(distinct id) from t group by c • select distinct id from t • select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10; en_codegen dis_codegen
  20. 20. Multi-users
  21. 21. Conclusion • Source quality – Readable – Google C++ style – Robust • MPP solution based on PG – Proved perf. – Easy to scale • Mixed engine usage – HDFS and DB
  22. 22. What’s next • • • • • Yarn integrating UDF Join with Big table BI roadmap Fail over
  23. 23. Rerf. • Cloudera Impala online doc. & src • http://files.meetup.com/1727991/Impala%20and %20BigQuery.ppt‎ • http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/ • http://berlinbuzzwords.de/sites/berlinbuzzwords. de/files/slides/Impala%20tech%20talk.pdf • @datascientist, @dongxicheng, @flyingsk, @zhh
  24. 24. Thanks! Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×