刘诚忠：Running cloudera impala on postgre sql

•

0 likes•931 views

hdhappy001

BDTC 2013 Beijing China

Technology Sports

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12

Story coming from…
• Data gravity
• Why big data
• Why SQL on big data

Today agenda
•
•
•
•
•
•

Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Performance
Conclusions
Q&A

What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster
• Besides this
–
–
–
–

TV Monitor
Mobile Monitor
Site Monitor
…

Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan

• Mobile Monitor
– Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression

Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface

SQL on Hadoop
•
•
•
•
•

Google Dremel
Apache Drill
Cloudera Impala
Facebook Presto
EMC Greenplum/Pivotal

Latency matters

Pig

Impala/Drill
/Pivotal/Presto

Map Reduce

HDFS

Hive

What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join

• Small result size

Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way

• Good code base
– Modularize
– Easy to add sub classes

• Really fast
– Llvm code generation
• 80s/95s – uv test

– Distributed aggregation Tree
– In-situ data processing (inside storage)

Typical Arch.
SQL Interface

Meta Store

Query
Planner

Query
Planner

Query
Planner

Coordinat
or

Coordinat
or

Coordinat
or

Exec
Engine

Exec
Engine

Exec
Engine

Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed

• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…

Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info

• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode

Front end
• Meta store update
– Link data to the table name
– Table location management

• Front end
– Compute table location

Back end
• Coordinator
– pg host

• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor

SQL Plan
• select count(distinct id)
from table
– MR like process

HDFS/PG scan
Aggr. : group by id

Exchange node
Aggr. : group by id

Aggr. : count(id)

Exchange node

Aggr.: sum(count(id)

Env.
• Ads impression logs
– 150 millions, 100KB/line

• 3 servers
–
–
–
–

24 cores
32 G mem
2T * 12 HD
100Mbps LAN

• Query
– Select count(id) from t group by campaign
– Select count(distinct id) from t group by campaign
– Select * from t where id = ‘xxxxxxxx’

Performance
• Group by speed / core
• 20 M /s

impala

hive
pg+impala

Codegen on/off
• select count(distinct id)
from t group by c
• select distinct id
from t
•

select id from t
group by id
having
count(case when c = '1' then 1 else null end) > 0
and
count(case when c= 2' then 1 else null end) > 0
limit 10;

en_codegen
dis_codegen

Conclusion
• Source quality
– Readable
– Google C++ style
– Robust

• MPP solution based on PG
– Proved perf.
– Easy to scale

• Mixed engine usage
– HDFS and DB

What’s next
•
•
•
•
•

Yarn integrating
UDF
Join with Big table
BI roadmap
Fail over

Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and
%20BigQuery.ppt‎
• http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.
de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh

What's hot

SOLR Power FTW: short versionAlex Pinkin

Building Hadoop Data Applications with Kitehuguk

Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee

Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin

Processing and AnalyticsAmazon Web Services

Open source big data landscape and possible ITS applicationsSoftwareMill

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingVianney FOUCAULT

Presto Summit 2018 - 10 - Qubolekbajda

Small intro to Big Data - Old versionSoftwareMill

GraphiteDavid Lutz

HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon

Accelerating analytics in a new era of dataArnon Shimoni

Hadoop Networking at Datasifthuguk

EMR AWS DemoRim Moussa

presto-at-netflix-hadoop-summit-15Zhenxiao Luo

Big Data Day LA 2015 - Optimizing HBase for the Cloud in Microsoft Azure HDIn...Data Con LA

HBaseConAsia2018 Track3-3: HBase at China Life InsuranceMichael Stack

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy

Big data serving: Processing and inference at scale in real timeItai Yaffe

DruidDori Waldman

What's hot (20)

SOLR Power FTW: short version

Building Hadoop Data Applications with Kite

Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale

Austin bdug 2011_01_27_small_and_big_data

Processing and Analytics

Open source big data landscape and possible ITS applications

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Presto Summit 2018 - 10 - Qubole

Small intro to Big Data - Old version

Graphite

HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...

Accelerating analytics in a new era of data

Hadoop Networking at Datasift

EMR AWS Demo

presto-at-netflix-hadoop-summit-15

Big Data Day LA 2015 - Optimizing HBase for the Cloud in Microsoft Azure HDIn...

HBaseConAsia2018 Track3-3: HBase at China Life Insurance

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...

Big data serving: Processing and inference at scale in real time

Druid

Similar to 刘诚忠：Running cloudera impala on postgre sql

DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal

MariaDB ColumnStoreMariaDB plc

Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely

Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd

Rapids: Data Science on GPUsinside-BigData.com

NVIDIA Rapids presentationtestSri1

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic

MySQL performance monitoring using Statsd and GraphiteDB-Art

Big Data Analytics with MariaDB ColumnStoreMariaDB plc

Webinar: SQL for Machine Data?Crate.io

Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI

Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta

Agility and Scalability with MongoDBMongoDB

WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...Łukasz Grala

Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa

M7 and Apache Drill, Micheal HausenblasModern Data Stack France

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy

Making sense of your data jugGerald Muecke

Solr Power FTW: Powering NoSQL the World OverAlex Pinkin

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services

Similar to 刘诚忠：Running cloudera impala on postgre sql (20)

DC Migration and Hadoop Scale For Big Billion Days

MariaDB ColumnStore

Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse

Migration to ClickHouse. Practical guide, by Alexander Zaitsev

Rapids: Data Science on GPUs

NVIDIA Rapids presentation

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

MySQL performance monitoring using Statsd and Graphite

Big Data Analytics with MariaDB ColumnStore

Webinar: SQL for Machine Data?

Overview of data analytics service: Treasure Data Service

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Agility and Scalability with MongoDB

WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...

Tweaking perfomance on high-load projects_Думанский Дмитрий

M7 and Apache Drill, Micheal Hausenblas

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...

Making sense of your data jug

Solr Power FTW: Powering NoSQL the World Over

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...

Recently uploaded

MINDCTI Revenue Release Quarter One 2024MIND CTI

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

GenAI Risks & Security Meetup 01052024.pdflior mazor

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Why Teams call analytics are critical to your entire businesspanagenda

ICT role in 21st century education and its challengesrafiqahmad00786416

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Manulife - Insurer Transformation Award 2024The Digital Insurer

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Strategies for Landing an Oracle DBA Job as a Fresher

Ransomware_Q4_2023. The report. [EN].pdf

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

GenAI Risks & Security Meetup 01052024.pdf

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Apidays New York 2024 - The value of a flexible API Management solution for O...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Beginners Guide to Building a RAG App Using Open Source Milvus

Why Teams call analytics are critical to your entire business

ICT role in 21st century education and its challenges

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Manulife - Insurer Transformation Award 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Year of the Servo Reboot: Where Are We Now?

刘诚忠：Running cloudera impala on postgre sql

1. Running Cloudera Impala on PostgreSQL By Chengzhong Liu liuchengzhong@miaozhen.com 2013.12

2. Story coming from… • Data gravity • Why big data • Why SQL on big data

3. Today agenda • • • • • • Big data in Miaozhen 秒针系统 Overview of Cloudera Impala Hacking practice in Cloudera Impala Performance Conclusions Q&A

4. What happened in miaozhen • 3 billion Ads impression per day • 20TB data scan for report generation every morning • 24 servers cluster • Besides this – – – – TV Monitor Mobile Monitor Site Monitor …

5. Before Hadoop • Scrat – PostgreSQL 9.1 cluster – Write a simple proxy – <2s for 2TB data scan • Mobile Monitor – Hadoop-like distribute computing system – Rabbit MQ + 3 computing servers – Write a Map-Reduce in C++ – Handles 30 millions to 500 millions Ads impression

6. Problem & Chance • Database cluster • SQL on Hadoop • Miscellaneous data • Requirements – Most data is rational – SQL interface

7. SQL on Hadoop • • • • • Google Dremel Apache Drill Cloudera Impala Facebook Presto EMC Greenplum/Pivotal Latency matters Pig Impala/Drill /Pivotal/Presto Map Reduce HDFS Hive

8. What’s this • A kind of MPP engine • In memory processing • Small to big join – Broadcast join • Small result size

9. Why Cloudera Impala • The team move fast – UDF coming out – Better join strategy on the way • Good code base – Modularize – Easy to add sub classes • Really fast – Llvm code generation • 80s/95s – uv test – Distributed aggregation Tree – In-situ data processing (inside storage)

10. Typical Arch. SQL Interface Meta Store Query Planner Query Planner Query Planner Coordinat or Coordinat or Coordinat or Exec Engine Exec Engine Exec Engine

11. Our target • A MPP database – Build on PostgreSQL9.1 – Scale well – Speed • A mixed data source MPP query engine – Join two tables in different sources – In fact…

12. Hacking… from where • Add, not change – Scan Node type – DB Meta info • Put changes in configuration – Thrift Protocol update • TDBHostInfo • TDBScanNode

13. Front end • Meta store update – Link data to the table name – Table location management • Front end – Compute table location

14. Back end • Coordinator – pg host • New scan node type – db scan node • Pg scan node • Psql library using cursor

15. SQL Plan • select count(distinct id) from table – MR like process HDFS/PG scan Aggr. : group by id Exchange node Aggr. : group by id Aggr. : count(id) Exchange node Aggr.: sum(count(id)

16. Env. • Ads impression logs – 150 millions, 100KB/line • 3 servers – – – – 24 cores 32 G mem 2T * 12 HD 100Mbps LAN • Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’

17. Performance • Group by speed / core • 20 M /s impala hive pg+impala

18. With index

19. Codegen on/off • select count(distinct id) from t group by c • select distinct id from t • select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10; en_codegen dis_codegen

20. Multi-users

21. Conclusion • Source quality – Readable – Google C++ style – Robust • MPP solution based on PG – Proved perf. – Easy to scale • Mixed engine usage – HDFS and DB

22. What’s next • • • • • Yarn integrating UDF Join with Big table BI roadmap Fail over

23. Rerf. • Cloudera Impala online doc. & src • http://files.meetup.com/1727991/Impala%20and %20BigQuery.ppt‎ • http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/ • http://berlinbuzzwords.de/sites/berlinbuzzwords. de/files/slides/Impala%20tech%20talk.pdf • @datascientist, @dongxicheng, @flyingsk, @zhh

24. Thanks! Q&A

刘诚忠：Running cloudera impala on postgre sql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 刘诚忠：Running cloudera impala on postgre sql

Similar to 刘诚忠：Running cloudera impala on postgre sql (20)

More from hdhappy001

More from hdhappy001 (20)

Recently uploaded

Recently uploaded (20)

刘诚忠：Running cloudera impala on postgre sql