SlideShare a Scribd company logo
1 of 35
Copyright © 2016 NTT DATA Corporation
December 2, 2016
NTT Data Corporation
Ayumi Ishii
Application of PostgreSQL to large social
infrastructure
PGCONF.ASIA 2016
Copyright © 2016 NTT DATA Corporation 2
How to use PostgreSQL in social infrastructure
3Copyright © 2016 NTT DATA Corporation
Positioning of smart meter management system
aggregation
device
SM
SM
SM
smart meter
management
system
SM
Data Center
SM
SM
SM
aggregation device
wheeling
management system
fee calculation for
new menu
other
power
companies
billing
processing
member management
system
reward points system
switching support
system
Organization
for Cross-
regional
Coordination
of
Transmission
Operators
★
4Copyright © 2016 NTT DATA Corporation
Main processing and mission of the system
main processing
5 million datasets
per 30 min
validate
save
data
save
calculated datacalculation
within 10minutes
• 240 million additional tuples per
day
• must be saved for 24 months
5 million
tuple
INSERT
Mission 1
Mission 2
large scale
SELECT
Mission 35 million
tuple
INSERT
5Copyright © 2016 NTT DATA Corporation
Mission
1. Load 10 million datasets within 10 minutes !
2. Must save data for 24 months !
3. Stabilize large scale SELECT performance !
6Copyright © 2016 NTT DATA Corporation
(1) Load 10 million datasets within 10 minutes !
★
main processing
5 million datasets
per 30 min
validate
save
data
save
calculated datacalculation
within 10minutes
• 240 million additional tuples per
day
• must be saved for 24 months
5 million
tuple
INSERT
Mission 2
large scale
SELECT
Mission 35 million
tuple
INSERT
Mission 1
7Copyright © 2016 NTT DATA Corporation
Data model
data : [Device ID] [Date] [Electricity Usage]
ex) ID: 1 used 500 at 1:00 August 1st.
Method 1 :UPDATE model
UPDATE new data for each device, daily
Device
ID
Day 0:00 0:30 1:00 1:30 …
1 8/1 100 300 500
2 8/1 200 400
Frequent UPADATEs are unfavorable for
PostgreSQL in terms of performance
8Copyright © 2016 NTT DATA Corporation
Data model
Device
ID
Date Value
1 8/1 0:00 100
1 8/1 0:30 300
1 8/1 1:00 500
… … …
○ performance
× data size
Method 2 : INSERT model
INSERT new data for each device, every 30 mins
Method 1 :UPDATE model
Device
ID
Day 0:00 0:30 1:00 1:30 …
1 8/1 100 300 500
2 8/1 200 400
9Copyright © 2016 NTT DATA Corporation
Data model
Device
ID
Date Value
1 8/1 0:00 100
1 8/1 0:30 300
1 8/1 1:00 500
… … …
○ performance
× data size
Method 2 : INSERT model
INSERT new data for each device, every 30 mins
Method 1 :UPDATE model
Device
ID
Day 0:00 0:30 1:00 1:30 …
1 8/1 100 300 500
2 8/1 200 400
Selected based on performance
10Copyright © 2016 NTT DATA Corporation
Performance factors
number of tuples
in one transaction ?
multiplicity? parameters?
data type?
restrictions?
index?
version?
pre research regarding performance factors
how to load to
partition table?
11Copyright © 2016 NTT DATA Corporation
Performance factors
number of tuples
in one transaction
10000multiplicity
8
parameter
wal_bugffers=1GB
data type
minimumrestriction
minimum
index
minimum
version
9.4
direct load to
partition child table
DB design
performance tuning
12Copyright © 2016 NTT DATA Corporation
Performance factors
number of tuples
in one transaction
10000multiplicity
8
parameter
wal_bugffers=1GB
data type
minimumrestriction
minimum
index
minimum
version
9.4
direct load to
partition child table
13Copyright © 2016 NTT DATA Corporation
Bottleneck Analysis with perf
19.83% postgres postgres [.] XLogInsert ★
6.45% postgres postgres [.] LWLockRelease
4.41% postgres postgres [.] PinBuffer
3.03% postgres postgres [.] LWLockAcquire
WAL is the
bottleneck !
perf
WAL
WAL
file
Disk
I/O
memory
WAL buffer
write
・commit
・buffer is full
14Copyright © 2016 NTT DATA Corporation
wal_buffers parameter
“The auto-tuning selected by the default
setting of -1 should give reasonable results
in most cases.”
by PostgreSQL Document
15Copyright © 2016 NTT DATA Corporation
wal_buffers
※INSERT only
(except SELECT)
0:00:00
0:01:00
0:02:00
0:03:00
0:04:00
0:05:00
0:06:00
0:07:00
0:08:00
0:09:00
16MB 1GB
Time
Impact of WAL_buffers
16Copyright © 2016 NTT DATA Corporation
PostgreSQL version
・WAL performance improved
・JSONB
・GIN performance improved
・CONCURRENTLY option
9.3 9.4
17Copyright © 2016 NTT DATA Corporation
Version up
• We had originally planned to use 9.3, but changed to 9.4.
0:00:00
0:01:00
0:02:00
0:03:00
0:04:00
0:05:00
0:06:00
0:07:00
0:08:00
9.3 9.4
time
impact of version up
※INSERT only
(except SELECT)
18Copyright © 2016 NTT DATA Corporation
0:07:57
0:06:59
0:05:49
0:03:29
0:03:29
0:03:29
0:00:00
0:02:00
0:04:00
0:06:00
0:08:00
0:10:00
0:12:01
9.3, 16MB 9.3, 1GB 9.4, 1GB
time
Result
target
accomplished!!
other processes
are already
tuned.
■INSERT
■others
19Copyright © 2016 NTT DATA Corporation
(2) Must save data for 24 months !
★
main processing
5 million datasets
per 30 min
validate
save
data
save
calculated datacalculation
within 10minutes
• 240 million additional tuples per
day
• must be saved for 24 months
5 million
tuple
INSERT
large scale
SELECT
Mission 35 million
tuple
INSERT
Mission 1
Mission 2
108TB
21Copyright © 2016 NTT DATA Corporation
Reduce data size by selecting the best data type
• Integer
 Use the smallest data type that can cover the range and precision
• Boolean
 Use BOOLEAN instead of CHAR(1)
Type precision Size
SMALLINT 4 digit 2 byte
INTEGER 9 digit 4 byte
BIGINT 18 digit 8 byte
NUMERIC 1000 digit 3 or 6 or 8 + ceiling(digit / 4) * 2
Type available data Size
CHAR(1) string (length is 1) 5 byte
BOOLEAN true or false 1 byte
22Copyright © 2016 NTT DATA Corporation
Reduce the data size by changing column order
• alignment
• PostgreSQL does not store data across the alignment
1 2 3 4 5 6 7 8
column_1(4byte) ***PADDING***
column_2(8byte)
8 byte
Column Type
column_1 integer
column_2 timestamp without time zone
column_3 integer
column_4 smallint
column_5 timestamp without time zone
column_6 smallint
column_7 timestamp without time zone
1 2 3 4 5 6 7 8
column_1 ***PADDING***
column_2
column_3 column_4 *PADDING*
column_5
column_6 ********PADDING*********
column_7
1 2 3 4 5 6 7 8
column_2
column_5
column_7
column_1 column_3
column_4 column_6
72 60
ex)
12 type / 1 tuple
 2.8GB /day!
24Copyright © 2016 NTT DATA Corporation
Change data model
num data select
frequency
update
frequency
policy model
1 1st day
~65th day
high high performance is the
priority
INSERT
2 66th day
~24 months
low low data size is the
priority
UPDATE
We adopted INSERT model considering the performance
• However, data size is large making it difficult to store long term
convert model for old data
25Copyright © 2016 NTT DATA Corporation
Change data model
ID date 0:00 0:30 1:00 … 22:30 23:00 23:30
1 8/1 100 300 500 … 1000 1100 1200
2 8/1 100 200 300 … 800 900 1000
ID timestamp value
1 8/1 0:00 100
2 8/1 0:00 100
1 8/1 0:30 300
2 8/1 0:30 200
1 8/1 1:00 500
2 8/1 1:00 300
… … …
1 8/1 22:30 1000
2 8/1 22:30 800
1 8/1 23:00 1100
2 8/1 23:00 900
1 8/1 23:30 1200
2 8/1 23:30 1000
INSERT model UPDATE model
remove duplicated data (ID, timestamp)
num of tuples/day: 240 million →5 million
size: 22GB→3GB
26Copyright © 2016 NTT DATA Corporation
result
108
11
0
20
40
60
80
100
120
datasize(TB)
reduce data size
before after
27Copyright © 2016 NTT DATA Corporation
(3) Stabilize large scale SELECT performance !
★
main processing
5 million datasets
per 30 min
validate
save
data
save
calculated datacalculation
within 10minutes
• 240 million additional tuples per
day
• must be saved for 24 months
5 million
tuple
INSERT
large scale
SELECT
5 million
tuple
INSERT
Mission 1
Mission 2
Mission 3
28Copyright © 2016 NTT DATA Corporation
Stabilize the performance of 10 million SELECT statements!
“stable performance” is important
• Performance degradation is caused by sudden changes in
execution plan is problem
control
execution plans
pg_hint_plan
lock statistical
information
pg_dbms_stats
stable performance
29Copyright © 2016 NTT DATA Corporation
Before using pg_hint_plan & pg_dbms_stats
In most cases, optimizer generates the best execution plan
fixing execution plan does not always bring good result
• The best execution plan at this time may not be best in the future.
However, it is necessary to reduce the risk.
If execution plan suddenly changed during operation, and
performance maybe reduced.
→Understand the demerits and use these extensions
• SELECT immediately after batch, before
ANALYZE
• SELECT from a lot of tables (JOIN)
• …
30Copyright © 2016 NTT DATA Corporation
pg_dbms_stats
Planner
pg_dbms_stats
PostgreSQL
Original
statistics
Plan
generate
Lock
“Locked”
statistics
31Copyright © 2016 NTT DATA Corporation
pg_dbms_stats in this system
usage
data
day
table
locked
statistics
day
table
locked
statistics
day
table
locked
statistics
day partition
set locked statistics with new table
COPY some statistics are different
depending on each child table
We can certainly get best plan even without
using ANALYZE.
• table’s OID, table name
• partition key, date
32Copyright © 2016 NTT DATA Corporation
Replacing statistics that should be changed according to table
• Create assumed dummy data
• ANALYZE dummy data
Column statistic
partition key Most Common Value
Date Histogram
Ex) “ 8/1 0:00” , “8/1 0:30”, “8/1 1:00”
48 pattern per day. Uniform distribution.
33Copyright © 2016 NTT DATA Corporation
1. Load 10 million datasets within 10 minutes !
2. Must save data for 24 months !
3. Stabilize large scale SELECT performance !
Mission
COMPLETE
34Copyright © 2016 NTT DATA Corporation
conclusion
The 20th anniversary of PostgreSQL
PostgreSQL finally evolved to be adopted in large scale social infrastructure.
Both PostgreSQL technical knowledge and business application knowledge are necessary
to be successful in difficult and large scale projects.
Pre research and know-how are important to get the full out of PostgreSQL.
Copyright © 2011 NTT DATA Corporation
Copyright © 2016 NTT DATA Corporation

More Related Content

What's hot

Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Masahiko Sawada
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
Ryu Kobayashi
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
Takumi Asai
 

What's hot (20)

Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Dataflow shuffle service
Dataflow shuffle service Dataflow shuffle service
Dataflow shuffle service
 
Aws meetup (sep 2015) exprimir cada centavo
Aws meetup (sep 2015)   exprimir cada centavoAws meetup (sep 2015)   exprimir cada centavo
Aws meetup (sep 2015) exprimir cada centavo
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 

Viewers also liked

Viewers also liked (20)

Application of postgre sql to large social infrastructure jp
Application of postgre sql to large social infrastructure jpApplication of postgre sql to large social infrastructure jp
Application of postgre sql to large social infrastructure jp
 
ブロックチェーンの仕組みと動向(入門編)
ブロックチェーンの仕組みと動向(入門編)ブロックチェーンの仕組みと動向(入門編)
ブロックチェーンの仕組みと動向(入門編)
 
Apache Hadoop 2.8.0 の新機能 (抜粋)
Apache Hadoop 2.8.0 の新機能 (抜粋)Apache Hadoop 2.8.0 の新機能 (抜粋)
Apache Hadoop 2.8.0 の新機能 (抜粋)
 
20170303 java9 hadoop
20170303 java9 hadoop20170303 java9 hadoop
20170303 java9 hadoop
 
商用ミドルウェアのPuppet化で気を付けたい5つのこと
商用ミドルウェアのPuppet化で気を付けたい5つのこと商用ミドルウェアのPuppet化で気を付けたい5つのこと
商用ミドルウェアのPuppet化で気を付けたい5つのこと
 
今からはじめるPuppet 2016 ~ インフラエンジニアのたしなみ ~
今からはじめるPuppet 2016 ~ インフラエンジニアのたしなみ ~今からはじめるPuppet 2016 ~ インフラエンジニアのたしなみ ~
今からはじめるPuppet 2016 ~ インフラエンジニアのたしなみ ~
 
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
 
データ活用をもっともっと円滑に! ~データ処理・分析基盤編を少しだけ~
データ活用をもっともっと円滑に!~データ処理・分析基盤編を少しだけ~データ活用をもっともっと円滑に!~データ処理・分析基盤編を少しだけ~
データ活用をもっともっと円滑に! ~データ処理・分析基盤編を少しだけ~
 
Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本
 
Apache NiFiと 他プロダクトのつなぎ方
Apache NiFiと他プロダクトのつなぎ方Apache NiFiと他プロダクトのつなぎ方
Apache NiFiと 他プロダクトのつなぎ方
 
値型と参照型
値型と参照型値型と参照型
値型と参照型
 
Hadoopのメンテナンスリリースバージョンをリリースしてみた (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo...
Hadoopのメンテナンスリリースバージョンをリリースしてみた (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo...Hadoopのメンテナンスリリースバージョンをリリースしてみた (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo...
Hadoopのメンテナンスリリースバージョンをリリースしてみた (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo...
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
 
PostgreSQLでpg_bigmを使って日本語全文検索 (MySQLとPostgreSQLの日本語全文検索勉強会 発表資料)
PostgreSQLでpg_bigmを使って日本語全文検索 (MySQLとPostgreSQLの日本語全文検索勉強会 発表資料)PostgreSQLでpg_bigmを使って日本語全文検索 (MySQLとPostgreSQLの日本語全文検索勉強会 発表資料)
PostgreSQLでpg_bigmを使って日本語全文検索 (MySQLとPostgreSQLの日本語全文検索勉強会 発表資料)
 
本当にあったHadoopの恐い話 Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...
本当にあったHadoopの恐い話Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...本当にあったHadoopの恐い話Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...
本当にあったHadoopの恐い話 Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...
 
SIプロジェクトでのインフラ自動化の事例 (第1回 Puppetユーザ会 発表資料)
SIプロジェクトでのインフラ自動化の事例 (第1回 Puppetユーザ会 発表資料)SIプロジェクトでのインフラ自動化の事例 (第1回 Puppetユーザ会 発表資料)
SIプロジェクトでのインフラ自動化の事例 (第1回 Puppetユーザ会 発表資料)
 
サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)
サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)
サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)
 
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
 
PostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もうPostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もう
 

Similar to Application of postgre sql to large social infrastructure

BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
eaiti
 
Balogh gyorgy big_data
Balogh gyorgy big_dataBalogh gyorgy big_data
Balogh gyorgy big_data
LogDrill
 
How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
Precisely
 

Similar to Application of postgre sql to large social infrastructure (20)

BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...
Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...
Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...
 
Big data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryBig data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQuery
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Hypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQLHypothetical Partitioning for PostgreSQL
Hypothetical Partitioning for PostgreSQL
 
Balogh gyorgy big_data
Balogh gyorgy big_dataBalogh gyorgy big_data
Balogh gyorgy big_data
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQL
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
 
M|18 Analytics in the Real World, Case Studies and Use Cases
M|18 Analytics in the Real World, Case Studies and Use CasesM|18 Analytics in the Real World, Case Studies and Use Cases
M|18 Analytics in the Real World, Case Studies and Use Cases
 
Sensor Data Management & Analytics: Advanced Process Control
Sensor Data Management & Analytics: Advanced Process ControlSensor Data Management & Analytics: Advanced Process Control
Sensor Data Management & Analytics: Advanced Process Control
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Tw Bizcases
Tw BizcasesTw Bizcases
Tw Bizcases
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
 
How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
 
Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0
 
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
 

More from NTT DATA OSS Professional Services

More from NTT DATA OSS Professional Services (15)

Global Top 5 を目指す NTT DATA の確かで意外な技術力
Global Top 5 を目指す NTT DATA の確かで意外な技術力Global Top 5 を目指す NTT DATA の確かで意外な技術力
Global Top 5 を目指す NTT DATA の確かで意外な技術力
 
Spark SQL - The internal -
Spark SQL - The internal -Spark SQL - The internal -
Spark SQL - The internal -
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
 
Hadoopエコシステムのデータストア振り返り
Hadoopエコシステムのデータストア振り返りHadoopエコシステムのデータストア振り返り
Hadoopエコシステムのデータストア振り返り
 
HDFS Router-based federation
HDFS Router-based federationHDFS Router-based federation
HDFS Router-based federation
 
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイントPostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
 
Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状
 
Distributed data stores in Hadoop ecosystem
Distributed data stores in Hadoop ecosystemDistributed data stores in Hadoop ecosystem
Distributed data stores in Hadoop ecosystem
 
Structured Streaming - The Internal -
Structured Streaming - The Internal -Structured Streaming - The Internal -
Structured Streaming - The Internal -
 
Apache Hadoopの未来 3系になって何が変わるのか?
Apache Hadoopの未来 3系になって何が変わるのか?Apache Hadoopの未来 3系になって何が変わるのか?
Apache Hadoopの未来 3系になって何が変わるのか?
 
Apache Hadoop and YARN, current development status
Apache Hadoop and YARN, current development statusApache Hadoop and YARN, current development status
Apache Hadoop and YARN, current development status
 
HDFS basics from API perspective
HDFS basics from API perspectiveHDFS basics from API perspective
HDFS basics from API perspective
 
SIerとオープンソースの美味しい関係 ~コミュニティの力を活かして世界を目指そう~
SIerとオープンソースの美味しい関係 ~コミュニティの力を活かして世界を目指そう~SIerとオープンソースの美味しい関係 ~コミュニティの力を活かして世界を目指そう~
SIerとオープンソースの美味しい関係 ~コミュニティの力を活かして世界を目指そう~
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Application of postgre sql to large social infrastructure

  • 1. Copyright © 2016 NTT DATA Corporation December 2, 2016 NTT Data Corporation Ayumi Ishii Application of PostgreSQL to large social infrastructure PGCONF.ASIA 2016
  • 2. Copyright © 2016 NTT DATA Corporation 2 How to use PostgreSQL in social infrastructure
  • 3. 3Copyright © 2016 NTT DATA Corporation Positioning of smart meter management system aggregation device SM SM SM smart meter management system SM Data Center SM SM SM aggregation device wheeling management system fee calculation for new menu other power companies billing processing member management system reward points system switching support system Organization for Cross- regional Coordination of Transmission Operators ★
  • 4. 4Copyright © 2016 NTT DATA Corporation Main processing and mission of the system main processing 5 million datasets per 30 min validate save data save calculated datacalculation within 10minutes • 240 million additional tuples per day • must be saved for 24 months 5 million tuple INSERT Mission 1 Mission 2 large scale SELECT Mission 35 million tuple INSERT
  • 5. 5Copyright © 2016 NTT DATA Corporation Mission 1. Load 10 million datasets within 10 minutes ! 2. Must save data for 24 months ! 3. Stabilize large scale SELECT performance !
  • 6. 6Copyright © 2016 NTT DATA Corporation (1) Load 10 million datasets within 10 minutes ! ★ main processing 5 million datasets per 30 min validate save data save calculated datacalculation within 10minutes • 240 million additional tuples per day • must be saved for 24 months 5 million tuple INSERT Mission 2 large scale SELECT Mission 35 million tuple INSERT Mission 1
  • 7. 7Copyright © 2016 NTT DATA Corporation Data model data : [Device ID] [Date] [Electricity Usage] ex) ID: 1 used 500 at 1:00 August 1st. Method 1 :UPDATE model UPDATE new data for each device, daily Device ID Day 0:00 0:30 1:00 1:30 … 1 8/1 100 300 500 2 8/1 200 400 Frequent UPADATEs are unfavorable for PostgreSQL in terms of performance
  • 8. 8Copyright © 2016 NTT DATA Corporation Data model Device ID Date Value 1 8/1 0:00 100 1 8/1 0:30 300 1 8/1 1:00 500 … … … ○ performance × data size Method 2 : INSERT model INSERT new data for each device, every 30 mins Method 1 :UPDATE model Device ID Day 0:00 0:30 1:00 1:30 … 1 8/1 100 300 500 2 8/1 200 400
  • 9. 9Copyright © 2016 NTT DATA Corporation Data model Device ID Date Value 1 8/1 0:00 100 1 8/1 0:30 300 1 8/1 1:00 500 … … … ○ performance × data size Method 2 : INSERT model INSERT new data for each device, every 30 mins Method 1 :UPDATE model Device ID Day 0:00 0:30 1:00 1:30 … 1 8/1 100 300 500 2 8/1 200 400 Selected based on performance
  • 10. 10Copyright © 2016 NTT DATA Corporation Performance factors number of tuples in one transaction ? multiplicity? parameters? data type? restrictions? index? version? pre research regarding performance factors how to load to partition table?
  • 11. 11Copyright © 2016 NTT DATA Corporation Performance factors number of tuples in one transaction 10000multiplicity 8 parameter wal_bugffers=1GB data type minimumrestriction minimum index minimum version 9.4 direct load to partition child table DB design performance tuning
  • 12. 12Copyright © 2016 NTT DATA Corporation Performance factors number of tuples in one transaction 10000multiplicity 8 parameter wal_bugffers=1GB data type minimumrestriction minimum index minimum version 9.4 direct load to partition child table
  • 13. 13Copyright © 2016 NTT DATA Corporation Bottleneck Analysis with perf 19.83% postgres postgres [.] XLogInsert ★ 6.45% postgres postgres [.] LWLockRelease 4.41% postgres postgres [.] PinBuffer 3.03% postgres postgres [.] LWLockAcquire WAL is the bottleneck ! perf WAL WAL file Disk I/O memory WAL buffer write ・commit ・buffer is full
  • 14. 14Copyright © 2016 NTT DATA Corporation wal_buffers parameter “The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.” by PostgreSQL Document
  • 15. 15Copyright © 2016 NTT DATA Corporation wal_buffers ※INSERT only (except SELECT) 0:00:00 0:01:00 0:02:00 0:03:00 0:04:00 0:05:00 0:06:00 0:07:00 0:08:00 0:09:00 16MB 1GB Time Impact of WAL_buffers
  • 16. 16Copyright © 2016 NTT DATA Corporation PostgreSQL version ・WAL performance improved ・JSONB ・GIN performance improved ・CONCURRENTLY option 9.3 9.4
  • 17. 17Copyright © 2016 NTT DATA Corporation Version up • We had originally planned to use 9.3, but changed to 9.4. 0:00:00 0:01:00 0:02:00 0:03:00 0:04:00 0:05:00 0:06:00 0:07:00 0:08:00 9.3 9.4 time impact of version up ※INSERT only (except SELECT)
  • 18. 18Copyright © 2016 NTT DATA Corporation 0:07:57 0:06:59 0:05:49 0:03:29 0:03:29 0:03:29 0:00:00 0:02:00 0:04:00 0:06:00 0:08:00 0:10:00 0:12:01 9.3, 16MB 9.3, 1GB 9.4, 1GB time Result target accomplished!! other processes are already tuned. ■INSERT ■others
  • 19. 19Copyright © 2016 NTT DATA Corporation (2) Must save data for 24 months ! ★ main processing 5 million datasets per 30 min validate save data save calculated datacalculation within 10minutes • 240 million additional tuples per day • must be saved for 24 months 5 million tuple INSERT large scale SELECT Mission 35 million tuple INSERT Mission 1 Mission 2
  • 20. 108TB
  • 21. 21Copyright © 2016 NTT DATA Corporation Reduce data size by selecting the best data type • Integer  Use the smallest data type that can cover the range and precision • Boolean  Use BOOLEAN instead of CHAR(1) Type precision Size SMALLINT 4 digit 2 byte INTEGER 9 digit 4 byte BIGINT 18 digit 8 byte NUMERIC 1000 digit 3 or 6 or 8 + ceiling(digit / 4) * 2 Type available data Size CHAR(1) string (length is 1) 5 byte BOOLEAN true or false 1 byte
  • 22. 22Copyright © 2016 NTT DATA Corporation Reduce the data size by changing column order • alignment • PostgreSQL does not store data across the alignment 1 2 3 4 5 6 7 8 column_1(4byte) ***PADDING*** column_2(8byte) 8 byte
  • 23. Column Type column_1 integer column_2 timestamp without time zone column_3 integer column_4 smallint column_5 timestamp without time zone column_6 smallint column_7 timestamp without time zone 1 2 3 4 5 6 7 8 column_1 ***PADDING*** column_2 column_3 column_4 *PADDING* column_5 column_6 ********PADDING********* column_7 1 2 3 4 5 6 7 8 column_2 column_5 column_7 column_1 column_3 column_4 column_6 72 60 ex) 12 type / 1 tuple  2.8GB /day!
  • 24. 24Copyright © 2016 NTT DATA Corporation Change data model num data select frequency update frequency policy model 1 1st day ~65th day high high performance is the priority INSERT 2 66th day ~24 months low low data size is the priority UPDATE We adopted INSERT model considering the performance • However, data size is large making it difficult to store long term convert model for old data
  • 25. 25Copyright © 2016 NTT DATA Corporation Change data model ID date 0:00 0:30 1:00 … 22:30 23:00 23:30 1 8/1 100 300 500 … 1000 1100 1200 2 8/1 100 200 300 … 800 900 1000 ID timestamp value 1 8/1 0:00 100 2 8/1 0:00 100 1 8/1 0:30 300 2 8/1 0:30 200 1 8/1 1:00 500 2 8/1 1:00 300 … … … 1 8/1 22:30 1000 2 8/1 22:30 800 1 8/1 23:00 1100 2 8/1 23:00 900 1 8/1 23:30 1200 2 8/1 23:30 1000 INSERT model UPDATE model remove duplicated data (ID, timestamp) num of tuples/day: 240 million →5 million size: 22GB→3GB
  • 26. 26Copyright © 2016 NTT DATA Corporation result 108 11 0 20 40 60 80 100 120 datasize(TB) reduce data size before after
  • 27. 27Copyright © 2016 NTT DATA Corporation (3) Stabilize large scale SELECT performance ! ★ main processing 5 million datasets per 30 min validate save data save calculated datacalculation within 10minutes • 240 million additional tuples per day • must be saved for 24 months 5 million tuple INSERT large scale SELECT 5 million tuple INSERT Mission 1 Mission 2 Mission 3
  • 28. 28Copyright © 2016 NTT DATA Corporation Stabilize the performance of 10 million SELECT statements! “stable performance” is important • Performance degradation is caused by sudden changes in execution plan is problem control execution plans pg_hint_plan lock statistical information pg_dbms_stats stable performance
  • 29. 29Copyright © 2016 NTT DATA Corporation Before using pg_hint_plan & pg_dbms_stats In most cases, optimizer generates the best execution plan fixing execution plan does not always bring good result • The best execution plan at this time may not be best in the future. However, it is necessary to reduce the risk. If execution plan suddenly changed during operation, and performance maybe reduced. →Understand the demerits and use these extensions • SELECT immediately after batch, before ANALYZE • SELECT from a lot of tables (JOIN) • …
  • 30. 30Copyright © 2016 NTT DATA Corporation pg_dbms_stats Planner pg_dbms_stats PostgreSQL Original statistics Plan generate Lock “Locked” statistics
  • 31. 31Copyright © 2016 NTT DATA Corporation pg_dbms_stats in this system usage data day table locked statistics day table locked statistics day table locked statistics day partition set locked statistics with new table COPY some statistics are different depending on each child table We can certainly get best plan even without using ANALYZE. • table’s OID, table name • partition key, date
  • 32. 32Copyright © 2016 NTT DATA Corporation Replacing statistics that should be changed according to table • Create assumed dummy data • ANALYZE dummy data Column statistic partition key Most Common Value Date Histogram Ex) “ 8/1 0:00” , “8/1 0:30”, “8/1 1:00” 48 pattern per day. Uniform distribution.
  • 33. 33Copyright © 2016 NTT DATA Corporation 1. Load 10 million datasets within 10 minutes ! 2. Must save data for 24 months ! 3. Stabilize large scale SELECT performance ! Mission COMPLETE
  • 34. 34Copyright © 2016 NTT DATA Corporation conclusion The 20th anniversary of PostgreSQL PostgreSQL finally evolved to be adopted in large scale social infrastructure. Both PostgreSQL technical knowledge and business application knowledge are necessary to be successful in difficult and large scale projects. Pre research and know-how are important to get the full out of PostgreSQL.
  • 35. Copyright © 2011 NTT DATA Corporation Copyright © 2016 NTT DATA Corporation