Timestamped Binary Association Table - IEEE Big Data Congress 2015

Write Optimization using
Asynchronous Update on Out-
of-Core Column-Store
Databases in Map-Reduce
Feng Yu, Eric S. Jones
Youngstown State University, Youngstown, OH
fyu@ysu.edu, esjones@student.ysu.edu
Wen-Chi Hou
Southern Illinois University, Carbondale, IL
hou@cs.siu.edu
Youngstown State University
Column-Store Databases
• The column-store database is also known as columnar
database or column-oriented database
• The column-store database fits well into the write-once-and-
read-many environment.
– Retrieve only the necessary attributes included in the
query prediction without the need to read the entire tuple.
– Works especially well for OLAP and data mining queries
– It can reach a higher compression rate and higher reading
speed than row-based databases.
Youngstown State University
Challenge
• Optimizing write operations in a column-store database has
always been a challenge.
• Data is vertically decomposed into BATs (Binary Association Tables)
and randomly distributed over the storage.
• The writing on a column-store database will be significantly delayed by
ad hoc access to large BATs across multiple pages.
• Existing works majorly focus on write optimizations in a main-
memory column-store database.
Youngstown State University
BAT Example
Youngstown State University
Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format
A BUN consists of
(oid, value)
Mapping Rules
Relational Data
Column-Store
Update on BAT in Map-Reduce
• In a Map-Reduce environment, we assume the
update list of OIDs are collected and submitted in a
batch
1. Map-Reduce Join
BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combine UPDATE_LIST)
• Map-side join: when UPDATE_LIST is small enough to fit into memory
• Reduce-side join: when UPDATE_LIST is large enough
2. Projection (Map-Only)
FOR each record in (BAT combine UPDATE_LIST)
IF UPDATE_LIST attribute is not NULL: output updated value
ELSE: output original value
Youngstown State University
Motivation
• Focus: Write-optimization on column-store in
Map-Reduce
• Principle: avoid seeking and writing on every
change
• Solution: Timestamp the newly updated data
(TBAT)
– multi-version
– no need of index
• Update: AMO (Asynchronous Map-Only) update
– the newly updated data is appended to the end of a
TBAT slip in a map-only manner
Youngstown State University
TBAT (Timestamped BAT)
• TBAT in HDFS:
struct TBUN{
TIMESTAMP optime,
ROWID oid,
USER_DEFINED_TYPE attrv
}
struct TBAT_slip{
TBUN[max_size_per_HDFS_slip] tbuns
}
– No need for any global pre-sorting or indexing
– ‘attrv’ is can be any user defined type that flexibly
define arbitrary kinds of schema
Youngstown State University
TBAT Example (logical view)
Youngstown State University
oid float
101 100.00
102 200.00
103 300.00
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
customer_balance customer_balance
BAT TBAT
Suppose the existing
records were inserted
in one batch at time1.
AMO Update (logical)
Youngstown State University
Example:
Uupdate query on customer table:
update customer set balance=201.00 where id=2
Current timestamp is time2 (>time1).
The newest TBUN for 201.00 is appended to the end of TBAT customer_balance
New Data
Old Data
Selection after AMO Update
• The data consistency is intact in a TBAT after AMO
update.
• Example:
– Selection after AOC update:
SELECT balance FROM customer WHERE id=2
– Two tuples will be retrieved:
t1=(time1, 102, 200.00)
t2=(time2, 102, 201.00)
– Compare the timestamps, time2 > time1. Then 201.00 is
returned which is consistent with the last update value.
Youngstown State University
Preliminary Experiment
• Performed on a Cloudera Distributed Hadoop
(CDH) version 5.3 cluster
– 1 master and 3 slaves
– Total HDFS capacity= 310GB (block size = 64MB)
– Interconnection is Gigabit Ethernet
• Data sets: 1GB and 10GB random synthetic
data in BAT and TBAT.
• Update queries: from 10% to 30% of the
original data.
Youngstown State University
Preliminary Experiment Results (cont.)
Youngstown State University
1GB Update Running Time
0
50
100
150
200
250
300
350
400
450
500
10 15 20 25 30
RunningTime(sec)
Update Percentage (%)
BAT TBAT
Preliminary Experiment Results (cont.)
Youngstown State University
10GB Update Running Time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
10 15 20 25 30
RunningTime(sec)
Update Percentage (%)
BAT TBAT
Preliminary Experiment Results (cont.)
Youngstown State University
Overhead Changing over Data Sets
0
20
40
60
80
100
120
140
160
180
10 15 20 25 30
Overhead(%)
Update Percentage (%)
1GB 10GB
Resource Usage
Youngstown State University
Conclusion
• We introduce a new method called AMO
update for write optimization on OOC column-
store databases in map-reduce.
• AMO update employs TBAT to improve the
update performance with data atomicity
guaranteed.
• Significant improvement in running speed of
AOC update has been shown in preliminary
experiment results.
Youngstown State University
Future Works
• The performance variation of the Map-Reduce
selection algorithm on TBAT after different
percentages of the file is updated.
• Introduce a distributed local indexing on each
TBAT slip in HDFS to improve the global data
retrieval performance.
Youngstown State University
THANK YOU!
Feng “George” Yu
Computer Science and Information Systems
Youngstown State University, Youngstown, OH
fyu@ysu.edu
Youngstown State University
1 of 18

Recommended

Write Optimization of Column-Store Databases in Out-of-Core Environment by
Write Optimization of Column-Store Databases in Out-of-Core EnvironmentWrite Optimization of Column-Store Databases in Out-of-Core Environment
Write Optimization of Column-Store Databases in Out-of-Core Environment"FENG "GEORGE"" YU
205 views37 slides
Best Practices in the Use of Columnar Databases by
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesDATAVERSITY
4K views12 slides
Web Access Log Management by
Web Access Log ManagementWeb Access Log Management
Web Access Log ManagementJay Patel
539 views19 slides
Column store databases approaches and optimization techniques by
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
306 views7 slides
MapReduce and parallel DBMSs: friends or foes? by
MapReduce and parallel DBMSs: friends or foes?MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?Spyros Eleftheriadis
1.4K views41 slides
SQL Optimization With Trace Data And Dbms Xplan V6 by
SQL Optimization With Trace Data And Dbms Xplan V6SQL Optimization With Trace Data And Dbms Xplan V6
SQL Optimization With Trace Data And Dbms Xplan V6Mahesh Vallampati
1.4K views56 slides

More Related Content

What's hot

Comparison of data recovery techniques on master file table between Aho-Coras... by
Comparison of data recovery techniques on master file table between Aho-Coras...Comparison of data recovery techniques on master file table between Aho-Coras...
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
22 views6 slides
Overview of query evaluation by
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
17.2K views25 slides
Etl interview questions by
Etl interview questionsEtl interview questions
Etl interview questionsashokvirtual
170 views3 slides
Systems and methods for improving database performance by
Systems and methods for improving database performanceSystems and methods for improving database performance
Systems and methods for improving database performanceEyjólfur Gislason
539 views98 slides
Query evaluation and optimization by
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimizationlavanya marichamy
658 views13 slides
Query optimization by
Query optimizationQuery optimization
Query optimizationdixitdavey
24.2K views24 slides

What's hot(8)

Comparison of data recovery techniques on master file table between Aho-Coras... by TELKOMNIKA JOURNAL
Comparison of data recovery techniques on master file table between Aho-Coras...Comparison of data recovery techniques on master file table between Aho-Coras...
Comparison of data recovery techniques on master file table between Aho-Coras...
Overview of query evaluation by avniS
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS17.2K views
Etl interview questions by ashokvirtual
Etl interview questionsEtl interview questions
Etl interview questions
ashokvirtual170 views
Systems and methods for improving database performance by Eyjólfur Gislason
Systems and methods for improving database performanceSystems and methods for improving database performance
Systems and methods for improving database performance
Eyjólfur Gislason539 views
Query optimization by dixitdavey
Query optimizationQuery optimization
Query optimization
dixitdavey24.2K views
Batch processing with watermark by Nazia Abdullah
Batch processing with watermarkBatch processing with watermark
Batch processing with watermark
Nazia Abdullah84 views
MonetDB :column-store approach in database by Nikhil Patteri
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
Nikhil Patteri4.1K views

Viewers also liked

Query Optimization - Brandon Latronica by
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
406 views23 slides
л.с.кайбышева — после чернобыля by
л.с.кайбышева — после чернобылял.с.кайбышева — после чернобыля
л.с.кайбышева — после чернобыляҸӲӅӐӪӋӒӍЪ ҶЎԈӚҨӴӔӍӦԜЪ
1.6K views367 slides
Trent pagani by
Trent paganiTrent pagani
Trent paganiTrentPagani
191 views21 slides
Vis Producten by
Vis ProductenVis Producten
Vis ProductenLitopOpleidingen
288 views9 slides
Tatiana routine by
Tatiana routineTatiana routine
Tatiana routineTatiana Padilla
78 views20 slides
Συμβουλές_για_μια_επιτυχημένη_συνέντευξη by
Συμβουλές_για_μια_επιτυχημένη_συνέντευξηΣυμβουλές_για_μια_επιτυχημένη_συνέντευξη
Συμβουλές_για_μια_επιτυχημένη_συνέντευξηKonstantina Kalodouka
193 views17 slides

Similar to Timestamped Binary Association Table - IEEE Big Data Congress 2015

Deep Dive on Amazon Redshift by
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftAmazon Web Services
4.4K views40 slides
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores by
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresEfficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresHan Li
952 views24 slides
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ... by
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
658 views55 slides
Cost-Based Optimizer in Apache Spark 2.2 by
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
5.5K views55 slides
Deep Dive on Amazon Redshift by
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftAmazon Web Services
3K views40 slides
SQL Server 2014 Memory Optimised Tables - Advanced by
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
1.4K views52 slides

Similar to Timestamped Binary Association Table - IEEE Big Data Congress 2015(20)

Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores by Han Li
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresEfficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Han Li952 views
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ... by Databricks
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks658 views
Cost-Based Optimizer in Apache Spark 2.2 by Databricks
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks5.5K views
SQL Server 2014 Memory Optimised Tables - Advanced by Tony Rogerson
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson1.4K views
Resolve issues with throttled dynamo db tables by Jean Joseph
Resolve issues with throttled dynamo db tablesResolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tables
Jean Joseph273 views
10 Reasons to Start Your Analytics Project with PostgreSQL by Satoshi Nagayasu
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu4.5K views
Deep Dive: Amazon Redshift (March 2017) by Julien SIMON
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
Julien SIMON774 views
Deep Dive Redshift, with a focus on performance by Amazon Web Services
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
Amazon Web Services2.2K views
Mapping Data Flows Perf Tuning April 2021 by Mark Kromer
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer829 views
MongoDB revs you up: What Storage Engine is Right for You? by Jonathan E. Tobin
MongoDB revs you up: What Storage Engine is Right for You?MongoDB revs you up: What Storage Engine is Right for You?
MongoDB revs you up: What Storage Engine is Right for You?
Jonathan E. Tobin375 views
Imply at Apache Druid Meetup in London 1-15-20 by Jelena Zanko
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko311 views
Stateful streaming and the challenge of state by Yoni Farin
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
Yoni Farin40 views
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg by Spark Summit
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit1K views
Rise of Column Oriented Database by Suvradeep Rudra
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
Suvradeep Rudra15.2K views
SQL Server 2014 In-Memory OLTP by Tony Rogerson
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
Tony Rogerson482 views
PostgreSQL as an Alternative to MSSQL by Alexei Krasner
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner947 views
Best storage engine for MySQL by tomflemingh2
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
tomflemingh2983 views

Recently uploaded

Ukraine Infographic_22NOV2023_v2.pdf by
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K views3 slides
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
6 views77 slides
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int... by
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...DataScienceConferenc1
5 views17 slides
VoxelNet by
VoxelNetVoxelNet
VoxelNettaeseon ryu
16 views21 slides
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks by
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks
[DSC Europe 23] Aleksandar Tomcic - Adversarial AttacksDataScienceConferenc1
5 views20 slides
Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
16 views27 slides

Recently uploaded(20)

Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int... by DataScienceConferenc1
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821716 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson29 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views

Timestamped Binary Association Table - IEEE Big Data Congress 2015

  • 1. Write Optimization using Asynchronous Update on Out- of-Core Column-Store Databases in Map-Reduce Feng Yu, Eric S. Jones Youngstown State University, Youngstown, OH fyu@ysu.edu, esjones@student.ysu.edu Wen-Chi Hou Southern Illinois University, Carbondale, IL hou@cs.siu.edu Youngstown State University
  • 2. Column-Store Databases • The column-store database is also known as columnar database or column-oriented database • The column-store database fits well into the write-once-and- read-many environment. – Retrieve only the necessary attributes included in the query prediction without the need to read the entire tuple. – Works especially well for OLAP and data mining queries – It can reach a higher compression rate and higher reading speed than row-based databases. Youngstown State University
  • 3. Challenge • Optimizing write operations in a column-store database has always been a challenge. • Data is vertically decomposed into BATs (Binary Association Tables) and randomly distributed over the storage. • The writing on a column-store database will be significantly delayed by ad hoc access to large BATs across multiple pages. • Existing works majorly focus on write optimizations in a main- memory column-store database. Youngstown State University
  • 4. BAT Example Youngstown State University Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format A BUN consists of (oid, value) Mapping Rules Relational Data Column-Store
  • 5. Update on BAT in Map-Reduce • In a Map-Reduce environment, we assume the update list of OIDs are collected and submitted in a batch 1. Map-Reduce Join BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combine UPDATE_LIST) • Map-side join: when UPDATE_LIST is small enough to fit into memory • Reduce-side join: when UPDATE_LIST is large enough 2. Projection (Map-Only) FOR each record in (BAT combine UPDATE_LIST) IF UPDATE_LIST attribute is not NULL: output updated value ELSE: output original value Youngstown State University
  • 6. Motivation • Focus: Write-optimization on column-store in Map-Reduce • Principle: avoid seeking and writing on every change • Solution: Timestamp the newly updated data (TBAT) – multi-version – no need of index • Update: AMO (Asynchronous Map-Only) update – the newly updated data is appended to the end of a TBAT slip in a map-only manner Youngstown State University
  • 7. TBAT (Timestamped BAT) • TBAT in HDFS: struct TBUN{ TIMESTAMP optime, ROWID oid, USER_DEFINED_TYPE attrv } struct TBAT_slip{ TBUN[max_size_per_HDFS_slip] tbuns } – No need for any global pre-sorting or indexing – ‘attrv’ is can be any user defined type that flexibly define arbitrary kinds of schema Youngstown State University
  • 8. TBAT Example (logical view) Youngstown State University oid float 101 100.00 102 200.00 103 300.00 optime oid float time1 101 100.00 time1 102 200.00 time1 103 300.00 customer_balance customer_balance BAT TBAT Suppose the existing records were inserted in one batch at time1.
  • 9. AMO Update (logical) Youngstown State University Example: Uupdate query on customer table: update customer set balance=201.00 where id=2 Current timestamp is time2 (>time1). The newest TBUN for 201.00 is appended to the end of TBAT customer_balance New Data Old Data
  • 10. Selection after AMO Update • The data consistency is intact in a TBAT after AMO update. • Example: – Selection after AOC update: SELECT balance FROM customer WHERE id=2 – Two tuples will be retrieved: t1=(time1, 102, 200.00) t2=(time2, 102, 201.00) – Compare the timestamps, time2 > time1. Then 201.00 is returned which is consistent with the last update value. Youngstown State University
  • 11. Preliminary Experiment • Performed on a Cloudera Distributed Hadoop (CDH) version 5.3 cluster – 1 master and 3 slaves – Total HDFS capacity= 310GB (block size = 64MB) – Interconnection is Gigabit Ethernet • Data sets: 1GB and 10GB random synthetic data in BAT and TBAT. • Update queries: from 10% to 30% of the original data. Youngstown State University
  • 12. Preliminary Experiment Results (cont.) Youngstown State University 1GB Update Running Time 0 50 100 150 200 250 300 350 400 450 500 10 15 20 25 30 RunningTime(sec) Update Percentage (%) BAT TBAT
  • 13. Preliminary Experiment Results (cont.) Youngstown State University 10GB Update Running Time 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 15 20 25 30 RunningTime(sec) Update Percentage (%) BAT TBAT
  • 14. Preliminary Experiment Results (cont.) Youngstown State University Overhead Changing over Data Sets 0 20 40 60 80 100 120 140 160 180 10 15 20 25 30 Overhead(%) Update Percentage (%) 1GB 10GB
  • 16. Conclusion • We introduce a new method called AMO update for write optimization on OOC column- store databases in map-reduce. • AMO update employs TBAT to improve the update performance with data atomicity guaranteed. • Significant improvement in running speed of AOC update has been shown in preliminary experiment results. Youngstown State University
  • 17. Future Works • The performance variation of the Map-Reduce selection algorithm on TBAT after different percentages of the file is updated. • Introduce a distributed local indexing on each TBAT slip in HDFS to improve the global data retrieval performance. Youngstown State University
  • 18. THANK YOU! Feng “George” Yu Computer Science and Information Systems Youngstown State University, Youngstown, OH fyu@ysu.edu Youngstown State University