Timestamped Binary Association Table - IEEE Big Data Congress 2015
1. Write Optimization using
Asynchronous Update on Out-
of-Core Column-Store
Databases in Map-Reduce
Feng Yu, Eric S. Jones
Youngstown State University, Youngstown, OH
fyu@ysu.edu, esjones@student.ysu.edu
Wen-Chi Hou
Southern Illinois University, Carbondale, IL
hou@cs.siu.edu
Youngstown State University
2. Column-Store Databases
• The column-store database is also known as columnar
database or column-oriented database
• The column-store database fits well into the write-once-and-
read-many environment.
– Retrieve only the necessary attributes included in the
query prediction without the need to read the entire tuple.
– Works especially well for OLAP and data mining queries
– It can reach a higher compression rate and higher reading
speed than row-based databases.
Youngstown State University
3. Challenge
• Optimizing write operations in a column-store database has
always been a challenge.
• Data is vertically decomposed into BATs (Binary Association Tables)
and randomly distributed over the storage.
• The writing on a column-store database will be significantly delayed by
ad hoc access to large BATs across multiple pages.
• Existing works majorly focus on write optimizations in a main-
memory column-store database.
Youngstown State University
4. BAT Example
Youngstown State University
Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format
A BUN consists of
(oid, value)
Mapping Rules
Relational Data
Column-Store
5. Update on BAT in Map-Reduce
• In a Map-Reduce environment, we assume the
update list of OIDs are collected and submitted in a
batch
1. Map-Reduce Join
BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combine UPDATE_LIST)
• Map-side join: when UPDATE_LIST is small enough to fit into memory
• Reduce-side join: when UPDATE_LIST is large enough
2. Projection (Map-Only)
FOR each record in (BAT combine UPDATE_LIST)
IF UPDATE_LIST attribute is not NULL: output updated value
ELSE: output original value
Youngstown State University
6. Motivation
• Focus: Write-optimization on column-store in
Map-Reduce
• Principle: avoid seeking and writing on every
change
• Solution: Timestamp the newly updated data
(TBAT)
– multi-version
– no need of index
• Update: AMO (Asynchronous Map-Only) update
– the newly updated data is appended to the end of a
TBAT slip in a map-only manner
Youngstown State University
7. TBAT (Timestamped BAT)
• TBAT in HDFS:
struct TBUN{
TIMESTAMP optime,
ROWID oid,
USER_DEFINED_TYPE attrv
}
struct TBAT_slip{
TBUN[max_size_per_HDFS_slip] tbuns
}
– No need for any global pre-sorting or indexing
– ‘attrv’ is can be any user defined type that flexibly
define arbitrary kinds of schema
Youngstown State University
8. TBAT Example (logical view)
Youngstown State University
oid float
101 100.00
102 200.00
103 300.00
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
customer_balance customer_balance
BAT TBAT
Suppose the existing
records were inserted
in one batch at time1.
9. AMO Update (logical)
Youngstown State University
Example:
Uupdate query on customer table:
update customer set balance=201.00 where id=2
Current timestamp is time2 (>time1).
The newest TBUN for 201.00 is appended to the end of TBAT customer_balance
New Data
Old Data
10. Selection after AMO Update
• The data consistency is intact in a TBAT after AMO
update.
• Example:
– Selection after AOC update:
SELECT balance FROM customer WHERE id=2
– Two tuples will be retrieved:
t1=(time1, 102, 200.00)
t2=(time2, 102, 201.00)
– Compare the timestamps, time2 > time1. Then 201.00 is
returned which is consistent with the last update value.
Youngstown State University
11. Preliminary Experiment
• Performed on a Cloudera Distributed Hadoop
(CDH) version 5.3 cluster
– 1 master and 3 slaves
– Total HDFS capacity= 310GB (block size = 64MB)
– Interconnection is Gigabit Ethernet
• Data sets: 1GB and 10GB random synthetic
data in BAT and TBAT.
• Update queries: from 10% to 30% of the
original data.
Youngstown State University
12. Preliminary Experiment Results (cont.)
Youngstown State University
1GB Update Running Time
0
50
100
150
200
250
300
350
400
450
500
10 15 20 25 30
RunningTime(sec)
Update Percentage (%)
BAT TBAT
13. Preliminary Experiment Results (cont.)
Youngstown State University
10GB Update Running Time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
10 15 20 25 30
RunningTime(sec)
Update Percentage (%)
BAT TBAT
14. Preliminary Experiment Results (cont.)
Youngstown State University
Overhead Changing over Data Sets
0
20
40
60
80
100
120
140
160
180
10 15 20 25 30
Overhead(%)
Update Percentage (%)
1GB 10GB
16. Conclusion
• We introduce a new method called AMO
update for write optimization on OOC column-
store databases in map-reduce.
• AMO update employs TBAT to improve the
update performance with data atomicity
guaranteed.
• Significant improvement in running speed of
AOC update has been shown in preliminary
experiment results.
Youngstown State University
17. Future Works
• The performance variation of the Map-Reduce
selection algorithm on TBAT after different
percentages of the file is updated.
• Introduce a distributed local indexing on each
TBAT slip in HDFS to improve the global data
retrieval performance.
Youngstown State University
18. THANK YOU!
Feng “George” Yu
Computer Science and Information Systems
Youngstown State University, Youngstown, OH
fyu@ysu.edu
Youngstown State University