Advertisement

Write Optimization of Column-Store Databases in Out-of-Core Environment

Oct. 20, 2016
Advertisement

More Related Content

Similar to Write Optimization of Column-Store Databases in Out-of-Core Environment(20)

Advertisement

Write Optimization of Column-Store Databases in Out-of-Core Environment

  1. Write Optimization of Column-Store Databases in Out-of-Core Environment YOUNGSTOWN STATE UNIVERSITY Dr. Feng “George” Yu Assistant Professor Department of Computer Science and Information Systems Youngstown State University fyu@ysu.edu
  2. Outline 1. Part I: Write-Optimization 2. Part II: Data Cleaning 3. Part III: Application on Big Data YOUNGSTOWN STATE UNIVERSITY, OH, USA
  3. What is Column-Store Database? Column-store database is also known as columnar database or column-oriented database The history of column-store database can be traced back to 1970s. Not until about 2005 when many open-source and commercial implementations of column-store databases took off. Well-known column-store databases: YOUNGSTOWN STATE UNIVERSITY, OH, USA
  4. Features of Column-Store Databases Fits well into the write-once-and-read-many environment. • Works especially well for OLAP and data mining queries • Retrieve many records but need only a few attributes. Higher data compression rate • Low data-entropy • Much better than row-based storage YOUNGSTOWN STATE UNIVERSITY
  5. Row-Based to Column-Store YOUNGSTOWN STATE UNIVERSITY Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format id name balance 1 Alissa 100.00 2 Bob 200.00 3 Charles 300.00 (a) Row-Based Table customer oid int 101 1 102 2 103 3 (b) BAT customer id o 1 1 1 (c) Figure 1: customer Data in Row-Based and much faster in a column-store database. Another featured benefit of the column-store database is data compression, which can reach a higher compression rate and higher speed than traditional row-based database. One of the major reasons is that the information entropy in the data of one column is lower compared to that of row-based data. Optimizing write operations in a column-store sec wo 2 e omer oid int 101 1 102 2 103 3 (b) BAT customer id oid varchar 101 Alissa 102 Bob 103 Charles (c) BAT customer name oid float 101 100.00 102 200.00 103 300.00 (d) BAT customer balance customer Data in Row-Based and Column-Store (BAT) Format A BUN consists of (oid, value) Mapping Rules Relational Data Column-Store
  6. Challenge •Optimizing write operations in a column-store database has always been a challenge because: • Data is vertically decomposed into BATs and randomly distributed over the storage. • The writing on a column-store database will be significantly delayed by ad hoc access to large BATs across multiple pages. YOUNGSTOWN STATE UNIVERSITY
  7. Out-Of-Core (OOC)? Existing works majorly focus on write optimizations for main-memory column-store database. To the best of our knowledge, very few works focus on optimizing the write performance on the Out-Of-Core (OOC or external memory) column-store databases. YOUNGSTOWN STATE UNIVERSITY
  8. Traditional Update on BAT In traditional BAT, an update by a given OID involves in 2 phases: 1. Search the location in BAT by OID (Time- consuming) 2.Update the value at the target location. YOUNGSTOWN STATE UNIVERSITY
  9. Motivation 1. Avoid Searching! 2. Allow multi-values for a given OID. 3. Keep data consistent. YOUNGSTOWN STATE UNIVERSITY, OH, USA
  10. Timestamped Binary Association Table (TBAT) YOUNGSTOWN STATE UNIVERSITY oid float 101 100.00 102 200.00 103 300.00 optime oid float time1 101 100.00 time1 102 200.00 time1 103 300.00 customer_balance customer_balance BAT TBAT Suppose the existing records were inserted in one batch at time1.
  11. The principle of AOC update is to avoid OOC searching and writing in every effort and to use the timestamp field of TBAT to label. In AOC update, the newly updated data that is directly appended to the end of a TBAT. In such a manner, we don't have to frequently perform ad hoc data searching. YOUNGSTOWN STATE UNIVERSITY
  12. AOC Update Example YOUNGSTOWN STATE UNIVERSITY Example: Uupdate query on customer table: update customer set balance=201.00 where id=2 Current timestamp is time2 (>time1). The newest TBUN for 201.00 is appended to the end of TBAT customer_balance New update -> inal value to 201.00. Instead of seeking the position to the record with oid=102, AOC update directly ap- pends at the end of the TBAT a new tuple as (time2, 102, 201.00). The timestamp when AOC update is performed is assumed to be time2, and 201.00 is the newly updated value. The TBAT customer balance after the AOC update is illustrated in Table 3. Table 3: TBAT customer balance after AOC Update optime oid float time1 101 100.00 time1 102 200.00 time1 103 300.00 time2 102 201.00 3.2.2 Cost Analysis of the AOC Update Body Appendix
  13. Selection after AOC Update The data consistency will be intact in a TBAT after AOC update. After the TBAT of customer has applied AOC updates, we run the following query: SELECT balance FROM customer WHERE id=2 In the updated TBAT customer_balance, two tuples will be returned: t1=(time1, 102, 200.00) t2=(time2, 102, 201.00) We compare the timestamps, time2 > time1. Then 201.00 is returned which is consistent with the last update value. YOUNGSTOWN STATE UNIVERSITY
  14. AOC Update Experiment Preliminary experiment results are designed in order to compare the speed performance between AOC updates on TBATs and traditional updates on BATs. The experiment is performed on a CentOS 6.5 workstation with Intel Core i7-3700 3.4GHz CPU, 16GB memory, and 250GB SATA 7200RPM hard disk. The experiment test code is implemented in Python 2.7. YOUNGSTOWN STATE UNIVERSITY
  15. AOC Update Experiment (cont.) 2.27 4.71 7.13 9.59 12.01 1.63E-03 3.25E-03 4.81E-03 6.41E-03 7.95E-03 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 10% 20% 30% 40% 50% ElapasedTime(sec) Update Percentage BAT TBAT YOUNGSTOWN STATE UNIVERSITY, OH, USA AOC Update and Traditional Update Running Time
  16. AOC Update Experiment (cont.) YOUNGSTOWN STATE UNIVERSITY 1392.84 1449.65 1484.23 1495.56 1509.9 1320 1340 1360 1380 1400 1420 1440 1460 1480 1500 1520 10% 20% 30% 40% 50% TimesFaster(x1.0) Update Percentage Times of AOC Update Faster than Traditional Update = Time(Update on BAT) Time(AOC Update on TBAT) Average 1466.436 times faster Overheads
  17. Potential Problems with AOC Update When many AOC updates are performed, searching becomes gradually slower because its unsorted appendix requires a linear search; the greater the volume of updated data, the slower the search. YOUNGSTOWN STATE UNIVERSITY, OH, USA TBAT customer balance. The target change the attribute value from the orig- to 201.00. Instead of seeking the position rd with oid=102, AOC update directly ap- he end of the TBAT a new tuple as (time2, 00). The timestamp when AOC update is is assumed to be time2, and 201.00 is the ated value. The TBAT customer balance OC update is illustrated in Table 3. BAT customer balance after AOC Update optime oid float time1 101 100.00 time1 102 200.00 time1 103 300.00 time2 102 201.00 t1=(time1, t2=(time2, As we comp than time1. Then tent with the last 3.2.4 O ine D Update After a period of there will be many same oid and di query is issued in will all be return execution time. In order to ficient on TBAT, Body Appendix
  18. Search Speed Degeneration YOUNGSTOWN STATE UNIVERSITY, OH, USA Selection Query Execution Overhead: TBAT over BAT ( time(TBAT)/time(BAT) × 100%)
  19. Outline 1. Part I: Write-Optimization 2. Part II: Data Cleaning 3. Part III: Application on Big Data YOUNGSTOWN STATE UNIVERSITY, OH, USA
  20. Data Cleaning After AOC Update Data cleaning has recently drawn a lot of attention. Data cleaning in our context is the process by which we merge the updated updated data from the appendix into the body. • Remove multi-values of the same OID • Avoid slower linear search During non-peak times, Offline Data Cleaning allows for these adjustments to be made by merging into the body the recently updated data. YOUNGSTOWN STATE UNIVERSITY, OH, USA
  21. Problems with the Offline Data Cleaning Method This method causes the database to go offline, meaning that any incoming queries will have to wait until the database comes back online. This lapse in service may not be appropriate for environments that require a constant workload; inappropriate for constant input-streams. YOUNGSTOWN STATE UNIVERSITY, OH, USA
  22. Online Data Cleaning The major difference of online data cleaning is the employment of a sophisticated data structure called snapshot. The idea of live snapshot roots from cloud computing. YOUNGSTOWN STATE UNIVERSITY, OH, USA Body Snapshot of Body Appendix New Appendix (original) online merge read read read & write Body Merged Appendix New During Online Cleaning After Online Cleaning
  23. Online Data Cleaning (cont.) The Online Eager Data Cleaning (speed priority) method merges the entire appendix of the TBATs into the body in one go to save on time. The Online Progressive Data Cleaning (memory-usage priority) method is used during more extreme cases when the full appendix may not fit into memory. The DBA manually decides a block size, and the appendix is split into several of those blocks and added to an appendix queue. The above eager method is applied to these appendix files and any streaming updates (present-time) can be added to a new split appendix file to be queued when it fills up the block size. YOUNGSTOWN STATE UNIVERSITY, OH, USA
  24. Progressive Data-Cleaning Results YOUNGSTOWN STATE UNIVERSITY, OH, USA
  25. Outline 1. Part I: Write-Optimization 2. Part II: Data Cleaning 3. Part III: Application on Big Data YOUNGSTOWN STATE UNIVERSITY, OH, USA
  26. Update on BAT in Map-Reduce In a Map-Reduce environment, we assume the update list of OIDs are collected and submitted in a batch of UPDATE_LIST 1. Map-Reduce Join BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combined with UPDATE_LIST) • Map-side join: when UPDATE_LIST is small enough to fit into memory • Reduce-side join: when UPDATE_LIST is large enough 2. Selective Projection (Map-Only) FOR each record in (BAT combine UPDATE_LIST) IF UPDATE_LIST attribute is not NULL: output updated value (keep the most recent update) ELSE: output original value YOUNGSTOWN STATE UNIVERSITY
  27. TBAT (Timestamped BAT) TBAT in HDFS: struct TBUN{ TIMESTAMP optime, ROWID oid, USER_DEFINED_TYPE attrv } struct TBAT_slip{ TBUN[max_size_per_HDFS_slip] tbuns } • No need for any global pre-sorting or indexing • ‘attrv’ is can be any user defined type that flexibly define arbitrary kinds of schema YOUNGSTOWN STATE UNIVERSITY
  28. AMO Update (logical) YOUNGSTOWN STATE UNIVERSITY Example: Update query on customer table: update customer set balance=201.00 where id=2 Current timestamp is time2 (>time1). The newest TBUN for 201.00 is appended to the end of TBAT customer_balance inal value to 201.00. Instead of seeking the position to the record with oid=102, AOC update directly ap- pends at the end of the TBAT a new tuple as (time2, 102, 201.00). The timestamp when AOC update is performed is assumed to be time2, and 201.00 is the newly updated value. The TBAT customer balance after the AOC update is illustrated in Table 3. Table 3: TBAT customer balance after AOC Update optime oid float time1 101 100.00 time1 102 200.00 time1 103 300.00 time2 102 201.00 3.2.2 Cost Analysis of the AOC Update t t 3 A t s q w e fi p New Data Old Data
  29. AMO Update Experiment Performed on a Cloudera Distributed Hadoop (CDH) cluster • 1 master and 3 slaves • Total HDFS capacity= 310GB (block size = 64MB) • Interconnection is Gigabit Ethernet Data sets: 1GB and 10GB random synthetic data in BAT and TBAT. Update queries: from 10% to 30% of the original data. YOUNGSTOWN STATE UNIVERSITY
  30. AMO Update Experiment (cont.) YOUNGSTOWN STATE UNIVERSITY 1GB Update Running Time 0 50 100 150 200 250 300 350 400 450 500 10 15 20 25 30 RunningTime(sec) Update Percentage (%) BAT TBAT
  31. YOUNGSTOWN STATE UNIVERSITY 10GB Update Running Time 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 15 20 25 30 RunningTime(sec) Update Percentage (%) BAT TBAT AMO Update Experiment (cont.)
  32. YOUNGSTOWN STATE UNIVERSITY Relative Overhead Changing over Data Sets 0 20 40 60 80 100 120 140 160 180 10 15 20 25 30 Overhead(%) Update Percentage (%) 1GB 10GB AMO Update Experiment (cont.)
  33. Generalized Write Optimization Framework (DEXA’15) YOUNGSTOWN STATE UNIVERSITY Atomic Buffer (Read In) Atomic Buffer Atomic Buffer Atomic Buffer Atomic Buffer Read Optimized Data Serialized TBAT_1 Serialized TBAT_2 Serialized TBAT_N Input Stream Atomic Buffer (Full) Atomic Buffer (Full) Atomic Buffer (Full) … Write Queue Buffer Pool … Write Optimized ModuleRead Optimized Module
  34. Publications 1. Hastening Data Retrieval on Out-of-Core Column-Store Databases using Offset B+-Tree F. Yu, E. S. Jones 28th International Conference on Computer Applications in Industry and Engineering (CAINE 2015), October 12-14, 2015, Hilton San Diego/Harbor Island, San Diego, California, USA, pp. 313-318 2. A Framework of Write Optimization on Read-Optimized Out-of-Core Column-Store Databases F. Yu, W.-C. Hou 26th International Conference on Database and Expert Systems Applications (DEXA 2015), Valencia, Spain, September 1-4, 2015, pp. 155-169 3. Write Optimization using Asynchronous Update on Out-of-Core Column-Store Databases in Map- Reduce F. Yu, E. S. Jones, W.-C. Hou 2015 IEEE International Congress on Big Data, June 27 - July 2, 2015, New York, USA, pp. 720-723 4. Online Data Cleaning for Out-Of-Core Column-Store Databases with Timestamped Binary Association Tables F. Yu, C. Luo, W.-C. Hou, E. S. Jones Proceeding of 30th International Conference On Computers And Their Applications (CATA 2015), Honolulu, Hawaii, USA, March 9-11, 2015, pp. 407-412 5. Asynchronous Update on Out-of-Core Column-Store Databases Utilizing the Time stamped Binary Association Table F. Yu, C. Luo, W.-C. Hou, E. S. Jones Proceeding of 27th International Conference on. Computer Applications in Industry and Engineering (CAINE 2014), New Orleans, Louisiana, LA, October 13-15, 2014, pp. 215-220. YOUNGSTOWN STATE UNIVERSITY, OH, USA
  35. Source Code https://github.com/YSU-Data-Lab/TBAT-DEXA15 YOUNGSTOWN STATE UNIVERSITY, OH, USA
  36. New Challenges •New Index on C-S DBs • Local and global • Searching • Data Cleaning • Parallel Processing •Big Data • Searching • Data Cleaning • Auto Mapping • To Index or not to index? •Broader Applications • Scientifics Data Management • Big Data Analytics • Machine Learning • OLAP • OLTP • HPC • HTC YOUNGSTOWN STATE UNIVERSITY, OH, USA
  37. Thank you! Feng “George” Yu Computer Science and Information Systems Youngstown State University, Youngstown, OH fyu@ysu.edu YOUNGSTOWN STATE UNIVERSITY
Advertisement