Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICDE2015 Research 3: Distributed Storage and Processing

535 views

Published on

ICDE2015 Research 3: Distributed Storage and Processing

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ICDE2015 Research 3: Distributed Storage and Processing

  1. 1. 【ICDE2015  勉強会】 Research  3: Distributed  Storage  and  Processing 担当:  若若森  拓拓⾺馬  (NTT) 2015.5.16
  2. 2. 紹介する論論⽂文 1. PABIRS:  A  Data  Access  Middleware   for  Distributed  File  Systems   – S.  Wu  (Zhejiang  Univ.),  G.  Chen,  X.  Zhou,  Z.   Zhang,  A.  K.  H.  Tung,  and  M.  Winslett   (UIUC) 2. Scalable  Distributed  Transactions   across  Heterogeneous  Stores – A.  Dey,  A.  Fekete,  and  U.  Röhm (Univ.  of  Sydney) R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)2
  3. 3. •  ⽬目的 –  選択率率率の⾼高いクエリや分析クエリの混合したワークロードを分散FS 上で効率率率的に処理理する •  課題 –  ソートやインデキシング等の前処理理が有効だが,複雑な前処理理は挿 ⼊入のスループットを低下する –  べき乗分布の実データ(下図)のインデックス設計も困難 •  貢献 –  複雑なクエリの混合するワークロードを効率率率的に処理理する 統合データアクセスミドルウェア  (PABIRS)  の提案 数10億の通話履履歴から直近数カ⽉月分(数千)のレコードを検索索 PABIRS:  A  Data  Access  Middleware  for   Distributed  File  Systems   R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)3 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key PABIRS 図は元論論⽂文より引⽤用 例例)  ある電話会社の通話ログデータから1,000個の電話番号(ID)をランダムに抽出した結果 Fig.  1.  通話頻度度の分布 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key support efficient data retrieval for various query workloads. PABIRS Fig.  2.  IDの含まれるデータブロック数 ・1%  の  ID  による通話が     半分以上を占める   ・べき乗分布  (power-­‐law) 場所や通話回数などの属性で集約して分析 ・頻出  ID  のレコードが     ほぼ全てのDFSブロック     中に存在
  4. 4. PABIRS  =  Bitmap  +  LSM  index •  DFS上の(半)構造化データへのアクセス⼿手段 –  DFSへのGETインタフェース –  MapReduce処理理:mapへのinputformat –  KVSのトランザクション:secondary  index •  DFS  wrapper:  ハイブリッドインデックス –  Bitmap  index:選択率率率の低いキー/タプル向け –  LSM  index:hot  value  に対してのみ⽣生成 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)4 PABIRS DFS g. 3. Architecture of PABIRSFig.  3.  PABIRS  のアーキテクチャ InputFormatInsert(key,  value)Lookup(key) Fig.  4.  bitmap  の例例 ・ブロック毎に  signature  を保持   ・DAGベースの階層構造     (directory  vertices  à  data  vertices)   1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 u u u u u u UID u u u u u 1 0 0 1 0 1 0 0 0 1 block signature data block 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 data block 2 block signature UID Fig. 4. Bitmap Example III. HYBRID INDEXING SCHEME The general idea behind our hybrid indexing scheme is to build bitmap signatures for all data blocks and select certain hot keys for LSM index. Bitmap signature is created for multiple attributes without re-ordering the records. To facilitate efficient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure Param s, s1 v pj N Bp Bt ✓ k m F W r with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a newly merged leaf node of the C1 tree. The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp- tying block, and new leaf nodes are written to a different buffered multi-page block called the filling block. When this filling block has been packed full with newly merged leaf nodes of C1, the block is written to a new free area on disk. The new multi-page block containing merged results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge steps bring together increasing index value segments of the C0 and C1 components until the maximum values are reached and the rolling merge starts again from the smallest values. C1 tree C0 tree Disk Memory Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk Newly merged blocks are written to new disk positions, so that the old blocks will not be over- written and will be available for recovery in case of a crash. The parent directory nodes in C1, also buffered in memory, are updated to reflect this new leaf structure, but usually remain in buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in- validated after the merge step is complete and are then deleted from the C1 directory. In gen- eral, there will be leftover leaf-level entries for the merged C1 component following each merge step, since a merge step is unlikely to result in a new node just as the old leaf node empties. The same consideration holds for multi-page blocks, since in general when the filling block has filled with newly merged nodes, there will be numerous nodes containing entries still LSM    Tree  [O'Neil+,  ’96] ・特徴:  ⾼高い書込スループット   ・インメモリのC0(AVL-­‐Tree)に     書き込み   ・C0のサイズがしきい値を     超えた時,ディスク上の     C1  (B-­‐Tree)  に  rolling  merge   コスト⾒見見積り 図は元論論⽂文より引⽤用 (図:  LSM  Tree  の元論論⽂文より)
  5. 5. ハイブリッドインデックスの最適化 1.  Bitmap  Signatureのコストモデルと最適化 –  fanout  パラメータ  F  に基づき,low-‐‑‒level vertices  からhigh-‐‑‒level  vertexを⽣生成 –  コストモデルを定義,コスト最⼩小化するFを推定 –  Pregel  [Malewicz+,  ʼ’10]  の  BSP  でグラフ探索索 2.  LSMによる最適化 –  あるキーの選択率率率がしきい値を超えた場合にLSM  indexを作成 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)5 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Cost of Bitmap and LSM he fanout of the B-tree. We try to insert the key ex, only when the estimated cost is no larger than Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update Fig.  7.  bitmap  と  LSM  の検索索コスト ・LSM  は選択率率率によらず⼀一定   ・選択率率率  0.1  %以下の場合は     bitmap  が⾼高速   ・実際は  90%  以上のクエリが       0.1  %以下  0.1 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Fig. 7. Search Cost of Bitmap and LSM where E is the fanout of the B-tree. We try to insert the key into LSM index, only when the estimated cost is no larger than the cost of bitmap index. Based on the inequality above, we are able to calculate the minimal selectivity, which makes LSM a more attractive selection than the bitmap. In Figure 7, we apply the theoretical Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update C. Update on the Indices PABIRS is specifically designed for the applications that require fast data insertion. In PABIRS, bitmap index is a lightweight index which can be built in a batch, while LSM index is intentionally designed to support the fast insertion. Fig.  8.  インデックスの更更新 ・新規データはDFSに追記   ・オフラインの  MR  Jobで     Bitmap  signature  と  LSM  の     Hot-­‐key  を更更新   for multiple attributes without re-ordering the records. To facilitate efficient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure on two different attributes, using 8 bits and 5 bits for these two attributes respectively. Generally speaking, the DAG structure consists of three layers. The retrieval layer contains individual signatures cor- responding to the data blocks, while each intermediate vertex in the index layer is associated with a summary signature by merging signatures of its children vertices. Data layer refers to the physical data blocks stored in the DFS. Signatures and their corresponding graph vertices are randomly distributed to multiple DFS nodes. In the rest of the paper, we refer the vertices in the retrieval layer as data vertices and the vertices in the index layer as directory vertices. On the other hand, LSM index replicates the records with hot keys and sorts them in its B-trees. For each indexing attribute, we independently create an LSM index to maintain its sorted replicas of hot data. In the rest of the section, we first introduce the tuning approaches used on the bitmap-based index, followed by the selection strategy between these two indices. For better readability, the notations used in the section are summarized in Table I. A. Optimizations on Bitmap Signature Suppose the signature of each data block follows the same distribution {p1, p2, . . . , pk}, in which each pj indicates the probability of having “1” on the j-th bit. Because of the exclusiveness between the values, the signature is a sparse vector, i.e. P j pj ⌧ k. Given two signatures s1 and s2, the expected number of common “1” on j-th bits in both s1 and s2 is P j(pj)2 . It is much smaller than the expected number of “1” in either s1 or s2, i.e. P j(pj), unless there exists a pj dominating the distribution. When records are randomly assigned to the data blocks, each probability pj is supposed to be a small positive number. This leads to the phenomenon of Weak Locality in PABIRS. It N total number of records/tuples Bp size of a data block Bt size of a tuple ✓ query selectivity k number of distinct values of the attributes m number of values mapped to the same bit F fanout of the directory vertex W number of virtual machines (workers) rl computation cost of a directory vertex rn network delay between any pair of vertices rd the overhead of reading a data block ✓ selectivity of a particular queried key fmin minimum frequency of any value in a domain p(✓) pdf of distribution on selectivity ✓ of queries. ... …... Fig. 5. Demonstration of Signature Graph is thus not helpful to group similar signatures when building high-level directory vertex in the index layer, because such merging only generates a new signature with a union of “1”s from the signatures of its children vertices. Although it is unlikely to optimize by better grouping, the fanout of the abstract tree structure, i.e. the number of children vertices for every directory vertex, remains tunable and turns out to be crucial to the searching efficiency. 1) Cost Model and Fanout Optimization: Instead of picking up similar signatures during bitmap construction, PABIRS simply groups the low-level vertices to generate a high-level vertex, based on a pre-specified fanout parameter F. Specifi- 116 Fig.  5.  signature  graph ※ほか,分析ワークロード向けの最適化等については元論論⽂文を参照   図は元論論⽂文より引⽤用
  6. 6. 実験,評価 •  環境 –  Hadoop  1.0.4  +  GPS  [Salihoglu+,  ʼ’12]  (Pregel  のOSS実装)をベースに 実装 –  4コアcpu,  8G  RAM,  32ノードのhadoopクラスタで実験 •  項⽬目 A.  ⾼高選択率率率クエリ:電話履履歴データに対する3種のselectクエリをHBase   Phoenix,  Impala,  BIDS  [Lu+,  13]と⽐比較 B.  分析クエリ:tpcdskew  [Bruno+,  ʻ‘05]  で⽣生成した⼈人⼯工データを⽤用いて TPC-‐‑‒H  Q3,  Q5,  Q6をHiveと⽐比較 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)6 320 400 ) 0 50 100 150 200 250 300 350 400 Q1 Q2 Q3 ResponseTime(second) HBase PABIRS BIDS Impala-8G Impala-4G Fig. 13. Queries 1 10 100 1000 80 160 240 320 400 AverageResponseTime(second) Data Size (G) Q1 Q2 Q3 Fig. 14. Effect of Data Size 40 50 onseTime PABIRS HBase 150 200 nute) PABIRS 0 20 40 60 80 00 20 40 60 0 4 8 12 16 20 24 28 32 Query Batch Size Single Processor Thread Quad Processor Thread 19. Throughput of Concurrent ries (Q1) 0 5 10 15 20 25 30 35 40 45 50 0 4 8 12 16 20 24 28 32 AverageResponseTime(second) Query Batch Size Single Processor Thread Quad Processor Thread Fig. 20. Response Time of Concurrent Queries (Q1) 10 100 1000 10000 1 2 3 4 ResponseTime(second) Skew Factor q3/h q3/p q5/h q5/p q6/h q6/p Fig. 21. Performance of TPC-H Query (Skew) 10 100 1000 10000 3.5%(3) 7.1%(6) 10.5%(9)14.2%(12) ResponseTime(second) Query Selectivity (month) q3/h q3/p q5/h q5/p q6/h q6/p Fig. 22. Performance of TPC-H Query (Selectivity) emory which leads to the “Memory Limit Exceeded” tion. or Q3 and Q6, we build an index for the column shipdate we increase ✏ to a larger value (e.g., 365), PABIRS finds th index-based access is even worse than scan-based access. will automatically switch to the disk scan, which generate better station. To avoid query with empty result, we intentionally select a number with at least one record under the base station. PABIRS can effectively handle queries with a high selectiv- ity but still involving numerous tuples. As shown in Table III, in our 160G dataset, we have 40960 blocks in total. Although the selectivities of the queries are as low as 0.00001%, the records related to Q1, Q2 and Q3 cover 477, 28863 and 343 data blocks respectively. The involved data blocks, especially for Q1 and Q3, are no close to the total number of data blocks, while the overhead of loading hundreds of data blocks from the disks remains high. In experiments, PABIRS, Phoenix and BIDS are allowed to use 4 GB main memory on each node of the cluster, while Impala are tested under two settings with 4 GB and 8 GB main memory respectively. The results in Figure 13 shows that Impala-4G is unable to finish the queries in reasonable time (i.e. 1,000 seconds), as it incurs high I/O cost on memory- disk data swap. It reveals the limitation of Impala on memory usage efficiency. Moreover, Impala and BIDS show a similar performance for all queries, because both approaches adopt the scan-based techniques (memory scan and disk scan). In the rest of the experiments, we only report the results of Impala- 8G, denoted as Impala in abbreviation. The results also imply that PABIRS significantly outperforms the other systems on all queries. When the selectivity of the query is high, such as Q1 and Q3, HBase Phoenix is the only alternative with close performance to PABIRS, because of its adoption of secondary index. But for the query involving a large portion of data like Q2, HBase Phoenix is slow as it incurs many random I/Os to retrieve all results. TABLE III. PROCESSING TIME OF PABIRS QID selectivity index time disk time total time Q1 1.2% 1.03s 1.47s 2.50s Q2 70% 2.11s 137.63s 139.74s Q3 0.8% 1.04s 1.28s 2.32s To gain better insight into the scalability of PABIRS, we the performances of PABIRS and HBase Phoenix degrade slightly when more insertions are conducted, because they need to build and query indexes for the new tuples. Finally, we implement a simple transaction module as discussed in Section 2. Our test transaction retrieves all records of a specific phone number (normally hundreds to thousands of records) and updates the values of NeID in those records to a new value. We vary the number of concurrent transactions and ss shown in Figure 18, for this test transaction, PABIRS can provide a good throughput. In PABIRS, queries can be grouped into batch and share the index searching process. In Figure 19 and Figure 20, we show the throughput and response time for varied batch size. As each node in the cluster is equipped with a 4-core CPU, we start four concurrent I/O threads at the same time. For comparison purpose, we also show the result when a single I/O thread run. The throughput of four I/O threads is almost three times higher than the single thread case. The throughput improves dramatically for a larger query batch, since we can share more signature and data scans among the queries. However, the results imply that the throughput gain shrinks with the increase of the query batch size. It is thus important to choose an appropriate batch size in real actions. The response time is also affected by the batch size. Figure 20 illustrates that the response time is generally proportional to the batch size. If a strict real-time requirement is needed, it is important for the system to carefully choose batch size, in order to hit a balance between the throughput and response time. C. Analytic Query Performance In this group of experiments, we evaluate the performance of PABIRS on data and queries generated by TPC-H bench- mark. Specifically, we generate 320 GB data with different skew factors using the TPC-H Skew Generator6 . We deploy Hive on top of PABIRS and compare the performances of PABIRS against the original Hive on query Q3, Q5 and Q6 in TPC-H. We also include Impala in the experiment. However, Impala requires buffering all intermediate join results A. B. Fig.  21.  TPC-­‐H  (skew) Fig.  22.  TPC-­‐H  (Selectivity) (Fig.  13) ・skew⼩小:Hiveと同等の性能   ・skew⼤大:インデックスの効果で性能向上   ・Q5には効果なし  (インデックスがordersにしかないため)   図は元論論⽂文より引⽤用
  7. 7. Scalable  Distributed  Transactions  across   Heterogeneous  Stores •  ⽬目的 –  異異なるデータストア間で複数アイテムに 対応したトランザクション処理理を⾏行行いたい •  課題 –  アプリケーションでトランザクションを⾏行行う場合: •  プログラマによるエラーを起こしやすく,可⽤用性やスケーラビリティ喪失の 恐れがある –  コーディネータのミドルウェアを導⼊入する場合: •  アプリケーションは全て管理理下になければならない •  貢献 –  異異なるデータストア間での複数アイテムのトランザクションをサ ポートするクライアントライブラリ:  Cherry  Garcia  (CG)の提案 –  Windows  Azure  Storage  (WAS),  Google  Cloud  Storage  (GCS), Tora  (a  high-‐‑‒throughput  KVS)  に実装 –  YCSB+T  [Dey+,  ʻ‘14]  (WebスケールTXNベンチマーク)で評価 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)7 BEGIN  TRANSACTION      SET  item1  of  Store1      SET  item2  of  Store2   COMMIT  TRANSACTION
  8. 8. 異異なるデータストア間のトランザクション Datastore wds. The example also uses a third store (e later) that acts as the Coordinating Data Store (CDS) 1 public void UserTransaction ( ) { D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ; 3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ; D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ; 5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ; try { 7 tx . s t a r t ( ) ; Record saving = tx . read ( gds , ” saving ” ) 9 Record checking = tx . read ( wds , ” checking ” ) ; i n t s = saving . get ( ” amount ” ) ; 11 i n t c = checking . get ( ” amount ” ) ; saving . s e t ( ” amount ” , s 5 ) ; 13 checking . s e t ( ” amount ” , c + 5 ) ; tx . w r i t e ( gds , ” saving ” , saving ) ; 15 tx . w r i t e ( wds , ” checking ” , checking ) ; tx . commit ( ) ; 17 } catch ( Exception e ) { tx . a b o r t ( ) ; 19 } } Listing 1. Example code that uses the API to accesses two da R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)8 Listing.  1.  GCのAPIを使⽤用して記述した2つのデータストア間のトランザクション Datastore:   データストアの   インスタンス Transaction:   トランザクションコーディネータ   ・Google  Cloud  Storage  の  Datastore  (gds)   から’saving’,Windows  Asure  Storage  の   Datastore  (wds)から’checking’を読み込み   それぞれ更更新する   ・ほか,Coordinating  Data  Store  (CDS)   として動作するDatastoreも使⽤用している 図は元論論⽂文より引⽤用
  9. 9. Cherry  Garcia  (CG): クライアントライブラリ •  プラットフォームの想定 –  単⼀一レコードをreadする 際のStrong  Consistency –  アトミックな単⼀一アイテ ムの更更新・削除  (Test-‐‑‒ and-‐‑‒Set) –  アイテム中にユーザ定義 のメタデータを含められ る R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)9 II. SYSTEM DESIGN ection, we describe the design of our client- ransaction processing protocol that enables trans- ving multiple data items that span multiple het- data store instances. The protocol is to be imple- library whose API abstracts data store instances alled Datastore, and these are accessed via a oordinator abstraction, a class called Transaction. ecord is addressable using a string key and its accessed using an object of a class called Record. an example of an application that uses the API to a records, one (“saving”) residing in an instance of d Storage, abstracted by the Datastore gds, while s stored in Windows Azure Storage represented as ds. The example also uses a third store (explained ts as the Coordinating Data Store (CDS). ransaction ( ) { D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ; D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ; D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ; = new T r a n s a c t i o n ( cds ) ; = tx . read ( gds , ” saving ” ) g = tx . read ( wds , ” checking ” ) ; g . get ( ” amount ” ) ; ng . get ( ” amount ” ) ; mount ” , s 5 ) ; ” amount ” , c + 5 ) ; Application 1 Transaction Application 2 Transaction Tora Windows Azure Storage Google Cloud Storage Tora Datastore Abstraction Application 3 Transaction Tora Datastore Abstraction WAS Datastore Abstraction WAS Datastore Abstraction GCS Datastore Abstraction GCS Datastore Abstraction Datastore Specific REST API Cherry Garcia Coordinating Storage TSR Fig. 1. Library architecture 2) Overview: In essence, the protocol calls for each data item to maintain the last committed and perhaps also the currently active version, for the data and relevant meta- data. Each version is tagged with meta-data pertaining to the transaction that created it. This includes the transaction commit time and transaction identifier that created it, pointing to a globally visible transaction status record (TSR) using a Universal Resource Identifier (URI). The TSR is used by the client to determine which version of the data item to use when reading it, and so that transaction commit can happen just by updating (in one step) the TSR. The transaction identifier, Fig.  1.  ライブラリのアーキテクチャ 図は元論論⽂文より引⽤用 •  クライアントによるトランザクションコーディネーションの概要 •  各レコードを単⼀一アイテムのデータベースのように扱う •  2PCでトランザクションコーディネーション •  中央にコーディネータをもたない •  データにトランザクションステートを持たせ,クライアントでコー ディネートする
  10. 10. CGによるトランザクションのタイムライン •  2PC –  Current  state  と  previous  state  をデータに持たせる –  Key  の  hash  値順に  PREPARED  フラグを⽴立立てる –  Coordinating  Data  Store  (CDS)  に  Transaction  Status  Record   (TSR)  を書き込み,  COMMITTED  フラグを⽴立立てる  (並列列処理理) R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)10 data wo the mit. ped ked hed ion wer ten SR) ing uly COMMITTED PREPARED application logic CDS WAS GCS C1 t1 r2 r1 transaction cache COMMITTED read() read() v1v1 v1 v1 v2 v2 write() commit() v2 v2 PREPARE PREPARE TSR COMMIT v2 v2 COMMITTEDPREPARED COMMITTED DELETE application logic C2 t2 transaction cache read() v1v1 v2 v2 commit()write() v1v1 v2 v2 time read() PREPARE application logic t3 transaction cache Cherry Garcia Cherry Garcia Fig. 2. The timeline describing 3 transactions running on 2 client hosts to access records in 2 data stores using a third data store as a CDS In the rest of this section we go deeper in detail on the components of the library and the algorithms. Pseudocode for図は元論論⽂文より引⽤用 (Fig.  2.)
  11. 11. 実装,実験 •  Cherry  Garcia  の実装 –  Javaライブラリ  (JDK  1.6) –  Datastore  abstractionをWindows  Azure  Storage   (WAS),  Google  Cloud  Storage  (GCS),  Tora   (WiredTigerストレージエンジン上で動作するKVS)  に対 して実装 •  実験 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)11 図は元論論⽂文より引⽤用 1885.4& 1888.6& 1862.2& 1911.6& 5898.4& 33810& 0& 10000& 20000& 30000& 40000& 0.1& 0.3& 0.5& 0.7& 0.9& 0.99& aborts'per'million' theta' aborts&per&million&transac:ons& Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a 1-node Tora cluster 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 600" 700" 800" 900" ons"per"second)" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx"800" 1000" 1200" tx/sec)" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig.  8.  8クライアントから  4ノードのToraクラスタに 対して16スレッドから128スレッドのトランザクショ ンを実⾏行行した時のスループット increased linearly until 16 threads and the average latency for each request stayed within the 500µs mark. As the number of threads were increased beyond 16 the latency increased until it reached 4.5ms at 64 threads. This increased latency suggests that there is a performance bottleneck somewhere in the system. We ran a further test with 4 client hosts and a cluster of 4 Tora servers and repeated the previous test and varied the number of threads from 1 through to 64 threads across all 4 client hosts and measured the throughput. The graph in Figure 7 shows that the performance on each host scales linearly until 16 threads (an aggregate of 64 threads across 4 client hosts) and then flattens out. We observed that the socket send buffers on the servers were full suggesting a network bottleneck at the client. G. Experiment 4: abort rates vary with contention We setup one EC2 m3.2xlarge server each as a YCSB+T client and Tora server in AWS and ran the client with 16 threads with a read to read-write ration of 50:50 over 1 million transactions. We used the Zipfian access key pattern, and varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99. Figure Fig 6 shows that the aborts increase as the contention increases, though aborts are infrequent even with extreme contention. H. Experiment 5: Scale-out test We ran YCSB+T with a mix of 90:10 read to read-modify- write operations in a Zipfian data access pattern with theta set to 0.99 across 1 to 8 client hosts each with 16 threads, running against a 4-node Tora cluster. We collected the throughput 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" 1" 6" 11" 16" throughput"(transac8ons"per"second)" number"of"client"threads" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx" 0" 200" 400" 600" 800" 1000" 1200" 1" 6" 11" 16" throughput"(tx/sec)" number"of"client"threads" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig. 9. Overhead of transactions and the effect of 1-phase optimization 133 Fig.  9.  トランザクションのオーバーヘッド   と1-­‐phaseの最適化(*)の効果   (*)  1アイテムに限定してPREPAREフェーズを省省略略 線形にスケール  (最⼤大  23288  trans/sec) 1-­‐phaseの最適化の   オーバーヘッドは⼩小さい 並列列化による   スループット向上

×