Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PARALLELIZING THE PROBLEMOF SET INTERSECTION BY GPU         COMPUTING    指導教授:伍朝欽教授      組員:張力升、黃迺翔、林承祖、黃勛賢
摘要CUDAapriori algorithmMemory coalescing演算法實驗結果結論
CUDABackgroundGPU verse CPU架構Programming ModelStreaming Multiprocessor
BackgroundCUDA是NVIDIA推出利用GPU做平行運算的架構。編程人員可以選擇使用高級語言或驅動程序 API來實現並行處理。
GPU verse CPU                    Control & cache         ALU                  ALU   ALU    Control                  ALU   ...
CUDA的架構CUDA的組成分為 CPULibrary、runtime、                                       ApplicationDriver 三個部分,而開發程式中,就                ...
Programming ModelCPU                                                One Kernel        One Grid                   Kernel 1 ...
Memory model                                   (Device)Grid        Read/write per-thread :           •registers          ...
G80 Streaming Multiprocessor (SM) StreamingMultiprocessor   I cache                 • Multithreading issuing unit  MT issu...
apriori algorithmData Mining關聯分析(Association Analysis)Find Frequent Patternsapriori Algorithm
Data Mining(1)   概念:從儲存的大量資料中,找出可有效理解    的資訊,協助決策者進行更週延的決策。     資料                   資訊            Data Mining
Data Mining(2)   領域:資料產生資訊的模式(Model),描述資    料中的特徵及關係。     分類(Classification)     關聯分析(Association)     分群(Clustering) ...
關聯分析   分析兩資料的關聯性   E.g.假設一零售商店店長,要進行一促銷策略    的選擇,在交易紀錄發現,碳酸飲料和洋芋片    一起購買的比例特別突出,就可以依此資訊來    進行促銷。
Find Frequent Patterns(1)   概念:找最出現頻率較高的資料。   E.g. 確認哪些組合的商品,是較多顧客會同時    選購的。
Find Frequent Patterns(2)   名詞定義:     Minimum   support : 最少需要出現幾次     K-itemset : K個元素組成的集合       E.g.   2-itemset={A...
Find Frequent Patterns(3)   Minimum Support:   Ex: minimum support = 3                          出現3次以上(含)交易記錄K-itemset  ...
apriori Algorithm(1)   apriori pruning principle: If there is any itemset    which is infrequent, its superset should not...
apriori Algorithm(2)    Data            Handle          Result           Search            Sort
apriori Algorithm(3)Minimum support = 3       Original            Sorted                        {A}      3          {B}   ...
apriori Algorithm(4)Minimum support = 3   1-itemset       {A},{B},{C},{D},{E}                       2-itemset      {A,B},{...
apriori Algorithm(5)Minimum support = 3   1-itemset    {A},{B},{C},{D},{E}                      2-itemset    {A,B},{B,C},{...
Memory coalescing概念Compare
概念   To maximize global memory bandwidth     Minimize     the number of bus transactions       Coalesce   memory access...
Compare(1)Address      128      132            136              140   …   188Thread        0        1              2      ...
Compare(2)Address   128            132          136           140              …   188Thread     0              1         ...
演算法目的參數定義實驗演算法
目的   以平行方式解決Find Frequent Patterns問題   以Memory Coalescing將程式最佳化                  1-itemset   {A},{B},{C},{D},{E}        ...
參數定義Target []   :   較短的集合  Set[]     :   較長的集合Result      :   兩集合交集後的結果 Begin      :   從Set[Begin]開始進行比對  End       :   比對...
實驗演算法(1) Target     B      C      E      F  Set      A       B      C      D      E   G          Begin Result   True    Tr...
實驗演算法(2)      G       H       I       J                                              CPU  A       B       C       D       ...
實驗演算法(2)      G     H   I        J                                              GPU                                       ...
實驗演算法(3)      G     H   I        J                                               GPU                                      ...
實驗結果平台介紹Data Size, Block & ThreadCPU versus GPUMemory Coalescing Effect
平台介紹(CPU)       # of Cores           4      # of Threads          4      Clock Speed        2GHz      Memory Size         ...
平台介紹(GPU)   Number of GPUs                1Number of processor cores      240     Clock Speed            1300MHz    Memory...
Data Size, Block & ThreadData Size = 10, Block fixedData Size = 10K, Block fixedData Size = 10K, Thread fixed
Data Size = 10
Data Size = 100K, Block fixed
Data Size = 100K, Thread fixed
CPU versus GPUBlock = 1, Thread = 1Block = 1, Thread = 10Block = 10, Thread = 100Block = 10, Thread = 512
Block = 1, Thread = 1
Block = 1, Thread = 10
Block = 10, Thread = 100
Block = 10, Thread = 512
Memory Coalescing EffectBlock = 1, Thread = 10Block = 1, Thread = 512Block = 10, Thread = 512
Block = 1, Thread = 10
Block = 1, Thread = 512
Block = 10, Thread = 512
結論
結論   藉由應用CUDA架構 , 資料探勘的搜尋工作時間    在資料量很大時只需原本CPU程式工作時間的    三分之一   經由Memory Coalescing改良的的CUDA程式效    能是原本的三倍左右 , 與CPU程式比較下...
Thank you for listening!
Upcoming SlideShare
Loading in …5
×

大學部101級專題 cuda

1,399 views

Published on

Published in: Technology, Sports
  • Be the first to comment

  • Be the first to like this

大學部101級專題 cuda

  1. 1. PARALLELIZING THE PROBLEMOF SET INTERSECTION BY GPU COMPUTING 指導教授:伍朝欽教授 組員:張力升、黃迺翔、林承祖、黃勛賢
  2. 2. 摘要CUDAapriori algorithmMemory coalescing演算法實驗結果結論
  3. 3. CUDABackgroundGPU verse CPU架構Programming ModelStreaming Multiprocessor
  4. 4. BackgroundCUDA是NVIDIA推出利用GPU做平行運算的架構。編程人員可以選擇使用高級語言或驅動程序 API來實現並行處理。
  5. 5. GPU verse CPU Control & cache ALU ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM
  6. 6. CUDA的架構CUDA的組成分為 CPULibrary、runtime、 ApplicationDriver 三個部分,而開發程式中,就 CUDA是經由這三個部份 Library來控制並運用GPU的運算能力。 CUDA Runtime CUDA Driver GPU
  7. 7. Programming ModelCPU One Kernel One Grid Kernel 1 Kernel 2Host Kernel 2 Call GPU Grid 1 Grid 2 Thread Thread ThreadDevice (0, 0) (1, 0) (2, 0) Block Block Block Block Thread Thread Thread (0, 0) (1, 0) (0, 0) (1, 0) (0, 1) (1, 1) (2, 1) Block Block Block Block (0, 1) (1, 1) (0, 1) (1, 1) Block Block (0, 2) (1, 2) BLOCK可支援到二維陣列,而THREAD則是支援至三維。
  8. 8. Memory model (Device)Grid Read/write per-thread : •registers Block (0, 0) Block (1, 0) Read/write per-block : Share Memory Share Memory •shared memory Read/write per-thread : Registers Registers Registers RegistersSPEED •local memory (DRAM) Read/write per-grid : Thread Thread Thread Thread •global memory (DRAM) (0, 0) (1, 0) (0, 0) (1, 0) Read/only per-grid : •constant and texture Local Local Local Local memories (DRAM) Memory Memory Memory Memory Host Global Memory Constant Memory Texture Memory
  9. 9. G80 Streaming Multiprocessor (SM) StreamingMultiprocessor I cache • Multithreading issuing unit MT issue -指令的調度 C cache • Instruction and constant cache SP SP • 8 streaming processor SP SP -每個SP對應處理一個Thread • 2 Special Function Units (SFU) SP SP -Transcendental operations (e.g. sin,cosin..) SP SP • A 16KB read/write shared memory SFU SFU -受軟體控制資料儲存 Share Memory
  10. 10. apriori algorithmData Mining關聯分析(Association Analysis)Find Frequent Patternsapriori Algorithm
  11. 11. Data Mining(1) 概念:從儲存的大量資料中,找出可有效理解 的資訊,協助決策者進行更週延的決策。 資料 資訊 Data Mining
  12. 12. Data Mining(2) 領域:資料產生資訊的模式(Model),描述資 料中的特徵及關係。  分類(Classification)  關聯分析(Association)  分群(Clustering)  趨勢分析(TrendAnalysis)  循序特徵(Sequence Pattern )
  13. 13. 關聯分析 分析兩資料的關聯性 E.g.假設一零售商店店長,要進行一促銷策略 的選擇,在交易紀錄發現,碳酸飲料和洋芋片 一起購買的比例特別突出,就可以依此資訊來 進行促銷。
  14. 14. Find Frequent Patterns(1) 概念:找最出現頻率較高的資料。 E.g. 確認哪些組合的商品,是較多顧客會同時 選購的。
  15. 15. Find Frequent Patterns(2) 名詞定義:  Minimum support : 最少需要出現幾次  K-itemset : K個元素組成的集合  E.g. 2-itemset={A, B},{A, C}  Transactions : Itemset的Index  E.g. Transactions Itemset T1 {A, B, C} T2 {A, D}
  16. 16. Find Frequent Patterns(3) Minimum Support: Ex: minimum support = 3 出現3次以上(含)交易記錄K-itemset 1-itemset {A},{B},{C},{D},{E} 2-itemset {A,B},{B,C},{B,D},{B,E},{C,D} 3-itemset {B,C,D}交 交易 易序 記 {A} 洋芋片號 錄 {B} 碳酸飲料 {C} 泡麵 {D} 牛奶 {E} 雞蛋
  17. 17. apriori Algorithm(1) apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! ( Agrawal & Srikant @VLDB’94,Mannila,etal.@KDD’ 94)
  18. 18. apriori Algorithm(2) Data Handle Result Search Sort
  19. 19. apriori Algorithm(3)Minimum support = 3 Original Sorted {A} 3 {B} 6 {B} 6 {C} 4 {C} 4 {D} 4 {D} 4 {A} 3 {E} 3 {E} 3 1-itemset {A},{B},{C},{D},{E}
  20. 20. apriori Algorithm(4)Minimum support = 3 1-itemset {A},{B},{C},{D},{E} 2-itemset {A,B},{B,C},{B,D},{B,E},{C,D} Original Sorted {A,B} 3 {B,C} 4 {A,C} 2 {B,D} 4 {A,D} 2 {A,B} 3 {A,E} 1 {B,E} 3 {B,C} 4 {C,D} 3 {B,D} 4 {A,C} 2 {B,E} 3 {A,D} 2 {C,D} 3 {C,E} 2 {C,E} 2 {D,E} 2 {D,E} 2 {A,E} 1
  21. 21. apriori Algorithm(5)Minimum support = 3 1-itemset {A},{B},{C},{D},{E} 2-itemset {A,B},{B,C},{B,D},{B,E},{C,D} 3-itemset {B,C,D} Original Sorted {A,B,C} 2 {B,C,D} 3 {A,B,D} 2 {A,B,C} 2 {A,B,E} 1 {A,B,D} 2 {B,C,D} 3 {B,C,E} 2 {B,C,E} 2 {B,D,E} 2 {B,D,E} 2 {C,D,E} 2 {C,D,E} 2 {A,B,E} 1
  22. 22. Memory coalescing概念Compare
  23. 23. 概念 To maximize global memory bandwidth  Minimize the number of bus transactions  Coalesce memory accesses Coalescing  Memory transactions are per half-warp (16 threads)
  24. 24. Compare(1)Address 128 132 136 140 … 188Thread 0 1 2 3 … 15 Half-wrap All threads participateAddress 128 132 136 140 … 188Thread 0 1 2 3 … 15 Some threads not participate
  25. 25. Compare(2)Address 128 132 136 140 … 188Thread 0 1 2 3 … 15 Permuted Access by ThreadsAddress 128 132 136 140 … 188Thread 0 1 2 3 … 15 Misaligned Starting Address (not a multiple of 64)
  26. 26. 演算法目的參數定義實驗演算法
  27. 27. 目的 以平行方式解決Find Frequent Patterns問題 以Memory Coalescing將程式最佳化 1-itemset {A},{B},{C},{D},{E} 2-itemset {A,B},{B,C},{B,D},{B,E},{C,D} 3-itemset {B,C,D}
  28. 28. 參數定義Target [] : 較短的集合 Set[] : 較長的集合Result : 兩集合交集後的結果 Begin : 從Set[Begin]開始進行比對 End : 比對進行不超過Set[End]
  29. 29. 實驗演算法(1) Target B C E F Set A B C D E G Begin Result True True True Flase
  30. 30. 實驗演算法(2) G H I J CPU A B C D E G H CPU
  31. 31. 實驗演算法(2) G H I J GPU Non Memory coalescing A B C D E G Thread_1 Thread_2 Thread_3
  32. 32. 實驗演算法(3) G H I J GPU Memory coalescing A B C D E G Thread_1 Thread_2 Thread_3
  33. 33. 實驗結果平台介紹Data Size, Block & ThreadCPU versus GPUMemory Coalescing Effect
  34. 34. 平台介紹(CPU) # of Cores 4 # of Threads 4 Clock Speed 2GHz Memory Size 6GB Memory Type DDR3 800 # of Memory Channels 3Max Memory Bandwidth 19.2GB/s Cache 4MB
  35. 35. 平台介紹(GPU) Number of GPUs 1Number of processor cores 240 Clock Speed 1300MHz Memory Size 4GB Memory Type GDDR3 Memory Clock 800MHzMax Memory Bandwidth 102.4GB/s Compute capability 1.3
  36. 36. Data Size, Block & ThreadData Size = 10, Block fixedData Size = 10K, Block fixedData Size = 10K, Thread fixed
  37. 37. Data Size = 10
  38. 38. Data Size = 100K, Block fixed
  39. 39. Data Size = 100K, Thread fixed
  40. 40. CPU versus GPUBlock = 1, Thread = 1Block = 1, Thread = 10Block = 10, Thread = 100Block = 10, Thread = 512
  41. 41. Block = 1, Thread = 1
  42. 42. Block = 1, Thread = 10
  43. 43. Block = 10, Thread = 100
  44. 44. Block = 10, Thread = 512
  45. 45. Memory Coalescing EffectBlock = 1, Thread = 10Block = 1, Thread = 512Block = 10, Thread = 512
  46. 46. Block = 1, Thread = 10
  47. 47. Block = 1, Thread = 512
  48. 48. Block = 10, Thread = 512
  49. 49. 結論
  50. 50. 結論 藉由應用CUDA架構 , 資料探勘的搜尋工作時間 在資料量很大時只需原本CPU程式工作時間的 三分之一 經由Memory Coalescing改良的的CUDA程式效 能是原本的三倍左右 , 與CPU程式比較下更是提 升將近十倍的效率 未來目標:將更多種演算法平行化,應用在 CUDA架構上,藉以達成最佳化的目的。
  51. 51. Thank you for listening!

×