Optimizing data mining process using graphic processors
Optimizing Data Mining Process Using Graphic Processors
MACHINE LEARNING Data Mining An interdisciplinary field DATABASE DATA PATTERN SYSTEMS MINING RECOGNITION INFORMATION STATISTICS SCIENCE“Extracting Knowledge from the Data”
CRISP-DM CRoss IndustryStandard Process for Data Mining SIX Phases http://www.crisp-dm.org/ founded in 1996
TelecommunicationsFinancial data analysis Retail Industry Healthcare and Web Data Mining biomedical research
ScalabilityDimensionality Complex Data Data QualityData Ownership
Architecture difference between GPU and CPU• More transistors for data processing• Many-core (hundreds of cores)
General Purpose computation using GPU in applications “other than 3D graphics” Flexible and programmableit fully supports vectorized floating point operations at IEEE single precisionadditional levels of programmability are emerging with every generation of GPU (about every 18 months)an attractive platform for general- purpose computation
Thread block “a batch of threads that cancooperate together byefficiently sharing datathrough some fast sharedmemory and synchronizingtheir execution to coordinatememory accesses.” Example of Block ID:A block (x,y) of a grid ofDIM(X,Y) has block ID (x + y.X)
Data Mining on Cloud (Nov 22nd ‘10) SVM GPU Miner for Estimation ofhttp://code.google.com/p/gpuminer/ Aqueous Solubility
An itemset is frequent if its support is not less than a threshold specified by usersThresholds:Minimum Confidence (in %): bond between the items of an itemsetMinimum Support Count (in Numbers): how many times an itemsetoccur in the database
“if an itemset is not frequent, any of its superset is never frequent” Proposed by Agrawal & Srikant @ VLDB’94An influential algorithm for mining frequent itemsets for association rules.
o We have presented a GPU-based implementation of Apriori algorithm for frequent itemset mining.o This implementation employs a bitmap data structure to encode the transaction database on the GPU and utilize the GPUs SIMD parallelism for support counting.o Our implementation stores the itemsets in a bitmap, and runs entirely on the GPU.