SlideShare a Scribd company logo
1 of 26
Download to read offline
Optimizing Data Mining Process Using
         Graphic Processors
MACHINE
                         LEARNING     Data Mining
                                      An interdisciplinary field

     DATABASE              DATA              PATTERN
     SYSTEMS              MINING           RECOGNITION




                                    INFORMATION
            STATISTICS
                                      SCIENCE




“Extracting Knowledge from the Data”
CRISP-DM
  CRoss Industry
Standard Process
 for Data Mining




                                                              SIX
                                                              Phases


                   http://www.crisp-dm.org/ founded in 1996
Telecommunications




Financial data analysis

                            Retail
                          Industry

                                Healthcare and
  Web Data Mining
                              biomedical research
Scalability
Dimensionality
 Complex Data
  Data Quality
Data Ownership
Architecture difference between GPU and CPU
• More transistors for data processing
• Many-core (hundreds of cores)
General Purpose computation using GPU in
        applications “other than 3D graphics”

    Flexible and programmable
it fully supports vectorized floating
 point operations at IEEE single
 precision
additional levels of programmability
 are emerging with every generation of
 GPU (about every 18 months)
an attractive platform for general-
 purpose computation
Thread block
 “a batch of threads that can
cooperate together by
efficiently sharing data
through some fast shared
memory and synchronizing
their execution to coordinate
memory accesses.”
  Example of Block ID:
A block (x,y) of a grid of
DIM(X,Y) has block ID
        (x + y.X)
Data Mining on Cloud
                                        (Nov 22nd ‘10)


                                                            SVM
             GPU Miner                                 for Estimation of
http://code.google.com/p/gpuminer/                     Aqueous Solubility
An itemset is
                      frequent if its
                   support is not less
                    than a threshold
                   specified by users
Thresholds:
Minimum Confidence (in %): bond between the items of an itemset
Minimum Support Count (in Numbers): how many times an itemset
occur in the database
“if an itemset is not frequent, any of its
           superset is never frequent”
               Proposed by Agrawal & Srikant
                                @ VLDB’94




An influential algorithm for mining frequent itemsets for association rules.
No   YES
Vertical data layout




Horizontal data layout



                         Bitmap Representation
Agrawal & Srikant @ VLDB’94
o We have presented a GPU-based implementation of Apriori algorithm for

   frequent itemset mining.

o This implementation employs a bitmap data structure to encode the

   transaction database on the GPU and utilize the GPU's SIMD parallelism for

   support counting.

o Our implementation stores the itemsets in a bitmap, and runs entirely on the

   GPU.
Optimizing data mining process using graphic processors

More Related Content

Similar to Optimizing data mining process using graphic processors

Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The BoxIan Foster
 
Accelerating Cyber Threat Detection With GPU
Accelerating Cyber Threat Detection With GPUAccelerating Cyber Threat Detection With GPU
Accelerating Cyber Threat Detection With GPUJoshua Patterson
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
 
Implementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleImplementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleInstitute of Validation Technology
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisAmazon Web Services
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...Manish Harsh
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Cheer Chain Enterprise Co., Ltd.
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Big Data LDN 2017: BI Converges with AI - GPUs for Fast Data
Big Data LDN 2017: BI Converges with AI - GPUs for Fast DataBig Data LDN 2017: BI Converges with AI - GPUs for Fast Data
Big Data LDN 2017: BI Converges with AI - GPUs for Fast DataMatt Stubbs
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scaleinside-BigData.com
 

Similar to Optimizing data mining process using graphic processors (20)

Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
Accelerating Cyber Threat Detection With GPU
Accelerating Cyber Threat Detection With GPUAccelerating Cyber Threat Detection With GPU
Accelerating Cyber Threat Detection With GPU
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
Big data use cases
Big data use casesBig data use cases
Big data use cases
 
Implementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleImplementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a Lifestyle
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
 
Expect More from Hadoop
Expect More from Hadoop Expect More from Hadoop
Expect More from Hadoop
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
 
Big data
Big dataBig data
Big data
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big Data LDN 2017: BI Converges with AI - GPUs for Fast Data
Big Data LDN 2017: BI Converges with AI - GPUs for Fast DataBig Data LDN 2017: BI Converges with AI - GPUs for Fast Data
Big Data LDN 2017: BI Converges with AI - GPUs for Fast Data
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scale
 

More from Gurupad Hegde

More from Gurupad Hegde (8)

Schedule basketball
Schedule basketballSchedule basketball
Schedule basketball
 
Schedule football
Schedule footballSchedule football
Schedule football
 
Schedule cricket
Schedule cricketSchedule cricket
Schedule cricket
 
Schedule volleyball
Schedule volleyballSchedule volleyball
Schedule volleyball
 
Renesa feb11
Renesa feb11Renesa feb11
Renesa feb11
 
Svm han baker
Svm han bakerSvm han baker
Svm han baker
 
Renesa: Oct 2010
Renesa: Oct 2010Renesa: Oct 2010
Renesa: Oct 2010
 
Resume gurupad s_hegde
Resume gurupad s_hegdeResume gurupad s_hegde
Resume gurupad s_hegde
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

Optimizing data mining process using graphic processors

  • 1. Optimizing Data Mining Process Using Graphic Processors
  • 2.
  • 3. MACHINE LEARNING Data Mining An interdisciplinary field DATABASE DATA PATTERN SYSTEMS MINING RECOGNITION INFORMATION STATISTICS SCIENCE “Extracting Knowledge from the Data”
  • 4. CRISP-DM CRoss Industry Standard Process for Data Mining SIX Phases http://www.crisp-dm.org/ founded in 1996
  • 5. Telecommunications Financial data analysis Retail Industry Healthcare and Web Data Mining biomedical research
  • 6. Scalability Dimensionality Complex Data Data Quality Data Ownership
  • 7.
  • 8. Architecture difference between GPU and CPU • More transistors for data processing • Many-core (hundreds of cores)
  • 9. General Purpose computation using GPU in applications “other than 3D graphics” Flexible and programmable it fully supports vectorized floating point operations at IEEE single precision additional levels of programmability are emerging with every generation of GPU (about every 18 months) an attractive platform for general- purpose computation
  • 10.
  • 11. Thread block “a batch of threads that can cooperate together by efficiently sharing data through some fast shared memory and synchronizing their execution to coordinate memory accesses.” Example of Block ID: A block (x,y) of a grid of DIM(X,Y) has block ID (x + y.X)
  • 12.
  • 13.
  • 14. Data Mining on Cloud (Nov 22nd ‘10) SVM GPU Miner for Estimation of http://code.google.com/p/gpuminer/ Aqueous Solubility
  • 15.
  • 16. An itemset is frequent if its support is not less than a threshold specified by users Thresholds: Minimum Confidence (in %): bond between the items of an itemset Minimum Support Count (in Numbers): how many times an itemset occur in the database
  • 17. “if an itemset is not frequent, any of its superset is never frequent” Proposed by Agrawal & Srikant @ VLDB’94 An influential algorithm for mining frequent itemsets for association rules.
  • 18.
  • 19. No YES
  • 20.
  • 21. Vertical data layout Horizontal data layout Bitmap Representation
  • 22.
  • 23. Agrawal & Srikant @ VLDB’94
  • 24.
  • 25. o We have presented a GPU-based implementation of Apriori algorithm for frequent itemset mining. o This implementation employs a bitmap data structure to encode the transaction database on the GPU and utilize the GPU's SIMD parallelism for support counting. o Our implementation stores the itemsets in a bitmap, and runs entirely on the GPU.