SlideShare a Scribd company logo
1 of 42
Case study: d60 Raptor
    smartAdvisor


                           Jan Neerbek
                     Alexandra Institute
Agenda

·   d60: A cloud/data mining case
·   Cloud
·   Data Mining
·   Market Basket Analysis
·   Large data sets
·   Our solution




2
Alexandra Institute


    The Alexandra Institute is a non-profit
    company that works with application-
    oriented IT research.

    Focus is pervasive computing, and we
    activate the business potential of our
    members and customers through research-
    based userdriven innovation.


3
The case: d60


·   Danish company
·   A similar products recommendation engine
·   d60 was outgrowing their servers (late 2010)
·   They saw a potential in moving to Azure




4
The setup



                           Product
                Internet   Recommendations
Webshops




           Log shopping
           patterns

                                 Do data mining




 5
The cloud potential


· Elasticity
· No upfront server cost
· Cheaper licenses
· Faster calculations



6
Challenges


· No SQL Server Analysis Services (SSAS)
· Small compute nodes
· Partioned database (50GB)
· SQL server ingress/outgress access is
    slow




7
The cloud



                          Node           Node
                                  Node

            Node




                   Node
     Node
                                 Node




8
The cloud and services



                          Node            Node
                                  Node

            Node
                                         Data layer
                                          service



                   Node                   Messaging
     Node                                  Service
                                 Node




9
Data layer service

                                            Data layer
·    Application specific (schema/layout)    service

·    SQL, table or other
·    Easy a bottleneck
·    Can be difficult to scale




10
Messaging service
Task Queues

·    Standard data structure          Messaging
                                       Service
·    Build-in ordering (FIFO)
·    Can be scaled
·    Good for asynchronous messages




11
12
Data mining


Data mining is the use of automated data analysis
 techniques to uncover relationships among data
 items


Market basket analysis is a data mining
 technique that discovers co-occurrence
 relationships among activities performed by
 specific individuals


     [about.com/wikipedia.org]
13
Market basket analysis


     Customer1    Customer2    Customer3    Customer4
Avocado          Milk         Beef         Cereal
Milk             Diapers      Lemons       Beer
Butter           Avocado      Beer         Beef
Potatoes         Beer         Chips        Diapers




14
Market basket analysis


      Customer1    Customer2    Customer3    Customer4
 Avocado          Milk         Beef         Cereal
 Milk             Diapers      Lemons       Beer
 Butter           Avocado      Beer         Beef
 Potatoes         Beer         Chips        Diapers



Itemset (Diapers, Beer) occur 50%

Frequency threshold parameter
Find as many frequent itemsets as possible
 15
Market basket analysis


Popular effective algorithm: FP-growth 
Based on data structure FP-tree
Requires all data in near-memory 
Most research in distributed models has been for
  cluster setups 




16
Building the FP-tree
(extends the prefix-tree structure)

                                       Customer1
           Avocado
                                      Avocado
                                      Milk
                                      Butter
          Butter
                                      Potatoes



         Milk




     Potatoes



17
Building the FP-tree

                        Customer2
           Avocado
                       Milk
                       Diapers
                       Avocado
          Butter
                       Beer



         Milk




     Potatoes



18
Building the FP-tree

                                Customer2
           Avocado
                               Milk
                               Diapers
                               Avocado
          Butter     Beer
                               Beer



         Milk        Diapers




     Potatoes           Milk



19
Building the FP-tree

                                Customer2
           Avocado
                               Milk
                               Diapers
                               Avocado
          Butter     Beer
                               Beer



         Milk        Diapers




     Potatoes           Milk



20
Building the FP-tree


           Avocado             Beef




          Butter     Beer        Beer




         Milk        Diapers          Chips   Cereal




     Potatoes           Milk          Lemon   Diapers



21
FP-growth

Grows the frequent itemsets, recusively

FP-growth(FP-tree tree)
{
     …
     for-each (item in tree)
          count =CountOccur(tree,item);
          if (IsFrequent(count))
          {
               OutputSet(item);
               sub = tree.GetTree(tree, item);
               FP-growth(sub);
          }

22
FP-growth algorithm
Divide and Conquer

Traverse tree
      Avocado             Beef



     Butter     Beer         Beer



     Milk       Diapers     Chips    Cereal



Potatoes           Milk      Lemon    Diapers




23
FP-growth algorithm
Divide and Conquer

Generate sub-trees
      Avocado             Beef



     Butter     Beer         Beer



     Milk       Diapers     Chips    Cereal



Potatoes           Milk      Lemon    Diapers




24
FP-growth algorithm
Divide and Conquer

Call recursively
      Avocado             Beef



     Butter     Beer         Beer                Avocado



     Milk       Diapers     Chips    Cereal     Butter     Beer



                                                           Diapers
Potatoes           Milk      Lemon    Diapers




25
FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
  do?
· Emulate Distributed Shared Memory




26
Distributed Shared Memory?


          CPU       CPU           CPU      CPU      CPU


         Memory    Memory       Memory    Memory   Memory


                            Network


                          Shared Memory



·    To add nodes is to add memory
·    Works best in tightly coubled setups, with low-lantency,
     high-speed networks
27
FP-growth algorithm
Memory usage

The FP-tree does not fit in local memory; what to
  do?
· Emulate Distributed Shared Memory
· Optimize your data structures
· Buy more RAM
· Get a good idea



28
Get a good idea


·    Database scans are serial and can be
     distributed
·    The list of items used in the recursive calls
     uniquely determines what part of data we are
     looking at




29
Get a good idea



      Avocado             Beef



     Butter     Beer         Beer                Avocado



     Milk       Diapers     Chips    Cereal     Butter     Beer



                                                           Diapers
Potatoes           Milk      Lemon    Diapers




30
Get a good idea


                                    Avocado



           Avocado                   Butter, Milk

          Butter          Beer



                          Diapers



                   Milk
                                      Avocado



                                                Beer




                                    Diapers,Milk
     These are postfix paths
31
32
Buckets


·    Use postfix paths for messaging
·    Working with buckets


                         Transactions




                 Items




33
FP-growth revisited
                                      Replaced with
 FP-growth(FP-tree tree)                 postfix

 {
          …
Done in parallel
          for-each (item in tree)
                                                  Done in parallel
                 count =CountOccur(tree,item);
                 if (IsFrequent(count))
                 {
                      OutputSet(item);
   Done in parallel   sub = tree.GetTree(tree, item);
                      FP-growth(sub);
                 }



 34
Communication




         Node                Node




                Data layer




         Node                Node




35
Revised Communication




         Node           Node




                MQ
                               Data layer



         Node           Node




36
Running FP-growth


                    Distribute buckets




                    Count items
                    (with postfix size=n)



                    Collect counts
                    (per postfix)
                    Call recursive


                    Standard FP-growth

37
Running FP-growth


                    Distribute buckets




                    Count items
                    (with postfix size=n)



                    Collect counts
                    (per postfix)
                    Call recursive


                    Standard FP-growth

38
Collecting what we have learned


·    Message-driven work, using message-queue
·    Peer-to-peer for intermediate results
·    Distribute data for scalability (buckets)
·    Small messages (list of items)
·    Allow us to distribute FP-growth




39
Advantages


·    Configurable work sizes
·    Good distribution of work
·    Robust against computer failure
·    Fast!




40
So what about performance?
     04:30:00


     04:00:00


     03:30:00


     03:00:00


     02:30:00                   Message-driven FP-growth

                                FP-growth
     02:00:00

                                Total node time
     01:30:00


     01:00:00


     00:30:00


     00:00:00
                1   2   4   8




41
Thank you!




42

More Related Content

Viewers also liked

2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"Alexandra Instituttet
 
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Alexandra Instituttet
 
2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"Alexandra Instituttet
 
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"Alexandra Instituttet
 
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"Alexandra Instituttet
 
øLmave eller bare insulinresistent
øLmave   eller bare insulinresistentøLmave   eller bare insulinresistent
øLmave eller bare insulinresistentEdb Huset a/s
 
2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"Alexandra Instituttet
 
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenKontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenAlexandra Instituttet
 
Customer Experience Maturity Assessment
Customer Experience Maturity AssessmentCustomer Experience Maturity Assessment
Customer Experience Maturity AssessmentMac Wheeler
 
Gartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing ClientsGartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing Clientsfranzel77
 
Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Jesse Gant
 
Top 10 utilities interview questions with answers
Top 10 utilities interview questions with answersTop 10 utilities interview questions with answers
Top 10 utilities interview questions with answerslibbygray000
 
Defining product marketing
Defining product marketingDefining product marketing
Defining product marketingGerardo A Dada
 

Viewers also liked (16)

2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"2010 Sorø "Velkommen i evv’s univers"
2010 Sorø "Velkommen i evv’s univers"
 
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
Innovation & it i turisterhvervet "Præsentation Alexandra Instituttet"
 
2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"2010 Sorø "Fra spæd idé til Discovery Channel"
2010 Sorø "Fra spæd idé til Discovery Channel"
 
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
2010 Sorø "Kontekst-sensitiv teknologi – GPS er kun begyndelsen"
 
Sund Innovation i Randers
Sund Innovation i RandersSund Innovation i Randers
Sund Innovation i Randers
 
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
2010 Sorø "Internet of Things, Cloud Computing & Sikkerhed"
 
øLmave eller bare insulinresistent
øLmave   eller bare insulinresistentøLmave   eller bare insulinresistent
øLmave eller bare insulinresistent
 
2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"2010 Sorø "Det der med innovation i danske virksomheder"
2010 Sorø "Det der med innovation i danske virksomheder"
 
Forretningsudvikling og Innovation
Forretningsudvikling og InnovationForretningsudvikling og Innovation
Forretningsudvikling og Innovation
 
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsenKontekst-sensitiv teknologi – GPS er kun begyndelsen
Kontekst-sensitiv teknologi – GPS er kun begyndelsen
 
Customer Experience Maturity Assessment
Customer Experience Maturity AssessmentCustomer Experience Maturity Assessment
Customer Experience Maturity Assessment
 
Gartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing ClientsGartner for Product Management and Marketing Clients
Gartner for Product Management and Marketing Clients
 
Product and UX - are the roles blurring?
Product and UX - are the roles blurring?Product and UX - are the roles blurring?
Product and UX - are the roles blurring?
 
Top 10 utilities interview questions with answers
Top 10 utilities interview questions with answersTop 10 utilities interview questions with answers
Top 10 utilities interview questions with answers
 
Defining product marketing
Defining product marketingDefining product marketing
Defining product marketing
 
Strategic Role - Product Management
Strategic Role - Product ManagementStrategic Role - Product Management
Strategic Role - Product Management
 

Similar to Apriori data mining in the cloud

Data Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aData Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aOllieShoresna
 
Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Lili Wu
 
Jazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelJazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelKai Wähner
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsPier Luca Lanzi
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data miningSulman Ahmed
 
Datacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaDatacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaKnowerce
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
Eclat algorithm in association rule mining
Eclat algorithm in association rule miningEclat algorithm in association rule mining
Eclat algorithm in association rule miningDeepa Jeya
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptraju980973
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 

Similar to Apriori data mining in the cloud (14)

Data Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts aData Mining Association Analysis Basic Concepts a
Data Mining Association Analysis Basic Concepts a
 
Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products Avatara: OLAP for Web-scale Analytics Products
Avatara: OLAP for Web-scale Analytics Products
 
Jazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache CamelJazoon 2011 - Smart EAI with Apache Camel
Jazoon 2011 - Smart EAI with Apache Camel
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules Basics
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Datacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp BratislavaDatacamp @ Bar Camp Bratislava
Datacamp @ Bar Camp Bratislava
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Eclat algorithm in association rule mining
Eclat algorithm in association rule miningEclat algorithm in association rule mining
Eclat algorithm in association rule mining
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.ppt
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 

More from Alexandra Instituttet

2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"Alexandra Instituttet
 
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"Alexandra Instituttet
 
Midt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktMidt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktAlexandra Instituttet
 
Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Alexandra Instituttet
 
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet
 
Find nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGFind nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGAlexandra Instituttet
 
Sund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterSund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterAlexandra Instituttet
 

More from Alexandra Instituttet (11)

2010 Sorø "Ekko af byen"
2010 Sorø "Ekko af byen"2010 Sorø "Ekko af byen"
2010 Sorø "Ekko af byen"
 
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
2010 Sorø "Erhvervsudvikling og Alexandra Instituttet i Region Sjælland"
 
2010 Sorø "Åbning i Sorø"
2010 Sorø "Åbning i Sorø"2010 Sorø "Åbning i Sorø"
2010 Sorø "Åbning i Sorø"
 
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
2010 Sorø "Innovative alliancer flytter grænser – skal din virksomhed være med?"
 
Midt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunktMidt- og vestjysk it-satsning – oplæg til vendepunkt
Midt- og vestjysk it-satsning – oplæg til vendepunkt
 
Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling Perspektiverne i den lokale erhvervsudvikling
Perspektiverne i den lokale erhvervsudvikling
 
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekterAlexandra Instituttet som samarbejdspartner i udviklingsprojekter
Alexandra Instituttet som samarbejdspartner i udviklingsprojekter
 
Massive Data
Massive DataMassive Data
Massive Data
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Find nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTINGFind nye forretningsmuligheder med IT-I-ALTING
Find nye forretningsmuligheder med IT-I-ALTING
 
Sund Innovation i Randers Sundhedscenter
Sund Innovation i Randers SundhedscenterSund Innovation i Randers Sundhedscenter
Sund Innovation i Randers Sundhedscenter
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 

Apriori data mining in the cloud

  • 1. Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute
  • 2. Agenda · d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution 2
  • 3. Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation. 3
  • 4. The case: d60 · Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure 4
  • 5. The setup Product Internet Recommendations Webshops Log shopping patterns Do data mining 5
  • 6. The cloud potential · Elasticity · No upfront server cost · Cheaper licenses · Faster calculations 6
  • 7. Challenges · No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is slow 7
  • 8. The cloud Node Node Node Node Node Node Node 8
  • 9. The cloud and services Node Node Node Node Data layer service Node Messaging Node Service Node 9
  • 10. Data layer service Data layer · Application specific (schema/layout) service · SQL, table or other · Easy a bottleneck · Can be difficult to scale 10
  • 11. Messaging service Task Queues · Standard data structure Messaging Service · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages 11
  • 12. 12
  • 13. Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org] 13
  • 14. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers 14
  • 15. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15
  • 16. Market basket analysis Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups  16
  • 17. Building the FP-tree (extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes 17
  • 18. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes 18
  • 19. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 19
  • 20. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 20
  • 21. Building the FP-tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 21
  • 22. FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); } 22
  • 23. FP-growth algorithm Divide and Conquer Traverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 23
  • 24. FP-growth algorithm Divide and Conquer Generate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 24
  • 25. FP-growth algorithm Divide and Conquer Call recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 25
  • 26. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory 26
  • 27. Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory · To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency, high-speed networks 27
  • 28. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea 28
  • 29. Get a good idea · Database scans are serial and can be distributed · The list of items used in the recursive calls uniquely determines what part of data we are looking at 29
  • 30. Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 30
  • 31. Get a good idea Avocado Avocado Butter, Milk Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths 31
  • 32. 32
  • 33. Buckets · Use postfix paths for messaging · Working with buckets Transactions Items 33
  • 34. FP-growth revisited Replaced with FP-growth(FP-tree tree) postfix { … Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34
  • 35. Communication Node Node Data layer Node Node 35
  • 36. Revised Communication Node Node MQ Data layer Node Node 36
  • 37. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37
  • 38. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38
  • 39. Collecting what we have learned · Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth 39
  • 40. Advantages · Configurable work sizes · Good distribution of work · Robust against computer failure · Fast! 40
  • 41. So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 02:30:00 Message-driven FP-growth FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41

Editor's Notes

  1. Weare a Tech transfer company. Webuildbrigdesbetweenuniversities (and other research institutes) and companiesFocuspervasisecomputing. For example mobile, cloud, data treatment
  2. Product: Raptor smart advisorHeavy data mining solutionproductrecomendation (brought, browsing, otherusers), association data miningMulti-passRessource intensiveLast year (2010) current server is becoming to smallNeed to upgrade -> biginvestment (hw, licenses)Looked to the cloud, Azure (utility model, usagepricing)What potentials did theysee?
  3. TheiroldsetupWe log patterns and build a model.Weuse the currentuserspattern to queryagainst the model (historic data)
  4. The reasons d60 looked to the cloudCheaperlicenses,e.g. continuespayment But typicallyyougetupgrades for freeFaster calculation ->currently (last year) batch processing, and the batch processingshouldbe done within 12 hoursNowtheycan do somethingtheycouldn’tbeforeNear-real time responses – still workingon-real time events, trends etc-huge potential
  5. D60 wanted to continue to use SQL Server50 GB in corboratesetting is not muchDuringprojectwerealised:Contacting the sqlserver from outside is slowish. By 10-20% (sometimes more!) (compared to a onpremisenetworkedsetup)
  6. This isloose talk. Weneed to establish basis. The cloud is a bounch of looselyconnected nodes. For it to be cloud you have to have the ability to scale up and downondemand - elasticityNodes aretypically virtual images of a small(ish) computerNodes interconnected via LAN (ifwearelucky), but mightbepositioned in geographicaldifferent locationsWeexpectbetter respons times thanifthiswas over the internet. Howeverwewillexperiencelower respons times thanifwe had a dedicatedsetuponpremises.But it’scheapIt is a distributedsetup, where hardware and OS is typicallyvendorcontrolled. As in otherdistributedsetups – plan on nodes beingunavailable.
  7. QueueAzure has a messagequeueGoogle has map-reduceframe-workorAppEngineTaskqueueAmazon has simple queue service (SQS)ScalableimplementationsexistsGlobal dbSimilary all providers has a global dbThereare a number of other cloud services, but theseare the onesthat matter to us
  8. Manyorderes, alsounordered, we just need FIFOWeended up buildingourown, because ofsome initial bad experienceswith the azureone (slowresponsesetc)Maybe not so negative
  9. This is prettyvague. Nowweconsider a more concreteexample
  10. Marketbasket is the historicalexample of associasionmining.Diapers and beerImagien 4 customers and their shopping baskets (carts)A basket is called a transactionAn item in a basket is called an ”item”Wewill talk about sets of items ”itemsets” or sets of items
  11. School book examplesGive story of whyDiapers and BeerareconnectedDon’t have time to go outDiscussvalues for frequencythresholdF-itemsetsGoal of Shopping basket analysis is to find as manyfrequent itemsets as possibleGenerateConditionalProbabilistrulesHard problemInternet and everyclick – lot of dataEach item witheach item – exponential
  12. FP-growth (from 2000)Clustersetups(allowing for fast information exchangebetween nodes)FP-growth: recursivealgorithm, each step takes a FP-treewithconditions, a conditionalFP-treeGenerallyperformsquitewellTreesize is comparable to data set size (huge)Nearmemory: fast memory (ram vsharddrivevsnetworkstorage), knownthat page faultdestroyeffecientcy of algortihmOptions for caching, but cloud is big problemLot of current researchCluster (paraellel) setupvsdistributedsetup. Shared RAM allow for really fast info transferReasearch: transfer of subtreesWiderapplication. For metypical type of problem. Centralizedalgorithm, want to distributewant to do
  13. Example of tree from examplebeforeHere customer1Note alphanummericalsorting, not kosher (weusefrequencycount)A prefixtreeortrieWill not talk about the lookup pointer structuresFP-tree is abouttwothings:CompressingFast data lookup
  14. Customer 2 from before
  15. Compression, but not 100% (Milk) (continueonnext slide)
  16. Note Milk node is in theretwice. Low-complexitycompressionNot intelligent compressionBut weneed to consider the orderingFor live data set typicallyyouareable to shrink an order of magnitude, due to the orderingEach node have weight
  17. All fourcustomersaddedActually I am not showning the root (null-node)Compression 13/16 approx 18% compressionLets look at the FP-growthalgorithm
  18. Thiswewant to distributeLets look at the tree
  19. Find frequent items (in tree)
  20. Wecount the occurences of ”Milk”, if support count is highengoughwegenerate the sub-tree. Otherwise just forget it.Milk occurs in two branches
  21. Note: all edges has a weight (as memtionedbefore)Weused the parts of the FP-tree not shownhere to loop/build the treeefficiently
  22. Youcan of course havemix of memorysetup types
  23. We of coursegot a goodidea
  24. That database scansareserial is crusial – wecould have neededrandomlookup. Thismeansthatdistributing the database is in-expensiveSecondbulletimplythatwe do not need the tree to do the recursive (distributed) calls. Allow for cheap/fastnetworkusageThiswashardwork to come up withWe show last bulletwith an examplenext
  25. Example from beforeI am going to move the treeon the right to the left (next)
  26. As youcansee as the postfixpathbecomes longer and longer the prefix-treebecomessmaller and smallerWecanbuild the subtree from the postfixpath!
  27. Each trans is made of itemsThe reasonthatwecanusebuckets is because of Observation1
  28. What is wrongwiththispicture?Sort of the original algorithm. We ”inheirited” turned out to be bad. Next gen:
  29. Peer-2-peer based + messagequeueRead-onlydbWeareworkingon new version withevenlessdblookup
  30. For eachpostfix:Distributeitem-buckets, transaction-bucketsCount number of postfixpaths in bucketsCollectcountacrosstransactionsFor eachfrequentpostfixpathcallrecursivelyIf expectedsize of prefixtree is small do standard FP-growthIN order to distributebucketsweuse MessageQueueINorder to collectcountsweusemessage parsing (node to node)IN order to do standard FPG weuselocalcomputation
  31. Message-driven – good for cloudMQ scaleswellDistributed data, thiswas a lessonlearned, also bad experiencewith SQL serverNew distribute FPGSmall messages in constrast to the cluster solutions, good for slownetworks
  32. Whataboutperf? (next)
  33. Ours is the lowgraphWith onlyoneworker.Growth as dbgrows