Apriori data mining in the cloud


Published on

Published in: Technology, Self Improvement
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Weare a Tech transfer company. Webuildbrigdesbetweenuniversities (and other research institutes) and companiesFocuspervasisecomputing. For example mobile, cloud, data treatment
  • Product: Raptor smart advisorHeavy data mining solutionproductrecomendation (brought, browsing, otherusers), association data miningMulti-passRessource intensiveLast year (2010) current server is becoming to smallNeed to upgrade -> biginvestment (hw, licenses)Looked to the cloud, Azure (utility model, usagepricing)What potentials did theysee?
  • TheiroldsetupWe log patterns and build a model.Weuse the currentuserspattern to queryagainst the model (historic data)
  • The reasons d60 looked to the cloudCheaperlicenses,e.g. continuespayment But typicallyyougetupgrades for freeFaster calculation ->currently (last year) batch processing, and the batch processingshouldbe done within 12 hoursNowtheycan do somethingtheycouldn’tbeforeNear-real time responses – still workingon-real time events, trends etc-huge potential
  • D60 wanted to continue to use SQL Server50 GB in corboratesetting is not muchDuringprojectwerealised:Contacting the sqlserver from outside is slowish. By 10-20% (sometimes more!) (compared to a onpremisenetworkedsetup)
  • This isloose talk. Weneed to establish basis. The cloud is a bounch of looselyconnected nodes. For it to be cloud you have to have the ability to scale up and downondemand - elasticityNodes aretypically virtual images of a small(ish) computerNodes interconnected via LAN (ifwearelucky), but mightbepositioned in geographicaldifferent locationsWeexpectbetter respons times thanifthiswas over the internet. Howeverwewillexperiencelower respons times thanifwe had a dedicatedsetuponpremises.But it’scheapIt is a distributedsetup, where hardware and OS is typicallyvendorcontrolled. As in otherdistributedsetups – plan on nodes beingunavailable.
  • QueueAzure has a messagequeueGoogle has map-reduceframe-workorAppEngineTaskqueueAmazon has simple queue service (SQS)ScalableimplementationsexistsGlobal dbSimilary all providers has a global dbThereare a number of other cloud services, but theseare the onesthat matter to us
  • Manyorderes, alsounordered, we just need FIFOWeended up buildingourown, because ofsome initial bad experienceswith the azureone (slowresponsesetc)Maybe not so negative
  • This is prettyvague. Nowweconsider a more concreteexample
  • Marketbasket is the historicalexample of associasionmining.Diapers and beerImagien 4 customers and their shopping baskets (carts)A basket is called a transactionAn item in a basket is called an ”item”Wewill talk about sets of items ”itemsets” or sets of items
  • School book examplesGive story of whyDiapers and BeerareconnectedDon’t have time to go outDiscussvalues for frequencythresholdF-itemsetsGoal of Shopping basket analysis is to find as manyfrequent itemsets as possibleGenerateConditionalProbabilistrulesHard problemInternet and everyclick – lot of dataEach item witheach item – exponential
  • FP-growth (from 2000)Clustersetups(allowing for fast information exchangebetween nodes)FP-growth: recursivealgorithm, each step takes a FP-treewithconditions, a conditionalFP-treeGenerallyperformsquitewellTreesize is comparable to data set size (huge)Nearmemory: fast memory (ram vsharddrivevsnetworkstorage), knownthat page faultdestroyeffecientcy of algortihmOptions for caching, but cloud is big problemLot of current researchCluster (paraellel) setupvsdistributedsetup. Shared RAM allow for really fast info transferReasearch: transfer of subtreesWiderapplication. For metypical type of problem. Centralizedalgorithm, want to distributewant to do
  • Example of tree from examplebeforeHere customer1Note alphanummericalsorting, not kosher (weusefrequencycount)A prefixtreeortrieWill not talk about the lookup pointer structuresFP-tree is abouttwothings:CompressingFast data lookup
  • Customer 2 from before
  • Compression, but not 100% (Milk) (continueonnext slide)
  • Note Milk node is in theretwice. Low-complexitycompressionNot intelligent compressionBut weneed to consider the orderingFor live data set typicallyyouareable to shrink an order of magnitude, due to the orderingEach node have weight
  • All fourcustomersaddedActually I am not showning the root (null-node)Compression 13/16 approx 18% compressionLets look at the FP-growthalgorithm
  • Thiswewant to distributeLets look at the tree
  • Find frequent items (in tree)
  • Wecount the occurences of ”Milk”, if support count is highengoughwegenerate the sub-tree. Otherwise just forget it.Milk occurs in two branches
  • Note: all edges has a weight (as memtionedbefore)Weused the parts of the FP-tree not shownhere to loop/build the treeefficiently
  • Youcan of course havemix of memorysetup types
  • We of coursegot a goodidea
  • That database scansareserial is crusial – wecould have neededrandomlookup. Thismeansthatdistributing the database is in-expensiveSecondbulletimplythatwe do not need the tree to do the recursive (distributed) calls. Allow for cheap/fastnetworkusageThiswashardwork to come up withWe show last bulletwith an examplenext
  • Example from beforeI am going to move the treeon the right to the left (next)
  • As youcansee as the postfixpathbecomes longer and longer the prefix-treebecomessmaller and smallerWecanbuild the subtree from the postfixpath!
  • Each trans is made of itemsThe reasonthatwecanusebuckets is because of Observation1
  • What is wrongwiththispicture?Sort of the original algorithm. We ”inheirited” turned out to be bad. Next gen:
  • Peer-2-peer based + messagequeueRead-onlydbWeareworkingon new version withevenlessdblookup
  • For eachpostfix:Distributeitem-buckets, transaction-bucketsCount number of postfixpaths in bucketsCollectcountacrosstransactionsFor eachfrequentpostfixpathcallrecursivelyIf expectedsize of prefixtree is small do standard FP-growthIN order to distributebucketsweuse MessageQueueINorder to collectcountsweusemessage parsing (node to node)IN order to do standard FPG weuselocalcomputation
  • Message-driven – good for cloudMQ scaleswellDistributed data, thiswas a lessonlearned, also bad experiencewith SQL serverNew distribute FPGSmall messages in constrast to the cluster solutions, good for slownetworks
  • Whataboutperf? (next)
  • Ours is the lowgraphWith onlyoneworker.Growth as dbgrows
  • Apriori data mining in the cloud

    1. 1. Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute
    2. 2. Agenda· d60: A cloud/data mining case· Cloud· Data Mining· Market Basket Analysis· Large data sets· Our solution2
    3. 3. Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation.3
    4. 4. The case: d60· Danish company· A similar products recommendation engine· d60 was outgrowing their servers (late 2010)· They saw a potential in moving to Azure4
    5. 5. The setup Product Internet RecommendationsWebshops Log shopping patterns Do data mining 5
    6. 6. The cloud potential· Elasticity· No upfront server cost· Cheaper licenses· Faster calculations6
    7. 7. Challenges· No SQL Server Analysis Services (SSAS)· Small compute nodes· Partioned database (50GB)· SQL server ingress/outgress access is slow7
    8. 8. The cloud Node Node Node Node Node Node Node8
    9. 9. The cloud and services Node Node Node Node Data layer service Node Messaging Node Service Node9
    10. 10. Data layer service Data layer· Application specific (schema/layout) service· SQL, table or other· Easy a bottleneck· Can be difficult to scale10
    11. 11. Messaging serviceTask Queues· Standard data structure Messaging Service· Build-in ordering (FIFO)· Can be scaled· Good for asynchronous messages11
    12. 12. 12
    13. 13. Data miningData mining is the use of automated data analysis techniques to uncover relationships among data itemsMarket basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org]13
    14. 14. Market basket analysis Customer1 Customer2 Customer3 Customer4Avocado Milk Beef CerealMilk Diapers Lemons BeerButter Avocado Beer BeefPotatoes Beer Chips Diapers14
    15. 15. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips DiapersItemset (Diapers, Beer) occur 50%Frequency threshold parameterFind as many frequent itemsets as possible 15
    16. 16. Market basket analysisPopular effective algorithm: FP-growth Based on data structure FP-treeRequires all data in near-memory Most research in distributed models has been for cluster setups 16
    17. 17. Building the FP-tree(extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes17
    18. 18. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes18
    19. 19. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk19
    20. 20. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk20
    21. 21. Building the FP-tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers21
    22. 22. FP-growthGrows the frequent itemsets, recusivelyFP-growth(FP-tree tree){ … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }22
    23. 23. FP-growth algorithmDivide and ConquerTraverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips CerealPotatoes Milk Lemon Diapers23
    24. 24. FP-growth algorithmDivide and ConquerGenerate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips CerealPotatoes Milk Lemon Diapers24
    25. 25. FP-growth algorithmDivide and ConquerCall recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer DiapersPotatoes Milk Lemon Diapers25
    26. 26. FP-growth algorithmMemory usageThe FP-tree does not fit in local memory; what to do?· Emulate Distributed Shared Memory26
    27. 27. Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory· To add nodes is to add memory· Works best in tightly coubled setups, with low-lantency, high-speed networks27
    28. 28. FP-growth algorithmMemory usageThe FP-tree does not fit in local memory; what to do?· Emulate Distributed Shared Memory· Optimize your data structures· Buy more RAM· Get a good idea28
    29. 29. Get a good idea· Database scans are serial and can be distributed· The list of items used in the recursive calls uniquely determines what part of data we are looking at29
    30. 30. Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer DiapersPotatoes Milk Lemon Diapers30
    31. 31. Get a good idea Avocado Avocado Butter, Milk Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths31
    32. 32. 32
    33. 33. Buckets· Use postfix paths for messaging· Working with buckets Transactions Items33
    34. 34. FP-growth revisited Replaced with FP-growth(FP-tree tree) postfix { …Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34
    35. 35. Communication Node Node Data layer Node Node35
    36. 36. Revised Communication Node Node MQ Data layer Node Node36
    37. 37. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth37
    38. 38. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth38
    39. 39. Collecting what we have learned· Message-driven work, using message-queue· Peer-to-peer for intermediate results· Distribute data for scalability (buckets)· Small messages (list of items)· Allow us to distribute FP-growth39
    40. 40. Advantages· Configurable work sizes· Good distribution of work· Robust against computer failure· Fast!40
    41. 41. So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 02:30:00 Message-driven FP-growth FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 841
    42. 42. Thank you!42