Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed data mining


Published on

Presentation of Research Paper: Distributed Data Mining for User Sensemaking in Online Collaborative Spaces.

Published in: Technology

Distributed data mining

  1. 1. School of something ComputingFACULTY OF ENGINEERING OTHERDistributed Data Mining for User Sensemaking in Online Collaborative Spaces Submitted to: DicoSyn2012 Workshop @ CSCW’12Presented By: Ahmad AmmariRF in User & Community Modelling
  2. 2. OUTLINE• The Big Data “Problem” in Online Collaborative Spaces• What is User Sensemaking and How Big Data is affecting it?• Can Distributed Data Mining Help? • What is Hadoop & Map / Reduce? • What is Mahout?• Proposed Approach to support User Sensemaking in OCS • Content Pre-Processing • Content Clustering • Topic Modelling• Case Study: Making Sense of Online Forums • How are Discussions currently Organized? Clusters vs. Categories • Which Content to Mine? Mining the Right Discussion Parts 2 • How Can This Help Sensemaking? Some Usage Scenarios
  3. 3. How “Big” is Big Data?• Emails • 90 Trillion – The Number of Emails Sent on the Internet in 2009 • 107 Trillion – The Number of Emails Sent on the Internet in 2010• Websites • 234 Million – The Number of Websites by Dec 2009 • 255 Million – The Number of Websites by Dec 2010• Social Media • 152 Million – The Number of Blogs on the Internet in 2010 • 25 Billion – The Number of sent Tweets on Twitter in 2010• Multi Media • 5 Billion – The Number of Photos Hosted by Flicker (Sep 2010) • 2 Billion – The Number of Videos Watched per Day on YouTube 3
  4. 4. What about Online CS? They are Big Too! Top 10 biggest Internet forums 4
  5. 5. What about Online CS? They are Big Too! Stack Exchange Family of Forums 5
  6. 6. Why is it a Problem? Where should I post my programming question to get relevant replies? 6
  7. 7. Why is it a Problem? Where to find a solution to my MS Outlook Problem? 7
  8. 8. Why is it a Problem? What are the actual discussions are really about? I cannot make sense of Big Content! 8
  9. 9. Why Making Sense of Big Data is not Easy, not Fast?• Because it’s Big and still increasing!• Because it’s Diverse! • Stack Exchange Suite of Forums has more than 50 Different Technical Discussion Forums • WebProWorld Technical Forums has more than 40 Discussion Categories• Because it’s Dynamic! • 294 Billion – The Average Number of Email Messages per Day • 21.4 Million – The Number of Added Websites in 2010 • 96,101 New Blogs in last 24 hours (8th Dec 2011) • 190 Million – The Number of Tweets per day in June 2011• Because it’s Noisy! • 200 billion – The number of spam emails per day in 2009 • 262 billion – The number of spam emails per day in 2010 9
  10. 10. But What is “Sensemaking”?!• Creating a representation of a collection of information [Russell et al, 1993] • Focused on the context of understanding large document collections. [Paul et al, 2011]• Transforming Information into Knowledge [Priolli & Card, 2005] • Seeking, filtering, searching for relations, extracting, schematizing• Understanding connections among people, places, and events [Klein et al, 2006] 10
  11. 11. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 11
  12. 12. What is Hadoop?• A framework for storing and processing big data on lots of commodity machines • Up to 4,000 machines in a cluster • Up to 20 PB in a cluster• Open Source Apache project• Implemented in Java We focused on distributed computation with Map/Reduce• Contains Many Sub-Projects: • Map/Reduce – Software Framework for Distributed Processing of Large Dataets • HDFS – Hadoop Distributed File System • Hadoop Common – Provides Access to the File Systems Supported by Hadoop • Chukwa – Data Collection System for Managing Large Distributed Systems • Hbase – Scalable, Distributed Database that Supports Structured Data Storage • Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying • Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation • Zookeeper – High-Performance Coordination Service for Dist. Apps. 12
  13. 13. Who Use Hadoop? 13
  14. 14. Why they Use Hadoop? 14
  15. 15. Hadoop Map/Reduce• Simply: A parallel programming model and an associated implementation• Abstract model: hides many system-level details from the programmer• Move-code-to-data philosophy: computation on data piece takes place on the same machine where that piece resides• Map/Reduce Job runs in Phases, each Phase runs in Parallel across all Nodes in the Hadoop Cluster• Main Phases: Mapping, Reducing• Are there Other Phases? Yes! • Shuffling & Sorting, Combining, Partitioning • But .. Programmer writes “Mapper” and “Reducer” functions only! 15
  16. 16. Hadoop Map/Reduce 16
  17. 17. Hadoop Map/Reduce More formally, • Map(k1,v1)  list(k2,v2) • Shuffle & Sort(list(k2,v2))  k2, list(v2) • Reduce(k2, list(v2))  list(k3, v3) 17
  18. 18. Hadoop Map/Reduce 18
  19. 19. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 19
  20. 20. What is Mahout?• Open source machine learning library from Apache• Began life in 2008 as a subproject of Apache’s Lucene Search Engine• In 2009 absorbed the Taste open source collaborative filtering project• In 2010 became a stand-alone Project• Written in Java• ML algorithms mainly for • Recommender Engines (CF-based) • Clustering April 2010 • Classification• Pre-Processing algorithms for Unstructured Data• Scalability is achieved by Map/Reduce Implementations of ML Algorithms We focused on Mahout Clustering and Pre-Processing Implementations in Map/Reduce 20
  21. 21. Sensemaking-Support with DDMINPUT: Collaboration Content (Discussions) 21
  22. 22. Sensemaking-Support with DDMContent Pre-Processing: Prepare Content for Mining 22
  23. 23. Sensemaking-Support with DDMContent Clustering: Derive Groups of Similar Content 23
  24. 24. Sensemaking-Support with DDMTopic Modelling: Identify Fine-Grained Topics andGenerate Topic Clouds 24
  25. 25. Sensemaking-Support with DDM OUTPUT: Topic Clouds 25
  26. 26. Content Pre-Processing• Apache Lucene Text Analysis • Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal• TFIDF Weighting: Computing Numerical Weights to Content Terms• n-gram Collocations • Multi-Term Phrases having high probability of occurring together • Examples: “Social Media”, “Data Mining”, “Machine Learning”• Normalization • decreasing the magnitude of large document vectors & increasing the magnitude of small ones • p-norm • p depends on similarity measure used • With Text Content, best similarity measures are Euclidean & Cosine  p = 2 • Example: the 2-norm of a 3-dimensional vector, [x, y, z], is 26
  27. 27. Content ClusteringDiscovering Clusters of “similar” Points EM algorithm to a 2 component Gaussian mixture model on the Old Faithful Geyser dataset 27
  28. 28. K-Means ClusteringMap/Reduce Implementation in Mahout 1. Starting with three random points as1 2 centroids 2. Map stage: assigns each point to the cluster nearest to it 3. Reduce stage: the associated points are averaged out to produce the new location of the3 4 centroid 4. After each iteration, the final configuration is fed back into the same loop until the centroids come to rest at their final 28 positions
  29. 29. Canopy Clustering• Fast approximate clustering technique• Divide the input set of points into overlapping clusters known as canopies• In Mahout, it is used to estimate the approximate cluster centroids (or canopy centroids) using two distance thresholds, T1 and T2, with T1 > T2 1. Start with a point and mark it as part1 2 of a canopy 2. all the points within distance T2 removed from the data set and prevented from becoming new canopies. 3. The points within the outer circle are also put in the same canopy, but3 4 they’re allowed to be part of other canopies. Assignment process is done in a single pass on a mapper. 4. The reducer computes the average of the centroid and merges close canopies 29
  30. 30. Sensemaking in Online Forums• Illustration of the Approach to support user sensemaking in Online Forums• Content Collection from WebProWorld Technical Forums• Large Forum (1000s of Discussion Threads)• Organize Discussions into Categories (Subforums) Defined by Forum Designers• Four subforums were chosen for the experiment: • Two subforums representing fairly specialized categories – SEO (Search Engine Optimization) and e-Commerce • Two subforums representing broad categories – IT and Computer Assistance• Objectives for the experiment • Investigate the extent of sensemaking support needed for the public technical forum • Determine which content representation for clustering is more appropriate to derive topic clouds for the sensemaker • Illustrate how the output of the approach could provide sensemaking 30 support
  31. 31. Clusters vs CategoriesDistribution of Four Categories in Distribution of Four Categories in FourFour Mahout-based Clusters by Title Mahout-based Clusters by Title and First Post 31
  32. 32. Content RepresentationThe smaller the average DBI, the clustering models having itembetter the model is for achieving a distribution values closer to 1.0 willcoherent set of similar discussions. derive minor distinct clusters with topic-specific discussions. 32
  33. 33. Example Topic Clouds Enabled Discovery of Topic- Specific Discussions not Obvious in Category Names: • Disk & Keyboard Problems • Security Issues • Hard Disk Backup • MS Outlook File Problems • Certificates and Skills in Web Design • Photo features in social networks (facebook) • Optimizing Search Engines for Blog Search • Design of Datawarehousing Systems 33
  34. 34. Cross Validated Statistics Forum 34
  35. 35. Conclusion• Big Data creates a Big Challenge to sensemaking in Online Collaborative Spaces• Distributed Data Mining with Hadoop Map/Reduce and Mahout is exploited to support user sensemaking by summarizing the huge content found in Large-scale Discussion Forums• Cluster Analysis shows that Different User-created Categories may contain similar Collaborative Content, creating difficulty for the users to find the content that address their problems / interests• Clustering of content represented by titles produces more coherent clusters with more ability to uncover fine-grained discussions that are buried in the huge amount of content• Mahout is not currently perfect! • Lack of Clustering Validity Measures • Lack of Dimension Reduction Algorithms (e.g. LSI) important to improve clustering results 35 • Lack of GUI Support
  36. 36. School of something ComputingFACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari