Data Mining in Hadoop,Making Sense of it in Mahout!Hadoop World 2011Michael Cutler @cotdp
Hello (Hadoop) World!• Senior Research Engineer• British Sky Broadcasting• Lead the Hadoop initiative• Fostering development
Topics•   What is Data Mining?•   Introducing Mahout•   Using Mahout•   Demo•   Summary•   Q&A
What is Data Mining?
It’s all about discovery...• Grouping similar data records• Identifying unusual records• Detecting relationships between r...
Trends...• 1990’s approach; “Think carefully first and get it right!”• 2000’s approach; “Think a little first, evolve it l...
Cost of Storagehttp://www.mkomo.com/cost-per-gigabyte
Other Reasons...• Increased generation of data• Complex interconnected datasets• You can be lazy about it...Consequence:  ...
Traditional Approach...• Collate your data into files• 6pm take your Database offline• Bulk load the previous 24hrs data• ...
Modern Approach• Stream data straight into Hadoop• No need for downtime• Analysis updated periodically or real-time• Scala...
Introducing
What is it?Library of scalable machine learning algorithms;• Classification• Clustering• Collaborative Filtering (Recommen...
How do you use it?• It’s just a Java library• Simple to get started• Easy to extend and enhance• Powerful command-line too...
Classification• Labels input data with one or more categories• Trained with known data
Clustering• Groups data based on their similarity• Unsupervised – no training
Collaborative Filtering• User-based recommendations  – Analyse user data  – Build neighbourhoods of users  – Other people ...
Others• Frequent Pattern mining• High performance maths & utilities
Mahout is a toolbox• Understand your data• Determine what needs to be done• Build a pipeline to compute results• Think abo...
Please Note• Scalability through Map/Reduce jobs• Like MR it is inherently Batch-driven• Some are not implemented in MR ye...
Using Mahout
Building a RecommenderObjectives:• Personalised• Item-based recommendations• Evolve with the times• Implicit feedback thro...
Problems with Recommenders• “Cold start” problem• “New stuff” problem• Tainted profiles• Stale profile data
When they go wrong...
Basic Strategy• Pre-compute rarely-changing data• Cache and serve them using traditional means• Flag data when it needs re...
Demo
Summary• Mahout is exciting!• Wide range of applications• Scalable algorithms• Scalable community
Questions?
Thank you!Hadoop World 2011Michael Cutler @cotdp
Upcoming SlideShare
Loading in …5
×

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

4,041
-1

Published on

Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data -- all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.

Published in: Technology, Education
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,041
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Clustering Outliers AssociationIt’s not new, we’ve been doing it manually for years
  • So why has it changed?
  • 1990 ~ $10,0002000 ~ $102010 ~ $0.10Currently 5 cents per GB
  • - It’s easier than ever before to generate or collect data- Complexity has increased- Storage and processing power is relatively cheap
  • Call data records, web logs etc.Rinse, RepeatProblem is as the volume of data has grown you need to go about it in a better way
  • Files,Hbase etc. Dashboards
  • Collaborative filtering for user-based and item-based recommendations Various clustering algorithms
  • Two JAR’s “core” and “math”Basic implementations for everythingYou can string together many use-cases just using the examples and CLI
  • Examples:Detecting spam emailOptical character recognition
  • You feed in the dataGive it a similarity metricSet a limit on the number of clusters
  • Colors Blue & Red appear together three timesPurple, Orange and Green appear only twice
  • How do you recommend to users you know nothing about If nobody has stumbled onto it, how do you recommend it? Outlier behaviour skewing results Tastes can change over time or seasonally
  • On the face of it, the fact it recommended SAW based on a Kids movie just means that parents are likely to watch SAW
  • Item-to-item relationships rarely changeHistorical data and trends rarely changeEasy to compute for new items
  • Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

    1. 1. Data Mining in Hadoop,Making Sense of it in Mahout!Hadoop World 2011Michael Cutler @cotdp
    2. 2. Hello (Hadoop) World!• Senior Research Engineer• British Sky Broadcasting• Lead the Hadoop initiative• Fostering development
    3. 3. Topics• What is Data Mining?• Introducing Mahout• Using Mahout• Demo• Summary• Q&A
    4. 4. What is Data Mining?
    5. 5. It’s all about discovery...• Grouping similar data records• Identifying unusual records• Detecting relationships between records• Discovering previously unknown patterns
    6. 6. Trends...• 1990’s approach; “Think carefully first and get it right!”• 2000’s approach; “Think a little first, evolve it later...”• 2010’s approach; “... if we capture everything, sense will come(?)”
    7. 7. Cost of Storagehttp://www.mkomo.com/cost-per-gigabyte
    8. 8. Other Reasons...• Increased generation of data• Complex interconnected datasets• You can be lazy about it...Consequence: – More data to process than ever before
    9. 9. Traditional Approach...• Collate your data into files• 6pm take your Database offline• Bulk load the previous 24hrs data• Run data mining, analytics, reporting overnight• Bring the database back up for 9am
    10. 10. Modern Approach• Stream data straight into Hadoop• No need for downtime• Analysis updated periodically or real-time• Scalable approach
    11. 11. Introducing
    12. 12. What is it?Library of scalable machine learning algorithms;• Classification• Clustering• Collaborative Filtering (Recommendations)• Frequent Pattern mining ... and many more
    13. 13. How do you use it?• It’s just a Java library• Simple to get started• Easy to extend and enhance• Powerful command-line tools & examples
    14. 14. Classification• Labels input data with one or more categories• Trained with known data
    15. 15. Clustering• Groups data based on their similarity• Unsupervised – no training
    16. 16. Collaborative Filtering• User-based recommendations – Analyse user data – Build neighbourhoods of users – Other people like you, liked <these>• Item-based recommendations – Analyse domain data – Build relationships between items – If you liked this, what about <these>
    17. 17. Others• Frequent Pattern mining• High performance maths & utilities
    18. 18. Mahout is a toolbox• Understand your data• Determine what needs to be done• Build a pipeline to compute results• Think about performance from the start
    19. 19. Please Note• Scalability through Map/Reduce jobs• Like MR it is inherently Batch-driven• Some are not implemented in MR yet• Fast-paced development
    20. 20. Using Mahout
    21. 21. Building a RecommenderObjectives:• Personalised• Item-based recommendations• Evolve with the times• Implicit feedback through measurement
    22. 22. Problems with Recommenders• “Cold start” problem• “New stuff” problem• Tainted profiles• Stale profile data
    23. 23. When they go wrong...
    24. 24. Basic Strategy• Pre-compute rarely-changing data• Cache and serve them using traditional means• Flag data when it needs refreshed• Tailor the cache on-the-fly
    25. 25. Demo
    26. 26. Summary• Mahout is exciting!• Wide range of applications• Scalable algorithms• Scalable community
    27. 27. Questions?
    28. 28. Thank you!Hadoop World 2011Michael Cutler @cotdp

    ×