• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting
 

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

on

  • 3,967 views

Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data -- all very necessary, but largely not exciting. In this ...

Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data -- all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.

Statistics

Views

Total Views
3,967
Views on SlideShare
3,458
Embed Views
509

Actions

Likes
11
Downloads
0
Comments
0

3 Embeds 509

http://www.cloudera.com 506
http://a0.twimg.com 2
http://cloudera.matt.dev 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Clustering Outliers AssociationIt’s not new, we’ve been doing it manually for years
  • So why has it changed?
  • 1990 ~ $10,0002000 ~ $102010 ~ $0.10Currently 5 cents per GB
  • - It’s easier than ever before to generate or collect data- Complexity has increased- Storage and processing power is relatively cheap
  • Call data records, web logs etc.Rinse, RepeatProblem is as the volume of data has grown you need to go about it in a better way
  • Files,Hbase etc. Dashboards
  • Collaborative filtering for user-based and item-based recommendations Various clustering algorithms
  • Two JAR’s “core” and “math”Basic implementations for everythingYou can string together many use-cases just using the examples and CLI
  • Examples:Detecting spam emailOptical character recognition
  • You feed in the dataGive it a similarity metricSet a limit on the number of clusters
  • Colors Blue & Red appear together three timesPurple, Orange and Green appear only twice
  • How do you recommend to users you know nothing about If nobody has stumbled onto it, how do you recommend it? Outlier behaviour skewing results Tastes can change over time or seasonally
  • On the face of it, the fact it recommended SAW based on a Kids movie just means that parents are likely to watch SAW
  • Item-to-item relationships rarely changeHistorical data and trends rarely changeEasy to compute for new items

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting Presentation Transcript

  • Data Mining in Hadoop,Making Sense of it in Mahout!Hadoop World 2011Michael Cutler @cotdp
  • Hello (Hadoop) World!• Senior Research Engineer• British Sky Broadcasting• Lead the Hadoop initiative• Fostering development
  • Topics• What is Data Mining?• Introducing Mahout• Using Mahout• Demo• Summary• Q&A
  • What is Data Mining?
  • It’s all about discovery...• Grouping similar data records• Identifying unusual records• Detecting relationships between records• Discovering previously unknown patterns
  • Trends...• 1990’s approach; “Think carefully first and get it right!”• 2000’s approach; “Think a little first, evolve it later...”• 2010’s approach; “... if we capture everything, sense will come(?)”
  • Cost of Storagehttp://www.mkomo.com/cost-per-gigabyte
  • Other Reasons...• Increased generation of data• Complex interconnected datasets• You can be lazy about it...Consequence: – More data to process than ever before
  • Traditional Approach...• Collate your data into files• 6pm take your Database offline• Bulk load the previous 24hrs data• Run data mining, analytics, reporting overnight• Bring the database back up for 9am
  • Modern Approach• Stream data straight into Hadoop• No need for downtime• Analysis updated periodically or real-time• Scalable approach
  • Introducing
  • What is it?Library of scalable machine learning algorithms;• Classification• Clustering• Collaborative Filtering (Recommendations)• Frequent Pattern mining ... and many more
  • How do you use it?• It’s just a Java library• Simple to get started• Easy to extend and enhance• Powerful command-line tools & examples
  • Classification• Labels input data with one or more categories• Trained with known data
  • Clustering• Groups data based on their similarity• Unsupervised – no training
  • Collaborative Filtering• User-based recommendations – Analyse user data – Build neighbourhoods of users – Other people like you, liked <these>• Item-based recommendations – Analyse domain data – Build relationships between items – If you liked this, what about <these>
  • Others• Frequent Pattern mining• High performance maths & utilities
  • Mahout is a toolbox• Understand your data• Determine what needs to be done• Build a pipeline to compute results• Think about performance from the start
  • Please Note• Scalability through Map/Reduce jobs• Like MR it is inherently Batch-driven• Some are not implemented in MR yet• Fast-paced development
  • Using Mahout
  • Building a RecommenderObjectives:• Personalised• Item-based recommendations• Evolve with the times• Implicit feedback through measurement
  • Problems with Recommenders• “Cold start” problem• “New stuff” problem• Tainted profiles• Stale profile data
  • When they go wrong...
  • Basic Strategy• Pre-compute rarely-changing data• Cache and serve them using traditional means• Flag data when it needs refreshed• Tailor the cache on-the-fly
  • Demo
  • Summary• Mahout is exciting!• Wide range of applications• Scalable algorithms• Scalable community
  • Questions?
  • Thank you!Hadoop World 2011Michael Cutler @cotdp