Your SlideShare is downloading. ×
0
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Design Patterns for Large-Scale Real-Time Learning

334

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jupcVS. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jupcVS.

Sean Owen provides examples of operational analytics projects in the field, presenting a reference architecture and algorithm design choices for a successful implementation based on his experience with customers and Oryx/Cloudera. Filmed at qconlondon.com.

Sean Owen is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
334
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1  
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /pattern-real-time-learning http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. 2 Design  Pa+erns  for  Large-­‐Scale     Real-­‐Time  Learning   QCon  London  2014   Sean  Owen  /  Director  of  Data  Science  /  Cloudera  
  • 5. 3 What  We  Talk  About  When     We  Talk  About  Data  Science  
  • 6. 4 www.quora.com/Data-­‐Science/What-­‐is-­‐the-­‐difference-­‐between-­‐a-­‐data-­‐scienJst-­‐and-­‐a-­‐staJsJcian  
  • 7. 5
  • 8. 6    tist
  • 9. Data  Science  Is  Exploratory  Analy-cs?   7   www.tc.umn.edu/~zief0002/Comparing-­‐Groups/blog.html   thenextweb.com/microsoS/2013/07/08/microsoS-­‐brings-­‐the-­‐office-­‐store-­‐to-­‐22-­‐new-­‐markets-­‐adds-­‐power-­‐bi-­‐an-­‐intelligence-­‐tool-­‐to-­‐office-­‐365/  
  • 10. Example:  Drug  InteracJons   8   Cloudera  analysis  of  FDA  drug   data:  “Our  analysis  revealed  a  few   drug  pairs  with  surprisingly  high   correlaJons  with  adverse  events   that  did  not  show  up  in  a  search  of   the  academic  literature:   gabapenJn  (a  seizure  medicaJon)   taken  in  conjuncJon  with   hydrocodone/paracetamol  was   correlated  with  memory   impairment,  and  haloperidol  in   conjuncJon  with  lorazepam  was   correlated  with  the  paJent   entering  into  a  coma.”   blog.cloudera.com/blog/2011/11/using-­‐hadoop-­‐to-­‐analyze-­‐adverse-­‐drug-­‐events/  
  • 11. 9
  • 12. Example:  Data  Science  in  the  Field   10   •  [Large  European  e-­‐commerce  site]   •  Wants  real-­‐Jme  recommendaJons    for  new  and  returning  users   •  Data  streamed  from  web  server  via     Flume  to  HDFS   •  MulJple  data  sources   •  100K+  products,  20M  users   Exploratory?  
  • 13. Example:   11   •  Search,  ML  over  PaJent  Data   •  MapReduce  for  indexing,  learning   •  HBase  for  storage  and  fast  access   •  Also:  Storm  for     incremental  update   •  And:  relaJonal  DB  for   most  recent  derived  data   •  API  façade  for  input;   API  for  querying  learning   engineering.cerner.com/2013/02/near-­‐real-­‐Jme-­‐processing-­‐over-­‐hadoop-­‐and-­‐hbase/  Engineering   Machine  Learning  
  • 14. 12 Adding  OperaJonal  AnalyJcs  
  • 15. 2014:  Lab  to  Factory   13  
  • 16. Data  Science  Will  Be  Opera-onal  Analy-cs   14  
  • 17. I  Built  A  Model  On  Hadoop.  Now  What?   15   Build  Model   Query  Model  Collect  Input   Repeat   ?   ?   ?  
  • 18. 16 Example:  Oryx  
  • 19. 17   www.mw+l.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mw+l.jpg  
  • 20. Gaps  to  fill,  and  Goals   18   •  Model  Building   •  Large-­‐scale   •  Con-nuous   •  Apache  Hadoop™-­‐based   •  Few,  good  algorithms   •  Model  Serving   •  Real-­‐-me  query   •  Real-­‐-me  update   •  Algorithms   •  Parallelizable   •  Updateable   •  Works  on  diverse  input   •  Interoperable   •  PMML  model  format   •  Simple  REST  API   •  Open  source  
  • 21. Large-­‐Scale  or  Real-­‐Time?   19   Large-­‐Scale   Offline   Batch   Real-­‐Time   Online   Streaming   vs   Why  Don’t  We  Have  Both?   λ!  
  • 22. Lambda  Architecture   20   •  Batch,  Stream     Processing  are  different   •  Tackle  separately  in     2+  Layers   •  Batch  Layer:  offline,   asynchronous   •  Serving  /  Speed  Layer:   real-­‐Jme,  incremental,   approximate   jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecJng   …  λ?  
  • 23. 21   Batch   Serving/Speed  
  • 24. Two  Layers   22   •  Computa-on  Layer   •  Java-­‐based  server  process   •  Client  of  Hadoop  2.x   •  Periodically  builds   “generaJon”  from  recent   data  and  past  model   •  Baby-­‐sits  MapReduce*   jobs  (or,  locally  in-­‐core)   •  Publishes  models   •  Serving  Layer   •  Apache  Tomcat™-­‐based   server  process   •  Consumes  models  from   HDFS  (or  local  FS)   •  Serves  queries  from   model  in  memory   •  Updates  from  new  input   •  Also  writes  input  to  HDFS   •  Replicas  for  scale   *  Apache  Spark  later  
  • 25. CollaboraJve  Filtering  :  ALS   23   •  AlternaJng  Least  Squares   •  Latent-­‐factor  model   •  Accepts  implicit  or     explicit  feedback   •  Real-­‐Jme  update     via  fold-­‐in  of  input   •  No  cold-­‐start   •  Parallelizable   YT   X  
  • 26. Clustering  :  k-­‐means++   24   •  Well-­‐known  and   understood   •  Parallelizable   •  Clusters  updateable   cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  
  • 27. ClassificaJon  /  Regression  :  RDF   25   •  Random  Decision  Forests   •  Ensemble  method   •  Numeric,  categorical     features  and  target     •  Very  parallel   •  Nodes  updateable   •  Works  well  on  many   problems   age$>$30 female? Yes income$>$20000 Yes Yes No
  • 28. PMML   26   •  PredicJve  Modeling   Markup  Language   •  XML-­‐based  format  for   predicJve  models   •  Standardized  by  Data   Mining  Group   (www.dmg.org)   •  Wide  tool  support   <PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>! </PMML>! www.dmg.org/v4-­‐1/TreeModel.html  
  • 29. Extra:  Apache  Spark  as  “Crossover  Hit”   27   •  Exploratory-­‐friendly   •  REPL   •  Scala  closures   •  MLlib   •  OperaJonal-­‐friendly   •  Distributed   •  Hadoop  integraJon   •  All  Java  libraries  available   blog.cloudera.com/blog/2014/03/why-­‐apache-­‐spark-­‐is-­‐a-­‐crossover-­‐hit-­‐for-­‐data-­‐scienJsts/  
  • 30. Thanks!   28   ?  
  • 31. 29  
  • 32. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/pattern- real-time-learning

×