SlideShare a Scribd company logo
1 of 32
1	
  
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/pattern-real-time-learning
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
2
Design	
  Pa+erns	
  for	
  Large-­‐Scale	
  	
  
Real-­‐Time	
  Learning	
  
QCon	
  London	
  2014	
  
Sean	
  Owen	
  /	
  Director	
  of	
  Data	
  Science	
  /	
  Cloudera	
  
3
What	
  We	
  Talk	
  About	
  When	
  	
  
We	
  Talk	
  About	
  Data	
  Science	
  
4
www.quora.com/Data-­‐Science/What-­‐is-­‐the-­‐difference-­‐between-­‐a-­‐data-­‐scienJst-­‐and-­‐a-­‐staJsJcian	
  
5
6	
  
	
  tist
Data	
  Science	
  Is	
  Exploratory	
  Analy-cs?	
  
7	
  
www.tc.umn.edu/~zief0002/Comparing-­‐Groups/blog.html	
  
thenextweb.com/microsoS/2013/07/08/microsoS-­‐brings-­‐the-­‐office-­‐store-­‐to-­‐22-­‐new-­‐markets-­‐adds-­‐power-­‐bi-­‐an-­‐intelligence-­‐tool-­‐to-­‐office-­‐365/	
  
Example:	
  Drug	
  InteracJons	
  
8	
  
Cloudera	
  analysis	
  of	
  FDA	
  drug	
  
data:	
  “Our	
  analysis	
  revealed	
  a	
  few	
  
drug	
  pairs	
  with	
  surprisingly	
  high	
  
correlaJons	
  with	
  adverse	
  events	
  
that	
  did	
  not	
  show	
  up	
  in	
  a	
  search	
  of	
  
the	
  academic	
  literature:	
  
gabapenJn	
  (a	
  seizure	
  medicaJon)	
  
taken	
  in	
  conjuncJon	
  with	
  
hydrocodone/paracetamol	
  was	
  
correlated	
  with	
  memory	
  
impairment,	
  and	
  haloperidol	
  in	
  
conjuncJon	
  with	
  lorazepam	
  was	
  
correlated	
  with	
  the	
  paJent	
  
entering	
  into	
  a	
  coma.”	
  
blog.cloudera.com/blog/2011/11/using-­‐hadoop-­‐to-­‐analyze-­‐adverse-­‐drug-­‐events/	
  
9
Example:	
  Data	
  Science	
  in	
  the	
  Field	
  
10	
  
•  [Large	
  European	
  e-­‐commerce	
  site]	
  
•  Wants	
  real-­‐Jme	
  recommendaJons	
  
	
  for	
  new	
  and	
  returning	
  users	
  
•  Data	
  streamed	
  from	
  web	
  server	
  via	
  	
  
Flume	
  to	
  HDFS	
  
•  MulJple	
  data	
  sources	
  
•  100K+	
  products,	
  20M	
  users	
  
Exploratory?	
  
Example:	
  
11	
  
•  Search,	
  ML	
  over	
  PaJent	
  Data	
  
•  MapReduce	
  for	
  indexing,	
  learning	
  
•  HBase	
  for	
  storage	
  and	
  fast	
  access	
  
•  Also:	
  Storm	
  for	
  	
  
incremental	
  update	
  
•  And:	
  relaJonal	
  DB	
  for	
  
most	
  recent	
  derived	
  data	
  
•  API	
  façade	
  for	
  input;	
  
API	
  for	
  querying	
  learning	
  
engineering.cerner.com/2013/02/near-­‐real-­‐Jme-­‐processing-­‐over-­‐hadoop-­‐and-­‐hbase/	
  Engineering	
  
Machine	
  Learning	
  
12
Adding	
  OperaJonal	
  AnalyJcs	
  
2014:	
  Lab	
  to	
  Factory	
  
13	
  
Data	
  Science	
  Will	
  Be	
  Opera-onal	
  Analy-cs	
  
14	
  
I	
  Built	
  A	
  Model	
  On	
  Hadoop.	
  Now	
  What?	
  
15	
  
Build	
  Model	
   Query	
  Model	
  Collect	
  Input	
  
Repeat	
  
?	
  
?	
  
?	
  
16
Example:	
  Oryx	
  
17	
  
www.mw+l.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mw+l.jpg	
  
Gaps	
  to	
  fill,	
  and	
  Goals	
  
18	
  
•  Model	
  Building	
  
•  Large-­‐scale	
  
•  Con-nuous	
  
•  Apache	
  Hadoop™-­‐based	
  
•  Few,	
  good	
  algorithms	
  
•  Model	
  Serving	
  
•  Real-­‐-me	
  query	
  
•  Real-­‐-me	
  update	
  
•  Algorithms	
  
•  Parallelizable	
  
•  Updateable	
  
•  Works	
  on	
  diverse	
  input	
  
•  Interoperable	
  
•  PMML	
  model	
  format	
  
•  Simple	
  REST	
  API	
  
•  Open	
  source	
  
Large-­‐Scale	
  or	
  Real-­‐Time?	
  
19	
  
Large-­‐Scale	
  
Offline	
  
Batch	
  
Real-­‐Time	
  
Online	
  
Streaming	
  
vs	
  
Why	
  Don’t	
  We	
  Have	
  Both?	
  
λ!	
  
Lambda	
  Architecture	
  
20	
  
•  Batch,	
  Stream	
  	
  
Processing	
  are	
  different	
  
•  Tackle	
  separately	
  in	
  	
  
2+	
  Layers	
  
•  Batch	
  Layer:	
  offline,	
  
asynchronous	
  
•  Serving	
  /	
  Speed	
  Layer:	
  
real-­‐Jme,	
  incremental,	
  
approximate	
  
jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecJng	
  
…	
  λ?	
  
21	
  
Batch	
  
Serving/Speed	
  
Two	
  Layers	
  
22	
  
•  Computa-on	
  Layer	
  
•  Java-­‐based	
  server	
  process	
  
•  Client	
  of	
  Hadoop	
  2.x	
  
•  Periodically	
  builds	
  
“generaJon”	
  from	
  recent	
  
data	
  and	
  past	
  model	
  
•  Baby-­‐sits	
  MapReduce*	
  
jobs	
  (or,	
  locally	
  in-­‐core)	
  
•  Publishes	
  models	
  
•  Serving	
  Layer	
  
•  Apache	
  Tomcat™-­‐based	
  
server	
  process	
  
•  Consumes	
  models	
  from	
  
HDFS	
  (or	
  local	
  FS)	
  
•  Serves	
  queries	
  from	
  
model	
  in	
  memory	
  
•  Updates	
  from	
  new	
  input	
  
•  Also	
  writes	
  input	
  to	
  HDFS	
  
•  Replicas	
  for	
  scale	
  
*	
  Apache	
  Spark	
  later	
  
CollaboraJve	
  Filtering	
  :	
  ALS	
  
23	
  
•  AlternaJng	
  Least	
  Squares	
  
•  Latent-­‐factor	
  model	
  
•  Accepts	
  implicit	
  or	
  	
  
explicit	
  feedback	
  
•  Real-­‐Jme	
  update	
  	
  
via	
  fold-­‐in	
  of	
  input	
  
•  No	
  cold-­‐start	
  
•  Parallelizable	
  
YT	
  
X	
  
Clustering	
  :	
  k-­‐means++	
  
24	
  
•  Well-­‐known	
  and	
  
understood	
  
•  Parallelizable	
  
•  Clusters	
  updateable	
  
cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering	
  
ClassificaJon	
  /	
  Regression	
  :	
  RDF	
  
25	
  
•  Random	
  Decision	
  Forests	
  
•  Ensemble	
  method	
  
•  Numeric,	
  categorical	
  	
  
features	
  and	
  target	
  	
  
•  Very	
  parallel	
  
•  Nodes	
  updateable	
  
•  Works	
  well	
  on	
  many	
  
problems	
  
age$>$30
female? Yes
income$>$20000 Yes
Yes No
PMML	
  
26	
  
•  PredicJve	
  Modeling	
  
Markup	
  Language	
  
•  XML-­‐based	
  format	
  for	
  
predicJve	
  models	
  
•  Standardized	
  by	
  Data	
  
Mining	
  Group	
  
(www.dmg.org)	
  
•  Wide	
  tool	
  support	
  
<PMML xmlns="http://www.dmg.org/PMML-4_1"!
version="4.1">!
<Header copyright="www.dmg.org"/>!
<DataDictionary numberOfFields="5">!
<DataField name="temperature"!
optype="continuous"!
dataType="double"/>!
…!
</DataDictionary>!
<TreeModel modelName="golfing"!
functionName="classification">!
<MiningSchema>!
<MiningField name="temperature"/>!
… !
</MiningSchema>!
<Node score="will play">!
<Node score="will play">!
<SimplePredicate field="outlook"!
operator="equal" !
value="sunny"/>!
…!
</Node>!
</Node>!
</TreeModel>!
</PMML>!
www.dmg.org/v4-­‐1/TreeModel.html	
  
Extra:	
  Apache	
  Spark	
  as	
  “Crossover	
  Hit”	
  
27	
  
•  Exploratory-­‐friendly	
  
•  REPL	
  
•  Scala	
  closures	
  
•  MLlib	
  
•  OperaJonal-­‐friendly	
  
•  Distributed	
  
•  Hadoop	
  integraJon	
  
•  All	
  Java	
  libraries	
  available	
  
blog.cloudera.com/blog/2014/03/why-­‐apache-­‐spark-­‐is-­‐a-­‐crossover-­‐hit-­‐for-­‐data-­‐scienJsts/	
  
Thanks!	
  
28	
  
?	
  
29	
  
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/pattern-
real-time-learning

More Related Content

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Design Patterns for Large-Scale Real-Time Learning

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /pattern-real-time-learning http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. 2 Design  Pa+erns  for  Large-­‐Scale     Real-­‐Time  Learning   QCon  London  2014   Sean  Owen  /  Director  of  Data  Science  /  Cloudera  
  • 5. 3 What  We  Talk  About  When     We  Talk  About  Data  Science  
  • 7. 5
  • 9. Data  Science  Is  Exploratory  Analy-cs?   7   www.tc.umn.edu/~zief0002/Comparing-­‐Groups/blog.html   thenextweb.com/microsoS/2013/07/08/microsoS-­‐brings-­‐the-­‐office-­‐store-­‐to-­‐22-­‐new-­‐markets-­‐adds-­‐power-­‐bi-­‐an-­‐intelligence-­‐tool-­‐to-­‐office-­‐365/  
  • 10. Example:  Drug  InteracJons   8   Cloudera  analysis  of  FDA  drug   data:  “Our  analysis  revealed  a  few   drug  pairs  with  surprisingly  high   correlaJons  with  adverse  events   that  did  not  show  up  in  a  search  of   the  academic  literature:   gabapenJn  (a  seizure  medicaJon)   taken  in  conjuncJon  with   hydrocodone/paracetamol  was   correlated  with  memory   impairment,  and  haloperidol  in   conjuncJon  with  lorazepam  was   correlated  with  the  paJent   entering  into  a  coma.”   blog.cloudera.com/blog/2011/11/using-­‐hadoop-­‐to-­‐analyze-­‐adverse-­‐drug-­‐events/  
  • 11. 9
  • 12. Example:  Data  Science  in  the  Field   10   •  [Large  European  e-­‐commerce  site]   •  Wants  real-­‐Jme  recommendaJons    for  new  and  returning  users   •  Data  streamed  from  web  server  via     Flume  to  HDFS   •  MulJple  data  sources   •  100K+  products,  20M  users   Exploratory?  
  • 13. Example:   11   •  Search,  ML  over  PaJent  Data   •  MapReduce  for  indexing,  learning   •  HBase  for  storage  and  fast  access   •  Also:  Storm  for     incremental  update   •  And:  relaJonal  DB  for   most  recent  derived  data   •  API  façade  for  input;   API  for  querying  learning   engineering.cerner.com/2013/02/near-­‐real-­‐Jme-­‐processing-­‐over-­‐hadoop-­‐and-­‐hbase/  Engineering   Machine  Learning  
  • 15. 2014:  Lab  to  Factory   13  
  • 16. Data  Science  Will  Be  Opera-onal  Analy-cs   14  
  • 17. I  Built  A  Model  On  Hadoop.  Now  What?   15   Build  Model   Query  Model  Collect  Input   Repeat   ?   ?   ?  
  • 20. Gaps  to  fill,  and  Goals   18   •  Model  Building   •  Large-­‐scale   •  Con-nuous   •  Apache  Hadoop™-­‐based   •  Few,  good  algorithms   •  Model  Serving   •  Real-­‐-me  query   •  Real-­‐-me  update   •  Algorithms   •  Parallelizable   •  Updateable   •  Works  on  diverse  input   •  Interoperable   •  PMML  model  format   •  Simple  REST  API   •  Open  source  
  • 21. Large-­‐Scale  or  Real-­‐Time?   19   Large-­‐Scale   Offline   Batch   Real-­‐Time   Online   Streaming   vs   Why  Don’t  We  Have  Both?   λ!  
  • 22. Lambda  Architecture   20   •  Batch,  Stream     Processing  are  different   •  Tackle  separately  in     2+  Layers   •  Batch  Layer:  offline,   asynchronous   •  Serving  /  Speed  Layer:   real-­‐Jme,  incremental,   approximate   jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecJng   …  λ?  
  • 24. Two  Layers   22   •  Computa-on  Layer   •  Java-­‐based  server  process   •  Client  of  Hadoop  2.x   •  Periodically  builds   “generaJon”  from  recent   data  and  past  model   •  Baby-­‐sits  MapReduce*   jobs  (or,  locally  in-­‐core)   •  Publishes  models   •  Serving  Layer   •  Apache  Tomcat™-­‐based   server  process   •  Consumes  models  from   HDFS  (or  local  FS)   •  Serves  queries  from   model  in  memory   •  Updates  from  new  input   •  Also  writes  input  to  HDFS   •  Replicas  for  scale   *  Apache  Spark  later  
  • 25. CollaboraJve  Filtering  :  ALS   23   •  AlternaJng  Least  Squares   •  Latent-­‐factor  model   •  Accepts  implicit  or     explicit  feedback   •  Real-­‐Jme  update     via  fold-­‐in  of  input   •  No  cold-­‐start   •  Parallelizable   YT   X  
  • 26. Clustering  :  k-­‐means++   24   •  Well-­‐known  and   understood   •  Parallelizable   •  Clusters  updateable   cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  
  • 27. ClassificaJon  /  Regression  :  RDF   25   •  Random  Decision  Forests   •  Ensemble  method   •  Numeric,  categorical     features  and  target     •  Very  parallel   •  Nodes  updateable   •  Works  well  on  many   problems   age$>$30 female? Yes income$>$20000 Yes Yes No
  • 28. PMML   26   •  PredicJve  Modeling   Markup  Language   •  XML-­‐based  format  for   predicJve  models   •  Standardized  by  Data   Mining  Group   (www.dmg.org)   •  Wide  tool  support   <PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>! </PMML>! www.dmg.org/v4-­‐1/TreeModel.html  
  • 29. Extra:  Apache  Spark  as  “Crossover  Hit”   27   •  Exploratory-­‐friendly   •  REPL   •  Scala  closures   •  MLlib   •  OperaJonal-­‐friendly   •  Distributed   •  Hadoop  integraJon   •  All  Java  libraries  available   blog.cloudera.com/blog/2014/03/why-­‐apache-­‐spark-­‐is-­‐a-­‐crossover-­‐hit-­‐for-­‐data-­‐scienJsts/  
  • 31. 29  
  • 32. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/pattern- real-time-learning