CONTEXT
SEMANTICS!
Danny : isBrotherOf : Nezih
food cart : uses : bicycles
Frank : isFriendsWith : Mohit
Frank : isFriendsWith : Ted
Frank : ...
This model... ... infers this interest.
Ted Kushal
Mohit
Danny
Ivy
Frank
Nezih
friends
friends
friends
brothers
friends
fr...
Virtuous cycle of data
CLOUD
Richer data to
analyze
CLIENTS
Richer data
from devices
Richer
user experiences
INTELLIGENT
S...
SEMANTIC INFORMATION
IS FUEL FOR THE CYCLE
1985 1995 2005 2015
enterprise
NoSQL
Docs
+
Semantics
RDF
WIDESPREAD
MACHINE LEARNING
ON THIS
IMAGINE THE POSSIBILITIES
Graph centrality
High
Program
Importance
(Centrality)
Low
Graph of
channel
viewing
behavior
Current popular
surfing patter...
Preference and Similarity Recommendations
User
Movie
1.7MM Nodes
23.9MM Edges
similar cast
prefers
similar
topic
userId: A...
10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes
185MM Edges
URL
Domain
IP Address
Calculation of...
Loopy Belief Propagation on the (semantic) web
A yoga
ball
graph.
Really!?!
You may actually need this
• When the problem is an information
network
• When a graph is a natural way of
expressing the ...
But there are challenges.
Handling all that
data.
Finding people good at both handling all
that data and data analysis.
Pu...
Congratulations
! You
are a
data scientist!
It’s a demanding job
Ingest &
Clean
Engineer
Features
Structure
Model
Train
Model
Query &
Analyze
Learn
Visualize
Skills s...
Decomposing
the “data
scientist”
Source: 2013 Report from Accenture Institute for High Performance
IMAGINE A PLATFORM FOR DATA SCIENTISTS
DOCS + SEMANTICS + MACHINE LEARNING
Ease-of-use: Making big data familiar
Python
R
Dataflow
GUI
...
Datacenter / CloudNetworkClient
BIG
DATA
API
Connec
tManag...
Delivering it
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND
STA...
Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs,...
Article Tagging Problem
• Articles are tagged by experts with MeSH terms, drawn
from a hierarchical controlled vocabulary ...
Hierarchy Level
Article Count
Demo: Graph Analytics For Medical Journal
Analysis
INGEST
&
CLEAN
ENGINEER
FEATURES
STRUCTURE
GRAPH
QUERY &
ANALYZE
LEARN
...
The Playbook?
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD
GRAPH
QUERY/
VISUALIZE
DATA
DETECT
CLUSTERS
USING LD...
The Real Playbook
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD
GRAPH
QUERY/
VISUALIZE
DATA
DETECT
CLUSTERS
USIN...
WE NEED THE AGILITY OF INTERACTIVE SCRIPTING
AND
THE
BRAINS AND BRAWN OF
SCALABLE GRAPH ANALYTICS
Build Frame
28
Build Graph
29
Query
Vertices
30
LDA with 3 Topics
LDA with 5
Topics
LDA with 7 Topics
Query Vertices Again – Now with ML
Properties
34
Following Analysis
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Wakefulness
Sleep
Animals
Electroencephalography
Circadian Rh...
Reimagining 2014
New partnerships in big data
Contributions to the open source community
The Intel Analytics Toolkit – COM...
INTERESTED IN THE INTEL ANALYTICS
TOOLKIT?
THEODORE.L.WILLKE@INTEL
.COM
Legal Disclaimers
All products, computer systems, dates, and figures specified are preliminary based on current expectatio...
MLconf NYC Ted Willke
Upcoming SlideShare
Loading in …5
×

MLconf NYC Ted Willke

1,892 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,892
On SlideShare
0
From Embeds
0
Number of Embeds
1,083
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MLconf NYC Ted Willke

  1. 1. CONTEXT SEMANTICS!
  2. 2. Danny : isBrotherOf : Nezih food cart : uses : bicycles Frank : isFriendsWith : Mohit Frank : isFriendsWith : Ted Frank : likes : bicycles Frank : likes : food carts Ivy : isFriendsWith : Kushal Ivy : isFriendsWith : Ted Ivy : likes : bicycles Ivy : likes : food carts Kushal : isFriendsWith : Mohit Kushal : isFriendsWith : Nezih Nezih : is FriendsWith : Ted Ted : likes : bicycles
  3. 3. This model... ... infers this interest. Ted Kushal Mohit Danny Ivy Frank Nezih friends friends friends brothers friends friends friends friends Food Cart likes likes likesBicycles likes likes likes uses Likes?
  4. 4. Virtuous cycle of data CLOUD Richer data to analyze CLIENTS Richer data from devices Richer user experiences INTELLIGENT SYSTEMS
  5. 5. SEMANTIC INFORMATION IS FUEL FOR THE CYCLE
  6. 6. 1985 1995 2005 2015 enterprise NoSQL Docs + Semantics RDF WIDESPREAD MACHINE LEARNING ON THIS
  7. 7. IMAGINE THE POSSIBILITIES
  8. 8. Graph centrality High Program Importance (Centrality) Low Graph of channel viewing behavior Current popular surfing patterns SH002463130000 EP005544723744 Changes in surfing behavior may predict customer churn.
  9. 9. Preference and Similarity Recommendations User Movie 1.7MM Nodes 23.9MM Edges similar cast prefers similar topic userId: A0A22A5 title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino] title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer] title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon] weight=11.8 weight=0.67 weight=0.03 weight=14.98 Min-cost path search
  10. 10. 10 URL Ground-Truth Data IP/Domain Reputations 420MM Records 74.5MM Nodes 185MM Edges URL Domain IP Address Calculation of priors LBP Messaging Loopy Belief Propagation on the (semantic) web 84.231.82.93 86.39.155.137 forum.vsichko.com hermansonskok.se euskzzbz.nonetheups.com keesenbep.spaces.live.com
  11. 11. Loopy Belief Propagation on the (semantic) web
  12. 12. A yoga ball graph. Really!?!
  13. 13. You may actually need this • When the problem is an information network • When a graph is a natural way of expressing the algorithm • When you want to study specific relationships • When you want faster machine learning or solvers on sparse data shortest path central influence sub networks triangle count
  14. 14. But there are challenges. Handling all that data. Finding people good at both handling all that data and data analysis. Putting exploratory work into production fast enough to keep up with the competition. 14
  15. 15. Congratulations ! You are a data scientist!
  16. 16. It’s a demanding job Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Skills shortage at intersection of systems engineering and data analysis Painful data ingestion and preparation Workflows that are not designed with loopbacks in mind Few tools for analyzing semantics at scale Composing pipeline is DIY
  17. 17. Decomposing the “data scientist” Source: 2013 Report from Accenture Institute for High Performance
  18. 18. IMAGINE A PLATFORM FOR DATA SCIENTISTS DOCS + SEMANTICS + MACHINE LEARNING
  19. 19. Ease-of-use: Making big data familiar Python R Dataflow GUI ... Datacenter / CloudNetworkClient BIG DATA API Connec tManag e Secure Analyzedistributed and parallel Manag eSecure Connec t Analyzelocal Query Big Data Java/Scala/C++ Computational Frameworks Big Data Algorithms Cluster Workload Mgmt Cluster Storage Machine Learning & Statistics Data WranglingAnalyst Skills The Other Skills
  20. 20. Delivering it FILESYSTEMS AND NOSQL STORAGE HW PLATFORM APACHE HADOOP APACHE SPARK DATA WRANGLING MACHINE LEARNING AND STATISTICS Graphical Algorithms Classical Algorithms Graph Construction Tools Useful String Manipulation Useful Math Operators BIG DATA API DATA SCIENCE SERVER (Query and Scripting) Intel Analytics Toolkit A UNIFIED DOCUMENT + SEMANTIC STORE The Ask
  21. 21. Approach Algorithm Category Applications/Use Cases Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising Label Propagation Structured Prediction Personalized recommendations Alternating Least Squares (ALS) Collaborative Filtering Recommenders Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders Connected Components Graph Analytics Network manipulation, image analysis Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering Structure Attribute Clustering Network analysis, consumer seg K-Truss Clustering Social network analysis KNN* Clustering Recommenders Logistic Regression* Classification Fraud detection Random Forest* Classification Fraud detection, consumer seg Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models Association Rule Mining Data Mining Market basket analysis, recommenders Frequent Pattern Mining* Data Mining Pattern Recognition Bringing a full spectrum of possibilities Graph 21
  22. 22. Article Tagging Problem • Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords • Process is resource-intensive – can we automate it? • Categorize articles into a hierarchy that matches the same categorization from the MeSH controlled vocabulary
  23. 23. Hierarchy Level Article Count
  24. 24. Demo: Graph Analytics For Medical Journal Analysis INGEST & CLEAN ENGINEER FEATURES STRUCTURE GRAPH QUERY & ANALYZE LEARN VISUALIZE PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA • Medline™ XML • MeSH Ontology XML • Create list of unique words • Stemming and lemmatization • Index word list • Transform articles into list of article/word pairs • Extract vertices • Assign id columns to vertex property • Assign year and count edge properties • Gremlin query for each visual • Python web server and other libraries • Select optimization parameters • Invoke LDA
  25. 25. The Playbook? PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Prepare graph data Basic analysis Run LDA INSIGHTFUL RESULT This never happens!
  26. 26. The Real Playbook PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Correct mistake Prepare graph data Correct schema mistake Correct aggregation mistake Data validation Correct dataset mistake Guess LDA settings Tune and re-run Detect bias in dataset
  27. 27. WE NEED THE AGILITY OF INTERACTIVE SCRIPTING AND THE BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS
  28. 28. Build Frame 28
  29. 29. Build Graph 29
  30. 30. Query Vertices 30
  31. 31. LDA with 3 Topics
  32. 32. LDA with 5 Topics
  33. 33. LDA with 7 Topics
  34. 34. Query Vertices Again – Now with ML Properties 34
  35. 35. Following Analysis 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Wakefulness Sleep Animals Electroencephalography Circadian Rhythm Arousal Sleep Stages REM Mental Recall Attention Rats Child Evoked Potentials Aged Schizophrenia Ocular Conditioning Infant Psychophysics Dreams Top MeSH terms that predict which category an article will be assigned
  36. 36. Reimagining 2014 New partnerships in big data Contributions to the open source community The Intel Analytics Toolkit – COMING SOON SEMANTICS + MACHINE LEARNING TOGETHER AT LAST!
  37. 37. INTERESTED IN THE INTEL ANALYTICS TOOLKIT? THEODORE.L.WILLKE@INTEL .COM
  38. 38. Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT- compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved.

×