Your SlideShare is downloading. ×
  • Like
MLconf NYC Ted Willke
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

MLconf NYC Ted Willke

  • 1,135 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,135
On SlideShare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. CONTEXT SEMANTICS!
  • 2. Danny : isBrotherOf : Nezih food cart : uses : bicycles Frank : isFriendsWith : Mohit Frank : isFriendsWith : Ted Frank : likes : bicycles Frank : likes : food carts Ivy : isFriendsWith : Kushal Ivy : isFriendsWith : Ted Ivy : likes : bicycles Ivy : likes : food carts Kushal : isFriendsWith : Mohit Kushal : isFriendsWith : Nezih Nezih : is FriendsWith : Ted Ted : likes : bicycles
  • 3. This model... ... infers this interest. Ted Kushal Mohit Danny Ivy Frank Nezih friends friends friends brothers friends friends friends friends Food Cart likes likes likesBicycles likes likes likes uses Likes?
  • 4. Virtuous cycle of data CLOUD Richer data to analyze CLIENTS Richer data from devices Richer user experiences INTELLIGENT SYSTEMS
  • 5. SEMANTIC INFORMATION IS FUEL FOR THE CYCLE
  • 6. 1985 1995 2005 2015 enterprise NoSQL Docs + Semantics RDF WIDESPREAD MACHINE LEARNING ON THIS
  • 7. IMAGINE THE POSSIBILITIES
  • 8. Graph centrality High Program Importance (Centrality) Low Graph of channel viewing behavior Current popular surfing patterns SH002463130000 EP005544723744 Changes in surfing behavior may predict customer churn.
  • 9. Preference and Similarity Recommendations User Movie 1.7MM Nodes 23.9MM Edges similar cast prefers similar topic userId: A0A22A5 title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino] title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer] title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon] weight=11.8 weight=0.67 weight=0.03 weight=14.98 Min-cost path search
  • 10. 10 URL Ground-Truth Data IP/Domain Reputations 420MM Records 74.5MM Nodes 185MM Edges URL Domain IP Address Calculation of priors LBP Messaging Loopy Belief Propagation on the (semantic) web 84.231.82.93 86.39.155.137 forum.vsichko.com hermansonskok.se euskzzbz.nonetheups.com keesenbep.spaces.live.com
  • 11. Loopy Belief Propagation on the (semantic) web
  • 12. A yoga ball graph. Really!?!
  • 13. You may actually need this • When the problem is an information network • When a graph is a natural way of expressing the algorithm • When you want to study specific relationships • When you want faster machine learning or solvers on sparse data shortest path central influence sub networks triangle count
  • 14. But there are challenges. Handling all that data. Finding people good at both handling all that data and data analysis. Putting exploratory work into production fast enough to keep up with the competition. 14
  • 15. Congratulations ! You are a data scientist!
  • 16. It’s a demanding job Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Skills shortage at intersection of systems engineering and data analysis Painful data ingestion and preparation Workflows that are not designed with loopbacks in mind Few tools for analyzing semantics at scale Composing pipeline is DIY
  • 17. Decomposing the “data scientist” Source: 2013 Report from Accenture Institute for High Performance
  • 18. IMAGINE A PLATFORM FOR DATA SCIENTISTS DOCS + SEMANTICS + MACHINE LEARNING
  • 19. Ease-of-use: Making big data familiar Python R Dataflow GUI ... Datacenter / CloudNetworkClient BIG DATA API Connec tManag e Secure Analyzedistributed and parallel Manag eSecure Connec t Analyzelocal Query Big Data Java/Scala/C++ Computational Frameworks Big Data Algorithms Cluster Workload Mgmt Cluster Storage Machine Learning & Statistics Data WranglingAnalyst Skills The Other Skills
  • 20. Delivering it FILESYSTEMS AND NOSQL STORAGE HW PLATFORM APACHE HADOOP APACHE SPARK DATA WRANGLING MACHINE LEARNING AND STATISTICS Graphical Algorithms Classical Algorithms Graph Construction Tools Useful String Manipulation Useful Math Operators BIG DATA API DATA SCIENCE SERVER (Query and Scripting) Intel Analytics Toolkit A UNIFIED DOCUMENT + SEMANTIC STORE The Ask
  • 21. Approach Algorithm Category Applications/Use Cases Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising Label Propagation Structured Prediction Personalized recommendations Alternating Least Squares (ALS) Collaborative Filtering Recommenders Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders Connected Components Graph Analytics Network manipulation, image analysis Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering Structure Attribute Clustering Network analysis, consumer seg K-Truss Clustering Social network analysis KNN* Clustering Recommenders Logistic Regression* Classification Fraud detection Random Forest* Classification Fraud detection, consumer seg Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models Association Rule Mining Data Mining Market basket analysis, recommenders Frequent Pattern Mining* Data Mining Pattern Recognition Bringing a full spectrum of possibilities Graph 21
  • 22. Article Tagging Problem • Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords • Process is resource-intensive – can we automate it? • Categorize articles into a hierarchy that matches the same categorization from the MeSH controlled vocabulary
  • 23. Hierarchy Level Article Count
  • 24. Demo: Graph Analytics For Medical Journal Analysis INGEST & CLEAN ENGINEER FEATURES STRUCTURE GRAPH QUERY & ANALYZE LEARN VISUALIZE PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA • Medline™ XML • MeSH Ontology XML • Create list of unique words • Stemming and lemmatization • Index word list • Transform articles into list of article/word pairs • Extract vertices • Assign id columns to vertex property • Assign year and count edge properties • Gremlin query for each visual • Python web server and other libraries • Select optimization parameters • Invoke LDA
  • 25. The Playbook? PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Prepare graph data Basic analysis Run LDA INSIGHTFUL RESULT This never happens!
  • 26. The Real Playbook PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Correct mistake Prepare graph data Correct schema mistake Correct aggregation mistake Data validation Correct dataset mistake Guess LDA settings Tune and re-run Detect bias in dataset
  • 27. WE NEED THE AGILITY OF INTERACTIVE SCRIPTING AND THE BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS
  • 28. Build Frame 28
  • 29. Build Graph 29
  • 30. Query Vertices 30
  • 31. LDA with 3 Topics
  • 32. LDA with 5 Topics
  • 33. LDA with 7 Topics
  • 34. Query Vertices Again – Now with ML Properties 34
  • 35. Following Analysis 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Wakefulness Sleep Animals Electroencephalography Circadian Rhythm Arousal Sleep Stages REM Mental Recall Attention Rats Child Evoked Potentials Aged Schizophrenia Ocular Conditioning Infant Psychophysics Dreams Top MeSH terms that predict which category an article will be assigned
  • 36. Reimagining 2014 New partnerships in big data Contributions to the open source community The Intel Analytics Toolkit – COMING SOON SEMANTICS + MACHINE LEARNING TOGETHER AT LAST!
  • 37. INTERESTED IN THE INTEL ANALYTICS TOOLKIT? THEODORE.L.WILLKE@INTEL .COM
  • 38. Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT- compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved.