Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to build your own Delve: combining machine learning, big data and SharePoint

5,868 views

Published on

You are experiencing the benefits of machine learning everyday through product recommendations on Amazon & Bol.com, credit card fraud prevention, etc… So how can we leverage machine learning together with SharePoint and Yammer. We will first look into the fundamentals of machine learning and big data solutions and next we will explore how we can combine tools such as Windows Azure HDInsight, R, Azure Machine Learning to extend and support collaboration and content management scenarios within your organization.

Published in: Data & Analytics
  • Be the first to comment

How to build your own Delve: combining machine learning, big data and SharePoint

  1. 1. How to build your own Delve: combining machine learning, big data and SharePoint #SPSBE11 Joris Poelmans April 18th, 2015
  2. 2. PlatinumGoldSilver Thanks to our sponsors!
  3. 3. http://jopx.blogspot.com
  4. 4. Agenda  Introduction to Delve  Office Graph  Big Data and Machine Learning  Building your own Delve - architectural concept
  5. 5. Agenda  Introduction to Delve  Office Graph  Big Data and Machine Learning  Building your own Delve - architectural concept
  6. 6. Stay In the Know Find What you Need Discover New Connections Connect with the right experts and learn more about their content. Find just the right results from any source and take action Discover new information tailored to you from your network Delve – Search and Discovery Across O365 Powered by Office Graph
  7. 7. Agenda  Introduction to Delve  Office Graph  Big Data and Machine Learning  Building your own Delve - architectural concept
  8. 8. What is The Office Graph? User Documents People Conversations
  9. 9. What is The Office Graph? Manager Direct report Works with Shared with me Viewed by me Trending around me Presented to me Liked by me
  10. 10. Connected Enterprise
  11. 11. Signals sent from Delve, Exchange, O365, … Click person Modify/Save Elevate Share Follow Like Comments Email Ignore Presented to Shown document Open document Shown board ++
  12. 12. Content and signals across O365 auto- populating the Office Graph insights Insights derived with machine learning for proactive and intelligent experiences
  13. 13. Agenda  Introduction to Delve  Office Graph  Big Data and Machine Learning  Building your own Delve - architectural concept
  14. 14. Big data is what happened when the cost of storing user data became cheaper than making the decision to throw it away
  15. 15. Transactions + Interactions + Observations = Big Data Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record ERP CRM WEB Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMSSentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity
  16. 16. Big Data Core Technology landscape • New paradigm for storing data • 100+ Non-SQL DB’s and growing • Support SQL querying • Internal architecture different from classic DBs • Appliances • Teradata • Microsoft PDW/APS • Oracle BDA X4-2 • Hadoop/HDFS+ MapReduce • Key Big Data technology Hadoop MPP NoSQLNewSQL
  17. 17. Modern Data Architecture • Apache Hadoop is an open source framework that supports data- intensive distributed applications  Uses HDFS storage to enable applications to work with 1000s of nodes and petabytes of data using a scale-out model  Uses MapReduce to process data  Inspired by Google  MapReduce  Google File System  Related projects:  HBase, Hive, Mahout, Pig,Sqoop, Ambari, Storm, Zookeeper, ... And many more
  18. 18. HDFS and MapReduce in a nutshell
  19. 19. Hadoop components Distributed Storage (HDFS) Hive Distributed Processing (MapReduce) Pig HBase HCatalog DataIntegration (ODBC/SQOOP/REST/Flume) MahoutPegasus Rhadoop Oozie Data integration Data access Hadoop core Operations AmbariZookeeper StormKafka http://jopx.blogspot.be/2015/03/overview-of-apache-hadoop-components-in.html
  20. 20. Microsoft Azure HDInsight Support HBase as NoSQL columnar database on Azure Blobs Support Storm as stream processing Hadoop in Azure Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Name Node Job Tracker HMaster Coordination Region Server Region Server Region Server Region Server Able to leverage Azure Blob Storage Pay per use model Based on Hortonworks Data Platform
  21. 21. Hive • Hadoop feature to perform data warehouse operations • HiveQL  High-level, SQL-like language, abstraction over MapReduce  Supports equi-joins  Schema on read NOT schema on write  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly • Metadata store  Contains descriptions of tables • Acts as a bridge to many BI products which expect tabular data
  22. 22. Sample Hive queries
  23. 23. Machine learning finding the needle in the haystack • Formal definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” - Tom M. Mitchell • Another definition: “The goal of machine learning is to program computers to use example data or past experience to solve a given problem.” – Introduction to Machine Learning, 2nd Edition, MIT Press • ML often involves two primary techniques: – Supervised Learning: Finding the mapping between inputs and outputs using correct values to “train” a model – Unsupervised Learning: Finding patterns in the input data (similar to Density Estimates in Statistics)
  24. 24. Vision Analytics Recommendation engines Advertising analysis Weather forecasting for business planning Social network analysis Legal discovery and document archiving Pricing analysis Fraud detection Churn analysis Equipment monitoring Location-based tracking and services Personalized Insurance
  25. 25. Some retailers profit … by predicting major changes in your life.
  26. 26. Steps to build a machine learning solution
  27. 27. Typical machine learning algorithms • Clustering (k-means, orthogonal partitioning,…) • Association rule learning ( A priori) • Regression (linear/logistic) • Recommendation engines • Classification (C4.5, decision trees, SVM, Naïve Bayes, AdaBoost, Random Forest, …) • Similarity matching • Neural networks • Bayesian networks • Genetic algorithms • Ensembles See http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ And http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf and http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
  28. 28. Doing recommendations – some approaches • Collaborative filtering • Feature based recommendations • K-nearest neighbours
  29. 29. Collaborative filtering • A set of items (books, beers, blogposts,…) • Ratings from users • Recommended items based on your ratings and other people’s ratings
  30. 30. Feature based recommendations • Use user’s ratings of items  Create an algorithm to define which features (metadata ) of items the user likes • Requires detailed information about items - content based  An item can be a person as well – see “People you may know” • Most approaches combine “feature based” and “collaborative filtering”
  31. 31. K-Nearest Neighbours (Classification approach) • Find ratings from people similar to you and see what they liked  Use similarity functions (Minkowski distance, RMSE, Pearson Correlation Coefficient,…) • Take the average ratings of the k people most similar to you  Display the items with the highest averages • Conclusion – requires solid background in Math and Statistics
  32. 32. Machine Learning and Data Scientists Developing predictive analytics and machine learning must be simpler, today it requires specialized skills: • Data management • Data exploration • Math & statistics • Domain expertise • Machine learning • Software development • Data visualization 65% of enterprise feel they have a strategic shortage of data scientists, a role many did not know existed 12 months ago …
  33. 33. Microsoft Azure Machine Learning
  34. 34. Microsoft Azure Machine Learning (Ctd.) Personalized Workspace Combine R modules with Microsoft’s best in class algorithms running Xbox and Bing Work with anyone, anywhere by simply sharing the workspace Easy Access to All Data Drop in desktop data sets into the built-in storage space. Bring in cloud data with the ease of a drop down Deploy Models as Web Services Operationalize in minutes and refine models at the speed of the market Partner Tools ML partners enjoy SDK access for robust solutions Microsoft Azure Machine Learning Studio Microsoft Azure Machine Learning API service Microsoft Azure Machine Learning SDK
  35. 35. Agenda  Introduction to Delve  Office Graph  Big Data and Machine Learning  Building your own Delve - architectural concept
  36. 36. E vent producers Web logs Documents & metadata Transform Long-term storage Azure SQL Database & Azure Storage Predictive Analytics Azure Machine Learning Presentation and action On premise Building your own Delve - high level architecture
  37. 37. Building your own Delve – remarks • Graph technology left out for simplicity  Take a look at Neo4J or Pegasus on Hadoop if you are interested • Not very realistic to rebuild Delve but possible to define point solutions • If you still go ahead  Think about the end-to-end data pipeline  Fast track with Recommendation API in datamarket http://datamarket.azure.com/dataset/amla/recommendations  Cache recommendations for performance and cost optimization  Learn R or Python to extend AzureML capabilities
  38. 38. Online Resources • www.coursera.org (MOOC) • Microsoft Virtual Academy  http://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft- azure-machine-learning  http://www.microsoftvirtualacademy.com/training-courses/implementing-big-data- analysis • Cloud Data Science process - http://azure.microsoft.com/en- us/documentation/articles/machine-learning-data-science-how-to-create-machine-learning-service/ • Blogs  http://blogs.msdn.com/b/benjguin/  http://hortonworks.com/blog/  http://blogs.msdn.com/b/bigdatasupport/  http://blogs.msdn.com/b/big_data_france/  http://blogs.msdn.com/b/brian_swan/  http://blogs.msdn.com/b/mwinkle/  http://blogs.msdn.com/b/avkashchauhan/  http://blogs.msdn.com/b/carlnol/  http://blogs.technet.com/b/machinelearning/
  39. 39. Recommended books
  40. 40. Thank you!

×