Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: The Magic to Attain New Heights

294 views

Published on

Ken Johnston of Microsoft shares the story of how to take advantage of Big Data for your team.

Published in: Data & Analytics
  • Be the first to comment

Big Data: The Magic to Attain New Heights

  1. 1. Big Data: The Magic to Attain New Heights Ken Johnston Principal Data Science Manager Twitter – @rkjohnston Blog – http://linkedin.com/in/rkjohnston Email – kenj@Microsoft.com LinkedIn - http://linkedin.com/in/rkjohnston @rkjohnston #DataMagic
  2. 2. Data Scientist in Core Data Science Team Office Live, WebApps, Office Online Cosmos, AutoPilot, Local, Shopping About Ken Kanban and Data Science series on LinkedIn EaaSy&MVQ – Everything as a Service & Minimum Viable Quality Write Books and Blog and some fiction
  3. 3. I have a lot of love in my life
  4. 4. My Kids
  5. 5. @rkjohnston #DataMagic
  6. 6. Team of Amazing Magicians
  7. 7. Getting hands dirty in the data
  8. 8. Connect the Dots
  9. 9. Create Deep Insights
  10. 10. Taking on Sudden Infant Death Syndrome
  11. 11. Big Data and Magic
  12. 12. So, My son gets this kids “Magic Kit in a Box” for his 8th birthday
  13. 13. Open our own Magic Show
  14. 14. Six Keys to a “Big” Magic Show Try, Try, Try Again The Tyrany of Counting Magic Tricks (A/B Testing, Runtime Flags) The Venue (Big Data Infrastructure) Foundation (Tools for Big Data) Security (Protection, Privacy, Fraud) The Assistant Recruit, Train, & Retain “Big Data” Search Trends @rkjohnston #DataMagic
  15. 15. The Venue Your Big Data Infrastructure
  16. 16. Common Design Patterns Good Paper to Read IDC: Six Patterns of Big Data and Analytics Adoption: The Importance of the Information Architecture Ingest From Services, IOT, Apps Via Streams Into Storage Process Build Pipelines Reduce, Transform, Join Pipe out Analyze From Services, IOT, Apps Via Streams Into Storage
  17. 17. Azure Model Cindy Gross – Technical Fellow: Big Data and Cloud Twitter: @SQLCindy cindyg@NealAnalytics.com Ingest Process Analyze
  18. 18. Hybrid: Azure and Hadoop Model Ingest Process Analyze
  19. 19. Amazon Model Ingest Process Analyze
  20. 20. How we do it in Windows
  21. 21. Prototypical Big Data PlatformClient1Client2Client3 TelemetryFrontEndService Fast pipeline for high priority Data Alerting DB Alerting Dashboard Big Data Map Reduce Cloud PIIScrubbingService DataExtractionService Insights DB 1 Insights DB N Additional Reporting Dashboards Personally Identifiable Information (PII) Management very critical. Data Driven Quality (DDQ) and big data pipelines will need a cloud platform Superfast pipeline typically (not always) bypasses cloud. Also void of PII. Big Data & ML Model Orchestration @rkjohnston #DataMagic
  22. 22. Prototypical Big Data PlatformClient1Client2Client3 TelemetryFrontEndService Fast pipeline for high priority Data Alerting DB Alerting Dashboard Big Data Map Reduce Cloud PIIScrubbingService DataExtractionService Insights DB 1 Insights DB N Additional Reporting Dashboards Big Data & ML Model Orchestration Ingest Process Analyze @rkjohnston #DataMagic
  23. 23. User Segmentation Approaches • Risk Tolerance Model • Users Segment themselves • Opt in for greater risk with a reward in mind • Profile Based • Usage behaviors • new vs. power users • Browser type • Connection Type • Device and Device OS @rkjohnston #DataMagic
  24. 24. Ring 2 External Beta UsersRing 2: Company & NDA Balancing Speed and Risk with Rings Ring 1: My Team Ring 4: Everyone Ring 0: Buddy Build Red Line demarks disclosure risk and possible loss of patent rights Risk Tolerance is highest No desire for risk @rkjohnston #DataMagic
  25. 25. Date Security Protection, Privacy, Fraud
  26. 26. Office 365 Advanced Threat Protection Big Data Only Solution Safe Link is powered by Cloud Exchange & Bing data AI Model powered by data from thousands of companies and attachments @rkjohnston #DataMagic
  27. 27. Short lived identifiers Increase transparency and control for users Build privacy into the OS and all apps
  28. 28. How the Windows Store Security Team made the Insights Leap @rkjohnston #DataMagic
  29. 29. App Store Data Architecture App Certification and Analysis Pipeline Store Services Log and Telemetry Bing Spam and Malware Windows Services Safety Platform (MSA, SmartScreen, Etc..) MMPC/Spynet Network IPs File Hashes PhotoDNA Strings API Called User Install Data Ratings and Reviews Purchases Geographic Data Account Reputation Bad URLs Botnet infected Clients Cosmos Storage and Compute BTW this was not Big Data
  30. 30. NoName was Learning basic DS Look at how I did this k-means clustering and found these weird outliers in buying circles from Dev accounts created the same week and same IP address Check it out, I found this guys FB page. We have his picture! NoName and I were Spitballing Ideas
  31. 31. Bad Dev ‘N’ Bad Dev ‘N’ Fraud Network Identification Bad Dev 1 Payment Instruments App Similarity Social Networks 3rd party app stores Bad Dev 2 XXXDeveloper Created 40 Different Store Developer Accounts and 100s of Apps App Metadata (URL, Websites) Developer Watering Holes Shared Fraudulent Payment Instruments Bad Dev ‘N’ New Identity Metadata Shared Fraudulent Payment Instruments App Similarity App Similarity
  32. 32. lights out
  33. 33. Date Foundation Tools and Skills for Big Data
  34. 34. The Big Red Switch This used to require humans
  35. 35. Sidebar: I had an Epiphany
  36. 36. Speed is your friend because…
  37. 37. Six week coding milestone Code churn is cumulative Imagine this as part of a larger multi-layered project Layer 1 Layer 2 Layer 3 • Tightly coupled layers • Long stabilization phase • Complicated end-to-end integration Sim-ship increases risk Maximum point of instability is at end of milestone Code Churn Example 1 @rkjohnston #DataMagic
  38. 38. Code Churn Example 2 (Continuous Deployment) Layer 1 Layer 2 Layer 3 • Risk per release decreases because of more incremental change • You still must be careful of Risk within Production but… • Total risk over time can be less with incremental change Rapid release cadence (weekly or daily) Max Risk is Production Layer N @rkjohnston #DataMagic
  39. 39. As Speed Accelerates Up Front & Post Deploy Testing Decreases
  40. 40. Measures = Test Cases • We do Measures • What is a post release test case? • Automation validates the golden path • We measure the golden path • Measures are the same as test cases • Monitor the golden path @rkjohnston #DataMagic
  41. 41. >1.5*IQR = Outlier = Bug (probably) • What is a Test Case? • What I expect to happen vs. What does happen • A Test Case is Binary • Measures can observe success and fail • Measures have history of pass fail • When pass or fail drift from standard expected rates we find outliers • Outliers are often bugs
  42. 42. Rings + Speed + Data = Success • When speed increases the need for telemetry increases • The rings model provides a buffer @rkjohnston #DataMagic
  43. 43. Tricks
  44. 44. Flighting and A/B testing are mostly the same thing @rkjohnston #DataMagic
  45. 45. Runtime Flags = Continuous Deployment
  46. 46. Generic Service Stack Service UX Front Door Service Auth/Identity Layer A vCurrent Layer B vCurrent Service Layer C (Persistent Data Store) DefaultPath Production Traffic Front door servers for logging and access management UX rendering layers Identity or authentication layers Persistent data layers @rkjohnston #DataMagic
  47. 47. Runtime Flags Example 1 Side-by-Side Deployments Service UX Front Door Service Auth/Identity Runtime Flags • Flags direct traffic through the stack • Used to test vNext before full release Layer A vCurrent Layer B vCurrent Service Layer C (Persistent Data Store) DefaultPath Runtime Production Traffic Test or Forked Traffic Runtime RuntimeRuntime Layer B vNext
  48. 48. Runtime Flags Example 2 N Test Environments Service UX Front Door Service Auth/Identity Layer A vCurrent Layer B vCurrent Service Layer C (Persistent Data Store) Production Traffic Test Case Checkin Tests DefaultPath Runtime Runtime Runtime Runtime Layer A Test Path Layer B Test Path
  49. 49. Apps as a Service: Facebook How Facebook secretly redesigned its iPhone app with your help …a system for creating alternate versions… within the native app. The team could then turn on certain new features for a subset of its users, directly, …a system of "different types of Legos... and see the results on the server in real time." From article on The Verge by Dieter Bohn September 18, 2013 @rkjohnston #DataMagic
  50. 50. All Magicians need an Assistant
  51. 51. Visualization Machine Learning Data Scientist Data Engineer Extract Load Transform Data Architecture Operations and Monitoring Big Data Infrastructure & Storage DB Administration Statistics Math Programming Modeling Story Telling Data Exploration http://www.datasciencecentral.com/profiles/blogs/difference- between-data-engineers-and-data-scientists Typical Industry Staffing
  52. 52. Blended Role for Agile Visualization Machine Learning Data Scientist/Data Engineer Extract Load Transform Data Architecture Operations and Monitoring Big Data Infrastructure & Storage DB Administration Statistics Math Programming Modeling Story Telling Data Exploration @rkjohnston #DataMagic
  53. 53. LDA vs PCA vs A13 before stratified sampling Backlog Doing Validation Done MLADS ARPD Rehearsal Submit Abstract to Strata + Hadoop World Edge Experiment 1 Data Processing Edge Experiment 2 Customer Sat and Post Sales Monetization Factors Analysis Install Base Decay Rate estimation using Baysian Model Friday Review Slides for Edge Experiment 1 Edge Experiment 1 Insights Analysis Top Enterprise DSAT list from textual analysis Business Entity Graph with DUNS, Domain Name, & TaxIDs Open Source Entity Graph visualization technology research Submit Paper to Informs 2016 ARPD V3 Model with FFF MLADS ARPD Slides Draft 1 Device Lifetime Value (LTV) model 2 Process and Culture impact Retention • Kanban for Project Management • Balance long and short term impact • Participate in Industry papers and reviews @rkjohnston #DataMagic
  54. 54. Trying Again & Again Advantages and Disadvantages of the counting culture
  55. 55. KPIs drive companies and behavior
  56. 56. The 5 Vs of Big Data Nine months ago there were only three Vs Variety VelocityVolume Verify Verification – managing data quality and access control at all points Value
  57. 57. Must Count More Counting More Granular Make it go up and to the right Is vs Likely Business Impact is a Given Drives behavior (especially if tied to compensation)
  58. 58. Viable Possible Features Minimum + Viable Good features to test the users responses Bad user experience. Too minimal a set or wrong set of features. Will not engage users enough to gain valuable insights The product you want to build but to deliver all features will take too long Wasted work adding features that do not add critical value for winning and retaining customers Minimum MVP in a Nutshell
  59. 59. Possible Data Viable Model should provide enough coverage that it can be used for core insights. Many models try to include all data and large numbers of attributes but that slows down innovation If precision is too low then the model can’t be trusted for even first level insights. Minimum More features can increase complexity without significant improvement in precision and recall Minimum Viable Model (MVM) Possible Features Minimum + Viable An Ideal MVM uses a modest amount of data, implements a relatively simple initial algorithm, has good precision (we aim for 98% or more) and enough recall to be used for core insights.
  60. 60. Keep your eye on the target The goal is not to get a bulls eye every time The goal is to get the data and Learn
  61. 61. Test & Ops = Data Science
  62. 62. Six Keys to a “Big” Magic Show Try, Try, Try Again The Tyrany of Counting Magic Tricks (A/B Testing, Runtime Flags) The Venue (Big Data Infrastructure) Foundation (Tools for Big Data) Security (Protection, Privacy, Fraud) The Assistant Recruit, Train, & Retain “Big Data” Search Trends @rkjohnston #DataMagic
  63. 63. Big Data: The Magic to Attain New Heights Ken Johnston Principal Data Science Manager Twitter – @rkjohnston Blog – http://linkedin.com/in/rkjohnston Email – kenj@Microsoft.com LinkedIn - http://linkedin.com/in/rkjohnston @rkjohnston #DataMagic

×