Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Self-Service Data Science for Leveraging ML & AI on All of Your Data

718 views

Published on

MapR has launched the MapR Data Science Refinery which leverages a scalable data science notebook with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Self-Service Data Science for Leveraging ML & AI on All of Your Data

  1. 1. © 2017 MapR TechnologiesMapR Confidential 1 Self-Service Data Science for Leveraging ML & AI on All of Your Data: Introducing the MapR Data Science Refinery Rachel Silver Product Manager – Data Science & Analytics 11/16/17
  2. 2. © 2017 MapR TechnologiesMapR Confidential 2 Summary • Why Companies Invest In ML/AI • Winning With a Data First Approach • Introducing the MapR Data Science Refinery • Deep Dive & Demos – Ease of Deployment – Data Exploration – Extensibility & Collaboration
  3. 3. © 2017 MapR TechnologiesMapR Confidential 3 Why Companies Invest In ML/AI
  4. 4. © 2017 MapR TechnologiesMapR Confidential 4 Where AI Creates Value In The Value Chain Produce Optimized Production & Maintenance Provide rich, personal, and convenient user experiences. Project Smarter R&D and forecasting Promote Targeted Sales & Marketing Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)
  5. 5. © 2017 MapR TechnologiesMapR Confidential 5 Project Where The Next Threat Will Come From Deep security analytics and advanced persistent threat (APT) detection • Centralization and visibility of all data from an information security perspective • Reduced risk of data breaches from DDOS and APT attacks • Real-time insights into what is happening within the environment OBJECTIVE • Early detection of data breaches and suspicious activity • Aggregate and retain all security related data into a single central store and then build statistical models to detect abnormal activity within the environment. • Get insights into what are insiders doing within the environment CHALLENGES • Existing SIEM solution could not scale • Current solutions do not work well for “unknown” threats SOLUTION • Leverage MapR-DB for fast data ingestion and query performance • MapR provided the deep storage and machine learning algorithms • NFS enabled easy integration with the IT ecosystem Retail Bank
  6. 6. © 2017 MapR TechnologiesMapR Confidential 6 Source 1 Source 2 Source 1000 Houston MAPR Core Cluster Time to insight (48 hrs) Manual Process Before Edge Source 1 Source 2 Source 1000 Houston MAPR Core Cluster Time to insight (<2 hrs) Automated Process 1000s of Oil & Drill Sources Will do Pre Processing locally +at Core (Custom App + Down Sampling) After Edge Produce More Efficiently ML aggregation and processing at the edge optimizes production Oil & Gas company
  7. 7. © 2017 MapR TechnologiesMapR Confidential 7 Promote personalized offers in real-time Targeting credit card customers using Recommendation Engine A Global Financial Services company wanted to offer real-time localized & personalized recommendations to their credit card holders using ML/AI OBJECTIVE • Increase revenue and customer loyalty through real-time personalized offers generated by a recommendation engine CHALLENGES • In order to be accurate, data had to be updated on a real-time basis • Being a global company, their Platform has to be consistent and 100% available 24x7 – no downtime • Must be able to simultaneously ingest (stream) and update data in the same cluster SOLUTION • MapR was the only distribution that met the mission critical needs of the customer and also provided the capability to ingest data continuously into the cluster • Direct NFS allows data to be continuously ingested directly into their cluster • MapR-XD’s self-healing capability allowed them to go into production safely Leading Credit Card Company
  8. 8. © 2017 MapR TechnologiesMapR Confidential 8 Provide Customers With a Customized Experience Provide customers with a personalized and convenient experience Using ML/AI to bring customer understanding to the center of business processes OBJECTIVE • Use full knowledge of customer relationship to inform online interactions. CHALLENGES • Need to store 20 trillion records • Training sample size is 400 million records • The decision trees contained 2 million possible pathways • Every combination must be evaluated every time a model is used (~15 billion combinations) SOLUTION • The MapR Converged Data Platform centralizes analytics and operational apps on one platform allowing Quantium to make one large infrastructure investment instead of many small silo’d ones. Current cluster has 50TB of memory and 5000 CPUs to process and store 5PB of data
  9. 9. © 2017 MapR TechnologiesMapR Confidential 9 A Winning Approach: Data First
  10. 10. © 2017 MapR TechnologiesMapR Confidential 10 Gartner estimates they solve between 10-100 business problems in three to five years. Gartner estimates they solve between 3-20 business problems in three to five years. 20% Contemplators Experimenters 41%40% Adopters Uncertain about the benefits of Data Science. Desire easy entry Entry Points in the Data Science Journey 20% Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) Source: Gartner – Magic Quadrant for Data Science Platforms (2017)
  11. 11. © 2017 MapR TechnologiesMapR Confidential 11 Entry Points in the Data Science Journey Gartner estimates they solve between 10-100 business problems in three to five years. Gartner estimates they solve between 3-20 business problems in three to five years. Uncertain about the benefits of Data Science. Desire easy entry Adopters 20% Contemplators Experimenters 41%40% 80%! Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) Source: Gartner – Magic Quadrant for Data Science Platforms (2017)
  12. 12. © 2017 MapR TechnologiesMapR Confidential 12 Entry Points in the Data Science Journey Gartner estimates they solve between 10-100 business problems in three to five years. Gartner estimates they solve between 3-20 business problems in three to five years. Uncertain about the benefits of Data Science. Desire easy entry Adopters 20% Experimenters 41% Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) AI adoption outside of the tech sector is stuck here and many firms report they are uncertain of the ROI Contemplators 40% Investment in AI is growing at a high rate, but adoption in 2017 remains low AI is only deployed into production 12% of the time
  13. 13. © 2017 MapR TechnologiesMapR Confidential 13 Entry Points in the Data Science Journey Gartner estimates they solve between 10-100 business problems in three to five years. Gartner estimates they solve between 3-20 business problems in three to five years. Uncertain about the benefits of Data Science. Desire easy entry Contemplators Experimenters 41%40% Adopters 20% Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) Seamless Data Access Technical Capabilities (a strong digital foundation) Leadership From The Top Key Traits Of A Successful Data Science Approach
  14. 14. © 2017 MapR TechnologiesMapR Confidential 14 If it is ALL about the data, then it better be about ALL your data. Seamless Data Access
  15. 15. © 2017 MapR TechnologiesMapR Confidential 15 ML Models Improve when Trained on Larger Datasets Instead of relying on assumptions and weak correlations, presence of more data results in better and more accurate models Source: A Survey of Applications of AI Algorithms in Eco-environmental modelling (2009)
  16. 16. © 2017 MapR TechnologiesMapR Confidential 16 Data Growth Puts A Premium on Efficient Leverage Source: McKinsey Global Institute: “The Age of Analytics”, Dec. 2016 The amount of data is predicted to double every three years Data Diversity EmailsCall Detail Records Click stream CSV DocumentsData PDFBilling Data Meta Data JSON Network Data Mobile Data XMLProduct Catalog Medical Records Text Files VideoText Messages Merchant Listings Sensor Data Server Logs Set Top Box Social Media Audio 4 Zettabytes of Data 20111986 300 Exabytes of Data 3 Exabytes of Data 20192016 2 Zettabytes of Data
  17. 17. © 2017 MapR TechnologiesMapR Confidential 17 Hadoop + Vendor Approach to Data Science Requires yet another cluster Data Science cluster Batch Cluster Streaming Cluster NoSQL Cluster On Premises
  18. 18. © 2017 MapR TechnologiesMapR Confidential 18
  19. 19. © 2017 MapR TechnologiesMapR Confidential 19 A Capable Platform With a Strong Digital Foundation NFS POSIX REST HDFS MAPR CONVERGED DATA PLATFORM ON-PREMISES, MULTI-CLOUD, IoT EDGE FILE STORE CONTAINER STORE CUSTOM FILE APPS METADATA MANAGEMENT JSON HBASEKAFKA HADOOP & SPARK APPS REAL-TIME BI APPS STREAMING APPS IoT/EDGE SQL OPERATIONAL DATA HUB CDC CONTEXTUAL USER EXPERIENCES CORE BUSINESS APPS SINGLE VIEW IOT
  20. 20. © 2017 MapR TechnologiesMapR Confidential 20 Real-time Machine Learning Pipelines A Robust Microservices Framework Event Streams • Persistent • Infinitely replicable • Re-playable Compare model results live! M Model A M Model B Persistent Client & Application Containers
  21. 21. © 2017 MapR TechnologiesMapR Confidential 21 Advice For Leadership Avoid • Creating new silos • Looking for a one-trick pony • Adopting tools that have unwieldy install, integration, and configuration processes • Tools that don’t scale to broader enterprise use • Ensure secure role based access to all data • Adopt tools that meet the needs of a broad range of Data Science Teams • Encourage adoption by making things easy, secure, and complete Important
  22. 22. © 2017 MapR TechnologiesMapR Confidential 22 Data Science @ MapR
  23. 23. © 2017 MapR TechnologiesMapR Confidential 23 The MapR Data Science Vision A Holistic Approach To Self-Service Data Science MAPR DATA SCIENCE REFINERY REFINERY DATA SCIENTISTS Data Scientist led product-and- services offerings including Quick Start Solutions (QSS) & Training REFINERY PARTNERSHIPS Expand on what we offer in- product to meet the needs of all data science teams An easy-to-deploy, secure, and extensible data science offering that leverages all existing platform assets MAPR CONVERGED DATA PLATFORM
  24. 24. © 2017 MapR TechnologiesMapR Confidential 24 MapR Data Science Refinery Provides the ability to work across many engines in one visual space • Apache Spark: Spark Streaming, SparkSQL, SparkR, and PySpark • Apache Hive • Apache Pig • Apache Drill • Python • Shell access to MapR-FS • Programmatic access to MapR-DB and MapR-ES in Spark Pluggable Visualization Available via Helium! An Enterprise-ready Data Science Notebook MAPR POSIX CLIENT FOR CONTAINERS MAPR CONVERGED CLIENT FOR CONTAINERS
  25. 25. © 2017 MapR TechnologiesMapR Confidential 25 MapR Data Science Refinery Benefits Easy to Deploy • A Docker Image includes all the necessary bits - no more, no less - required to leverage MapR as a persistent data store for your data science output. • Available on DockerHub Secure • Authentication occurs at a container level to ensure containerized applications only have access to data for which they are authorized. • Communications are encrypted to ensure privacy when accessing data in MapR. Extensible • A Dockerfile is also available on GitHub, allowing you to further customize the image as needed to support your specific application needs. • The Helium Framework enables pluggable visualization Leverage Locally, On-premise, or in Cloud CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES High Availability Real-time Unified Security Multi-Tenancy Disaster Recovery Global Namespace MAPR CONVERGED DATA PLATFORM
  26. 26. © 2017 MapR TechnologiesMapR Confidential 26 Partner Integration: An Example We’re enabling our partners to integrate with and use this product DataScience.com Platform Services MapR DSR Zeppelin Livy JDBC MapR Clients
  27. 27. © 2017 MapR TechnologiesMapR Confidential 27
  28. 28. © 2017 MapR TechnologiesMapR Confidential 28 Demo: Ease of Deployment & Data Exploration
  29. 29. © 2017 MapR TechnologiesMapR Confidential 29 Demo: Ease of Deployment What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE -- device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science- refinery:v1.0_6.0.0_4.0.0_centos7
  30. 30. © 2017 MapR TechnologiesMapR Confidential 30 Demo: Ease of Deployment What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE -- device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science- refinery:v1.0_6.0.0_4.0.0_centos7
  31. 31. © 2017 MapR TechnologiesMapR Confidential 31 Demo: Ease of Deployment What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE -- device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science- refinery:v1.0_6.0.0_4.0.0_centos7
  32. 32. © 2017 MapR TechnologiesMapR Confidential 32 Demo: Ease of Deployment What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE -- device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science- refinery:v1.0_6.0.0_4.0.0_centos7
  33. 33. © 2017 MapR TechnologiesMapR Confidential 33 Demo: Ease of Deployment What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE -- device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science- refinery:v1.0_6.0.0_4.0.0_centos7
  34. 34. © 2017 MapR TechnologiesMapR Confidential 34 Demo: Ease of Deployment How is Security Handled? $ maprlogin password [Password for user ’jane' at cluster 'my.cluster.com': ] MapR credentials of user ’john' for cluster 'my.cluster.com' are written to '/tmp/janes_ticket’ Job submits as ‘jane’
  35. 35. © 2017 MapR TechnologiesMapR Confidential 35 Demo: Ease of Deployment Why Livy? CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES MAPR CONVERGED DATA PLATFORMHTTP (RPC) Advantages over native Spark Interpreter: • Jobs are submitted in YARN cluster mode • Spark context can be shared • Support for Spark Dynamic Resource Allocation
  36. 36. © 2017 MapR TechnologiesMapR Confidential 36 Demo: Extensibility & Collaboration
  37. 37. © 2017 MapR TechnologiesMapR Confidential 37 Demo: Extensibility & Collaboration Collaboration CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES MAPR CONVERGED DATA PLATFORM
  38. 38. © 2017 MapR TechnologiesMapR Confidential 38 Demo: Extensibility & Collaboration Collaboration CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES MAPR CONVERGED DATA PLATFORM MAPR POSIX CLIENT FOR CONTAINERS
  39. 39. © 2017 MapR TechnologiesMapR Confidential 39 Demo: Extensibility & Collaboration What’s in the command docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e ZEPPELIN_NOTEBOOK_DIR=/mapr/my.cluster.com/user/mapr/zeppelin/shared- notebooks/ -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-refinery:v1.0_6.0.0_4.0.0_centos7
  40. 40. © 2017 MapR TechnologiesMapR Confidential 40 Demo: Extensibility Adding Deep Learning libraries to the container
  41. 41. © 2017 MapR TechnologiesMapR Confidential 41 Demo: Extensibility Adding Deep Learning libraries to the container CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES MAPR CONVERGED DATA PLATFORM Compute Persistent Storage
  42. 42. © 2017 MapR TechnologiesMapR Confidential 42 Demo: Extensibility Adding Deep Learning libraries to the container CLOUD-SCALE DATA STORE MAPR-XD OPERATIONAL DATABASE MAPR-DB EVENT STREAMING MAPR-ES MAPR CONVERGED DATA PLATFORM Compute Persistent Storage What if this was a box of GPUs?
  43. 43. © 2017 MapR TechnologiesMapR Confidential 43 A Final Comparison Traditional Hadoop Vendor BatchCluster StreamingCluster NoSQLCluster On Premises Data Science cluster
  44. 44. © 2017 MapR TechnologiesMapR Confidential 44 Q&A ENGAGE WITH US @mapr rsilver@mapr.com

×