Advertisement

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud

Making Data Real for FinServ
Jun. 13, 2018
Advertisement

More Related Content

Slideshows for you(20)

Similar to Cloudera Analytics and Machine Learning Platform - Optimized for Cloud (20)

Advertisement

Recently uploaded(20)

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud

  1. © Cloudera, Inc. All rights reserved. DIE MODERNE, OPENSOURCE-BASIERTE UND CLOUD-OPTIMIERTE BIG DATA PLATTFORM FÜR MACHINE LEARNING & ANALYTICS Stefan Lipp & Frank Hereygers / Juni 2018
  2. © Cloudera, Inc. All rights reserved. 2© Cloudera, Inc. All rights reserved. CLOUDERA’S COMMITMENTS Anything that stores your data Any APIs your applications call Uses open source code Our contributions and fixes go back to open source first When possible, use projects supported by multiple commercial vendors Keeping your cluster running Cloudera CDH edition No limit to number of servers Managing your applications Employ* committers, if not PMC members, on the projects we support * People manage their own careers. Temporary gaps may exist High availability features Open source Subscription expiration won’t stop the cluster Free to use forever RBAC over your data
  3. © Cloudera, Inc. All rights reserved. 3© Cloudera, Inc. All rights reserved. OUR GOAL: CUSTOMER SUCCESS WITH OPEN SOURCE By innovating in open source Some vendors consume the open source community’s activity; others help drive it. Cloudera leads in influencing the Hadoop platform's evolution by creating, contributing, donating (Apache Sentry, Apache Impala, Apache Kudu) and supporting new capabilities that meet customer requirements for security, scale, and usability. By curating open standards Cloudera has a long and proven track record of identifying, curating, and supporting the open standards (including Apache HBase, Apache Solr, Apache Spark and Apache Kafka) that provide the mainstream, long-term architecture upon which new customer use cases are built. By meeting the highest enterprise requirements To ensure the best customer experience, Cloudera invests significant resources in multi-dimensional testing on real workloads before releases, as well as in supportability of the entire platform via extensive involvement in the open source community.
  4. © Cloudera, Inc. All rights reserved. 4© Cloudera, Inc. All rights reserved. CDH: CLOUDERA DISTRIBUTION of HADOOP STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN, Zookeeper SECURITY Sentry FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr • Ensure that disparate Apache projects work together reliably • Provide enterprise-class capabilities initially not addressed by Apache • Create Sustainability OPERATIONS Cloudera Manager “Express”
  5. © Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved. CDH6: GIANT LEAP FORWARD Hadoop 3 Hive 2.1 HBase 2 Spark 2.2 Parquet 1.9 Solr 7 Oozie 5 Sentry 2 Kafka 1 Avro 1.8 ZooKeeper 3.4 Flume 1.8 Sqoop 1.4 Pig 0.17 currently in Beta, GA by mid year
  6. © Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved. CLOUDERA SUBSCRIPTION EXTENDS ON THE EDGES STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN, Zookeeper SECURITY Sentry FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr DATA MANAGEMENT Cloudera Navigator Navigator Encrypt Navigator Optimizer OPERATIONS Cloudera Manager Cloudera Director Cloudera Altus DATASCIENCE ENABLEMENT Cloudera Data Science Workbench enhancements based on customers’ needs 24x7 support Rolling upgrades Data governance and lineage Automated backup and recovery Full disk encryption hybrid & portable multicloud usage Data Science Enablement With partners: rigorous testing and certification cycles #1 Goal: Maximum value with minimum risk
  7. © Cloudera, Inc. All rights reserved.7 © Cloudera, Inc. All rights reserved. BIG DATA MARKET EVOLUTION BIG DATA TECH DATA PLATFORM CIO & Data Admins ML, ANALYTICS & CLOUD LOB & Data Scientists IT early adopters & Developers DIGITAL TRANSFORMATION powered by data C-suite & Boards
  8. © Cloudera, Inc. All rights reserved. 8© Cloudera, Inc. All rights reserved. EARLY STAGE: CHAIN OF BIG DATA TOOLS Data Sources Data Ingest Data Storage & Processing Serving, Analytics & Machine Learning Apache Kafka Stream or batch ingestion of IoT data Apache Sqoop Ingestion of data from relational sources Apache Hadoop Storage (HDFS) & deep batch processing Apache Kudu Storage & serving for fast changing data Apache HBase NoSQL data store for real time applications Apache Impala MPP SQL for fast analytics Cloudera Search Real time searchConnected Things/ Data Sources Structured Data Sources Apache Spark Stream & iterative processing, ML
  9. © Cloudera, Inc. All rights reserved. 9© Cloudera, Inc. All rights reserved. EARLY STAGE: CHAIN OF CLOUD BIG DATA TOOLS
  10. 10 © Cloudera, Inc. All rights reserved. CLOUDERA DIRECTOR Infrastructure- as-a-Service Automate Cluster Provisioning OPERATIONA L DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE Cloudera Director (Cloud Provider API’s)
  11. © Cloudera, Inc. All rights reserved.11 © Cloudera, Inc. All rights reserved. WHAT IS A BIG DATA WORKLOAD? Data + Compute + Data Context Data Context: • Schema definitions (HMS) • Security authorizations (Sentry) • Metadata (Navigator) • Business glossary (Navigator) • Data Lineage (Navigator) • Audit logs (Navigator)
  12. 13 © Cloudera, Inc. All rights reserved. LIFT & SHIFT CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Analytics Data Science Security Metadata Governance STORAGE HDFS CLOUDERA CLUSTER (PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Analytics Data Science Security Metadata Governance STORAGE CLOUD OBJECT STORE CUSTOMER VPC ON PREMISES PUBLIC CLOUD
  13. © Cloudera, Inc. All rights reserved.14 © Cloudera, Inc. All rights reserved. EVOLUTION PHASE 1: DATA MANAGEMENT PLATFORM Integrated data, workflows, metadata, security, governance, ... Amazon S3 Microsoft ADLS HDFS KUDU SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION DATA CATALOG Core Services Storage Services ANALYTIC DATABASE DATA SCIENCE EXTENSIBLE SERVICES OPERATIONAL DATABASE DATA ENGINEERING
  14. 15 © Cloudera, Inc. All rights reserved. EVEN AVAILABLE AS PLATFORM AS A SERVICE portable code, APIs, data, workflows, metadata, security, governance, ... Customer Cloud Compute Storage CLI Web SDK ALTUS ANALYTIC DATABASE ALTUS DATA ENGINEERING ALTUS CONTROL PLANE
  15. © Cloudera, Inc. All rights reserved.16 © Cloudera, Inc. All rights reserved. NOW: THE NEXT CHALLENGE Balance these needs DATA SCIENCE • Access to granular data • Flexibility - preferred open source tools • Elastic provisioning of compute and storage • Reproducible research • Path to production DATA MANAGEMENT • Security • Governance • Standards • Low maintenance • Low cost • Self-service access
  16. © Cloudera, Inc. All rights reserved.17 © Cloudera, Inc. All rights reserved. THE TYPICAL DATA SCIENTIST “If I can’t use my favorite tools, I’ll…” • Copy data to my laptop • Copy data to a data science appliance • Copy data to a cloud service Why this is a problem: • Complicates security • Breaks data governance • Adds latency to process • Makes collaboration more difficult • Complicates model management and deployment • No model governance
  17. © Cloudera, Inc. All rights reserved.18 © Cloudera, Inc. All rights reserved. DATA SCIENCE / MACHINE LEARNING AT CLOUDERA Our philosophy We empower our customers to run their business on data with an open platform: ● Your data ● Open algorithms ● Running anywhere We accelerate enterprise data science.
  18. © Cloudera, Inc. All rights reserved. 19© Cloudera, Inc. All rights reserved. THE IMPORTANCE OF AN OPEN DATA SCIENCE ECOSYSTEM Open ecosystem Black box
  19. © Cloudera, Inc. All rights reserved.20 © Cloudera, Inc. All rights reserved. CURRENT INNOVATION: MACHINE LEARNING PLATFORM Enable applied machine learning from research to production
  20. © Cloudera, Inc. All rights reserved.21 © Cloudera, Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH Accelerate Machine Learning from Research to Production For data scientists • Experiment faster Use R, Python, or Scala with on-demand compute and secure CDH data access • Work together Share reproducible research with your whole team • Deploy with confidence Get to production repeatably and without recoding For IT professionals • Bring data science to the data Give your data science team more freedom while reducing the risk and cost of silos • Secure by default Leverage common security and governance across workloads • Run anywhere On-premises or in the cloud
  21. © Cloudera, Inc. All rights reserved.22 © Cloudera, Inc. All rights reserved. PLATFORM FOR DATA SCIENCE & MACHINE LEARNING • Open platform • Complete lifecycle • Team collaboration • Enterprise ready • Runs anywhere RESEARCH | PRODUCTION LOCAL | SPARK | IMPALA DEPLOYMENT COMPUTE OPEN SOURCE ECOSYSTEMALGORITHM S SELF-SERVICE TOOLS SOLUTIONS | USE CASESAPPS CLOUD ON-PREMISES ADLSS3 HDFS KUDU CATALOG | SECURITY | GOVERNANCE
  22. © Cloudera, Inc. All rights reserved.23 © Cloudera, Inc. All rights reserved. A MODERN DATA SCIENCE ARCHITECTURE Containerized environments with scalable, on-demand compute • Built with Docker and Kubernetes • Isolated, reproducible user environments • Supports both big and small data • Local Python, R, Scala runtimes • Schedule & share GPU resources • Run Spark, Impala, and other CDH services • Secure and governed by default • Easy, audited access to Kerberized clusters • Leverages SDX platform services • Deployed with Cloudera Manager CDH CDH Cloudera Manager gateway node(s) CDH nodes Hive, HDFS, ... CDSW CDSW ... Master ... Engine EngineEngine EngineEngine
  23. © Cloudera, Inc. All rights reserved.24 © Cloudera, Inc. All rights reserved. ACCELERATED DEEP LEARNING WITH GPUs Multi-tenant GPU support on-premises or cloud • Extend CDSW to deep learning • Schedule & share GPU resources • Train on GPUs, deploy on CPUs • Works on-premises or cloud CDSW GPUCPU CDH CPU CDH CPU single-node training distributed training, scoring “Our data scientists want GPUs, but we need multi-tenancy. If they go to the cloud on their own, it’s expensive and we lose governance.” GPU On CDH coming in C6
  24. © Cloudera, Inc. All rights reserved.25 © Cloudera, Inc. All rights reserved. SUMMARY Cloudera helps with OpenSource Data Management AND Machine Learning DATA MANAGEMENT MACHINE LEARNING Enterprise Data Hub with SDX provides a unified foundation. Data Science Workbench enables collaborative self- service. APPLIED RESEARCH Fast Forward Labs cuts through the hype.
  25. © Cloudera, Inc. All rights reserved.26 © Cloudera, Inc. All rights reserved. ONE MORE THING https://www.cloudera.com/products/altus.html
  26. 27 © Cloudera, Inc. All rights reserved. ALTUS ARCHITECTURE CLOUDERA CLUSTER (TRANSIENT / PERSISTENT) COMPUTE DATA CONTEXT Data Engineering Analytics Data Science Security Metadata Governance STORAGE CLOUD OBJEcT STORE Cloud IaaS Altus PaaS CLOUDERA CLUSTERS (TRANSIENT– ALTUS) COMPUTE Data Engineering CUSTOMER VPC STORAGE CLOUD OBJECT STORE CLOUDERA CLUSTER (PERSISTENT–DIRECTOR) COMPUTE DATA CONTEXT CLOUDERA CLUSTERS (TRANSIENT– ALTUS) COMPUTE Analytics CUSTOMER VPC CLOUDERA VPC CLOUDERA ALTUS CONTROL PLANE DATA CONTEXT
  27. © Cloudera, Inc. All rights reserved. 28
Advertisement