Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science in the Enterprise

Amr Awadallah's slides from his talk at TIBCO in collaboration with The Hive Think Tank on May 11th, 2017.

  • Login to see the comments

Data Science in the Enterprise

  1. 1. 1© Cloudera, Inc. All rights reserved. Data Science in the Enterprise Amr Awadallah (@awadallah) Founder, Chief Technical Officer, Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Typical Data Science Workflow Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition
  3. 3. 3© Cloudera, Inc. All rights reserved. • Team: Data scientists and analysts • Goal: Understand data, develop and improve models, share insights • Data: New and changing; often sampled • Environment: Local machine, sandbox cluster • Tools: R, Python, SAS/SPSS, SQL; notebooks; data wrangling/discovery tools, … • End State: Reports, dashboards, PDF, MS Office • Team: Data engineers, developers, SREs • Goal: Build and maintain applications, improve model performance, manage models in production • Data: Known data; full scale • Environment: Production clusters • Tools: Java/Scala, C++; IDEs; continuous integration, source control, … • End State: Online/production applications Types of Data Science Exploratory (discover and quantify opportunities) Operational (deploy production systems)
  4. 4. 4© Cloudera, Inc. All rights reserved. Common Limitations Access Many times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box. Scale Notebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster. Developer Experience Popular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production
  5. 5. 5© Cloudera, Inc. All rights reserved. Management of Dependencies
  6. 6. 6© Cloudera, Inc. All rights reserved. Open Data Science in the Enterprise IT drive adoption while maintaining compliance Data Scientist explore, experiment, iterate
  7. 7. 7© Cloudera, Inc. All rights reserved.
  8. 8. 8© Cloudera, Inc. All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  9. 9. 9© Cloudera, Inc. All rights reserved. How does CDSW help? Visualizeresults ChangeandCompileSource code Retrainandredeploy ExtensibleEngines ConfigurableSessions Trivialtotweakparameters MultipleUsers Roles/Governance CDH
  10. 10. 10© Cloudera, Inc. All rights reserved. The Importance of an Open Ecosystem Open Ecosystem Black Box
  11. 11. 11© Cloudera, Inc. All rights reserved. Demo
  12. 12. 12© Cloudera, Inc. All rights reserved. Key Benefits How is Cloudera Data Science different? Works with fully secured clusters One tool for multiple standard languages (Python, R, Scala) Multi-tenant Architecture Common Platform
  13. 13. 13© Cloudera, Inc. All rights reserved. 1 A conference for and by practicing data scientists! Save the Date: July 20th at the Chapel, San Francisco Wrangle is a 1 day, single track community event that hosts the best and brightest in the Bay Area talking about the principles, practice, and application of Data Science, across multiple data-rich industries. Join Cloudera, Facebook, Netflix and more to discuss future trends, how they can can be predicted, and most importantly—how can they be anticipated. #wrangleconf | Powered by Cloudera
  14. 14. 14© Cloudera, Inc. All rights reserved. Thank You Amr Awadallah (@awadallah)