Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Easy Data Science Deployment with the Anaconda Platform

3,735 views

Published on

Presented by Peter Wang at the Gartner Data Analytics Summit, March 2017.

Published in: Data & Analytics

Easy Data Science Deployment with the Anaconda Platform

  1. 1. © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Anaconda – Open Data Science Platform
  2. 2. © 2016 Continuum Analytics - Confidential & Proprietary Continuum Analytics is the company behind Anaconda and offers: – Open-Source Software – Commercial Software – Training – Consulting is…. the leading Open Data Science platform powered by Python the fastest growing open data science language Accelerate, Connect & Empower
  3. 3. © 2016 Continuum Analytics - Confidential & Proprietary Quickly Engage with Your Data Modern, Open Data Science Platform powered by Python Anaconda – 730+ Popular Python & R packages – Compiled for Windows, Mac, and Linux – Package Distribution is free for everyone – Foundation of our Enterprise Platform – Extensible via Conda Package Manager – Easily sandbox and deploy packages & analytical computing environments
  4. 4. © 2016 Continuum Analytics - Confidential & Proprietary 44 Anaconda …is Trusted by Industry Leaders Financial Services • Risk management, Quant modeling, Data exploration and processing, algorithmic trading, compliance reporting Government • Fraud detection, data crawling, web & cyber data analytics, statistical modeling Healthcare & Life Sciences • Genomics data processing, cancer research, natural language processing for health data science High Tech • Customer behavior, recommendations, ad bidding, retargeting, social media analytics Retail & CPG • Engineering simulation, supply chain modeling, scientific analysis Oil & Gas • Pipeline monitoring, noise logging, seismic data processing, geophysics
  5. 5. © 2016 Continuum Analytics - Confidential & Proprietary Env 1 Python 2.7 Conda: Package and Environment Management Env 2 Python 3.4 Pandas v.0.18 Jupyter Env 3 R R Essentials conda Windows, Mac OSX, Linux – Install packages – Update packages – Create sandboxes: Conda environments – Conda environments: Critical for reproducibility, collaboration & scale NumPy v1.11 NumPy v1.10 Pandas v.0.16
  6. 6. © 2016 Continuum Analytics - Confidential & Proprietary 66 Continuum Sponsored Open-Source Projects • Bokeh - Interactive Web Visualizations and Applications • Dask – Painless distributed and parallel computations in Python • Numba - JIT for Python applications • Jupyter, Spyder – Notebooks and IDE for data science • Pandas, Datashader, Blaze, …
  7. 7. © 2016 Continuum Analytics - Confidential & Proprietary 77 Anaconda • High performance Python & R • 720+ data science packages • Cross-platform package, dependency & environments • Community driven package repository collaboration Anaconda Navigator • Desktop Portal & Installer Anaconda Enterprise Components OPEN DATA SCIENCE DATA SCIENCE GOVERNANCE DATA SCIENCE COLLABORATION Anaconda Repository • Storage & sharing of packages, environments, notebooks • On-premise governance • Enterprise authentication Anaconda • Deep Learning: Theano, Tensorflow, Caffe, Keras, Neon, Lasagne • Natural Language Processing: NLTK, spaCy • Machine Learning: Scikit- learn • GPU enablement Anaconda Enterprise Notebooks • Collaborative project based workflows for Python & R • Enterprise authentication & permissioning • Notebook sharing, versioning, search, differencing Anaconda • Interactive browser based dashboards & visualizations with Bokeh • Bokeh apps using Python, R, Scala DATA SCIENCE FOR BIG DATA Anaconda Scale • Hadoop & Spark integration • Scalable distributed processing framework • Integration with resource management & data stores • Distributed package, dependency & environments Anaconda Fusion • Integration of Open Data Science with Microsoft Excel® • Big Data querying & transformations
  8. 8. © 2016 Continuum Analytics - Confidential & Proprietary On-premises package repository – Governance for your analytics environment – Empower your data scientists within the structure of enterprise IT Enterprise notebook collaboration – Easily replicate and share analysts’ environments – Centrally store proprietary libraries and manage versioning Scalable analytics computations – Scale up: leverage GPU and parallel- optimized libraries – Scale out: easily manage Anaconda across your Hadoop/Spark cluster – Scale up and out with Python and R Enterprise data science deployment – Encapsulate and deploy data science projects – Deploy live notebooks, dashboards, interactive applications, and models with REST APIs Anaconda Enterprise Open Source Without Anxiety: Governance and Scalability
  9. 9. © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Scaling Out with Anaconda
  10. 10. © 2016 Continuum Analytics - Confidential & Proprietary Anaconda - Scaled Out Open Data Science Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc. Analytics pandas, NumPy, SciPy, Numba, scikit- learn, NLTK, scikit-image, PIL, and more Computation PySpark, SparkR Dask, Distributed Data and Resource Management HDFS, NFS, YARN, SGE, SLURM Servers Bare-metal or Cloud-based Cluster ClusterAnaconda
  11. 11. © 2016 Continuum Analytics - Confidential & Proprietary Spectrum of Parallelization Threads Processes MPI ZeroMQ Explicit control: Fast but low-level Implicit control: Restrictive but easy Dask Hadoop Spark SQL: Hive Pig Impala
  12. 12. © 2016 Continuum Analytics - Confidential & Proprietary Scaling Out with Anaconda and Spark • Using Anaconda with Spark is: • Extensible: Use libraries from Anaconda with PySpark and SparkR jobs • Integrated: Use interactive notebooks with data in HDFS and on YARN clusters • Secure: Works with Kerberized Hadoop clusters • Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets • Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise Hadoop distributions • Anaconda dramatically simplifies the installation and management of popular Python and R packages and their dependencies.
  13. 13. © 2016 Continuum Analytics - Confidential & Proprietary Other ways to Scale Out with Anaconda • Anaconda integrates with: • Spark (PySpark, SparkR) and other Hadoop components, including YARN, HDFS, Hive, Impala, and more • Dask, Distributed, knit, dask-ec2, hdfs3, fastparquet • CSV, SQL, JSON, HDF5, Parquet, etc. • Amazon Web Services, Microsoft Azure, Google Cloud Platform • Streaming analytics: Streamparse for Apache Storm, Spark Streaming, Kafka, Python integration with ELK • Anaconda Technology Partners: • Cloudera • Hortonworks • IBM • H2O • Docker • … and more
  14. 14. © 2016 Continuum Analytics - Confidential & Proprietary Scaling Out with Anaconda Without Anaconda Scale Head Node 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies With Anaconda Scale Compute Nodes 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies Compute Nodes Head Node Easily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster
  15. 15. © 2016 Continuum Analytics - Confidential & Proprietary Scaling Out with Anaconda – Example Use Cases Analyzing text, tabular, or array data using Dask • Use Pandas dataframes or NumPy arrays at scale • Work with data in different formats and data stores Distributed natural language processing with text data using PySpark • Explore data using a distributed memory cluster • Interactively query and analyze data using libraries from Anaconda Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more • Work interactively and collaboratively in notebooks • Simplify installation and management of ML libraries and dependencies Handle custom code and workflows using Dask • Work with custom data formats • Construct complex pipelines including ETL and flexible computations
  16. 16. © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Productionizing and Deploying Data Science Projects
  17. 17. © 2016 Continuum Analytics - Confidential & Proprietary Productionizing Data Science Projects
  18. 18. © 2016 Continuum Analytics - Confidential & Proprietary Deploying Data Science Projects - Notebooks
  19. 19. © 2016 Continuum Analytics - Confidential & Proprietary Deploying Data Science Projects - Dashboards
  20. 20. © 2016 Continuum Analytics - Confidential & Proprietary Deploying Data Science Projects – Interactive Applications
  21. 21. © 2016 Continuum Analytics - Confidential & Proprietary Deploying Data Science Projects – Models with REST APIs Load Data Clean Data Anomaly Detection Models with REST APIs DashboardsReports Interactive Applications Regression Clustering Machine Learning Pipeline Deployed Applications Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.
  22. 22. © 2016 Continuum Analytics - Confidential & Proprietary Scalable and Deployable Data Science • … with Anaconda and Anaconda Enterprise, including: • Scaled-up Analytics: Develop and deploy the same code/environments on your local machine and a cluster • Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster • Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups • Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions
  23. 23. © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Data Science Deployment in Anaconda Enterprise
  24. 24. © 2016 Continuum Analytics - Confidential & Proprietary ENSURES AVAILABILITY, UPTIME, & MONITORING PROVISIONS COMPUTE RESOURCES MANAGES DEPENDENCIES & ENVIRONMENTS SHARE COMPUTE RESOURCES SECURE NETWORK COMMUNICATIONS & SSL SECURE DATA & NETWORK CONNECTIVITY ENGINEER FOR SCALABILITY MANAGE AUTHENTICATION & ACCESS CONTROL SCHEDULE REGULAR EXECUTION OF JOBS With Anaconda Enterprise life just got a whole lot easier… Learn more: https://www.continuum.io/blog/developer- blog/productionizing-deploying-data-science-projects
  25. 25. © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary Anaconda Fusion
  26. 26. © 2016 Continuum Analytics - Confidential & Proprietary 2 Anaconda Fusion brings Open Data Science to Microsoft Excel Anaconda Fusion • BRING interactive visualizations, machine learning and ETL to Excel • BRIDGE Excel Data to Python & R through notebooks • ACCESS all the power of Python and Big Data, natively embedded inside Excel
  27. 27. © 2016 Continuum Analytics - Confidential & Proprietary Empowering Business Analysts & Data-driven Employees • Anaconda Fusion is a Microsoft Excel® Add-in that enables a unique and simple link between Excel and Python without writing code • Anaconda Fusion is targeted to Business Analysts who want “No Code” Data Science
  28. 28. © 2016 Continuum Analytics - Confidential & Proprietary Analysts and Data Scientists can keep using their prefered tools 2
  29. 29. © 2016 Continuum Analytics - Confidential & Proprietary “No Code” Data Science – Data Loading Example 1 2Select Anaconda Fusion Notebook and click “Upload” Select function you wish to run Click “Run” Data is loaded into spreadsheet 3 4
  30. 30. © 2016 Continuum Analytics - Confidential & Proprietary Just change one line of code in your notebook
  31. 31. © 2016 Continuum Analytics - Confidential & Proprietary • Extract data - pull data directly into Excel to perform analysis • Machine Learning – use trained models created by Data Scientists and plug them into your spreadsheet data • Interactive Visualizations – create custom advanced interactive graphs, charts and plots from Excel data • Big Data – analyze, transform, model and query data stored in Hadoop and Spark Figure: Anaconda Fusion on Mac Anaconda Fusion Use Cases

×