Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning and Deep learning on HDP 3.0.1 and HDF 3.2

475 views

Published on

https://www.meetup.com/futureofdata-princeton/events/254821251/

My talk along with IBM on Data Science, Big Data Engineering and machine learning pipelines with Apache NiFi. Also running deep learning workloads on YARN, HDF, HDP with and without containers. We also interface with Spark workloads.

Published in: Engineering
  • Be the first to comment

Machine Learning and Deep learning on HDP 3.0.1 and HDF 3.2

  1. 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved. © Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Machine Learning and Deep Learning on HDP 3.0.1 and HDF 3.2 Timothy Spann, Senior Solutions Engineer Hortonworks @PaaSDev November 14, 2018 Future of Data – Princeton
  2. 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved. Disclaimer • This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. • Technical feasibility, market demand, user feedback, and the Apache Software Foundation community development process can all effect timing and final delivery. • This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. • Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. • Since this document contains an outline of general product development plans, customers should not rely upon it when making a purchase decision.
  3. 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved. Apache Deep Learning Flow
  4. 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved. Global Data Management With Hortonworks Globally Manage, Secure, Govern, Consume DATAPLANE SERVICE (DPS) MANAGE, GOVERN, SECURE DATA LIFECYCLE MANAGER DATA STEWARD STUDIO ISV SERVICES EXTENSIBLE SERVICES IBM DSXCLOUD- BREAK DATA ANALYTICS STUDIO CONNECTED DATA PLATFORMS HORTONWORKS DATA PLATFORM (HDP®) DATA-AT-REST HORTONWORKS DATAFLOW (HDF™) DATA-IN-MOTION MODERN DATA USE CASES EDW OPTIMIZATION CYBERSECURITY DATA SCIENCE ADVANCED ANALYTICS IOT/ STREAMING ANALYTICS HORTONWORKS CONNECTION ENTERPRISE SUPPORT PREMIER SUPPORT EDUCATIONAL SERVICES PROFESSIONAL SERVICES COMMUNITY CONNECTION HORTONWORKS PLATFORM SERVICES OPERATIONAL SERVICES SMARTSENSE™ DATA SOURCES DATA CENTER CLOUD EDGE Exception Monitoring 360 View of Operations Cyber Security Telemetry – Connected Devices Time Series Sensors, Control Systems
  5. 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved. NiFi - Terminology • FlowFile • Unit of data moving through the system • Content + Attributes (key/value pairs) • Processor • Performs the work, can access FlowFiles • Connection • Links between processors • Queues that can be dynamically prioritized • Process Group • Set of processors and their connections • Receive data via input ports, send data via output ports
  6. 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved. NiFi is based on Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  7. 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved. Visual Command and Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  8. 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved. Provenance/Lineage
  9. 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved. Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  10. 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved.
  11. 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved. Latency vs. Throughput • Choose between lower latency, or higher throughput on each processor
  12. 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved. NiFi UI
  13. 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved. Edge Intelligence with Apache MiNiFi à Guaranteed delivery à Data buffering ‒ Backpressure ‒ Pressure release à Prioritized queuing à Flow specific QoS ‒ Latency vs. throughput ‒ Loss tolerance à Data provenance à Recovery / recording a rolling log of fine-grained history à Designed for extension Different from Apache NiFi à Design and Deploy à Warm re-deploys Key Features
  14. 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved. Integrating TensorFlow with Streaming https://community.hortonworks.com/articles/198855/executing-tensorflow-classifications-from-apache-n.html https://community.hortonworks.com/articles/116803/building-a-custom-processor-in-apache-nifi-12-for.html https://community.hortonworks.com/articles/224268/running-tensorflow-on-yarn-31-with-or-without-gpu.html https://community.hortonworks.com/articles/183806/using-a-tensorflow-person-blocker-with-apache-nifi.html
  15. 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved http://mxnet.incubator.apache.org/ • Cloud ready • Experienced team (XGBoost) • AWS, Microsoft, NVIDIA, Baidu, Intel backing • Apache Incubator Project • Run distributed on YARN • In my early tests, faster than TensorFlow. • Runs on Raspberry PI, NVidia Jetson TX1 and other constrained devices • Great documentation • Gluon • Great Python Interaction • Model Server Available • ONNX Support • Now in Version 1.1! • Great Model Zoo https://mxnet.incubator.apache.org/how_to/cloud.html https://github.com/apache/incubator-mxnet/tree/1.1.0/example
  16. 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • Apache MXNet Running in Apache Zeppelin Notebooks • Apache MXNet Running on YARN 3.1 In Hadoop 3.1 In Dockerized Containers • Apache MXNet Running on YARN Apache NiFi Integration with Apache Hadoop Options https://community.hortonworks.com/articles/176789/apache-deep-learning-101-using-apache-mxnet-in-apa.html https://community.hortonworks.com/articles/174399/apache-deep-learning-101-using-apache-mxnet-on-apa.html https://www.slideshare.net/Hadoop_Summit/deep-learning-on-yarn-running-distributed-tensorflow-etc-on-hadoop-cluster-v3
  17. 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Instance Segmentation: Mask RCNN with GluonCV net = model_zoo.get_model('mask_rcnn_resnet50_v1b_coco', pretrained=True) Mask RCNN model trained on COCO dataset with ResNet-50 backbone https://gluon-cv.mxnet.io/build/examples_instance/demo_mask_rcnn.html https://arxiv.org/abs/1703.06870 https://github.com/matterport/Mask_RCNN
  18. 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Semantic Segmentation: Fully Convolutional Networks model = gluoncv.model_zoo.get_model(‘fcn_resnet101_voc ', pretrained=True) GluonCV FCN model on PASCAL VOC dataset https://gluon-cv.mxnet.io/build/examples_segmentation/demo_fcn.html run1.sh demo_fcn_webcam.py https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf
  19. 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Introducing HDP 3.0 FASTER
  20. 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Faster Time to Deployment— Containerization Why containerization? à Overcomes limits of data architecture à Allows for agility and elasticity to process data à Developers can build data intensive apps quickly à Ensure apps deploy quickly, reliably and consistently across deployment environments Result: Faster time to deployment and increased developer productivity -> competitive advantage
  21. 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Easy to create and access a containerized HBase service
  22. 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved New Abstraction: YARN Container Runtimes • Challenge: Run existing process container in the same cluster as Docker containers • Solution: Container Runtimes – specify the container runtime to use at application submission time. DefaultLinuxContainerRuntime DockerLinuxContainerRuntime Existing Linux process- based execution. Using Docker to run and monitor a container. Early versions shipped in Apache Hadoop 2.8
  23. 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Key Improvements in HDP 3.0 • Improved container lifecycle management – reliably run, stop, and remove Docker containers • Delayed deletion of exited containers for debugging • ACLs for privileged containers, with the ability to disable privileged containers system wide • Default untrusted mode for running unmodified images out of the box • Support for bind mounting host files into containers, validated against an admin supplied whitelist • Ambari Integration for configuring YARN containerization features and security
  24. 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Use Case—Cloud Portability with Containers Any company that wants faster time to deployment • Containerization helps companies • Move apps from on-prem to cloud, or between cloud environments • Deploy apps quickly • Helpful for migrating to cloud or looking to adopt a hybrid cloud strategy Result: Maximizes portability: self-contained runtime environments can be taken anywhere
  25. 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Introducing HDP 3.0 SMARTER
  26. 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Smarter Decisions Made Based on Support for Deep Learning Workloads Why GPU support? à Enhances the performance of computations needed for enterprise ML/DL apps à DL requires intense computational algorithms à Containerized software powered by GPUs helps data processing at scale Result: Data Scientists can run DL models in days vs months, hours vs days, minutes vs hours
  27. 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Deploy Capture Billions of images in data lake in Core Pool GPUs and CPUs - think a giant super computer for 100x faster processing Deploy data intensive containerized deep learning micro-services in minutes Train deep learning models using GPUs & images in data lake Edge Nvidia Drive PX 2 Use Case — Autonomous Driving Car
  28. 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Easy to enable GPUs with a sliderEnable GPUs
  29. 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Use Case - Manufacturing Machine Learning & Deep Learning Using GPUs à Monitor factory equipment in real-time using ML/DL apps à Capture sensor data including temperature, vibration, internal pressures to perform preventative maintenance à Minimize costly downtime on machinery before problems occur Result: Improve bottom line - less downtime
  30. 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Science and In-Memory Performance Improvements Securely & seamlessly integrate with other services including Ranger & Atlas TensorFlow Tech Preview will complement GPU pooling to support deep learning use cases Spark testing with S3Guard to support cloud Spark/Hive integration to connect easily to the cloud
  31. 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Enable TensorFlow TensorFlow training metrics & TensorFlow on YARN
  32. 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache MXNet on Apache YARN 3.1 Native No Spark yarn jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications- distributedshell.jar -jar /usr/hdp/current/hadoop-yarn-client/hadoop- yarn-applications-distributedshell.jar -shell_command python3.6 - shell_args "/opt/demo/analyzex.py /opt/images/cat.jpg" - container_resources memory-mb=512,vcores=1 Uses: Python Any
  33. 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache MXNet on Apache YARN 3.1 Native No Spark https://community.hortonworks.com/content/kbentry/222242/running-apache-mxnet-deep-learning-on-yarn-31- hdp.html https://github.com/tspannhw/ApacheDeepLearning101/blob/master/analyzehdfs.py
  34. 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache MXNet on YARN 3.2 in Docker Using “Submarine” https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name xyz-job-001 --docker_image <your docker image> --input_path hdfs://default/dataset/cifar-10-data --checkpoint_path hdfs://default/tmp/cifar-10-jobdir --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=2 --worker_launch_cmd "shell for Apache MXNet" Wangda Tan (wangda@apache.org) Hadoop {Submarine} Project: Running deep learning workloads on YARN https://issues.apache.org/jira/browse/YARN-8135
  35. 35. 35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Introducing HDP 3.0 HYBRID DATA Cloud Native
  36. 36. 36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A hybrid architecture is a key requirement of a modern data architecture, and is composed of on prem + multi cloud + edge - DATA AT R EST + DATA IN M OT IO N - M A NAG ES T H E ENT IR E LIF ECYCLE O F DATA - S PA NS ACRO S S O N P R EM IS ES , CLO U D A ND M U LT I - CLO U D - P RO CES S A ND DR IV ES INS IGH T - CO NS IST ENT S ECU R IT Y, G OV ER NA NCE A ND O P ER AT IO NS A M O D E R N D ATA A R C H I T E C T U R E requires
  37. 37. 37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved SOLUTION M O D E R N H Y B R I D D ATA A R C H I T E C T U R E Cloud-native Data Architecture Extend to The Edge Seamless Architecture Consistent Security and Governance HORTONWORKS DATA PLATFORM HORTONWORKS DATAFLOW HORTONWORKS DATAPLANE NEW, OPEN HYBRID ARCHITECTURE INITIATIVE CORE REQUIREMENTS
  38. 38. 38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Focus on extending data science and machine learning to analyze the data in Apache Hadoop systems Provides Data Science & Machine Learning Provides Open Hadoop Data Platform Make our clients competitive in their markets using advanced analytics faster and at scale + ... Deliver Data Science at Scale Stronger Together
  39. 39. 39 © Hortonworks Inc. 2011 – 2017. All Rights Reserved • Top Technology – Top Hadoop Engine – Top SQL on Hadoop Engine – Top Data Science Platform • Open Source Approach Leads to Future Integration – 100% Pure Open Source Hadoop Distribution – Big SQL Maintains the Integrity of Open Source Hadoop – Data Science leverages Open Source Analytics Solve Real Business Problems Provides Best in Class Technology Gives Clients Freedom Today and Innovation Tomorrow Together we Lead
  40. 40. 40 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDP Technical Differentiators • Cloudbreak: Deployment made easy – USABILITY • Elasticity with long running • Support GCP, AWS, Azure AND Private cloud (OpenStack) Hive 3.0 • Support for TEZ, LLAP and Druid Integration •Allow to do ACID – PERFORMANCE is there •Complete coverage of ANSI SQL 2011 – SIMPLIFY ETL development •Competition use Hive 2.x, and support only HIVE on Sparks and does not allow TEZ • Knox: SSO across ecosystem - SIMPLIFY security management • Ranger: Attribute based tagging – SIMPLIFY security deployment • Complete platform integration: HDFS, Yarn, Hive, Hbase, Storm, Atlas, Kafka Security Hybrid Cloud deployment
  41. 41. 41 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDP Technical Differentiators • Full support for Docker Containers in Yarn - INTELLIGENT • Make the most of your applications with GPU support! Hadoop 3.1 • Zeppelin: Powerful notebook made availaible with HDP • No need for an advanced science subscription • Broad Partner integration • TensorFlow can run on the cluster (Docker) • IBM Watson are market leader Data Science
  42. 42. 42 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDF Technical Differentiators • SAM: UI for Storm – USABILITY improvement • Druid: OLAP cube – REAL TIME analytics Data Ingestion • NiFi: Fully Open Source project • Very mature (est. 2006 - NSA) - USABILITY • Home grown – no OEM • Edge deployment via MiniFi - PORTABILITY • Schema registry: Ability to apply same schema to NiFi, Kafka, Storm, SAM - PRODUCTIVITY • Solve Kafka blindness with SMM! • Single Monitoring Dashboard for all your Kafka Clusters across 4 entities: Stream Processing Streaming Analytics Broker Producer Topic Consumer
  43. 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved. HCC – community.hortonworks.com • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Gallery and Samples Read access for everyone, join to participate and be recognized
  44. 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved. Community Engagement 9,000+ Registered Users 25,000+ Answers 40,000+ Technical Assets One Website! https://community.hortonworks.com
  45. 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved. Thanks!!! https://community.hortonworks.com/users/9304/tspann.html https://dzone.com/users/297029/bunkertor.html https://www.meetup.com/futureofdata-princeton/ https://twitter.com/PaaSDev https://github.com/tspannhw/ApacheDeepLearning201 https://www.youtube.com/watch?v=ksDKNp6Z4BE https://community.hortonworks.com/articles/193835/detecting-language-with-apache-nifi.html https://community.hortonworks.com/content/kbentry/189213/etl-with-lookups-with-apache-hbase-and-apache-nifi.html

×