Data Lake for the Cloud: Extending your Hadoop Implementation


Published on

As more applications are created using Apache Hadoop that derive value from the new types of data from sensors/machines, server logs, click-streams, and other sources, the enterprise "Data Lake" forms with Hadoop acting as a shared service. While these Data Lakes are important, a broader life-cycle needs to be considered that spans development, test, production, and archival and that is deployed across a hybrid cloud architecture.

If you have already deployed Hadoop on-premise, this session will also provide an overview of the key scenarios and benefits of joining your on-premise Hadoop implementation with the cloud, by doing backup/archive, dev/test or bursting. Learn how you can get the benefits of an on-premise Hadoop that can seamlessly scale with the power of the cloud.

Published in: Technology

Data Lake for the Cloud: Extending your Hadoop Implementation

  1. 1. Page 1 © Hortonworks Inc. 2014June 2014 We do Hadoop. Data Lake for the Cloud …Extending your Hadoop Implementation
  2. 2. Page 2 © Hortonworks Inc. 2014 Your speakers… John O’Brien Principal Analyst and CEO Radiant Advisors Bob Page VP Partner Product Management Hortonworks Matt Winkler Principal Program Manager Microsoft
  3. 3. Page 3 © Hortonworks Inc. 2014 Poll: Where are you on your Hadoop Journey? •  Researching our options •  Currently Evaluating •  Deep in a trial •  What’s Hadoop?
  4. 4. Page 4 © Hortonworks Inc. 2014 Trends and drivers… John O’Brien Principal Analyst and CEO Radiant Advisors
  5. 5. Page 5 © Hortonworks Inc. 2014 Leading Business Drivers and Trends 1.  Scale down operational infrastructure management costs •  General evaluation for all on-premises to private/public/hybrid cloud •  Hadoop does not fit IT efficiency through economies of scale and standards 2.  Centralize Hadoop data management •  Resolve costly data movement, duplication and latency between data centers •  Cloud Data Lake Strategy for shared access across geographic regions 3.  Moving data store closer to data sources and Users •  Performance and costs (Internet/VPN, LAN Ethernet, InfiniBand) •  Data sources are increasingly external to the company 4.  Ecosystem of strategic IT relationships “Our sister organization just signed a great deal with Microsoft Azure and we want to leverage shared services.”
  6. 6. Page 6 © Hortonworks Inc. 2014 Technical Drivers for Hadoop in the Cloud 1.  Elasticity – setting nominal resources and handling load volatility 2.  Flexibility – managing base workloads and handling others 3.  Scalability – can on-premises handle scalable requirements 4.  Security – requirements dictate from Hadoop apps to networking 5.  Proximity – distance data travels impacts cost and performance 6.  Functionality – not all distributions are equal (Hive, HBase versions) 7.  Usability – Internal existing skillsets with OS and scripting 8.  Manageability – monitoring cloud and hybrid easily Reference: Microsoft Big Data Solutions. Wiley 2014. Adam Jorgensen, James Rowland-Jones, John Welch, Dan Clark, Christopher Price, Brian Mitchell.
  7. 7. Page 7 © Hortonworks Inc. 2014 Hadoop Operating Models and Maturity 1.  On-Premises Hadoop Clusters •  Predefined balanced configurations with internal connectivity •  May leverage private cloud architecture for elasticity 2.  Cloud-based Hadoop Clusters and Storage •  Always-on Infrastructure-as-a-Service (IaaS) pricing model and workload •  On-demand Platform-as-a-Service (PaaS) pricing model and workloads 3.  Hybrid Hadoop Architectures •  Affordable storage and access to second class data •  Separation of production Analytic Applications from temporary activities •  Enabling on-premises clusters to efficiently meet the demands of volatility
  8. 8. Page 8 © Hortonworks Inc. 2014 Hybrid Cloud Architecture Driver #1 Driver: Lower cost through optimized data platform •  Lower cost storage for lower value data needs (lower SLA) •  Regulatory requirements of historical data Online Transparent Archive: •  Data policy driven by time, status and read-only state •  10/90 or 10/100 data architecture to simplify data management Online Backup and Business Continuity: •  Hadoop has good fault tolerance built-in with multiple data copies •  “Clusters” are single location oriented and not disaster recovery
  9. 9. Page 9 © Hortonworks Inc. 2014 Hybrid Cloud Architecture Driver #2 Driver: Flexibility for on-demand and temporary needs •  Workload and cluster management (Prioritize jobs) •  Separate Production from Dev/Test and Discovery (mindset) Discovery Sandboxes: •  Load external data to cloud for evaluation is easier than into the data centers (network load, storage, security) Proof of Concepts: •  Verifying new technologies and analytic apps on smaller subset •  Beyond exploring new data (not evaluation of Hadoop distribution) Separating environment for Analytic Applications: •  Ensuring SLA-driven operational applications from discovery
  10. 10. Page 10 © Hortonworks Inc. 2014 Hybrid Cloud Architecture Driver #3 Driver: Need for temporary elasticity •  On-premises clusters typically configured for nominal •  Volatility requires on-demand temporary resources Bursting: •  Setting and managing ongoing nominal workloads with expected volatility in data volumes (threshold) Surging: •  Maintaining performance levels during surging event data volumes or surging user activity (dynamic) Electric Grids maintain the balance of dynamic energy generation with dynamic demand.
  11. 11. Page 11 © Hortonworks Inc. 2014 Dig Deeper Considerations 1.  Network Connectivity between corporate data centers and cloud locations are often taken for granted where configuration stability and latency have become obstacles. 2.  Unified Data Access can become an issue when federated access involves extracting data out rather than pushing workloads into Hadoop clusters. 3.  Hybrid Cloud Architectures vary for IaaS and PaaS implementations of Hadoop. Understand the drivers for either Always-on IaaS or On-Demand PaaS first then adjust the hybrid architecture.
  12. 12. Page 12 © Hortonworks Inc. 2014 Key Takeaways 1.  Hadoop with the Cloud is driven by a set of business drivers and then feasibility assessments for an increasing number of use cases, architecture patterns and balance. 2.  Understand the different value propositions for Hadoop in the Cloud with both IaaS and PaaS architectures as Cloud elasticity comes in various forms. 3.  Strategic relationships play a significant roll in determining Cloud and Hybrid-Cloud Hadoop architectures.
  13. 13. Page 13 © Hortonworks Inc. 2014 Data Lake for the Cloud… Bob Page VP Partner Product Management Hortonworks
  14. 14. Page 14 © Hortonworks Inc. 2014 Hadoop Deployments Start Small SCALE SCOPE New Analytic Apps New types of data LOB-driven
  15. 15. Page 15 © Hortonworks Inc. 2014 And Then Grow Into Data LakesSCALE SCOPE A Modern Data Architecture/Data Lake   New Analytic Apps New types of data LOB-driven RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale. Supporting multiple applications and workloads.
  16. 16. Page 16 © Hortonworks Inc. 2014 Example Applications on the Data Lake $ •  New Account Risk Screens •  Fraud Prevention •  Trading Risk •  Maximize Deposit Spread •  Insurance Underwriting •  Accelerate Loan Processing •  Call Detail Records (CDRs) •  Infrastructure Investment •  Next Product to Buy (NPTB) •  Real-time Bandwidth Allocation •  New Product Development •  360° View of the Customer •  Analyze Brand Sentiment •  Localized, Personalized Promotions •  Website Optimization •  Optimal Store Layout Financial Services Retail Telecom Healthcare Utilities, Oil & Gas Public Sector •  Genomic data for medical trials •  Monitor patient vitals •  Reduce re-admittance rates •  Store medical research data •  Recruit cohorts for pharmaceutical trials •  Smart meter stream analysis •  Slow oil well decline curves •  Optimize lease bidding •  Compliance reporting •  Proactive equipment repair •  Seismic image processing •  Analyze public sentiment •  Protect critical networks •  Prevent fraud and waste •  Crowdsource reporting for repairs to infrastructure •  Fulfill open records requests •  Supplier Consolidation •  Supply Chain and Logistics •  Assembly Line Quality Assurance •  Proactive Maintenance •  Crowdsourced Quality Assurance Manufacturing
  17. 17. Page 17 © Hortonworks Inc. 2014 Efficient Data Lakes can Span to the Cloud On-Premises Cloud HDP on Windows HDP on Linux Your deployment of Hadoop hosted as a VM in Azure HDP on Windows HDP on Linux Full control of HW and software configs Analytics Platform System Turnkey Hadoop and relational warehouse appliance HDInsight Managed Hadoop Service Built on Azure storage Enjoy cross-platform interoperability based on 100% open source HDP 1 2 3 4
  18. 18. Page 18 © Hortonworks Inc. 2014 …and Provide On-Premises and Cloud Interoperability Deployment choice: run the same apps in the environment of your choice Consistent management story Co-locate Hadoop processing next to your apps, deployed on-premises or in the cloud Leverage Azure for cloud hosting, Hadoop as a service, or as a destination for backup On-­‐premises  or    “private  cloud”   Microso6  Analy9cs   Pla;orm  System   Opera9onal   Tools   Microso6  Azure   Microsoft Applications Azure Storage Azure HDInsight
  19. 19. Page 19 © Hortonworks Inc. 2014 Hybrid Hadoop Scenarios Key Considerations: •  Deployment Choice –  Linux, Windows –  On-Premises, Cloud, Hybrid •  “Tethered” Clusters –  Compatible services –  An explicit “connection” •  Synchronized Datasets –  Efficient sharing & access –  Governance & lineage Develop/POC Bursting Backup/Archive Production Learn
  20. 20. Page 20 © Hortonworks Inc. 2014 Hybrid Hadoop Scenarios: Cloud Backup and Archive Azure blob storage as low cost, offsite backup §  Run HDP and HDInsight to power analytics on your data in the cloud Automated data upload & backup •  Use Falcon to schedule data load rules, push data based on business needs Global aggregation §  Capture data centers around the world §  Run Hadoop local to a DC, or aggregate across DC’s to query the entire dataset Seamless transfer to other storage §  Leverage Azure SQL DB & Azure storage as sources or destinations data On-­‐premises  or    “private  cloud”   Microso6  Analy9cs   Pla;orm  System   Microso6  Azure   Azure Storage Azure HDInsight
  21. 21. Page 21 © Hortonworks Inc. 2014 Hybrid Hadoop Scenarios: App Development/POC Develop new apps on 100% interoperable infrastructure •  Develop & test without pre-committing to on-prem or cloud deployment Create new development & test environments on demand •  Do development with predictable costs De-risk application development •  Protect production data & SLA workloads from new dev errors and load spikes Experiment with new types of data to create new apps •  Defer decisions on data value and integration with the Data Lake On-­‐premises  or    “private  cloud”   Microso6  Analy9cs   Pla;orm  System   Microso6  Azure   Azure Storage Azure HDInsight …
  22. 22. Page 22 © Hortonworks Inc. 2014 Hybrid Hadoop Scenarios: Bursting Handle peak workloads in 100% interoperable environments §  Run HDP and HDInsight to power analytics on your data in the cloud §  Runs the same application code Make additional capacity available by separating jobs, e.g. •  Ad hoc from scheduled •  analytics from reporting •  recent data from archived data •  ETL from aggregation •  SLA from non-SLA •  departmental •  by priorities On-­‐premises  or    “private  cloud”   Microso6  Analy9cs   Pla;orm  System   Microso6  Azure   Azure Storage Azure HDInsight …
  23. 23. Page 23 © Hortonworks Inc. 2014 Demo Matt Winkler Principal Program Manager Microsoft
  24. 24. Page 24 © Hortonworks Inc. 2014 Story line: Leveraging Falcon to enable data movement to the cloud Microsoft Azure Azure Storage HDInsight Hadoop cluster deployed to IaaS On-Premises Hadoop Cluster (HDP 2.1) Running on CentOS HDFS YARN Tez Hive MR Falcon •  Leveraging Falcon to seamlessly move data to the cloud •  Leveraging HDInsight to create a cluster on demand to process the same data with the same job
  25. 25. Page 25 © Hortonworks Inc. 2014 Wait for it….Wait for it…
  26. 26. Page 26 © Hortonworks Inc. 2014 Demo wrap-up… Why Cloud? •  Elasticity •  Cost Optimization •  Economic flexibility •  Support for bursting workloads •  Global footprint Why on Premises? •  Compliance requirements •  Specific control over hardware/networking •  Integration requirements for additional apps to be close to cluster Why Both? •  Offsite backup •  Dev/Test •  Burst to Cloud
  27. 27. Page 27 © Hortonworks Inc. 2014 Next steps… Industry leading Hadoop Sandbox §  Free download §  Personal, portable Hadoop environment Included Tutorials for Microsoft §  How to Use Excel 2013 to Access Hadoop Data §  How to Use Excel 2013 to Analyze Hadoop Data §  How to Install and Configure the Hortonworks ODBC driver on Windows 7 Try Hadoop in the Cloud •  Up and running in minutes •  Spin up without hardware Free Trial:
  28. 28. Page 28 © Hortonworks Inc. 2014 Thank you Time for Q&A