EMC HADOOP Storage Strategy

2,499 views

Published on

EMC HADOOP Storage Strategy presented at EMCWorld2014

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,499
On SlideShare
0
From Embeds
0
Number of Embeds
59
Actions
Shares
0
Downloads
135
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Situation: Past few years we have seen a major transformation on next gen analytics.Big Data is a major focus of your business and application teams2013 EMCWorldwe announced Isilon HDFS support, and launch of Pivotal2014 EMCWorldViPR HDFS, Pivotal momentum, industry investment in Cloudera, and MongoDBToday Amazon 1/3 of sales comes from personalization & recc systemsNot just big companies like Amazon, and eBay but even your local groceryWeekly sales circularNow-enter store you get target marketing messages with discounts 4uStores collecting data on every shopper using loyalty and WIFIEvery industry healthcare, insurance, financialProblem: confusion, what do I need to do. Almost 40% EBCStake: In order to determine the best IT infrastructure for these services you need to understand the key enablers<next>
  • First evolution is data volume: EMC Digital Universe 7th UpdateDigital Data stored will double every two years for next decadeData growth from emerging markets are exploding60% data generated from mature markets US, Japan, GermanyBy 2016 60% of data generated from emerging economies such Brazil, China, and MexicoThe second evolution is the impact from the Internet of Things or the Industrial Internet, Data collection is accelerating:14 Billion internet connected devices today, 2% of all data32 billion internet connected devices in 2020, 10% of all dataGE Wind Turbine example, 20,000 sensors. 400 updates per second.The third is analysis of unstructured data including images, video, and audio. No longer just analyzing neat tables of data organ in columns & rows.The NYC exploding manhole coverMessy data – records from 188051,000 manholesEnough cable to wrap the earth 2.5 times44% reduction in disastersThe forth, is new tools optimized to analyze large, complete data sets, that are often dense with frequently collected data from sensors and devices, and include unstructured data such as images, video, and audio. These tools are inexpensive, leverage Open Source. Easy to deploy – local grocerCombination of collect, store more data cost effectively with new tools creating perfect storm for Big Data. <click>My current storage architecture can’t meet all these requirements. What should a storage architect do?
  • First step is you need to content repository or Data Lake. Most of the new analytics tools such as Hadoop rely upon HDFS and it API interface. Several great attributes of HDFS:Scales from terabytes to Petabytes easilyOptimized for Big Block IO – 64MB block sizeSupports structured & unstructured dataOpen source low cost, HW independentLet’s look at this simple HDFS block diagramHighly distributed processingBut, what about those Wind Turbines streaming 400 data points a second?Customers are combining in-memory database technologies such as GemfireXD and Impala <click>
  • IMDG provide the fact ingest and query performance. IMDG technologies such as GemfireXD will write copies of their data to HDFS for persistence and deeper analytics.IMDG+HDFS support store, and analysis capability for large data sets, streaming ingest, and analysis of structured, and unstructured data. Tools like Pivotal HAWQ allow you to access data in IMDG and HDFS Data LakeWhat are storage requirements for DL?
  • Cost Optimized: We recommend HDFS and IMDG to manage storage costs at scale with hot edge, cold core arch. Minimize $/GB. Data will double every two years.No Silo’s: Content Repository/DL must be accessible by all protocols. Write one protocol, read any. Ready for next big thingScalability from terabytes to 100’s of petabytes. Non disruptive capacity growth. No down time migrationsPiece of cake? How many storage solutions can do this today?No one storage platform provides all this. EMC believes in building blocks and options. There are four common DL storage options today. Each have +/-’s
  • First one Hadoop HDFS on server storageMost start hereExperience issues with scale. Poor capacity utilizationDisadvantages:Low efficiencyHardware support at scaleLimited to Hadoop distroHadoop silo
  • +Access data already storedLeverage existing investmentEnterprise Reliability, Security, and Availability ** EMC Hadoop Starter Kit<<talk about EMC Elastic Cloud Storage - ECSCommon concerns:- limited high performance options- storage hardware lockin- HDFS compatibility with Hadoop Distro’s
  • ViPR architectureHadoop Starter Kit – ViPR editionLot’s to like:Leverage existing investmentCentralized management/provisioningLeverage reliability, security, and availability of storage HWFlexibility of Data ServicesCommon concerns: - new with HDFS data services GA in Feb 2014 - HDFS compatibility with Hadoop Distro’s (HCFS)
  • Mature: Greenplum DCA, and VCE Vblock for Big DataLarge enterprise and SP customersFast to deploy, predictable performanceCommon concerns:Hardware Vendor lockinInflexible modulesSlower innovationThese four options all have strengths and weaknesses. The most mature for Gown up HDFS is our Isilon solution with many happy customers. The most compelling is our storage software virtualization solution, ViPR but it is new and building traction. With the 2.0 release it is gaining many of the features customers need now. Things like additional protocol support is road mapped over the next 12 months.Do you want to see this in action? I’d like to introduce Jim Ruddy - Lead EMC OCTO Big Data Architect to demo a Data Lake in action with the Pivotal Analytics suite from a recent customer deployment. Jim what are you going to show us?
  • Demo – Retail Use Case1)      Data enters though adapters. These adapters can receive data from multiple sources like Twitter, POS, manufacturing devices, or from sources on “The internet of things”2)      The adapter is written in SpringXDwith can be a single node or scale to multiple nodes.3)      The first analytics of data is done at this level. Where does the data go? Does it need instance analysis or does it need to be compared to a history of transactions 4)      There are 2 ways data can be written at this point. It can be directly written from the adapter to GemfireXD for in memory analytics or a tap can be done, where data is written to GemfireXD and HDFS at the same time. The adapter can also decide if some data goes to GemfireXD and some goes to HDFS and how to make this determination. This would be the first level of analytics.5)      Once data is in Gemfire, it is stored using in-memory tables, or you can persist very large tables to local disk store files or to HDFS. How long or where the data is kept is variable and can be tuned per table created. The use of pivotal framework extension (PFX) allow for HAWQ to query data in memory.  As the data is persisted to HDFS Hawq can also query the data there as well.6)      GemfireXD is built as a cluster. There is one locater server and one or more data servers that host data. These servers keep the tables in memory, have local storage to persist data, can write and read data to/from HDFS and run yarn/map reduce jobs.7)      Once data is persisted to HDFS Yarn (mapreduce version2) can then run batch jobs against the data. 8)      Every node in Pivotal hadoop that is a Yarn node manager is also a Hawq segment. This is how hawq access’s data in HDFS.9)      Once data is persisted to HDFS, Hawq and hadoop can do historical analysis.
  • Awesome Jim. As you can see Gen Analytics is very powerful for your business. It is the top priority for many of our customers application teams.EMC is uniquely qualified as the industry leader in data storage, 30+ years of history of innovation helping our customers and industry through these evolutions. We also have learned a great deal from our experience with Pivotal.We believe the key is to architect your content repository using a combination of storage technologies optimized for both $/GB, and performance, to support the new analytics tools. These tools require access via a variety of protocols including legacy file, SQL, and new storage protocols such as Object and HDFS.In closing, EMC provides highly scalable, and cost efficient storagesolutions that are part of our building block approach. We have proven solutions to help you deploy a DL that scales effortlessly, and cost effectively, across geo’s. Thank you.We have time for some questions…Jim, Dan please join me.
  • EMC HADOOP Storage Strategy

    1. 1. 1© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. EMC Hadoop Storage Strategy Ed Walsh - @vEddieW Jim Ruddy - @Darth_Ruddy Dan Baskette - @dbbaskette
    2. 2. 2© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. CHANGES IN ANALYTICS DATA VOLUME DATA VELOCITY DATA TYPES APPS
    3. 3. 3© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. DATA LAKE TECHNOLOGY HADOOP DISTRIBUTED FILE SYSTEM • Highly saleable & portable - Apache Open Source Specification • Structured and unstructured data • Analytics API interface standard • Storage hardware flexibility • Performance optimized for large file access HDFS TRADE-OFFS • Optimized for streaming writes; poor for random seeks • Write once file system • Hardware failure results in reduced performance • Specialized file system, not designed for general use HDFS Architecture Client NameNode Secondary NameNode (Now called checkpoint or backup node) Where do I read or write data? Just these nodes DataNode DataNode DataNode Data Status HDFS Data
    4. 4. 4© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. DATA LAKE TECHNOLOGY HADOOP TIER DataNode HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode HDFS PROCESSING TIER – ME, HIVE, ETC. DEEP SCALE SQL ANALYTICS – PIVOTAL HAWQ IN MEMORY TIER SQL OBJECTS JSON DATABASES Operational data is the focus (it is in memory, mostly) Continue to work with RDBs All data, history in HDFS HDFS data files directly accessible inside Hadoop Analytic results routed to memory tier
    5. 5. 5© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. DATA LAKE STORAGE FEATURES NO SILOS Multi-protocol access Simultaneous access for unstructured data Separation of storage from access protocol OPTMIZED COST Choice of storage hardware Multi-vendor, no lock-in LIMITLESS SCALE Expand capacity as needed Massive scale-out Highly available
    6. 6. 6© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. INTEGRATED HDFS WITH HADOOP DISTRO STRENGTHS • Tightly coupled with Hadoop software • Low cost • Storage hardware choice • Integrated software support • Data locality
    7. 7. 7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. HDFS STORAGE ARRAY INTERFACE STRENGTHS • No Ingest necessary • NameNode Fault Tolerance • Eliminate 3x mirroring • Multi-protocol access • Simultaneous Multi-Hadoop distribution support • Smart-Dedupe for Hadoop • SEC 17a-4 Compliance • Kerberos Authentication • Application Multi-tenancy EMC Hadoop Starter Kit: https://community.emc.com/docs/DOC-26892
    8. 8. 8© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. HDFS BY STORAGE VIRTUALIZATION SOFTWARE STRENGTHS • Multi-protocol access - Object, HDFS, Block (iSCSI), more coming - Write file, read object & vice versa • NameNode Fault Tolerance • Eliminate 3x mirroring • Compute & data locality • Application multi-tenancy • Heterogeneous Storage: - Pool server storage - Enterprise arrays • EMC, Netapp, Hitachi EMC Hadoop Starter Kit: https://community.emc.com/docs/DOC-34442
    9. 9. 9© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. ANALYTICS APPLIANCES STRENGTHS • Rapid deployment • Predictable performance & scale • Optimum resource utilization • Integrated, simplified management • Simplified support & maintenance • Optimized cost • Highest Reliability, Availability, and Stability
    10. 10. 10© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Traditional Analytics Architecture RMT Historian IMAS Alarm LIMS Oracle BI (SSRS, Panopticon, Web) Analytics Server (SAS) Analytics Server (R) Pre- aggregated Tables BI (Cognos)
    11. 11. 11© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Modern Analytics Architecture EMC Data Lake Architecture RMT Historian IMAS Alarm LIMS BI Server (SSRS, Panopticon, Web) Analytics Server (SAS) Analytics Server (R) Historian Alpine/Chor us (Pivotal) “Real Time” Feed BI Server (Tableau or other) Reporting DB GemFire XD HAWQ HDFS
    12. 12. 12© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. MODERN ANALYTICS USING DATA LAKE DEMO
    13. 13. 13© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. EMC DATA LAKE CAPABILITIEs Documents (XLS, PPT, DOC) SQL Databases Rich Media (PDF, JPG, Video, Streaming) Sensor Data (GPS coordinates, temperature measurements) Unstructured Context (Web Server Logs, Scale Effortlessly | Store Efficiently | Access Globally
    14. 14. Ed Walsh - @vEddieW Jim Ruddy - @Darth_Ruddy Dan Baskette - @dbbaskette
    15. 15. 16© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. The Emerging Data Platform Ecosystem Business Data Lake Ingestion Tier Real-time Batch Micro batch Data Sources Clickstream Sensors Telemetrics Weblogs Network Data CRM ERP Data Collab} Insights Tier SQL MapReduce NoSQL Spark R Action Tier Real-time Insights Batch Insights Interactive Insights Operations Tier Data Services Tier Processing Tier MDM RDM Audit and Policy mgmt Data mgmt services Systems monitoring and management Relational Database MPP Database In-memory processing Workflow Management Hadoop App Server Web Services Data Management Tier HDFS Software- defined Storage Enterprise SAN/NAS Public Cloud Hybrid Cloud Private Cloud Infrastructure Tier
    16. 16. 17© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Business Data Lake EMC Federation Solutions Data Sources Ingestion Tier Clickstream Sensors Telemetrics Real-time Weblogs Network Data Batch CRM ERP Data Collab Micro batch } Operations Tier Data Services Tier Processing Tier MDM RDM Audit and Policy mgmt Data mgmt services Systems monitoring and management Relational Database MPP Database In-memory processing Data Management Tier Workflow Management Insights Tier SQL MapReduce NoSQL Spark R Action Tier Real-time Insights Batch Insights Interactive Insights Hadoop App Server Web Services HDFS Software- defined Storage Enterprise SAN/NAS Public Cloud Hybrid Cloud Private Cloud Infrastructure Tier VMware vCloud Suite vCloud Hybrid Services

    ×