Emerging Big Data & Analytics Trends with Hadoop

2,626 views

Published on

Presented at the 2012 InnoTech San Antonio by Paul Levine.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,626
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Here’s what we’re going to cover in today’s session:Walk through agenda
  • To start things off today, let’s look at “The Big Data Opportunity”
  • <This slide gives you the opportunity to tell the audience that we are in the era of big data.>I’m sure you’ve seen some of the articles in the press about “Big Data”. It seems as if everyone is talking about it. Some of you are probably living it today. There’s lots of interest in it but many aren’t exactly sure about what they should be doing about it. Big Data has been recognized world over for the potential impact it can have. Gartner has said that enterprise’s who embrace Big Data will outperform their peers financially by 20%.<click>
  • Make no mistake about it The Era of Big Data is Here
  • Over the next decade, the explosion of data will introduce not only massive challenges for IT, but massive opportunities for business. In fact, we’ve seen a number of our customers use Big Data to transform their businessLet’s look a just a few examples
  • Healthcare: Hospitals are implementing EMR (electronic medical records) and enabling access to larger volumes of historical patient data. Big Data Analytics infrastructures enable doctors and hospitals to leverage this EMR data to find patterns in the success of various treatments for patients with a variety of characteristics. Through the ability to store and analyze massive volumes of patient data, doctors are discovering more effective treatment options targeted at the specific characteristics of their patientsFinancial services: Banks and investment institutions have always been focused on the use of data in all of their operations Now “Big Data” brings the ability to run predictive analytics enabling these organizations to determine how their balance sheets can be affected by a variety of different market forces. For example, if the Euro drops 20%, how will that affect the bank’s balance and ability to borrow or lend money.Utilities: The implementation of Advanced Metering Infrastructure is generating massive amounts of data on the distribution and consumption of energy by commercial institutions and businesses. Utility companies can leverage these new forms of data to predict service failures and more quickly detect energy theft.
  • Now let’s look at ”Hadoop” and its role on Big Data Analytics.
  • To harness the full power of Big Data assets, “Big Data Analytics” are increasingly importantWith “Big Data Analytics”, organizations can leverage their “Big Data” assets to uncover new, emerging trends and identify potential business opportunities.With these powerful tools, businesses can tap into their Big Data assets and potentially discover new ways to gain competitive advantages.In sum, these technologies help organizations become more agile and identify opportunities and respond fasterRecent technology trends including the growth of the Internet have generated an immense and growing wave of “Big Data” that will require your “Big Data” storage and analytics platforms to scale significantly to handle the volume, velocity and variety of this data.To under score this, IDC recently projected that the amount of data managed by enterprises today will increase by 50x by 2020. In addition, 80% or more of this data will be “unstructured” , file-based data. With this as the backdrop, let’s look at the emergence of Hadoop.
  • Hadoop was developed 5-6 years ago to specifically address the need for “Big Data Analytics” At the time, development for Hadoop was being driven by the big Internet companies like Yahoo! And Google who were amassing a huge amount of unstructured data and needed a new way to analyze it because traditional approaches couldn’t handle this new “Big Data” challenge.The development of Hadoop was pioneered by Doug Cutting, a former Yahoo! EngineerHadoop consists of 2 key elements: The “Hadoop Distributed File System” (HDFS) while handles the storage component of the systemMapReduce which handles the “compute” functionToday,Hadoop is an ‘open-source’ initiative, very similar to Linux, and backed by a large, open source development community who collaborate on “Apache Hadoop”As with Linux, there are a number of approved or authorized Apache Hadoop distributions, including EMC Greenplum’s “Greenplum HD”. <You may also want to note that “Hadoop” got it’s name from Doug Cutting’s son’s toy elephant. This also explains, the “elephant” that is often depicted on materials relating to Apache Hadoop.>Now let’s look at why hadoop is so important.
  • One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
  • <This slide will automatically build to the next slide>
  • The initial, early adopters of Hadoop were largely the big Internet companies as well as a number of universities and research organizations.These early adopters were very “techy” and research-oriented. Typically, Hadoop was deployed in a “lab” environment, outside the domain of any traditional enterprise IT department. Often, these early deployments were very much a “do-it-yourself” effort involving the assembly of systems using commodity components.It wasn’t unusual, especially in academic environments, for a small-army of research assistants to be used to keep the system running.<advance to next slide>
  • Now, flash forward 5-6 years and we are seeing Hadoop beginning to go mainstream in enterprise environments across a wide range of industries.Increasingly, IT executives and line-of-business managers looking to leverage the “Big Data” assets within their organization to identify new opportunities and accelerate their business.Related to this, we are seeing the emergence of a new role in organizations: Data ScientistsThese organizations are also keenly interested in integrating Hadoop and its infrastructure into their overall IT environment so that they can protect the data and manage it with their standard IT processes. They are also more interested in acquiring and deploying ‘proven’ Hadoop solutions rather than building a “do-it-yourself” projectWhile Hadoop offers great potential value to organizations, it is not without certain challenges that need to be addressed. Let’s look as these.
  • It this section, we’re going to identify and describe the key technology challenges of Hadoop, especially when deployed using direct-attached storage (DAS).
  • One challenge associated with traditional deployments of Hadoop, is that it has largely been done on a dedicated infrastructure and not integrated with or connected to any other applications. In effect, a silo’d environment, often outside the realm of the IT team. This poses a number inefficiencies and risks.<click>A well-recognized issue with traditional Hadoop deployments is the “single-point-of-failure” problem with a the HadoopNamenode. In a Hadoop environment, a single namenode manages the hadoopfilesystem. If it goes down, the Hadoop environment will immediately go off-line. <Click to next build slide>
  • Another issue with traditional Hadoop environments is the lack of enterprise-level data protection. Typical Hadoop deployments do not have rigorous data protection backup and recovery capabilities such as snapshots or data replication capabilities for disaster recovery (DR) purposes.<click> Traditional Hadoop deployments on direct-attached storage (DAS) are also extremely inefficient. It’s not unusual for a DAS environment to operate with a 30-35% storage utilization rate (or less). Compounding this inefficiency is the fact that data is often mirrored (the default is 3 times). In addition to storage inefficiency, this type of infrastructure is very management-intensive.<click>Another issue with Hadoop running with direct attached storage is that ‘server’ and ‘storage’ resources must be increased together in lock-step. For example, if more storage resources are required, a new server must be deployed (and vice versa). This rigidity adds additional inefficiencies. Another issue is the manual import/export of data that is required in a traditional hadoop environment. In addition to being time and resource (bandwith) consuming, the hadoop data in typical environments can not be accessed or shared with other enterprise applications due to the lack of industry-standard protocol support.To address these challenges and to enable enterprises to begin realizing the benefits of Hadoop quickly and easily, EMC has recently introduced an exciting new Hadoop solution.<click to advance to next slide>
  • With the new EMC solution which incorporates EMC Isilon Scale-out NAS storage, organizations can deploy Hadoop on a highly scaleable platform that easily leverage other enterprise applications and workflows.<click>
  • The new EMC solution also eliminates the “single-point-of-failure” issue. We do this by enabling all nodes in an EMC Isilon storage cluster to become, in effect, namenodes. This greatly improves the resiliency of your hadoop environment.The EMC solution for hadoop also provides reliable, end-to-end data protection for Hadoop data including snapshoting for backup and recovery and data replication (with SyncIQ) for disaster recovery capabilities.Our new hadoop solution also takes advantage of the outstanding efficiency of EMC Isilon storage systems. With our solutions, customers can achieve up to 80% or more storage utilization.EMC Hadoop solutions can also scale easily and independently. This means if you need to add more storage capacity, you don’t need to add another server (and vice versa). With EMC isilon, you also get the added benefit of linear increases in performance as the scale increases.EMC also recently announced that we are the 1st vendor to integrate the HDFS (Hadoop Distributed File System) into our storage solutions. This means that with EMC Isilon storage, you can readily use your Hadoop data with other enterprise applications and workloads while eliminating the need to manually move data around as you would with direct-attached storage.
  • EMC is the industry’s first and only storage vendor to provide native Hadoop integration with scale-out storage. Our solution is designed to a number of key benefits:Our end-to-end approach helps enterprises deploy a proven hadoop solution quickly so that you can begin benefitting from this powerful technology quickly.Our solution eliminates risk and increases data protection.Another advantage of EMC’s Hadoop solution is that we have a significant amount of knowledge and expertise about big data analytics that you can leverage (we’ll cover this in more detail later in the presentation.Now let’s take a closer look at the EMC solution to see how we’re able to deliver on these benefits.
  • Scale out software architectures make commodity hardware work. You don’t want to be in the hardware business. This is being commoditized.Graphics showing the accelerating growth of the hardware with the go lower price pressures.
  • Main points:With a shared node based architecture any node can go down and any other node can take over for it; N-way resiliencyIsilon stripes vertically across all nodesIf a drive were to fail we rebuild the data across the available free space of the clusterIsilon can do protection levels unprecedented in the storage industry. N+1 through N+4… quadruple parity protectionCan sustain up to four simultaneous failures (4 drives or 4 nodes)Since each node in the cluster is participating in rebuilding a small piece of the data in parallel we can rebuild lost drives faster than anyone in the industry…. Easily rebuilding a 250GB drive in minutes rather than hours________________________________________________________Example narration:Lets talk a little about reliability. First as a shared node based architecture any node can go down and any other node can take over for it. We call this N-way resiliency. Second we do data protection very uniquely. Take this oil and gas file. A user hits “save” (“Click”) and the file is sent to the cluster and striped vertically across all nodes. Each node takes a small part of the file. It is “distributed” across the entire cluster. We also do this with parity or ECC. If a drive were to fail we rebuild the data across the available free space of the cluster… rather than on some dedicated parity drive or within some RAID group of drives. (“Click”) Moreover we can do protection levels unprecedented in the storage industry. N+2… akin to RAID 6 or RAID DP all the way through N+4… or quadruple parity protection failure. So we can sustain up to four simultaneous failures in our solution and be protected. This is industry leading data protection levels not previously understood in storage but achieved with Isilon. Finally, since each node in the cluster is participating in rebuilding a small piece of the data in parallel we can rebuild lost drives faster than anyone in the industry…. Easily rebuilding a 250GB drive in minutes rather than hours. This minimizes your window of risk when you have failed components.
  • EMC’s enterprise hadoop solution combines the power of EMCGreenplum HD, EMC’s “Apache Hadoop Distribution”, with EMC Isilon Scale-out NAS storage.The Greenplum HD software, depicted here at the top of the diagram, provides the “Compute” function while the Isilon storage (depicted at the bottom of the diagram) provides the “storage” function in the EMC Hadoop solution. Note that the “Hadoop Distribution File System (HDFS)” is integrated into the OneFS Operating system used by the EMC Isilon storage systems.Together, this solution provides a comprehensive hadoop solution that is easy to implement and manage. It is also highly efficient, reliable and highly scaleable.Our Hadoop solution can also be easily augmented with additional EMC Greenplum technologies to expand your data analytics capabilities (these will be discussed later in the presentation). Now let’s look at how the EMC Hadoop solution is packaged.
  • EMC’s Hadoop solution is a available in 2 basic configurations:EMC GreenplumHadoop software + EMC Isilon storageAn EMC Hadoop “data computing appliance” + EMC Isilon storage In the 1st solution configuration, the customer provides their own x86 server hardware which is then loaded with Greenplum HD is packaged as software-only. The server then is connected to the EMC Isilon Scale-out NAS.In the 2nd solution configuration, an EMC Greenplum “Data Computing Appliance” (includes an x86 server appliance, pre-loaded with Greenplum HD software) connects to the Isilon scale-out NAS storage platform.Either offering, enterprises can deploy and implement a comprehensive hadoop solution quickly and easily. Now let’s look at the underlying software architecture of the solution with out “Data Computing Appliance”.
  • This slide illustrates the architecture of EMC’s enterprise Hadoop solution based on our Greenplum “Data Computing Appliance (DCA)”.Starting at the bottom, you’ll note that the solution incorporates EMC Isilon storage which connects to our DCA with the HDFS protocol. Within the DCA, you’ll note: the Pluggable Storage LayerThe MapReduce Layer of Hadoop (which provides the “Compute” function).Standardhadoop tools such as “Pig” and “Hive”Advanced tools through Greenplum Chorus (which will be described in more detail in a few minutes).This solution provides a number of advantages over traditional Hadoop deployments:Easier and more reliable: EMC’s end-to-end approach removes the pain associated with building out a Hadoop cluster from scratch, which is required with other distributions. A purpose-built Hadoop infrastructure: Enterprises can deploy a Hadoop cluster quickly while eliminating the risk associated with the typical hardware and software configuration process.A key component of a unified analytics platform: The Hadoop solution of Greenplum HD is a core component of Greenplum’s Unified Analytics Platform, which is designed to answer the Big Data analytics needs of the agile enterprise by delivering business value through analytical insights.As a packaged and supported solution from EMC, you can also take advantage of the EMC's extensive support and services:Enterprise Hadoop support:Rely on EMC to provide 24x7 worldwide support with the industry’s largest Hadoop support infrastructure. Proven at scale: Certified by EMC to remove the guesswork associated with Hadoop deployments.Now, let me introduce my colleague from EMC’s Greenplum team to describe additional ways we can help you address your Big Data analytics needs.
  • Greenplum is working with an amazing group of customers to help them pursue business value from Analytics and participate in this era of Big Data. These industry leaders and innovative thinkers are doing extraordinary things with our platform. As you can see we are working with companies in many industries and verticals. Everything from Finance, to retail, to telecom to internet. Regardless of the sector, companies using Greenplum are innovating in new ways.
  • Our expansive partner network ensures you protect your existing investments while having the opportunity to leverage the best available technology. Greenplum has deep partnerships with industry leading organizations such as the SAS institute, Microstrategy and Informattica. We are also working with the emerging partners including karmasphere, datameer and predixion who doing new and interesting things on Hadoop and big data. Finally, we are fortunate to work with a number of leading applications providers like Silverspring networks and Clickfox who leverage Greenplum as a powerful backend technology. Greenplum is proud to work with this extraordinary partner ecosystem.
  • You have heard us say Greenplum not just a database, but guess what, it’s also, Greenplum, not just about technology.Data science teams are an emerging practice that are making amazing things happen on big data on behalf of their organizations. Greenplum is committed to the future of data science. We are working with leading universities on developing data science curriculums and programs.And we are investing in the community. We recently announced, with the help of several partners, a 1000 node Hadoop cluster called the Greenplum Analytic workbench for Hadoop. The only one of its kind in the industry. We will always have community editions of our software available for free. And we continue to invest in the practice by creating an publicizing events like the Data Science Summits.We also have our own data scientist practice with PHDs that have expertise in leading analytic tools. This team works every day with our customers advancing their projects and enabling new things from data.
  • Emerging Big Data & Analytics Trends with Hadoop

    1. 1. Big Data and Big Analytics: Big Opportunities with Hadoop Solutions from EMC Featuring EMC Isilon Scale-Out NAS Storage and EMC Greenplum HD Paul S. Levine Senior Systems Engineer April 9, 2012© Copyright 2011 EMC Corporation. All rights reserved. 1
    2. 2. Today‘s Agenda • The Big Data Opportunity • Big Data Analytics with Hadoop • Technology Challenges of Hadoop • EMC‘s Hadoop Solutions for the Enterprise • EMC Greenplum‘s Unified Analytics Platform (UAP) for Big Data • Q+A© Copyright 2011 EMC Corporation. All rights reserved. 2
    3. 3. The Big Data Opportunity© Copyright 2011 EMC Corporation. All rights reserved. 3
    4. 4. !!! !!!―Big Data Is Less About Size, And More About Freedom‖ ―Techcrunch !!! !!! !!! ―Findings: ‗Big Data‘ Is More Extreme Than Volume‖ ―Big Data! It‘s Real, It‘s ― Gartner Real-time, and It‘s Already Changing Your World‖ ―Total data: ―IDC !!! ‗bigger‘ than big data‖ !!! ― 451 Group !!!© Copyright 2011 EMC Corporation. All rights reserved. 4
    5. 5. !!! !!!―Big Data Is Less About Size, And More About Freedom‖ ―Techcrunch THE ERA OF !!! !!! BIG DATA ―Findings: ‗Big Data‘ Is !!! More Extreme Than Volume‖ ―Big Data! It‘s Real, It‘s ― Gartner Real-time, and It‘s Already Changing Your IS HERE World‖ ―Total data: ―IDC !!! !!! ‗bigger‘ than big data‖ !!! ― 451 Group© Copyright 2011 EMC Corporation. All rights reserved. 5
    6. 6. BIG DATA IS TRANSFORMING BUSINESS© Copyright 2011 EMC Corporation. All rights reserved. 6
    7. 7. Big Data in Action• Healthcare – Leverage historical data to discover better treatments• Financial Services – Data-driven banking stress tests & risk analysis• Utilities – Machine-learning to predict service outages & prevent energy theft © Copyright 2011 EMC Corporation. All rights reserved. 7
    8. 8. Hadoop & Big Data© Copyright 2011 EMC Corporation. All rights reserved. 8
    9. 9. The Promise of Big Data Analytics Leverage data assets to identify key trends and new business opportunities Analyze new sources of information to gain competitive advantages Take an agile approach to analytics that can adapt at the speed of business Scale your storage and analysis platform to handle Big Data‘s volume, velocity and variety© Copyright 2011 EMC Corporation. All rights reserved. 9
    10. 10. The Emergence of Hadoop• Created 5-6 years ago by former Yahoo! Engineer, Doug Cutting• Software platform designed to analyze massive amounts of unstructured data• Two core components: – Hadoop Distributed File System (HDFS) (storage) – MapReduce (compute)• Now a top-level Apache project backed by large, open source development community© Copyright 2011 EMC Corporation. All rights reserved. 10
    11. 11. Why Hadoop is Important Pragmatic approach to analytics on a very large scale – Opens up new ways of gaining insights and identifying opportunities for businesses Designed to address the rise of unstructured data – Enterprise data to grow by 650% over next 5 years – More than 80% of this growth will be unstructured data© Copyright 2011 EMC Corporation. All rights reserved. 11
    12. 12. Evolution of the Hadoop Market Innovators/ Early Majority Late Majority Laggards Early Adopters Hadoop Early Adopters Hadoop Early Majority© Copyright 2011 EMC Corporation. All rights reserved. 12
    13. 13. Evolution of the Hadoop Market HADOOP PROFILE (TO DATE) Pioneers and academics Application Architect Visionary Open source / community driven Build-your-own server, application & storage infrastructure Commodity components Web 2.0 Universities Life Sciences Hadoop Early Adopters Hadoop Early Majority© Copyright 2011 EMC Corporation. All rights reserved. 13
    14. 14. Evolution of the Hadoop Market HADOOP PROFILE (TO DATE) HADOOP PROFILE (EMERGING) Pioneers and academics IT Manager & CIO Application Architect Data Scientist Visionary Line-of-business Open source / community driven Commercial distribution Build-your-own server, application & Turnkey solution storage infrastructure End-to-End Data protection Commodity components Web 2.0 Fortune 1000 Universities Financial Services Life Sciences Retail Hadoop Early Adopters Hadoop Early Majority© Copyright 2011 EMC Corporation. All rights reserved. 14
    15. 15. Technology Challenges of Hadoop© Copyright 2011 EMC Corporation. All rights reserved. 15
    16. 16. Technology Challenges of Hadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Name node Single Point of Failure 2 – Namenode Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 5 – Rigid compute to storage ratio Manual Import/Export 6 – No protocol support© Copyright 2011 EMC Corporation. All rights reserved. 16
    17. 17. Technology Challenges of Hadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Namenode 1x Single Point of Failure 2 – Namenode 1x 1x Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup 2x 2x Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 2x 3x 5 – Rigid compute to storage ratio Manual Import/Export 3x 3x 6 – No protocol support© Copyright 2011 EMC Corporation. All rights reserved. 17
    18. 18. EMC Addresses the Hadoop Challenge Dedicated Storage Infrastructure Scale-Out Storage Platform 1 – One-off for Hadoop only 1 – Multiple applications & workflows Single Point of Failure No Single Point of Failure 2 – Namenode 2 – Distributed Namenode Lacking Enterprise Data Protection End-to-End Data Protection 3 3 – SnapshotIQ, SyncIQ, NDMP Backup – No Snapshots, replication, backup Industry-Leading Storage Efficiency Poor Storage Efficiency 4 – >80% Storage Utilization 4 – 3X mirroring Independent Scalability Fixed Scalability 5 – Add compute & storage separately 5 – Rigid compute to storage ratio Multi-Protocol Manual Import/Export 6 – Industry standard protocols 6 – No protocol support – NFS, CIFS, FTP, HTTP, HDFS© Copyright 2011 EMC Corporation. All rights reserved. 18
    19. 19. The EMC Isilon Advantage for Hadoop Scale-Out Storage Platform 1 – Multiple applications & workflows No Single Point of Failure 2 – Distributed Namenode End-to-End Data Protection 3 – SnapshotIQ, SyncIQ, NDMP Backup Industry-Leading Storage Efficiency 4 – >80% Storage Utilization Independent Scalability 5 – Add compute & storage separately Multi-Protocol 6 – Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS© Copyright 2011 EMC Corporation. All rights reserved. 19
    20. 20. Industry’s First and Only Scale-Out Storage Solutionwith Native Hadoop Integration Accelerating the Benefits of Hadoop for the Enterprise Reducing Risk End-to-End Data Protection Organizational Knowledge/Experience
    21. 21. Core Innovation…Value to CustomersIsilon’s OneFS Scale-Out Operating System  Creates one giant network drive  Single file system, single volume  Guaranteed 80% raw storage utilization  Highest performance, fully symmetric cluster  Easy to manage and grow  Auto Balanced, Self Healing  Global Namespace  Multi-tier single file system/single cluster 2 1NO More Management of LUNs, Volumes or RAID© Copyright 2011 EMC Corporation. All rights reserved. 21
    22. 22. Isilon‘s Clustered Storage Solution ENTERPRISE CLASS HARDWARE & SOFTWARE FREE BSD OPERATING SYSTEM SOFTWARE File SystemA 3-node 40 Gigabit InfinibandIsilon IQ Cluster Expandable to 15.5 PB in a single file system (144 nodes) Sequential I/O performance= >85 GB/sec Specsfs_2008 I/O operations/sec = 1.6 Million IOPs© Copyright 2011 EMC Corporation. All rights reserved. 22
    23. 23. Isilon IQ Network Architecture Windows 40 Gb NFS, CIFS, iSCSI FTP, HTTP Infiniband UNIX/LINUX (optional switches for additional subnets) (optional 2nd 1 or 10 GigE switch for high MAC availability) Intracluster Client/Application Standard Isilon IQ Communication Layer Gigabit Clustered Storage InfiniBand Layer Ethernet Layer Layer Industry standard protocols NFS v3, v4, SMB, SMB2 (Native), iSCSI, HTTP, HDFS (Hadoop), NDMP, SNMP, ADS, LDAP and NIS for security© Copyright 2011 EMC Corporation. All rights reserved. 23
    24. 24. The Most Reliable Storage System Built-in high availability clustered architecture Traditional storage requires costly, redundant heads and softwareWith N+2, N+3, and N+2:1N+4 protection,protection, 100% 40 Gbdata is 100% availabledata is 100% available 100% Infinibandif multiple drives or oreven if a single drive FAILED 100%nodesfailsnode fail 100% 100%And… Isilon IQ offers the (optional 2nd 100%industry‘s fastest drive switch for highrebuild times — 100% availability)In less than an hour! FAILED 100% Protection can be set at the cluster, directory, or file level© Copyright 2011 EMC Corporation. All rights reserved. 24
    25. 25. Largest and Most Scalable Storage System OneFS™ can scale from 18TB to over 15,000 TB in a single file system • • • •© Copyright 2011 EMC Corporation. All rights reserved. 25
    26. 26. Linear, Predictable Performance = SLAs AutoBalance: Automated data balancing across nodes Reduces costs, complexity and risks for scaling storage BALANCED AutoBalance migrates EMPTY content to new storage nodes while system is online and in production BALANCED EMPTY FULL Requires NO BALANCED Manual intervention EMPTY FULL Reconfiguration server or client mount point BALANCED EMPTY or application changes FULL Under 15 seconds to scale with no downtime BALANCED EMPTY World’s fastest performance and capacity FULL scaling© Copyright 2011 EMC Corporation. All rights reserved. 26
    27. 27. Solutions Siz Nodes Proc Memory Disk Capacity e (2x 4 SSD/SAS S200 24 - 96GB 7-14 TB 2U core) 24 Slots SSD/SAT (1x 4 X200 6 - 48GB A 6-36 TB 2U core) 12 Slots Westmer 24 - SSD/SAT e 36, 72, 108 X400 192GB A 4U (2x 6 TB 36 Slots core) Westmer e SATA 36, 72, 108 NL400 12 - 48GB 4U (2x 4 36 Slots TB core) Backup (2x 4 Accelerato 32 GB Diskless 4 Fiber Ports 1U core) r© Copyright 2011 EMC Corporation. All rights reserved. 27
    28. 28. Full Suite of Enterprise Software Options • Combine multiple storage tiers into a single file system • Simple, scalable and flexible data protection • Policy-based client load balancing with NFS failover • Quota management and thin provisioning • Fast and flexible file-base asynchronous replication • Analytics platform to maximize performance and resource utilization • WORM functionality enforces file-level retention© Copyright 2011 EMC Corporation. All rights reserved. 28
    29. 29. EMC‘s Enterprise Hadoop SolutionEMC Greenplum HD and EMC Isilon Scale-Out Storage Apache Hadoop certified by Greenplum Compute Simple platform management and control Parallel analytics access with Greenplum Database Storage© Copyright 2011 EMC Corporation. All rights reserved. 29
    30. 30. Flexible Packaging Hadoop Software + Storage Package Greenplum HD software on commodity x86 hardware Isilon scale-out NAS Hadoop Appliance + Storage Greenplum HD Data Computing Appliance Isilon scale-out NAS© Copyright 2011 EMC Corporation. All rights reserved. 30
    31. 31. Greenplum HD Data Computing ApplianceSoftware Architecture with Isilon Greenplum Greenplum Chorus Command Center Greenplum Hadoop Tools (Pig, Hive, HBase, Mahout, etc…) MapReduce Layer Pluggable Storage Layer (HDFS API) HDFS Protocol Isilon lsilon OneFS© Copyright 2011 EMC Corporation. All rights reserved. 31
    32. 32. Innovative Companies Using Greenplum© Copyright 2011 EMC Corporation. All rights reserved. 32
    33. 33. Powerful Partner Ecosystem Discovix© Copyright 2011 EMC Corporation. All rights reserved. 33
    34. 34. Greenplum: Not Just About Technology • Data Science teams will become the driving force for success with big data analytics • Greenplum is committed to the future of data science – University data science program collaboration with Stanford and UC Berkeley – Community investment including the Greenplum Analytic Workbench, Community edition software, and Data Science Summits • Greenplum built its own Data Science practice – Leading PhDs with analytic tools expertise© Copyright 2011 EMC Corporation. All rights reserved. 34
    35. 35. Questions?© Copyright 2011 EMC Corporation. All rights reserved. 35

    ×