Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

on

  • 1,421 views

Using Greenplum HD, Isilon Scale-Out NAS and EMC services, learn how you can quickly and easily deploy a powerful, yet worry-free Hadoop-based analytics engine. If you ever desired to take the plunge ...

Using Greenplum HD, Isilon Scale-Out NAS and EMC services, learn how you can quickly and easily deploy a powerful, yet worry-free Hadoop-based analytics engine. If you ever desired to take the plunge with Hadoop or wanted the confidence to grow your Hadoop deployment for full-scale production, learn how EMC can provide you the tested solution to do so.

Statistics

Views

Total Views
1,421
Views on SlideShare
1,420
Embed Views
1

Actions

Likes
0
Downloads
71
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight Presentation Transcript

  • 1. EMC Isilon Big Data Storage and Hadoop Analytics Jemish Patel© Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Today’s Agenda • The Big Data Opportunity • Big Data Analytics with Hadoop • Technology Challenges of Hadoop • EMC’s Hadoop Solutions for the Enterprise • Q+A© Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. The Big Data Opportunity© Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. !!! !!!“Big Data Is Less About Size, And More About Freedom” ―Techcrunch !!! !!! !!! “Findings: ‘Big Data’ Is More Extreme Than Volume” “Big Data! It’s Real, It’s ― Gartner Real-time, and It’s Already Changing Your World” “Total data: ―IDC !!! ‘bigger’ than big data” !!! ― 451 Group !!!© Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. !!! !!!“Big Data Is Less About Size, And More About Freedom” ―Techcrunch THE ERA OF !!! !!! BIG DATA “Findings: ‘Big Data’ Is !!! More Extreme Than Volume” “Big Data! It’s Real, It’s ― Gartner Real-time, and It’s Already Changing Your IS HERE World” “Total data: ―IDC !!! !!! ‘bigger’ than big data” !!! ― 451 Group© Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. BIG DATA IS TRANSFORMING BUSINESS© Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Big Data in Action• Healthcare – Leverage historical data to discover better treatments• Financial Services – Data-driven banking stress tests & risk analysis• Utilities – Machine-learning to predict service outages & prevent energy theft © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Hadoop & Big Data© Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. The Promise of Big Data Analytics Leverage data assets to identify key trends and new business opportunities Analyze new sources of information to gain competitive advantages Take an agile approach to analytics that can adapt at the speed of business Scale your storage and analysis platform to handle Big Data’s volume, velocity and variety© Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. The Emergence of Hadoop• Created 5-6 years ago by former Yahoo! Engineer, Doug Cutting• Software platform designed to analyze massive amounts of unstructured data• Two core components: – Hadoop Distributed File System (HDFS) (storage) – MapReduce (compute)• Now a top-level Apache project backed by large, open source development community© Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. MapReduce•"Map" step: The master node takes the input, divides it intosmaller sub-problems, and distributes them to worker nodes. Aworker node may do this again in turn, leading to a multi-leveltree structure. The worker node processes the smaller problem,and passes the answer back to its master node.•"Reduce" step: The master node then collects the answers to allthe sub-problems and combines them in some way to form theoutput – the answer to the problem it was originally trying tosolve.© Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. MapReduce© Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Services for MapReduce•JobTracker – A master node that manages job submissions, schedulingand reprocessing in case of job failures. Jobs consist of a mapper, areducer and a list of inputs.•TaskTracker- Each slave node in the cluster runs a TaskTracker process.The JobTracker instructs the TaskTrackers to run and monitor a task. Atask consists of a map or a reduce over a piece of data.© Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. HDFS – Hadoop Distributed Filesystem• HDFS is a filesystem designed for storing very large files withstreaming data access patterns, running on clusters ofcommodity hardware.•HDFS has a permissions model for files and directories that ismuch like POSIX.© Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Services for HDFS•Namenode - manages the filesystem namespace. It maintains thefilesystem tree and the metadata for all the files and directories in thetree. This information is stored persistently on the local disk in the formof two files: the namespace image and the edit log.•Datanode- Workhorses of the filesystem. They store and retrieveblocks when they are told to (by clients or the namenode), and theyreport back to the namenode periodically with lists of blocks that theyare storing.•Secondary Namenode - Its main role is to periodically merge thenamespace image with the edit log to prevent the edit log frombecoming too large. The secondary namenode usually runs on aseparate physical machine© Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Eco-System Components Pig - A high-level data-flow language and execution framework for parallel computation Mahout - A Scalable machine learning and data mining library Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying (SQL) Hbase - A scalable, distributed database that supports structured data storage for large tables R(RHIPE) – Combines Hadoop + R analytics language R Pig Mahou Hive HBase (RHIPE) t Ecosystem C MapReduce – Compute Layer (Job Scheduling / Execution) o r HDFS – Storage Layer (Hadoop Distributed Filesystem) e© Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Why Hadoop is Important Pragmatic approach to analytics on a very large scale – Opens up new ways of gaining insights and identifying opportunities for businesses Designed to address the rise of unstructured data – Enterprise data to grow by 650% over next 5 years – More than 80% of this growth will be unstructured data© Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Evolution of the Hadoop Market Innovators/ Early Majority Late Majority Laggards Early Adopters Hadoop Early Adopters Hadoop Early Majority© Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Evolution of the Hadoop Market HADOOP PROFILE (TO DATE) Pioneers and academics Application Architect Visionary Open source / community driven Build-your-own server, application & storage infrastructure Commodity components Web 2.0 Universities Life Sciences Hadoop Early Adopters Hadoop Early Majority© Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Evolution of the Hadoop Market HADOOP PROFILE (TO DATE) HADOOP PROFILE (EMERGING) Pioneers and academics IT Manager & CIO Application Architect Data Scientist Visionary Line-of-business Open source / community driven Commercial distribution Build-your-own server, application & Turnkey solution storage infrastructure End-to-End Data protection Commodity components Web 2.0 Fortune 1000 Universities Financial Services Life Sciences Retail Hadoop Early Adopters Hadoop Early Majority© Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Technology Challenges of Hadoop© Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Hadoop Architecture 1. Data is ingested into the Hadoop File System (HDFS) 2. Computation occurs inside Hadoop (MapReduce) 3. Results are exported from HDFS for use Hadoop Data Node Hadoop Data Node Hadoop Data Node Ethernet Hadoop Name Node Hadoop Data Node Hadoop Data Node Hadoop Data Node© Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Writing Data into Hadoop© Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. Reading Data from HDFS© Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Technology Challenges of Hadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Name node Single Point of Failure 2 – Namenode Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 5 – Rigid compute to storage ratio Manual Import/Export 6 – No protocol support© Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Technology Challenges of Hadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Namenode 1x Single Point of Failure 2 – Namenode 1x 1x Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup 2x 2x Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 2x 3x 5 – Rigid compute to storage ratio Manual Import/Export 3x 3x 6 – No protocol support© Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. EMC Addresses the Hadoop Challenge Dedicated Storage Infrastructure Scale-Out Storage Platform 1 – One-off for Hadoop only 1 – Multiple applications & workflows Single Point of Failure No Single Point of Failure 2 – Namenode 2 – Distributed Namenode Lacking Enterprise Data Protection End-to-End Data Protection 3 3 – SnapshotIQ, SyncIQ, NDMP Backup – No Snapshots, replication, backup Industry-Leading Storage Efficiency Poor Storage Efficiency 4 4 – 3X mirroring – >80% Storage Utilization Independent Scalability Fixed Scalability 5 5 – Rigid compute to storage ratio – Add compute & storage separately Multi-Protocol 6 Manual Import/Export 6 – Industry standard protocols – No protocol support – NFS, CIFS, FTP, HTTP, HDFS© Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. The EMC Isilon Advantage for Hadoop Scale-Out Storage Platform 1 – Multiple applications & workflows No Single Point of Failure 2 – Distributed Namenode End-to-End Data Protection 3 – SnapshotIQ, SyncIQ, NDMP Backup Industry-Leading Storage Efficiency 4 – >80% Storage Utilization Independent Scalability 5 – Add compute & storage separately Multi-Protocol 6 – Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS© Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. Writing into Hadoop with Isilon•Isilon becomes the namenode as well as the data node•Provides scalability and protection of the data.•Hadoop cluster no longer has a single point of failure and no longer writesmultiple 64MB-128MB chunks of data to datanodes© Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Reading Hadoop Data with IsilonData is read off the cluster back to the compute nodes. The datanodes are now just compute nodes and areindependent of the data in the Hadoop cluster. –Benefits are that the Hadoop hardware can be upgraded without the need for migration of the data© Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. Industry’s First and Only Scale-Out StorageSolution with Native Hadoop Integration Accelerating the Benefits of Hadoop for the Enterprise Reducing Risk End-to-End Data Protection Organizational Knowledge/Experience© Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. EMC’s Enterprise Hadoop SolutionEMC Greenplum HD and EMC Isilon Scale-Out Storage  Apache Hadoop certified by Greenplum Compute  Simple platform management and control  Parallel analytics access with Greenplum Database Storage© Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. Greenplum: Not Just About Technology • Data Science teams will become the driving force for success with big data analytics • Greenplum is committed to the future of data science – University data science program collaboration with Stanford and UC Berkeley – Community investment including the Greenplum Analytic Workbench, Community edition software, and Data Science Summits • Greenplum built its own Data Science practice – Leading PhDs with analytic tools expertise© Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Hadoop in Action© Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Customer Case Study Purdue University Leading Big Ten university renowned worldwide for its research and academic excellence.BackgroundChallengeSolution© Copyright 2012 EMC Corporation. All rights reserved. 35
  • 36. Customer Case Study Purdue University • Large Hadoop environment for researchers in Statistics Department • No central storage infrastructure, leading to many different, disparate islands of data without consistent protection orBackground performanceChallenge • Small IT staff managing large amounts of data and hundreds of data-intensive usersSolution© Copyright 2012 EMC Corporation. All rights reserved. 36
  • 37. Customer Case Study Purdue University • Deployed Isilon with HDFS, which plugged seamlessly into their Hadoop environment • Created a single, shared storage resource for data computing and analytics • Delivered a highly reliable and flexible storage infrastructure that protected dataBackground from loss or corruptionChallenge • Eliminated need to migrate data between storage silos, delivering immediateSolution accessibility and significantly higher performance© Copyright 2012 EMC Corporation. All rights reserved. 37
  • 38. Customer Case Study Purdue University “We tested EMC Isilon with Hadoop in our statistics department, which must often analyze huge data sets. EMC Isilons multi- protocol capabilities provided fast and reliable delivery of data to our statisticians, demonstrating the potential to increase theBackground time spent on actually doing the science, while reducing management costs.”Challenge Alex Younts, Purdue UniversitySolution© Copyright 2012 EMC Corporation. All rights reserved. 38
  • 39. Customer Case Study Global Shipping & Transportation Co. Leading Global Shipping and Transportation company.BackgroundChallengeSolution© Copyright 2012 EMC Corporation. All rights reserved. 39
  • 40. Customer Case Study Global Shipping & Transportation Co. • Large amounts of data in different formats from various business units. Focused on E-commerce self service site with semi-structured (XML) and unstructured log data • Looking to optimize their current ways of analyzing this data regardless of format.Background • They wanted to understand what devicesChallenge were accessing their self-service site in order to measure usage patterns to enhance user experience on their E-Solution commerce site© Copyright 2012 EMC Corporation. All rights reserved. 40
  • 41. Customer Case Study Global Shipping & Transportation Co. • Using Isilon with HDFS as the central storage for their Hadoop environment, they eliminated any ETL steps as data could simply be copied over standard protocols • Created a single, shared storage resource for data analytics regardless of structured, semi-structured or unstructured dataBackground queries across their entire data set.Challenge • Delivered a highly reliable and flexible storage infrastructure that enabledSolution mechanisms such as backup and archive to be part of their analytics workflow© Copyright 2012 EMC Corporation. All rights reserved. 41
  • 42. Questions?© Copyright 2012 EMC Corporation. All rights reserved. 42
  • 43. Thank You!© Copyright 2012 EMC Corporation. All rights reserved. 43
  • 44. Provide Feedback & Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app© Copyright 2012 EMC Corporation. All rights reserved. 44
  • 45. © Copyright 2012 EMC Corporation. All rights reserved. 45