AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS
Upcoming SlideShare
Loading in...5
×
 

AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

on

  • 1,670 views

Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the ...

Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop.

This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.

What we'll learn:
• See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps.
• Examples of real world applications and customer successes in production
• Best practices for maximizing the benefits of using MapR with AWS.

Statistics

Views

Total Views
1,670
Views on SlideShare
1,663
Embed Views
7

Actions

Likes
2
Downloads
22
Comments
0

1 Embed 7

https://twitter.com 7

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS Presentation Transcript

  • Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS
  • Introducing Maya Cabassi Partner Marketing Manager Amazon Web Services
  • Webinar Overview  Submit Your Questions using the Q&A tool.  A copy of today’s presentation will be made available on:  AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/  AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-nPlVzJI- ccQXlxjSvJmw
  • Introducing Jonathan Fritz Sr. Product Manager Amazon Web Services Steve Wooledge VP, Product Marketing MapR Technologies Bruce Penn Principal Sales Engineer MapR Technologies
  • What We’ll Cover • Elastic MapReduce (EMR): Hadoop in the cloud • Elastic clusters tailored for your workflows • Best container to run Hadoop in the AWS Ecosystem • Introduction to MapR’s Hadoop Platform • Defining feature • Increased performance • Case Studies: MapR + Elastic MapReduce • Q&A
  • Hadoop in the Cloud Using MapR and Amazon Elastic MapReduce to unlock Big Data Jonathan Fritz, Sr. Product Manager, Amazon Web Services Steve Wooledge, VP, Product Marketing, MapR Technologies
  • Agenda • Elastic MapReduce (EMR): Hadoop in the cloud – Elastic clusters tailored for your workflows – Best container to run Hadoop in the AWS Ecosystem • Introduction to MapR’s Hadoop Platform – Defining features – Increased performance • Case Studies: MapR + Elastic MapReduce • Q+A
  • • YouTube users upload 48 hours of new video/min/day • Twitter sees roughly 175 million tweets every day The Three V’s: the drivers behind Big Data Variety Velocity Volume • Facebook analyzes 30+ petabytes of user generated data • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide • 2.7 zetabyes data exist in the digital universe today. • Data production will be 44 times greater in 2020 vs. 2009
  • Hadoop is the right system for Big Data • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  • Challenges with managing Hadoop On-Premises • Manage HDFS, upgrades, and system administration • Pay for expensive support contracts • Select hardware in advance and stick with predictions Cloud • Hard to tightly integrate with AWS storage services • Independently manage and monitor clusters
  • Amazon Elastic MapReduce (EMR) is the easiest way to run Hadoop in the cloud.
  • • Managed services • Easy to tune clusters and trim costs • Support for multiple AWS datastores • Unique features and ecosystem support Why Amazon Elastic MapReduce?
  • Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Name node Input data Elastic cluster S3, DynamoDB, Redshift S3/HDFS
  • Elastic MapReduce Code Name node Input data S3/HDFS Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster
  • Elastic MapReduce Code Name node Output Input data Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS
  • Output Input data S3, DynamoDB, Redshift
  • Elastic clusters. Customize size and type to reduce costs.
  • Choose your instance types Try out different configurations to find your optimal architecture. CPU c1.xlarge cc1.4xlarge cc2.8xlarge Memory m1.large m2.2xlarge m2.4xlarge Disk hs1.8xlarge
  • Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need. =
  • 10 hours Resizable clusters Easy to add and remove compute capacity on your cluster.
  • 6 hours Resizable clusters Easy to add and remove compute capacity on your cluster.
  • Peak capacity Resizable clusters Easy to add and remove compute capacity on your cluster.
  • Matched compute demands with cluster sizing. Resizable clusters Easy to add and remove compute capacity on your cluster. 10 hours
  • Use Spot and Reserved Instances. Minimize costs by supplementing on-demand pricing.
  • Easy to use Spot Instances Name-your-price supercomputing to minimize costs. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard EC2 pricing for on-demand capacity.
  • 24/7 clusters on Reserved Instances Minimize cost for consistent capacity. Reserved Instances for long running clusters. Up to 65% off on-demand pricing.
  • Your data, your choice. Easy to integrate Elastic MapReduce with your datastores.
  • Using Amazon S3 and On-Cluster Storage Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports Long running EMR cluster holding data on the cluster in a NoSQL database Weekly Report Ad-hoc Query Data aggregated and stored in Amazon S3
  • Use Amazon EMR with Redshift and S3 Data Sources Daily data aggregated in Amazon S3 Amazon EMR cluster used to process data Processed data loaded into Amazon Redshift data warehouse
  • © 2014 MapR Technologies 34© 2014 MapR Technologies Introduction to MapR
  • © 2014 MapR Technologies 35 MAPR: WORLDWIDEHADOOPTECHNOLOGYLEADER UNIQUELYADDRESSESBOTH ANALYTICANDOPERATIONALUSECASES 500+PAYINGCUSTOMERS HQ
  • © 2014 MapR Technologies 36 Hadoop Distributions Open Source Open Source Distribution A Distribution B MANAGEMENT Open Source MANAGEMENT ARCHITECTURAL INNOVATIONS
  • © 2014 MapR Technologies 37 Management MapR Data Platform APACHE HADOOP & OSS ECOSYSTEM Impala SharkHivePigHueOozieZooKeeper Mahout MLLibJujuSolrCascadingHttpFSFlume Storm Spark Streaming YARNMapReduceHBaseWhirrSqoop Drill Tez Knox Sentry Spark Falcon • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real-time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency MapR Distribution for Hadoop Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  • © 2014 MapR Technologies 38© 2014 MapR Technologies A winning combination: MapR with Amazon Elastic MapReduce
  • © 2014 MapR Technologies 39 Launching a Cluster MapR Option Integrated within EMR
  • © 2014 MapR Technologies 40 MapR: Designed for Both Transient and Long-Term Clusters • High Availability • Easy Development • Multi-Tenancy • World-Record Performance • Breadth of Applications Fastest On-Ramp to Develop Hadoop Applications Best Platform for Long-Term Hadoop Production Success
  • © 2014 MapR Technologies 41 Resource Manager HA, Application Master HA JobTracker HA for MRv1 NFS HA Instant recovery • YARN jobs are not impacted by failures • Continue to meet SLAs with MapReduce v2 • MapReduce v1 jobs are not impacted by failures • Meet your data processing SLAs • High throughput and resilience for NFS-based data ingestion, import/export and multi-client access • Files and tables are accessible within seconds of a node failure or cluster restart High Availability (HA) For Hadoop No-NameNode architecture • Distributed metadata can self-heal • No practical limit on # of files
  • © 2014 MapR Technologies 42 Direct Integration with Existing Applications • 100% POSIX compliant • Industry standard APIs - NFS, ODBC, LDAP, REST • More 3rd-party solutions • No proprietary connectors required • Language neutral
  • © 2014 MapR Technologies 43 Multi-Tenancy Support for ParallelizedApp Development Isolation • Tasks sandboxed so they don’t impact other tasks or system daemons • System resources protected from runaway jobs • Volume-based data segregation based on users and groups • Volume-based data placement to control • Label-based job scheduling to control Quotas • Storage quotas by volume/user/group • CPU and memory quotas by queue/user/group Security and delegation • Fine-grained administration permissions including volume-level delegation • Authenticate users to AD, LDAP and Kerberos via Linux PAM Reporting • Detailed reporting on resource usage (75+ different metrics) • All reports are available via UI, CLI and REST API
  • © 2014 MapR Technologies 44 MapR M7: The Best In-Hadoop NoSQL Database Benefit Features High Performance Over 1 million ops/sec with 10 node cluster Continuous Low Latency No I/O storms, no compactions 24x7 Applications Instant recovery, online schema modification, snapshots, mirroring Zero Administration No processes to manage, automated splits, self-tuning High Scalability 1 trillion tables, billions of rows, millions of columns Low TCO Files and tables on one platform, more work with fewer nodes Performance Reliability Easy Administration
  • © 2014 MapR Technologies 45 425 925 333 563 367 532 163 331 IDH 2.4.1 CDH 4.3 Source: Flux7 Labs Study, October 2013 Flux7: Comparative Study of Hadoop Distributions Web Search and Data Analytics Benchmarks Page Rank Hive JOIN Query Timeinseconds Timein Seconds Lower is Better Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz 4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz
  • © 2014 MapR Technologies 46 Comparative Study of Hadoop Distributions 212 59 262 69 276 64 475 465 IDH CDH HDP MapR Source: Flux7 Labs Study, October 2013 http://flux7.com/blogs/case-studies/hadoop-distributions-a-detailed-comparative-study-whitepaper/ Read and Write Throughput Benchmarks DFSIO Read Throughput DFSIO Write Throughput MBperSecond MBperSecond Hardware Specs: EC2 on AWS 1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz 4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz Higher is Better
  • © 2014 MapR Technologies 47 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  Integrated with Hadoop HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics
  • © 2014 MapR Technologies 48© 2014 MapR Technologies Customer Case Studies MapR with Amazon Elastic MapReduce in Action
  • © 2014 MapR Technologies 49 Use cases for MapR with Amazon EMR • Targeted advertising / clickstream analysis • Security: anti-virus, fraud detection, image recognition • Pattern matching / recommendations • Reporting / BI • Bio-informatics (genome analysis) • Financial simulation (Monte Carlo simulation) • File processing (resize jpegs, video encoding) • Web indexing
  • © 2014 MapR Technologies 50 Case Study Outcomes from MapR Deployment w/ EMR • Increased flexibility to scale at lower costs • Faster turnaround for customer requests • Ease of experimentation Challenges • RDBMS on AWS too slow • Solution must be compatible with AWS & Java 7 • High performance
  • © 2014 MapR Technologies 51 Case Study Outcomes from MapR Deployment w/ EMR • Faster machine learning performance enables more/faster simulations • MapR M7 provides geospatial database backed by Amazon S3 Challenges • Large volumes of sensor data • Project weather for 2.5 years at every 20x20 plot across the US • Climatology simulations need to quickly experiment at small scale and then scale reliably
  • © 2014 MapR Technologies 52© 2014 MapR Technologies Demo
  • © 2014 MapR Technologies 53 MapR/EMR Demonstration • Create MapR cluster using EMR • Review MapR Control System (MCS) • Show S3 and MapR integration • Demonstrate MapR’s real-time capability • Connect Mac to MapR via NFS • Run queries with HiveServer2 and Impala • Visualize data with Tableau
  • Questions and Contact MapR: http://aws.amazon.com/elasticmapreduce/mapr/ swooledge@mapr.com AWS Contact: aws.amazon.com/contact-us jonfritz@amazon.com @mapr @awscloud Maprtech Amazon Web Services
  • We’d like your feedback. Please complete a short survey https://aws.asia.qualtrics.com/SE/?SID=SV_brzWly lHrqM29tr