SlideShare a Scribd company logo
Joseph Ziegler                                                                            Abhishek Sinha
Technical Evangelist                                                                      Big Data BDM
zieglerj@amazon.com                                                                       sinhaar@amazon.com
  @jiyosub

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Collect, Store, Organize, Analyze and Share



                               Velocity, Volume and Variety

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
App                                                    Data                                                  App




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Data




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
                                                                                                    http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Data




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
                                                                                                    http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Hadoop




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
uncertainty                                                               Flexibility




                                     Variety                                                                     Volume




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
• But




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
- 2011 IDC Digital Universe Study




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Best Friends




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
EMR is Hadoop in the Cloud

                                 Hadoop is an open-source framework for
                                 parallel processing huge amounts of data
                                 on a cluster of machines

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Versions & Distributions

• Versions
  • 1.0.3
  • 0.20.205
  • 0.20
  • 0.18
• Distributions
  • Apache Hadoop




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Job Flows

• Custom JAR
• Cascading
• Streaming




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Applications

• Hive
• Pig
• Hbase




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Choose: Hadoop distribution,
                                                                                                                 # of nodes, types of nodes,
                                                                                                                custom configs, Hive/Pig/etc.

   Put the data
     into S3
                                                  S3                                            EMR Cluster


                                                                      011001101
                                                                                                      EMR
                                                                                                                                Launch the cluster using
                                                                                                                                 the EMR console, CLI,
                                                                                                                                      SDK, or APIs
         Get the output                                                   You can also
           from S3                                                      store everything
                                                                            in HDFS
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
EMR Cluster

                                                  S3


                                                                                                      EMR




                                                                                                                                   You can easily add
                                                                                                                                   and remove nodes


© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
S3                                              EMR Cluster




                                                                                                    When processing is complete,
                                                                                                    you can terminate the cluster
                                                                                                         (and stop paying)

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
EMR Cluster

                                                  S3


                                                                                                      EMR




                                                                                                             If you run your jobs 24 x 7 ,
                                                                                                            you can also run a persistent
                                                                                                            cluster and use RI models to
                                                                                                                      save costs
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
AWS

                      Console Upload                                                   S3                                   3rd Party Commercial
                                                                                                                            Applications


                                FTP                                                                                          Tsunami UDP


                      AWS Import / Export
                                                                                                                     Storage Gateway


                                                        S3 API
                                                                                        Direct Connect




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
code                        S3




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hadoop

 elastic-mapreduce --create --alive 
 --instance-type m1.xlarge 
 --num-instances 5

  ./elastic-mapreduce --create --alive 
  --name "Test Hive" 
  --hadoop-version 0.20 
  --num-instances 5 
  --instance-type m1.large 
  --hive-interactive 
  --hive-versions 0.7.1

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
What are Spot Instances?


               Sold at                                                   Sold at
                 50%
               Unused                                                      54%
                                                                         Unused
              Discount!                                                 Discount!



                           Sold at                   Sold at
                            56%
                           Unused                      59%
                                                     Unused
                          Discount!                 Discount!



   Sold at                                                              Sold at
     66%
   Unused                                                                 63%
                                                                        Unused
  Discount!                                                            Discount!


                                Availability Zone               Availability Zone




                                                                           Region
What is the tradeoff?



               Unused                                               Unused




                         Unused
                        Reclaimed                 Unused




    Unused
   Reclaimed                                                       Unused



                              Availability Zone            Availability Zone




                                                                      Region
Mix Spot and On-Demand Instances


              Scenario #1
                  Cluster #1
                                   #1: Cost without Spot
                                   4 instances *14 hrs * $0.45 = $25.20


                   Duration:
                   14 Hours
                                   #2: Cost with Spot
                                   4 instances *7 hrs * $0.45 = $12.60 +
              Scenario #2          5 instances * 7 hrs * $0.225 = $7.875
                  Cluster #2       Total = $20.475




                                              Time Savings: 50%
                    Duration:                 Cost Savings: ~19%
                    7 Hours
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
options




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Data                                 Data
                                                                                                                                   Masking                         Data
                                                                                              Exchange                                                             Quality




                                                                                                                                                                          MDM

                                                                                           Data
                                                                                           Transformation                            Enterprise
                                                                                                                                         Data
                                                                                                                                     Integration




                                                                                                                                                  Identity
                                                                                                     Connectivity                                 Resolution
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon.com Confidential/NDA Only
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon.com Confidential/NDA Only
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
aws.amazon.com/elasticmapreduce
• Online Training
   – Videos
   – Articles/tutorials
• Documentation
   – Getting Started Guide
   – Developer Guide
   – API Reference
• FAQs




© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Joseph Ziegler                                                                           Abhishek Sinha
 Technical Evangelist                                                                     Big Data BDM
 zieglerj@amazon.com                                                                      sinhaar@amazon.com
 @jiyosub

© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

More Related Content

What's hot

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
TAUS - The Language Data Network
 
Develop multi-screen applications with Flex
Develop multi-screen applications with Flex Develop multi-screen applications with Flex
Develop multi-screen applications with Flex
Codemotion
 
CloudFront Partner Webinar
CloudFront Partner WebinarCloudFront Partner Webinar
CloudFront Partner Webinar
Amazon Web Services
 
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
Amazon Web Services
 
How to: Avoid Mistakes at Scale
How to: Avoid Mistakes at ScaleHow to: Avoid Mistakes at Scale
How to: Avoid Mistakes at Scale
Amazon Web Services
 
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
Vyom Labs
 

What's hot (6)

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Beijing, Yu Gong, Adobe, 23 Ap...
 
Develop multi-screen applications with Flex
Develop multi-screen applications with Flex Develop multi-screen applications with Flex
Develop multi-screen applications with Flex
 
CloudFront Partner Webinar
CloudFront Partner WebinarCloudFront Partner Webinar
CloudFront Partner Webinar
 
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
AWS Webcast - Accelerating Application Performance Using In-Memory Caching in...
 
How to: Avoid Mistakes at Scale
How to: Avoid Mistakes at ScaleHow to: Avoid Mistakes at Scale
How to: Avoid Mistakes at Scale
 
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
Preserving Customizations with Overlays & Custom Objects in AR System 7.6.04
 

Viewers also liked

Couchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep diveCouchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Dipti Borkar
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
Ted Dunning
 
Development Platform as a Service - erfarenheter efter ett års användning - ...
Development Platform as a Service - erfarenheter efter ett års användning -  ...Development Platform as a Service - erfarenheter efter ett års användning -  ...
Development Platform as a Service - erfarenheter efter ett års användning - ...
IBM Sverige
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slides
dbelova
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Rick Branson
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management tools
SaltStack
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
Krishna Gade
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
MongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
Treasure Data, Inc.
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
Hortonworks
 
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
Amazon Web Services
 
Accelerating DevOps Pipelines with AWS
Accelerating DevOps Pipelines with AWSAccelerating DevOps Pipelines with AWS
Accelerating DevOps Pipelines with AWS
Amazon Web Services
 
Managing an Enterprise Class Hybrid Architecture
Managing an Enterprise Class Hybrid ArchitectureManaging an Enterprise Class Hybrid Architecture
Managing an Enterprise Class Hybrid Architecture
Amazon Web Services
 

Viewers also liked (20)

Couchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep diveCouchbase Server 2.0 - Indexing and Querying - Deep dive
Couchbase Server 2.0 - Indexing and Querying - Deep dive
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Development Platform as a Service - erfarenheter efter ett års användning - ...
Development Platform as a Service - erfarenheter efter ett års användning -  ...Development Platform as a Service - erfarenheter efter ett års användning -  ...
Development Platform as a Service - erfarenheter efter ett års användning - ...
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slides
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management tools
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
 
Accelerating DevOps Pipelines with AWS
Accelerating DevOps Pipelines with AWSAccelerating DevOps Pipelines with AWS
Accelerating DevOps Pipelines with AWS
 
Managing an Enterprise Class Hybrid Architecture
Managing an Enterprise Class Hybrid ArchitectureManaging an Enterprise Class Hybrid Architecture
Managing an Enterprise Class Hybrid Architecture
 

Similar to Hadoop on the Cloud

Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Amazon Web Services
 
CIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access ManagementCIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access Management
CloudIDSummit
 
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Amazon Web Services
 
Meetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDBMeetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDB
Amazon Web Services
 
Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3
Amazon Web Services
 
AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10
Amazon Web Services
 
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
Amazon Web Services
 
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
Amazon Web Services
 
AWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS FailoverAWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS Failover
Amazon Web Services
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - Webinar
Amazon Web Services
 
AWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQLAWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQL
Amazon Web Services
 
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
Amazon Web Services
 
AWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon RedshiftAWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon Redshift
Amazon Web Services
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스
Amazon Web Services Korea
 
AWS Webcast - Intro CloudFront Reporting Features
AWS Webcast - Intro CloudFront Reporting FeaturesAWS Webcast - Intro CloudFront Reporting Features
AWS Webcast - Intro CloudFront Reporting Features
Amazon Web Services
 
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
Amazon Web Services
 
iQ FutureNow: Ensuring the success of your mobile strategy
iQ FutureNow: Ensuring the success of your mobile strategyiQ FutureNow: Ensuring the success of your mobile strategy
iQ FutureNow: Ensuring the success of your mobile strategy
iQcontent
 
Delivering Better Search For WordPress - AWS Webcast
Delivering Better Search For WordPress - AWS WebcastDelivering Better Search For WordPress - AWS Webcast
Delivering Better Search For WordPress - AWS Webcast
Michael Bohlig
 
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
Amazon Web Services
 
Introducing Amazon Simple Workflow (Amazon SWF)
Introducing Amazon Simple Workflow (Amazon SWF)Introducing Amazon Simple Workflow (Amazon SWF)
Introducing Amazon Simple Workflow (Amazon SWF)
Amazon Web Services
 

Similar to Hadoop on the Cloud (20)

Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia - Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
Disaster Recovery with AWS - Simone Brunozzi - AWS Summit 2012 Australia -
 
CIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access ManagementCIS13: AWS Identity and Access Management
CIS13: AWS Identity and Access Management
 
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch ...
 
Meetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDBMeetup - Using CloudSearch with DynamoDB
Meetup - Using CloudSearch with DynamoDB
 
Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3Backup and Recovery for Linux With Amazon S3
Backup and Recovery for Linux With Amazon S3
 
AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10AWS Webinar - Design for Availability-13_09_10
AWS Webinar - Design for Availability-13_09_10
 
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
SEC303 Top 10 AWS Identity and Access Management Best Practices - AWS re:Inve...
 
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
 
AWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS FailoverAWS Webcast - High Availability with Route 53 DNS Failover
AWS Webcast - High Availability with Route 53 DNS Failover
 
Disaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - WebinarDisaster Recovery using Amazon Web Services - Webinar
Disaster Recovery using Amazon Web Services - Webinar
 
AWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQLAWS Webcast - Introducing Amazon RDS for PostgreSQL
AWS Webcast - Introducing Amazon RDS for PostgreSQL
 
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
AWS Webcast - Using Amazon CloudFront-Accelerate Your Static, Dynamic, Intera...
 
AWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon RedshiftAWS Webcast - Data Integration into Amazon Redshift
AWS Webcast - Data Integration into Amazon Redshift
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스
 
AWS Webcast - Intro CloudFront Reporting Features
AWS Webcast - Intro CloudFront Reporting FeaturesAWS Webcast - Intro CloudFront Reporting Features
AWS Webcast - Intro CloudFront Reporting Features
 
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
SEC101 A Guided Tour of AWS Identity and Access Management - AWS re: Invent…
 
iQ FutureNow: Ensuring the success of your mobile strategy
iQ FutureNow: Ensuring the success of your mobile strategyiQ FutureNow: Ensuring the success of your mobile strategy
iQ FutureNow: Ensuring the success of your mobile strategy
 
Delivering Better Search For WordPress - AWS Webcast
Delivering Better Search For WordPress - AWS WebcastDelivering Better Search For WordPress - AWS Webcast
Delivering Better Search For WordPress - AWS Webcast
 
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
Architecting Security & Governance across Your AWS Landing Zone - SEC301 - An...
 
Introducing Amazon Simple Workflow (Amazon SWF)
Introducing Amazon Simple Workflow (Amazon SWF)Introducing Amazon Simple Workflow (Amazon SWF)
Introducing Amazon Simple Workflow (Amazon SWF)
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Hadoop on the Cloud

  • 1. Joseph Ziegler Abhishek Sinha Technical Evangelist Big Data BDM zieglerj@amazon.com sinhaar@amazon.com @jiyosub © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 3. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 4. Collect, Store, Organize, Analyze and Share Velocity, Volume and Variety © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 5. App Data App © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 6. Data © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc. http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
  • 7. Data © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc. http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
  • 8. Hadoop © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 9. uncertainty Flexibility Variety Volume © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 10. • But © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 11. - 2011 IDC Digital Universe Study © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 12. Best Friends © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 13. EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 14. Versions & Distributions • Versions • 1.0.3 • 0.20.205 • 0.20 • 0.18 • Distributions • Apache Hadoop © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 15. Job Flows • Custom JAR • Cascading • Streaming © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 16. Applications • Hive • Pig • Hbase © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 17. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 18. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 19. Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 S3 EMR Cluster 011001101 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output You can also from S3 store everything in HDFS © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 20. EMR Cluster S3 EMR You can easily add and remove nodes © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 21. S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 22. EMR Cluster S3 EMR If you run your jobs 24 x 7 , you can also run a persistent cluster and use RI models to save costs © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 23. AWS Console Upload S3 3rd Party Commercial Applications FTP Tsunami UDP AWS Import / Export Storage Gateway S3 API Direct Connect © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 24. code S3 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 25. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 26. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 27. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 28. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 29. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 30. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 31. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 32. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 33. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 34. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 35. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 36. Hadoop elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5 ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 37. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 38. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 39. What are Spot Instances? Sold at Sold at 50% Unused 54% Unused Discount! Discount! Sold at Sold at 56% Unused 59% Unused Discount! Discount! Sold at Sold at 66% Unused 63% Unused Discount! Discount! Availability Zone Availability Zone Region
  • 40. What is the tradeoff? Unused Unused Unused Reclaimed Unused Unused Reclaimed Unused Availability Zone Availability Zone Region
  • 41. Mix Spot and On-Demand Instances Scenario #1 Cluster #1 #1: Cost without Spot 4 instances *14 hrs * $0.45 = $25.20 Duration: 14 Hours #2: Cost with Spot 4 instances *7 hrs * $0.45 = $12.60 + Scenario #2 5 instances * 7 hrs * $0.225 = $7.875 Cluster #2 Total = $20.475 Time Savings: 50% Duration: Cost Savings: ~19% 7 Hours
  • 42. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 43. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 44. options © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 45. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 46. Data Data Masking Data Exchange Quality MDM Data Transformation Enterprise Data Integration Identity Connectivity Resolution © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 47. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 48. Amazon.com Confidential/NDA Only © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 49. Amazon.com Confidential/NDA Only © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 50. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 51. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 52. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 53. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 54. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 55. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 56. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 57. aws.amazon.com/elasticmapreduce • Online Training – Videos – Articles/tutorials • Documentation – Getting Started Guide – Developer Guide – API Reference • FAQs © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 58. Joseph Ziegler Abhishek Sinha Technical Evangelist Big Data BDM zieglerj@amazon.com sinhaar@amazon.com @jiyosub © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Editor's Notes

  1. The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
  2. Hadoop is complex to setup and hard to setup. Its capex intensiveAlso if your workload needs to scale, Hadoop is pretty hard to scale on physical infrastructure. Though hadoop is a fault tolerant system, its difficult to replace failed components like disks drives or nodes. You still need time to procure this information
  3. The key messages that we want to deliver with this slide are 1. Elastic MapReduce is a hosted hadoop service. We use the most stable version of apache hadoop and provide a hosted service, and build integration point withs other services on the AWS eco-system such as S3, Cloudwatch, Dynamodb etc. We make other improvements to Hadoop so that it becomes easier to scale and manage on AWS2. We will keep iterating on the different versions of hadoop as they become stable. When you use the console you launch the latest version of hadoop, but you also have the choice or launching an older version of hadoop via the CLI or the SDK. 3. So what all you can do with EMR ?You can build applications on Amazon EMR, just like you would with HadoopIn order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster. You can also test various server configurations without having to purchase or reconfigure hardware. When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used.Amazon EMR provides three types of clusters (also called job flows) that you can launch to run custom map-reduce applications, depending on the type of program you're developing and which libraries you intend to use.
  4. Supported hadoop versions are 1.0.30.20.2050.200.18
  5. Custom JARRun your custom map-reduce program written in Java. This cluster provides low-level access to the MapReduce API. You have the most flexibility programming for this type of cluster, but also the responsibility of defining and implementing the map reduce tasks in your Java application.CascadingCascading is an open-source Java library that provides a query API, a query planner, and a job schedulerfor creating and running Hadoop MapReduce applications. Applications developed with Cascading arecompiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoopapplications.Multitool is a Cascading application that provides a simple command line interface for managing largedatasets. For example, you can filter records matching a Java regular expression from data stored inAmazon S3 and copy the results to the Hadoop file system.You can run the Cascading Multitool application on Amazon Elastic MapReduce (Amazon EMR) usingeither the Amazon EMR command line interface or the Amazon EMR console. Amazon EMR supportsall Multitool arguments.StreamingRun a single Hadoop job based on map and reduce functions you upload to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
  6. HIVE and PIGYou can use Amazon EMR to analyze data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run map-reduce jobs and query data using either a SQL-like syntax or a specialized query language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig.With Amazon Hive, you can run queries against data in NOSQL data stores like dynamodb, Hbase, along with data present in S3 and in HDFS – ALL in a single query. This is an amazon specific option.You can also use EMR to Move large volumes of data You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), Amazon Dynamo DB, and Apache Hbase.
  7. RazorFish
  8. EMR supports multiple instance types including the latest HS1 instance types EMR now supports High Storage Instances (hs1.8xlarge) in US East. These new instances offer 48 TB of storage across 24 hard disk drives, 35 EC2 Compute Units (ECUs) of compute capacity, 117 GB of RAM, 10 Gbps networking, and 2.4+ GB per second of sequential I/O performance. High Storage Instances are ideally suited for Hadoop and they significantly reduce the cost of processing very large data sets on EMR. We look forward to adding support for High Storage Instances in additional regions early next year.
  9. And the concept of adding nodes works well with hadoop – especially on the cloud since 10 nodes running for 10 hours costs the same as 100 nodes running for 1 hour.
  10. 10 x 10 = 100 nodes running for 1 hour
  11. 1.3 Trillion Objects835k+ peak transactions per second
  12. You can run hadoop clusters in automated mode , where you code will be pulled out of S3 automatically by the cluster OR You can run an interactive cluster , where once the cluster boots, you can SSH into the master node and manually fire a job
  13. Now you can create a job flowIts important to understand the concept of a job flow A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data. A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data. Steps are executed in the order in which they are defined in the job flow
  14. This screen gives you to chance to select different version of hadoop.
  15. Now you can select the type of job flow you want to run
  16. There will be different options available for different types of program For eg. The Java based JAR will ask you for location of your input data, output data and location of mapper and reducer scripts Extra arguments are anything extra that your programs might need, for example in this case, I have choose to include some specific HIVE Libraries that my HIVE script refers to
  17. Amazon EMR refers to managed hadoop clustersas a job flow, and defines the concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and task.Each Amazon EMR job flow includes one master instance group that contains one master node, a core instance group containing one or more core nodes, and an optional task instance group, which can contain any number of task nodes.If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For job flows running on more than one node, one instance is the master node and the remaining are core or task nodes.You have the choice of running different instance types for each of them. Lets look at each of these instance group types Master Instance GroupThe master instance group manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop publishes to the web server running on the master node. As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon S3, and provides status metadata to the master node.Core Instance GroupThe core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Because core nodes store data, you can't remove them from a job flow. However, you can add more core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.CautionRemoving HDFS from a running node runs the risk of losing data.Task Instance GroupThe task instance group contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress.Task nodes are managed by the master node. While a job flow is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.There are three other aspects related to instance groups are important and we will address at a later part of this presentation. They are SPOT instances Dealing with Failure Resizing job flows
  18. Amazon EC2 Key Pair Optionally, specify a key pair that you created previously. If you do not enter a value in this field, you cannot use SSH to connect to the master node. Amazon VPC Subnet Id Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon VPC. Amazon S3 Log Path OptionOptionally, specify a path in Amazon S3 to receive a copy of the log files generated by the job flow. When this value is set, Amazon EMR copies the log files from the EC2 instances in the job flow to Amazon S3. This prevents the log files from being lost when the job flow ends and the EC2 instances hosting the job flow are terminated.Enable Debugging Optionally, select Yes to create an index of your log files in Amazon SimpleDB. This index must exist in order to use the debugging tool in the Amazon EMR console. Whether or not to create this index can only be set when the job flow is created. If you set this to Yes, you must also specify a value for Amazon S3 Log Path.Keep Alive Optionally, select Yes to cause the job flow to continue running when all processing is completed. This is how you would run a persistent cluster. Once you keep the cluster alive , you will be able to submit jobs to the cluster. Once a job is finished you will see that the cluster is in wAITING mode as we discussed earlier.If you select No. Because this job flow is non-interactive, it will terminate automatically when it is done so you do not continue to accrue charges on an idle job flow.Termination Protection Optionally, select Yes to ensure the job flow is not shut down due to accident or errorVisible To All IAM Users Select Yes to make the job flow visible and accessible to all IAM users on the AWS account. For more information, see Configure User Permissions with IAM.
  19. Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data.Unlike other managed services, EMR gives you complete control. With the bootstrap action you can make any customizations to the hadoop cluster or run other open source projects like MAHOUT etc to it. NoteIf the bootstrap action returns a nonzero error code, Amazon Elastic MapReduce (Amazon EMR) treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the job flow. If just a few instances fail, then an attempt is made to reallocate the failed instances and continue.So this is another advantage of the the managed service. Amazon provides a number of predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Elastic MapReduce by using the bootstrap-action parameter.I am going to talk to you about the pre-defined bootstrap actions in the next slide
  20. All these pre-defined bootstrap action scripts are available in S3 and you can download them and change them. You can also use your own scripts. So examples could be a script that installs script and pulls data from a relational data store incrementallyAnother example could be a script that install mahout and configures the environment for it Lets look at the existing pre-defined bootstrap actions Configure DaemonsThis predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.Configure HadoopThis bootstrap action allows you to set cluster-wide Hadoop settings. This script provides two types of command line options:Option 1—Enables you to upload an XML file containing configuration settings to Amazon S3. The bootstrap action merges the new configuration settings with the existing Hadoop configuration.Option 2—Allows you to specify a Hadoop key value pair from the command line that overrides the existing Hadoop configuration.Configure Memory-Intensive WorkloadsThis bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads.NOTE: The default configurations for cc1.4xlarge, cc2.8xlarge, hs1.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.Shutdown ActionsA bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a job flow is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.NoteShutdown action scripts are not guaranteed to run if the node terminates with an error.Run IfYou can use this predefined bootstrap action to conditionally run a command when an instance-specific value is found in the instance.json or job-flow.json files. The command can refer to a file in Amazon S3 that MapReduce can download and execute.Lastly, the one that we think gets used quite frequently is GangliaThe Ganglia open source project is a scalable, distributed system designed to monitor clusters and gridswhile minimizing the impact on their performance. When you enable Ganglia on your job flow, you cangenerate reports and view the performance of the cluster as a whole, as well as inspect the performanceof individual node instancesTo set up Ganglia monitoring on a job flow, you must specify the Ganglia bootstrap action when you create the job flow. You cannot add Ganglia monitoring to a job flow that is already running. Amazon Elastic MapReduce (Amazon EMR) then installs the monitoring agents and the aggregator that Ganglia uses to report data. Once you have Ganglia setup then you can look at Ganglia detailed metrics like the next slide
  21. When you open the Ganglia web reports in a browser, you see an overview of the cluster’s performance,with graphs detailing the load, memory usage, CPU utilization, and network traffic of the cluster. Belowthe cluster statistics are graphs for each individual server in the cluster. So for example in this job we launched three instances, so in the following reports there are three instance charts showingthe cluster data.
  22. When you don’t put alive the cluster dies down and you don’t pay me.
  23. You can increase or decrease the number of nodes in a running job flow. A job flow contains a single master node. The master node controls any slave nodes that are present. There are two types of slave nodes: core nodes, which hold data to process in the Hadoop Distributed File System (HDFS), and task nodes, which do not contain HDFS. After a job flow is running, you can increase, but not decrease, the number of core nodes. Task nodes also run your Hadoop jobs. After a job flow is running, you can both increase or decrease the number of task nodes.You can modify the size of a running job flow using either the API or the CLI. The AWS Management Console allows you to monitor job flows that you resized, but it does not provide the option to resize job flows.You may include a predefined step in your workflow that automatically resizes a job flow between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of slave nodes that will execute a given job flow step.
  24. Enter spot instances
  25. What is the trade off – so in case of hadoop if your task nodes are on spot and they get taken away, your job wont stop and you will be able to continue.
  26. Suppose you have a job which runs for 14 hrs and takes 4 nodes. So 14 nodes running for 4 hrs at 0.45 cents per hour (on-demand) will cost you 25.20 dollars.Now assume that we added 5 more nodes BUT we add it on spot. Since the number of nodes have doubled , the time taken is half , given hadoop’s scalability. So in second case, I pay for 4 instances x 7 hours x 0.45 cents = 12.60 cents and ASSUME spot is at 50% on demand pricing then 5 x spot * time = 7.85 , totalling to 20.475 dollarsSo you save 50% time and 19% cost savings. If you capacity gets taken away you will be back to scenario one – which is what you intended to run earlier. So everything in scenario 2 (bottom one) is a bonus !
  27. Guess this is a great time to talk about what happens in case of a failure. If the master node goes down, your job flow will be terminated and you’ll have to rerun your job. Amazon Elastic MapReduce currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays “The master node was terminated” message which is an indicator for you to start a new job flow. Customers can instrument check pointing in their job flows to save intermediate data (data created in the middle of a job flow that has not yet been reduced) on Amazon S3. This will allow resuming the job flow from the last check point in case of failure.Amazon Elastic MapReduce is fault tolerant for slave failures and continues job execution if a slave node goes down. The service also monitors your job flow execution—retrying failed tasks, shutting down problematic instances, and provisioning new nodes to replace those that fail.AWS EMR support name node redundancy using MapR , so if you want to try mapR , please go ahead.
  28. There are two types of logs that store information about your job flow: step-level logs generated by Amazon Elastic MapReduce (Amazon EMR) and Hadoop job logs generated by Apache Hadoop. You need to examine both log types to have complete information about your job flow.Amazon EMR step-level logs contain information about the job flow and the results of each step. These logs are useful when you are debugging problems that you encounter initializing and running the job flow. For example, a step-level log contains status information such as Streaming Command Failed!.Hadoop logs contain information about Hadoop jobs, tasks, and task attempts. They are the standard log files generated by Apache Hadoop.The following image shows the relationship between Amazon EMR job flow steps and Hadoop jobs, tasks, and task attempts.Both step-level logs and Hadoop logs are generated by default and stored on the master node of the job flow. You can access them while the job flow is running by using SSH to connect to the master node as When the job flow ends the master node is terminated and you will no longer be able to access those logs using SSH. To be able to access the log files of a terminated job flow, you can direct Amazon EMR to copy the step-level and Hadoop log files to an Amazon S3 bucketIf you specify that the log files are to be copied to an Amazon S3 bucket, you have the option to have Amazon EMR create an index over those log files to generate debugging information and reports. This index is stored in Amazon SimpleDB and can be accessed by clicking the Debug button in the Amazon EMR console.
  29. Summarize this slide
  30. Quickly show this slide, take the names and move on to move examples as listed down from 49 to 53
  31. There is also support for enterprise products such as Informatica which you probably have heard about. Informatica is the leader in Enterprise data integration space. Their product Hparser allows you to use the cloud to do ETL operations on large data sets.Informatica'sHParser is a tool you can use to extract data stored in heterogeneous formats and convert it into a form that is easy to process and analyze. For example, if your company has legacy stock trading information stored in custom-formatted text files, you could use HParser to read the text files and extract the relevant data as XML. In addition to text and XML, HParser can extract and convert data stored in proprietary formats such as PDF and Word files.HParser is designed to run on top of the Hadoop architecture, which means you can distribute operations across many computers in a cluster to efficiently parse vast amounts of data. Amazon Elastic MapReduce (Amazon EMR) makes it easy to run Hadoop in the Amazon Web Services (AWS) cloud. With Amazon EMR you can set up a Hadoop cluster in minutes and automatically terminate the resources when the processing is complete.
  32. The MapR Hadoop distribution adds dependability and ease of use to the strength and flexibility of Hadoop. The Amazon Elastic MapReduce (EMR) service enables you to easily setup, operate, and scale MapR deployments in the cloud as well as integrate with other AWS services.
  33. NFSThe MapR distribution for Hadoop provides an NFS interface that you can use to mount the cluster. The NFS interface enables you to use standard Linux tools and applications with your cluster directly. You can get data into and out of the cluster with scp, and analyze data with commands like grep, sed, awk, or your own applications or scripts. Amazon EMR with MapR clusters have NFS preconfigured. The cluster is mounted at the /mapr directory on the master node; cluster data and files reside in the directory /mapr/clustername (for example/mapr/my.cluster.com). To use NFS on your Amazon EMR with MapR cluster, log in to the master node via ssh. After logging in to the cluster, you can use standard file-based applications, including Linux utilities, file browsers, and other applications.The MapRdistrbution for Hadoop provides a Hive ODBC driver that conforms to the standard ODBC 3.52 specification
  34. With M5 version of the MapR software you get enterprise features like DR across availability zones, where you can mirror specific data between clustersYou can also extend an on premises MapR cluster to the cloud Last but by no means least – you could do periodic on demand snapshots to S3
  35. So lets look at some of the common design decisions, developers have to make before they start deploying a cluster. The first one is that should I use s3 or should I run HDFS ? Actually you can use both - the choice is yours. Remember with EMR, data is lost as soon as you shut down the cluster since HDFS sits on the local ephemeral drives and dies as soon as the cluster is shutdown.
  36. Take for example the Netflix hadoop platform as a service architecture. Netflix collects a huge amount of data and what you see in the diagram is their hadoop as a service platform built on AWS. They offer big data processing engine to different stakeholders within the business. At the base of the service is S3, where everything that is worth storing is stored and hence is the “single version of truth”. With the scale, cost, global reach and durability , S3 is the perfect place for them to store data. From S3 they run multiple EMR clusters. They like to use EMR instead of building their cluster on EC2 because EMR takes away the undifferentiated heavy lifting.There are various tools used to explore data like HIVE, PIG, JAVA programs and Python code. On top of it they have a job exectution and resource management platform called Genie. Genie is connected to enterprise schedulers and other viz and web tools for data anlaysis.
  37. These are the reasons why customers choose S311 9 of reliability and durabilityVersion control against failure: With S3 you can create version control, which protects the data from a logical corruption. Lets say on your physical cluster, a developer overwrote something and logically corrupted the data. Inspite of 3 times replication of data in HDFS, you cannot recover it. With S3 just roll back. Elastic and practically unlimited size You can run multiple clusters (one production cluster), one SLA driven high performance cluster, many ad-hoc clusters , many dev cluster. Running different types of workflow in parallel gurantees isolation between jobs. Remember 5 , 10 node clusters cost you the same as one 50 node cluster but provides better isolation. NOW if your data is in HDFS, you will need to replicate all the data between each cluster, however with S3, there is one single version of truth and you can run as many clusters as you want Ability to continuously resize clusters on the run can be difficult if you have all your data in HDFS(which can be problematic because data redistribution can happen) with S3 just a single version of truthFailure or spikey load , spin up a new cluster and no need to mirror data across HDFS or create a new cluster and start the job flow http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
  38. However if you do want to use HDFS, you can.Remember that if the cluster is shutdown then the data is lost.Make sure termination protection is on All data processing happens local and not from S3. Make sure termination protection is on Alternatively consider snapshotting to S3 periodically Use S3DistCP , to move large volumes of data from S3 or push large volumes of data to S3. S3distcp is a tool that is available on EMR can be used to move large amounts of data. Remember, that S3distcp can be runs on multiple nodes so that each node pulls data in parallel.
  39. You can definitely use HDFS on EMR. You need to have