The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Hadoop is complex to setup and hard to setup. Its capex intensiveAlso if your workload needs to scale, Hadoop is pretty hard to scale on physical infrastructure. Though hadoop is a fault tolerant system, its difficult to replace failed components like disks drives or nodes. You still need time to procure this information
The key messages that we want to deliver with this slide are 1. Elastic MapReduce is a hosted hadoop service. We use the most stable version of apache hadoop and provide a hosted service, and build integration point withs other services on the AWS eco-system such as S3, Cloudwatch, Dynamodb etc. We make other improvements to Hadoop so that it becomes easier to scale and manage on AWS2. We will keep iterating on the different versions of hadoop as they become stable. When you use the console you launch the latest version of hadoop, but you also have the choice or launching an older version of hadoop via the CLI or the SDK. 3. So what all you can do with EMR ?You can build applications on Amazon EMR, just like you would with HadoopIn order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster. You can also test various server configurations without having to purchase or reconfigure hardware. When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used.Amazon EMR provides three types of clusters (also called job flows) that you can launch to run custom map-reduce applications, depending on the type of program you're developing and which libraries you intend to use.
Supported hadoop versions are 184.108.40.206.2050.200.18
Custom JARRun your custom map-reduce program written in Java. This cluster provides low-level access to the MapReduce API. You have the most flexibility programming for this type of cluster, but also the responsibility of defining and implementing the map reduce tasks in your Java application.CascadingCascading is an open-source Java library that provides a query API, a query planner, and a job schedulerfor creating and running Hadoop MapReduce applications. Applications developed with Cascading arecompiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoopapplications.Multitool is a Cascading application that provides a simple command line interface for managing largedatasets. For example, you can filter records matching a Java regular expression from data stored inAmazon S3 and copy the results to the Hadoop file system.You can run the Cascading Multitool application on Amazon Elastic MapReduce (Amazon EMR) usingeither the Amazon EMR command line interface or the Amazon EMR console. Amazon EMR supportsall Multitool arguments.StreamingRun a single Hadoop job based on map and reduce functions you upload to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
HIVE and PIGYou can use Amazon EMR to analyze data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run map-reduce jobs and query data using either a SQL-like syntax or a specialized query language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig.With Amazon Hive, you can run queries against data in NOSQL data stores like dynamodb, Hbase, along with data present in S3 and in HDFS – ALL in a single query. This is an amazon specific option.You can also use EMR to Move large volumes of data You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), Amazon Dynamo DB, and Apache Hbase.
EMR supports multiple instance types including the latest HS1 instance types EMR now supports High Storage Instances (hs1.8xlarge) in US East. These new instances offer 48 TB of storage across 24 hard disk drives, 35 EC2 Compute Units (ECUs) of compute capacity, 117 GB of RAM, 10 Gbps networking, and 2.4+ GB per second of sequential I/O performance. High Storage Instances are ideally suited for Hadoop and they significantly reduce the cost of processing very large data sets on EMR. We look forward to adding support for High Storage Instances in additional regions early next year.
And the concept of adding nodes works well with hadoop – especially on the cloud since 10 nodes running for 10 hours costs the same as 100 nodes running for 1 hour.
10 x 10 = 100 nodes running for 1 hour
1.3 Trillion Objects835k+ peak transactions per second
You can run hadoop clusters in automated mode , where you code will be pulled out of S3 automatically by the cluster OR You can run an interactive cluster , where once the cluster boots, you can SSH into the master node and manually fire a job
Now you can create a job flowIts important to understand the concept of a job flow A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data. A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data. Steps are executed in the order in which they are defined in the job flow
This screen gives you to chance to select different version of hadoop.
Now you can select the type of job flow you want to run
There will be different options available for different types of program For eg. The Java based JAR will ask you for location of your input data, output data and location of mapper and reducer scripts Extra arguments are anything extra that your programs might need, for example in this case, I have choose to include some specific HIVE Libraries that my HIVE script refers to
Amazon EMR refers to managed hadoop clustersas a job flow, and defines the concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and task.Each Amazon EMR job flow includes one master instance group that contains one master node, a core instance group containing one or more core nodes, and an optional task instance group, which can contain any number of task nodes.If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For job flows running on more than one node, one instance is the master node and the remaining are core or task nodes.You have the choice of running different instance types for each of them. Lets look at each of these instance group types Master Instance GroupThe master instance group manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop publishes to the web server running on the master node. As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon S3, and provides status metadata to the master node.Core Instance GroupThe core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Because core nodes store data, you can't remove them from a job flow. However, you can add more core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.CautionRemoving HDFS from a running node runs the risk of losing data.Task Instance GroupThe task instance group contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress.Task nodes are managed by the master node. While a job flow is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.There are three other aspects related to instance groups are important and we will address at a later part of this presentation. They are SPOT instances Dealing with Failure Resizing job flows
Amazon EC2 Key Pair Optionally, specify a key pair that you created previously. If you do not enter a value in this field, you cannot use SSH to connect to the master node. Amazon VPC Subnet Id Optionally, specify a VPC subnet identifier to launch the job flow in an Amazon VPC. Amazon S3 Log Path OptionOptionally, specify a path in Amazon S3 to receive a copy of the log files generated by the job flow. When this value is set, Amazon EMR copies the log files from the EC2 instances in the job flow to Amazon S3. This prevents the log files from being lost when the job flow ends and the EC2 instances hosting the job flow are terminated.Enable Debugging Optionally, select Yes to create an index of your log files in Amazon SimpleDB. This index must exist in order to use the debugging tool in the Amazon EMR console. Whether or not to create this index can only be set when the job flow is created. If you set this to Yes, you must also specify a value for Amazon S3 Log Path.Keep Alive Optionally, select Yes to cause the job flow to continue running when all processing is completed. This is how you would run a persistent cluster. Once you keep the cluster alive , you will be able to submit jobs to the cluster. Once a job is finished you will see that the cluster is in wAITING mode as we discussed earlier.If you select No. Because this job flow is non-interactive, it will terminate automatically when it is done so you do not continue to accrue charges on an idle job flow.Termination Protection Optionally, select Yes to ensure the job flow is not shut down due to accident or errorVisible To All IAM Users Select Yes to make the job flow visible and accessible to all IAM users on the AWS account. For more information, see Configure User Permissions with IAM.
Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data.Unlike other managed services, EMR gives you complete control. With the bootstrap action you can make any customizations to the hadoop cluster or run other open source projects like MAHOUT etc to it. NoteIf the bootstrap action returns a nonzero error code, Amazon Elastic MapReduce (Amazon EMR) treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the job flow. If just a few instances fail, then an attempt is made to reallocate the failed instances and continue.So this is another advantage of the the managed service. Amazon provides a number of predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Elastic MapReduce by using the bootstrap-action parameter.I am going to talk to you about the pre-defined bootstrap actions in the next slide
All these pre-defined bootstrap action scripts are available in S3 and you can download them and change them. You can also use your own scripts. So examples could be a script that installs script and pulls data from a relational data store incrementallyAnother example could be a script that install mahout and configures the environment for it Lets look at the existing pre-defined bootstrap actions Configure DaemonsThis predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.Configure HadoopThis bootstrap action allows you to set cluster-wide Hadoop settings. This script provides two types of command line options:Option 1—Enables you to upload an XML file containing configuration settings to Amazon S3. The bootstrap action merges the new configuration settings with the existing Hadoop configuration.Option 2—Allows you to specify a Hadoop key value pair from the command line that overrides the existing Hadoop configuration.Configure Memory-Intensive WorkloadsThis bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads.NOTE: The default configurations for cc1.4xlarge, cc2.8xlarge, hs1.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.Shutdown ActionsA bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a job flow is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.NoteShutdown action scripts are not guaranteed to run if the node terminates with an error.Run IfYou can use this predefined bootstrap action to conditionally run a command when an instance-specific value is found in the instance.json or job-flow.json files. The command can refer to a file in Amazon S3 that MapReduce can download and execute.Lastly, the one that we think gets used quite frequently is GangliaThe Ganglia open source project is a scalable, distributed system designed to monitor clusters and gridswhile minimizing the impact on their performance. When you enable Ganglia on your job flow, you cangenerate reports and view the performance of the cluster as a whole, as well as inspect the performanceof individual node instancesTo set up Ganglia monitoring on a job flow, you must specify the Ganglia bootstrap action when you create the job flow. You cannot add Ganglia monitoring to a job flow that is already running. Amazon Elastic MapReduce (Amazon EMR) then installs the monitoring agents and the aggregator that Ganglia uses to report data. Once you have Ganglia setup then you can look at Ganglia detailed metrics like the next slide
When you open the Ganglia web reports in a browser, you see an overview of the cluster’s performance,with graphs detailing the load, memory usage, CPU utilization, and network traffic of the cluster. Belowthe cluster statistics are graphs for each individual server in the cluster. So for example in this job we launched three instances, so in the following reports there are three instance charts showingthe cluster data.
When you don’t put alive the cluster dies down and you don’t pay me.
You can increase or decrease the number of nodes in a running job flow. A job flow contains a single master node. The master node controls any slave nodes that are present. There are two types of slave nodes: core nodes, which hold data to process in the Hadoop Distributed File System (HDFS), and task nodes, which do not contain HDFS. After a job flow is running, you can increase, but not decrease, the number of core nodes. Task nodes also run your Hadoop jobs. After a job flow is running, you can both increase or decrease the number of task nodes.You can modify the size of a running job flow using either the API or the CLI. The AWS Management Console allows you to monitor job flows that you resized, but it does not provide the option to resize job flows.You may include a predefined step in your workflow that automatically resizes a job flow between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of slave nodes that will execute a given job flow step.
Enter spot instances
What is the trade off – so in case of hadoop if your task nodes are on spot and they get taken away, your job wont stop and you will be able to continue.
Suppose you have a job which runs for 14 hrs and takes 4 nodes. So 14 nodes running for 4 hrs at 0.45 cents per hour (on-demand) will cost you 25.20 dollars.Now assume that we added 5 more nodes BUT we add it on spot. Since the number of nodes have doubled , the time taken is half , given hadoop’s scalability. So in second case, I pay for 4 instances x 7 hours x 0.45 cents = 12.60 cents and ASSUME spot is at 50% on demand pricing then 5 x spot * time = 7.85 , totalling to 20.475 dollarsSo you save 50% time and 19% cost savings. If you capacity gets taken away you will be back to scenario one – which is what you intended to run earlier. So everything in scenario 2 (bottom one) is a bonus !
Guess this is a great time to talk about what happens in case of a failure. If the master node goes down, your job flow will be terminated and you’ll have to rerun your job. Amazon Elastic MapReduce currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays “The master node was terminated” message which is an indicator for you to start a new job flow. Customers can instrument check pointing in their job flows to save intermediate data (data created in the middle of a job flow that has not yet been reduced) on Amazon S3. This will allow resuming the job flow from the last check point in case of failure.Amazon Elastic MapReduce is fault tolerant for slave failures and continues job execution if a slave node goes down. The service also monitors your job flow execution—retrying failed tasks, shutting down problematic instances, and provisioning new nodes to replace those that fail.AWS EMR support name node redundancy using MapR , so if you want to try mapR , please go ahead.
There are two types of logs that store information about your job flow: step-level logs generated by Amazon Elastic MapReduce (Amazon EMR) and Hadoop job logs generated by Apache Hadoop. You need to examine both log types to have complete information about your job flow.Amazon EMR step-level logs contain information about the job flow and the results of each step. These logs are useful when you are debugging problems that you encounter initializing and running the job flow. For example, a step-level log contains status information such as Streaming Command Failed!.Hadoop logs contain information about Hadoop jobs, tasks, and task attempts. They are the standard log files generated by Apache Hadoop.The following image shows the relationship between Amazon EMR job flow steps and Hadoop jobs, tasks, and task attempts.Both step-level logs and Hadoop logs are generated by default and stored on the master node of the job flow. You can access them while the job flow is running by using SSH to connect to the master node as When the job flow ends the master node is terminated and you will no longer be able to access those logs using SSH. To be able to access the log files of a terminated job flow, you can direct Amazon EMR to copy the step-level and Hadoop log files to an Amazon S3 bucketIf you specify that the log files are to be copied to an Amazon S3 bucket, you have the option to have Amazon EMR create an index over those log files to generate debugging information and reports. This index is stored in Amazon SimpleDB and can be accessed by clicking the Debug button in the Amazon EMR console.
Summarize this slide
Quickly show this slide, take the names and move on to move examples as listed down from 49 to 53
There is also support for enterprise products such as Informatica which you probably have heard about. Informatica is the leader in Enterprise data integration space. Their product Hparser allows you to use the cloud to do ETL operations on large data sets.Informatica'sHParser is a tool you can use to extract data stored in heterogeneous formats and convert it into a form that is easy to process and analyze. For example, if your company has legacy stock trading information stored in custom-formatted text files, you could use HParser to read the text files and extract the relevant data as XML. In addition to text and XML, HParser can extract and convert data stored in proprietary formats such as PDF and Word files.HParser is designed to run on top of the Hadoop architecture, which means you can distribute operations across many computers in a cluster to efficiently parse vast amounts of data. Amazon Elastic MapReduce (Amazon EMR) makes it easy to run Hadoop in the Amazon Web Services (AWS) cloud. With Amazon EMR you can set up a Hadoop cluster in minutes and automatically terminate the resources when the processing is complete.
The MapR Hadoop distribution adds dependability and ease of use to the strength and flexibility of Hadoop. The Amazon Elastic MapReduce (EMR) service enables you to easily setup, operate, and scale MapR deployments in the cloud as well as integrate with other AWS services.
NFSThe MapR distribution for Hadoop provides an NFS interface that you can use to mount the cluster. The NFS interface enables you to use standard Linux tools and applications with your cluster directly. You can get data into and out of the cluster with scp, and analyze data with commands like grep, sed, awk, or your own applications or scripts. Amazon EMR with MapR clusters have NFS preconfigured. The cluster is mounted at the /mapr directory on the master node; cluster data and files reside in the directory /mapr/clustername (for example/mapr/my.cluster.com). To use NFS on your Amazon EMR with MapR cluster, log in to the master node via ssh. After logging in to the cluster, you can use standard file-based applications, including Linux utilities, file browsers, and other applications.The MapRdistrbution for Hadoop provides a Hive ODBC driver that conforms to the standard ODBC 3.52 specification
With M5 version of the MapR software you get enterprise features like DR across availability zones, where you can mirror specific data between clustersYou can also extend an on premises MapR cluster to the cloud Last but by no means least – you could do periodic on demand snapshots to S3
So lets look at some of the common design decisions, developers have to make before they start deploying a cluster. The first one is that should I use s3 or should I run HDFS ? Actually you can use both - the choice is yours. Remember with EMR, data is lost as soon as you shut down the cluster since HDFS sits on the local ephemeral drives and dies as soon as the cluster is shutdown.
Take for example the Netflix hadoop platform as a service architecture. Netflix collects a huge amount of data and what you see in the diagram is their hadoop as a service platform built on AWS. They offer big data processing engine to different stakeholders within the business. At the base of the service is S3, where everything that is worth storing is stored and hence is the “single version of truth”. With the scale, cost, global reach and durability , S3 is the perfect place for them to store data. From S3 they run multiple EMR clusters. They like to use EMR instead of building their cluster on EC2 because EMR takes away the undifferentiated heavy lifting.There are various tools used to explore data like HIVE, PIG, JAVA programs and Python code. On top of it they have a job exectution and resource management platform called Genie. Genie is connected to enterprise schedulers and other viz and web tools for data anlaysis.
These are the reasons why customers choose S311 9 of reliability and durabilityVersion control against failure: With S3 you can create version control, which protects the data from a logical corruption. Lets say on your physical cluster, a developer overwrote something and logically corrupted the data. Inspite of 3 times replication of data in HDFS, you cannot recover it. With S3 just roll back. Elastic and practically unlimited size You can run multiple clusters (one production cluster), one SLA driven high performance cluster, many ad-hoc clusters , many dev cluster. Running different types of workflow in parallel gurantees isolation between jobs. Remember 5 , 10 node clusters cost you the same as one 50 node cluster but provides better isolation. NOW if your data is in HDFS, you will need to replicate all the data between each cluster, however with S3, there is one single version of truth and you can run as many clusters as you want Ability to continuously resize clusters on the run can be difficult if you have all your data in HDFS(which can be problematic because data redistribution can happen) with S3 just a single version of truthFailure or spikey load , spin up a new cluster and no need to mirror data across HDFS or create a new cluster and start the job flow http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
However if you do want to use HDFS, you can.Remember that if the cluster is shutdown then the data is lost.Make sure termination protection is on All data processing happens local and not from S3. Make sure termination protection is on Alternatively consider snapshotting to S3 periodically Use S3DistCP , to move large volumes of data from S3 or push large volumes of data to S3. S3distcp is a tool that is available on EMR can be used to move large amounts of data. Remember, that S3distcp can be runs on multiple nodes so that each node pulls data in parallel.
You can definitely use HDFS on EMR. You need to have
What are Spot Instances? Sold at Sold at 50% Unused 54% Unused Discount! Discount! Sold at Sold at 56% Unused 59% Unused Discount! Discount! Sold at Sold at 66% Unused 63% Unused Discount! Discount! Availability Zone Availability Zone Region
What is the tradeoff? Unused Unused Unused Reclaimed Unused Unused Reclaimed Unused Availability Zone Availability Zone Region