Your SlideShare is downloading. ×
Yahoo! Hack Europe Workshop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Yahoo! Hack Europe Workshop


Published on

Slides from London

Slides from London

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
  • Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
  • Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
  • Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
  • Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
  • A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
  • MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
  • A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
  • Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
  • The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
  • MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
  • The Sort phase guarantees that the input to every Reducer is sorted by key.
  • ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each <key, (list of values)> pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each <key, (list of values)> pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
  • OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The 'how' is handled by the getRecordWriter() function.From Hadoop's point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job's configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn't already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: 'write' and 'close'. The 'write' function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key's bytes (returned by the getBytes() function)a tab character delimiterthe value's bytes (again, produced by getBytes())a newline character.The 'close' function closes the Hadoop data stream to the output file.We've talked about the format of output data, but where is it stored? Again, you've probably seen the output of a job stored in many ‘part' files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 '-- part-00005from
  • Mostly "boilerplate" with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a "loop" with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
  • In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator<values>) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
  • An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
  • First the files are partitioned in parts that will be distributed to process across the cluster nodesEach part is parsed in pairs of Key(sorteable object) - Value(object), that will be the input parameters for the tasks implementing the Map functionThese user defined tasks (map), will read the value object, do something with it, and then build a new key-value list that will be stored by the framework, in intermediate files.Once all the map tasks are finished, it means that the whole data to process was completely read, and reordered into this mapreduce model of key-value paris.These intermediate key-value results are combined, resulting a new paris of key-value that will be the input for the next reduce tasksThese user defined tasks (reduce), will read the value object, do something with it, and then produce the 3rd and last list of key-value pairs, that the framework will combine, and regroup into a final result.from
  • The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.JobTracker also:Checks jobspecificationsCalculates the InputSplit[]’s for the jobInitiates the DistributedCache of the job, if there is oneDeploys the runtime jar and configurations to the system directory on HDFSSubmits the job to the JobTrackerMonitors the job
  • TaskTracker is a node in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from theJobTrackerTaskTracker is configured with a set of slots. Slots define the boundary of number of tasks that TaskTracker can acceptWhen JobTracker attempts to schedule a task, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rackTaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take TaskTracker downTaskTracker monitors spawned processes, capturing output and exit codesWhen the process finishes, TaskTracker notifies JobTracker as to success or failureTaskTracker also sends out “I’m alive” heartbeat messages to JobTrackerregularlyto notify JobTrackerThe included message in the heartbeat also tells JobTracker the number of available slots, soJobTracker knows which nodes are available for new tasks
  • You can chain MapReduce jobs to accomplish complex tasks which cannot be completed with a single job This is fairly easy since the output of the job typically goes to HDFS. In this way the output can be used as the input for the next sequential job.Remember that ensuring jobs are complete (success/failure) is a client responsibilityJob-control options are:runJob() : Submits the job and returns only after the job has completed.submitJob() : Only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling decisions.JobConf.setJobEndNotificationURI() : Sets up a notification upon job-completion, thus avoiding polling
  • -Dproperty=valueAllows for the setting of values at the command line to overide default site properties and properties set with -conf option (e.g. numReduceTasks)-confAdds an XML formatted property file to your job-fsuriShortcut for – files from the local client filesystem to the local filesystem on the cluster nodes. Distributed Cache is used.
  • GenericOptionsParserHadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally.GenericOptionsParser also allows you to set individual properties. For example: $ hadoopConfigurationPrinter -D color=yellow | grep color color=yellow The -D option is used to set the configuration property with key color to the value yellow. Options specified with -D take priority over properties from the configuration files. This is very useful: you can put defaults into configuration files and then override them with the -D option as needed. A common example of this is setting the number of reducers for a MapReduce job via -D mapred.reduce.tasks=nThis will override the number of reducers set on the cluster or set in any client-side configuration files.
  • JobConf represents a MapReduce job configurationIt sets up properties associated with a particular Job submissionIt reads a Configuration() instance to get some of those propertiesIt assigns a default name to the JobIt specifies key & value datatypes output from mapper() and datatypes output from reducer().JobConf is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. JobConf is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat and OutputCommitter implementationsIt also indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path))JobConf is optionally used to specify other advanced facets of the job such as the Comparator to be used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), debugging via user-provided scripts (setMapDebugScript(String)/setReduceDebugScript(String)) , whether job tasks can be executed in a speculative manner (setMapSpeculativeExecution(boolean))/(setReduceSpeculativeExecution(boolean)) , maximum number of attempts per task (setMaxMapAttempts(int)/setMaxReduceAttempts(int)) , percentage of tasks failure which can be tolerated by the job (setMaxMapTaskFailuresPercent(int)/setMaxReduceTaskFailuresPercent(int)) etc.Of course, users can use set(String, String)/get(String, String) to set/get arbitrary parameters needed by applications. NOTE: for performance, use the DistributedCache for large amounts of (read-only) data
  • See more at
  • Reference Task Tracker slidesOne of the primary reasons to use Hadoop to run your jobs is due to its high degree of fault tolerance. Even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure, Hadoop can guide jobs toward a successful completion.The primary way that Hadoop achieves fault tolerance is through restarting tasks. Individual task nodes (TaskTrackers) are in constant communication with the head node of the system, called the JobTracker. If a TaskTracker fails to communicate with the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume that the TaskTracker in question has crashed. The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.If the job is still in the mapping phase, then other TaskTrackers will be asked to re-execute all map tasks previously run by the failed TaskTracker. If the job is in the reducing phase, then other TaskTrackers will re-execute all reduce tasks that were in progress on the failed TaskTracker.Reduce tasks, once completed, have been written back to HDFS. Thus, if a TaskTracker has already completed two out of three reduce tasks assigned to it, only the third task must be executed elsewhere. Map tasks are slightly more complicated: even if a node has completed ten map tasks, the reducers may not have all copied their inputs from the output of those map tasks. If a node has crashed, then its mapper outputs are inaccessible. So any already-completed map tasks must be re-executed to make their results available to the rest of the reducing machines. All of this is handled automatically by the Hadoop platform.This fault tolerance underscores the need for program execution to be side-effect free. If Mappers and Reducers had individual identities and communicated with one another or the outside world, then restarting a task would require the other nodes to communicate with the new instances of the map and reduce tasks, and the re-executed tasks would need to reestablish their intermediate state. This process is notoriously complicated and error-prone in the general case. MapReduce simplifies this problem drastically by eliminating task identities or the ability for task partitions to communicate with one another. An individual task sees only its own direct inputs and knows only its own outputs, to make this failure and restart process clean and dependable.from
  • The default number of reducers is config dependent.The number of output files = number of reducersOne reducer implies one output fileTen reducers implies ten output filesIn your code you can specify the number of reducersIt is possible to specify zero reducers With a “map only job” the number of output files = number of mappersThe number of mappers = The number of input blocksYour code can implement inputSplit[] to set the number of mappers.
  • Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
  • After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
  • GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file 'myfile' will be read"pigs eat anything"The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
  • Pig return codes:(type in page 18 of Pig)
  • The complex types can contain data of any type
  • The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
  • Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
  • twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses
  • Web site click log processingUser activityDaily updatesBusiness IntelligenceAdvertising placementData mining (key words)
  • Central repository of Hive Metadata regarding tables, partitions and databasesRuns in the same JVM as the Hive serviceServices metadata requests from clientsHive stores metadata in a standard relational databaseDefault database is Derby, an open source, in-memory, embedded SQL, which is run on the same machine as HiveSingle user databaseNeed to move to MySQL to support multiple usersMetastore database typically called metastore_dbHive supports remote metastoresMetastores running in separate processes to the Hive serviceReally just another way to submit queries using HiveAlternative to Command Line InterfaceSupports clients using different languagesExposes a Thrift serviceProvides APIs that can be used by other clients (such as JDBC drivers) to talk to Hive.Hive ClientsThrift ClientSupports using C++, Java, PHP, Python and RubyJDBC DriverODBC DriverClasses to Serialize and Deserialize Data SerDe is a short name for Serializer and DeserializerSerializing – converting data into a format for storageDeserializer – extracting data structure from a series of bytesHive uses SerDe to read from/write to tables.Framework libraries that allow users to develop serializers and deserializers for their own data formats. Contains some builtin serialization/deserialization families.
  • You have to use the ANSI Standard JOIN syntax (ie. No where clause in a JOIN)
  • Table formatsDefault table format without using ROW, FORMAT or STORED AS is delimited text, one row per lineControl-A is the default row delimiterDefault collection delimiter for an ARRAY, STRUCT or MAP is the Control-B characterDefault Map key delimiter is Control-C
  • HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.SerDeSerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.For JSON files, Amazon has provided a JSON SerDe available at:s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jarHive uses the Serde interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.
  • Slide 40 - all of Ecosystem – use these earlyDO THIS VERY EARLY!
  • Transcript

    • 1. ©  Hortonworks  Inc.  2012  Hadoop WorkshopChris  Harris    Twi6er  :  cj_harris5  E-­‐mail  :    
    • 2. © Hortonworks Inc. 2013Enhancing the Core of Apache HadoopPage 2HADOOP  CORE  PLATFORM  SERVICES   Enterprise ReadinessHDFS   YARN  (in  2.0)  MAP  REDUCE  Deliver high-scalestorage & processingwith enterprise-readyplatform servicesUnique Focus Areas:•  Bigger, faster, more flexibleContinued focus on speed & scale andenabling near-real-time apps•  Tested & certified at scaleRun ~1300 system tests on large Yahooclusters for every release•  Enterprise-ready servicesHigh availability, disaster recovery,snapshots, security, …
    • 3. © Hortonworks Inc. 2013Page 3HADOOP  CORE  DATA  SERVICES  DistributedStorage & ProcessingPLATFORM  SERVICES   Enterprise ReadinessData Services for Full Data LifecycleWEBHDFS  HCATALOG  HIVE  PIG  HBASE  SQOOP  FLUME  Provide data services tostore, process & accessdata in many waysUnique Focus Areas:•  Apache HCatalogMetadata services for consistent tableaccess to Hadoop data•  Apache HiveExplore & process Hadoop data via SQL &ODBC-compliant BI tools•  Apache HBaseNoSQL database for Hadoop•  WebHDFSAccess Hadoop files via scalable REST API•  Talend Open Studio for Big DataGraphical data integration tools
    • 4. © Hortonworks Inc. 2013Operational Services for Ease of UsePage 4OPERATIONAL  SERVICES  DATA  SERVICES  Store,Process andAccess DataHADOOP  CORE  DistributedStorage & ProcessingPLATFORM  SERVICES   Enterprise ReadinessOOZIE  AMBARI  Include completeoperational services forproductive operations& managementUnique Focus Area:•  Apache Ambari:Provision, manage & monitor a cluster;complete REST APIs to integrate withexisting operational tools; job & taskvisualizer to diagnose issues
    • 5. © Hortonworks Inc. 2013Useful Links• Hortonworks Sandbox:–• Sample Data:––– Other speakers…Page 5
    • 6. © Hortonworks Inc. 2013HDFS Architecture
    • 7. © Hortonworks Inc. 2013HDFS ArchitectureNameNode  NameSpace  Block  Map  Block  Management  DataNode  BL1   BL6  BL2   BL7  NameSpace  MetaData    Image  (Checkpoint)    And  Edit  Journal  Log  Checkpoints  Image  and    Edit  Journal  Log  (backup)  Secondary  NameNode  DataNode  BL1   BL3  BL6   BL2  DataNode  BL1   BL7  BL8   BL9  
    • 8. © Hortonworks Inc. 2013HDFS HeartbeatsHDFS  heartbeats  Data  Node  daemon  Data  Node  daemon  Data  Node  daemon  Data  Node  daemon  “Im  datanode  X,  and    I’m  OK;  I  do  have  some  new  informa8on  for  you:  the  new  blocks  are  …”  NameNode  fsimage  editlog  
    • 9. © Hortonworks Inc. 2013Basic HDFS File System CommandsHere are a few (of the almost 30) HDFS commands:-cat: just like Unix cat – display file content(uncompressed)-text: just like cat – but works on compressed files-chgrp,-chmod,-chown: just like the Unix command,changes permissions-put,-get,-copyFromLocal,-copyToLocal: copies filesfrom the local file system to the HDFS and vice-versa. Twoversions.-ls, -lsr: just like Unix ls, list files/directories-mv,-moveFromLocal,-moveToLocal: moves files-stat: statistical info for any given file (block size, numberof blocks, file type, etc.)
    • 10. © Hortonworks Inc. 2013Commands Example$ hadoop fs –ls /user/brian/!$ hadoop fs -lsr!$ hadoop fs –mkdir notes!$ hadoop fs –put ~/training/commands.txt notes!$ hadoop fs –chmod 777 notes/commands.txt!$ hadoop fs –cat notes/commands.txt | more!$ hadoop fs –rm notes/*.txt!!$ hadoop fs –put filenameSrc filenameDest!$ hadoop fs –put filename dirName/fileName!$ hadoop fs –cat foo!$ hadoop fs –get foo LocalFoo!$ hadoop fs –rmr directory|file!!
    • 11. © Hortonworks Inc. 2013MapReduce
    • 12. © Hortonworks Inc. 2013A Basic MapReduce Jobmap() implementedprivate final static IntWritable one = new IntWritable(1);!private Text word = new Text(); !!!public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException {!String line = value.toString();!StringTokenizer tokenizer = newStringTokenizer(line);!while (tokenizer.hasMoreTokens()) {!word.set(tokenizer.nextToken());!output.collect(word, one);!}!}!
    • 13. © Hortonworks Inc. 2013A Basic MapReduce Jobreduce() implemented!private final IntWritable totalCount = new IntWritable();!!public void reduce(Text key, !!Iterator<IntWritable> values, OutputCollector<Text,!IntWritable> output, Reporter reporter) throwsIOException {!int sum = 0;!while (values.hasNext()) {!sum +=;!}!! totalCount.set(sum);!output.collect(key, totalCount);!}!
    • 14. © Hortonworks Inc. 2013Pig
    • 15. © Hortonworks Inc. 2013What is Pig?• Pig is an extension of Hadoop that simplifies theability to query large HDFS datasets• Pig is made up of two main components:– A SQL-like data processing language called Pig Latin– A compiler that compiles and runs Pig Latin scripts• Pig was created at Yahoo! to make it easier to analyzethe data in your HDFS without the complexities ofwriting a traditional MapReduce program• With Pig, you can develop MapReduce jobs with a fewlines of Pig Latin
    • 16. © Hortonworks Inc. 2013Running PigA Pig Latin script executes in three modes:1.  MapReduce: the code executes as a MapReduce applicationon a Hadoop cluster (the default mode)2.  Local: the code executes locally in a single JVM using a localtext file (for development purposes)3.  Interactive: Pig commands are entered manually at acommand prompt known as the Grunt shell$ pig myscript.pig$ pig -x local myscript.pig$ pig!grunt>
    • 17. © Hortonworks Inc. 2013Understanding Pig Execution• Pig Latin is a data flow language• During execution each statement is processed by thePig interpreter• If a statement is valid, it gets added to a logical planbuilt by the interpreter• The steps in the logical plan do not actually executeuntil a DUMP or STORE command
    • 18. © Hortonworks Inc. 2013A Pig Example• The first three commands are built into a logical plan• The STORE command triggers the logical plan to bebuilt into a physical plan• The physical plan will be executed as one or moreMapReduce jobslogevents = LOAD ‘input/my.log’ AS (date, level, code,message);severe = FILTER logevents BY (level == ‘severe’AND code >= 500);grouped = GROUP severe BY code;STORE grouped INTO ‘output/severeevents’;
    • 19. © Hortonworks Inc. 2013Hive
    • 20. © Hortonworks Inc. 2013What is Hive?• Hive is a subproject of the Apache Hadoop projectthat provides a data warehousing layer built on top ofHadoop• Hive allows you to define a structure for yourunstructured big data, simplifying the process ofperforming analysis and queries by introducing afamiliar, SQL-like language called HiveQL• Hive is for data analysts familiar with SQL who needto do ad-hoc queries, summarization and dataanalysis on their HDFS data
    • 21. © Hortonworks Inc. 2013Hive is not…• Hive is not a relational database• Hive uses a database to store metadata, but the datathat Hive processes is stored in HDFS• Hive is not designed for on-line transactionprocessing and does not offer real-time queries androw level updates
    • 22. © Hortonworks Inc. 2013Pig vs. Hive• Pig and Hive work well together• Hive is a good choice:–  when you want to query the data– when you need an answer to a specific questions– if you are familiar with SQL• Pig is a good choice:– for ETL (Extract -> Transform -> Load)– preparing your data so that it is easier to analyze– when you have a long series of steps to perform• Many businesses use both Pig and Hive together
    • 23. © Hortonworks Inc. 2013What is a Hive Table?• A Hive table consists of:– Data: typically a file or group of files in HDFS– Schema: in the form of metadata stored in a relational database• Schema and data are separate.– A schema can be defined for existing data– Data can be added or removed independently– Hive can be "pointed" at existing data• You have to define a schema if you have existing datain HDFS that you want to use in Hive
    • 24. © Hortonworks Inc. 2013HiveQL• Hive’s SQL like language, HiveQL, uses familiarrelational database concepts such as tables, rows,columns and schema• Designed to work with structured data• Converts SQL queries to into MapReduce jobs• Supports uses such as:– Ad-hoc queries– Summarization– Data Analysis
    • 25. © Hortonworks Inc. 2013Running Jobs with the Hive Shell• Primary way people use to interact with Hive$ hivehive>• Can run in the shell in a non-interactive way$ hive –f myhive.q– Use –S option to have only the results show
    • 26. © Hortonworks Inc. 2013Hive Shell - Information• At terminal enter:– $ hive• List all properties and values:– hive> set –v• List and describe tables– hive> show tables;– hive> describe <tablename>;– hive> describe extended <tablename>;• List and describe functions– hive> show functions;– hive> describe function <functionname>;
    • 27. © Hortonworks Inc. 2013Hive Shell – Querying Data• Selecting Data– hive> SELECT * FROM students;– hive> SELECT * FROM studentsWHERE gpa > 3.6 SORT BY gpa ASC;
    • 28. © Hortonworks Inc. 2013HiveQL• HiveQL is similar to other SQLs• User does not need to know Map/Reduce• HiveQL is based on the SQL-92 specification• Supports multi-table inserts via your code
    • 29. © Hortonworks Inc. 2013Table Operations• Defining a table:hive> CREATE TABLE mytable (name chararray, age int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ,STORED AS TEXTFILE;• ROW FORMAT is a Hive-unique command that indicate thateach row is comma delimited text• HiveQL statements are terminated with a semicolon ;• Other table operations:– SHOW TABLES– CREATE TABLE– ALTER TABLE– DROP TABLE
    • 30. © Hortonworks Inc. 2013SELECT• SELECT– Simple query example:– SELECT * FROM mytable;• Supports the following:– WHERE clause– ALL and DISTINCT– GROUP BY and HAVING– LIMIT clause– Rows returned are chosen at random– Can use REGEX Column Specification– Example:– SELECT (ds|hr)?+.+ FROM sales;
    • 31. © Hortonworks Inc. 2013JOIN - Inner Joins•  Inner joins are implemented with ease:SELECT * FROM students;Steve 2.8Raman 3.2Mary 3.9SELECT * FROM grades;2.8 B3.2 B+3.9 ASELECT students.*, grades.*FROM students JOIN grades ON (students.grade =grades.grade)Steve 2.8 2.8 BRaman 3.2 3.2 B+Mary 3.9 3.9 A
    • 32. © Hortonworks Inc. 2013JOIN – Outer Joins• Allows for the finding of rows with non-matches in thetables being joined• Outer Joins can be of three types– LEFT OUTER JOIN– Returns a row for every row in the first table entered– RIGHT OUTER JOIN– Returns a row for every row in the second table entered– FULL OUTER JOIN– Returns a row for every row from both tables
    • 33. © Hortonworks Inc. 2013Sorting• ORDER BY– Sorts but sets the number of reducers to 1• SORT BY– Multiple reducers with a sorted file from each
    • 34. © Hortonworks Inc. 2013Hive Summary• Not suitable for complex machine learning algorithms• All jobs have a minimum overhead and will take timejust for set up– Still Hadoop MapReduce on a cluster• Good for batch jobs on large amounts of append onlydata– Immutable filesystem– Does not support row level updates (except through file deletion orcreation)
    • 35. © Hortonworks Inc. 2013HCatalog
    • 36. © Hortonworks Inc. 2013What is HCatalog?• Table management and storage management layer forHadoop• HCatalog provides a shared schema and data typemechanism– Enables users with different data processing tools – Pig,MapReduce and Hive to have data interoperability–  HCatalog provides read and write interfaces for Pig, MapReduceand Hive to HDFS (or other data sources)– HCatalog’s data abstraction presents users with a relational viewof data• Command line interface for data manipulation• Designed to be accessed though other programssuch as Pig, Hive, MapReduce and HBase• HCatalog installs on top of Hive
    • 37. © Hortonworks Inc. 2013HCatalog DDL• CREATE/ALTER/DROP Table• SHOW TABLES• SHOW FUNCTIONS• DESCRIBE• Many of the commands in Hive are supported– Any command which is not supported throws an exception andreturns the message "Operation Not Supported".
    • 38. © Hortonworks Inc. 2013Accessing HCatalog Metastore through CLI• Using HCatalog client– Execute a script filehcat –f “myscript.hcatalog”– Execute DDLhcat –e create table mytable(a int);
    • 39. © Hortonworks Inc. 2013Define a New Schema• A schema is defined as an HCatalog table:create table mytable (id int,firstname string,lastname string)comment An example of an HCatalog tablepartitioned by (birthday string)stored as sequencefile;
    • 40. © Hortonworks Inc. 2013Pig Specific HCatStorer Interface• Used with Pig scripts to write data to HCatalog managedtables.• Accepts a table to write to and optionally a specification ofpartition keys to create a new partition• HCatStorer is implemented on top of HCatOutputFormat– HCatStorer is accessed via a Pig store statement.– Storing into table partitioned on month, date, hour…STORE my_processed_data INTO ‘dbname.tablenmame’USINGorg.apache.hcatalog.pig.HCatStorer(‘month=12,date=25, hour=0300’,‘a:int,b:chararray,c:map[]’);
    • 41. © Hortonworks Inc. 2013Thank You!Questions & AnswersPage 41