Yahoo! Hack Europe


Published on

Slides from London

Published in: Technology
1 Comment
  • Free Download :
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
  • In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
  • Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
  • A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
  • The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
  • It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
  • At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes & combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
  • Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
  • HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
  • Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.Apache Ambari: Management & MonitoringMake Hadoop clusters easy to operateSimplified cluster provisioning with a step-by-step install wizardPre-configured operational metrics for insight into health of Hadoop servicesVisualization of job and task execution for visibility into performance issuesComplete RESTful API for integrating with existing operational toolsIntuitive user interface that makes controlling a cluster easy and productive
  • So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
  • And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
  • Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
  • Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
  • Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
  • Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
  • Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
  • A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
  • MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
  • A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
  • Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
  • The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
  • MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
  • The Sort phase guarantees that the input to every Reducer is sorted by key.
  • ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each <key, (list of values)> pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each <key, (list of values)> pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
  • OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The 'how' is handled by the getRecordWriter() function.From Hadoop's point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job's configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn't already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: 'write' and 'close'. The 'write' function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key's bytes (returned by the getBytes() function)a tab character delimiterthe value's bytes (again, produced by getBytes())a newline character.The 'close' function closes the Hadoop data stream to the output file.We've talked about the format of output data, but where is it stored? Again, you've probably seen the output of a job stored in many ‘part' files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 '-- part-00005from
  • Mostly "boilerplate" with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a "loop" with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
  • In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator<values>) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
  • An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
  • Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
  • After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
  • GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file 'myfile' will be read"pigs eat anything"The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
  • Pig return codes:(type in page 18 of Pig)
  • The complex types can contain data of any type
  • The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
  • Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
  • twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses
  • Yahoo! Hack Europe

    1. 1. © Hortonworks Inc. 2013Chris HarrisTwitter : cj_harris5E-mail : charris@hortonworks.comPage 1Introduction to Big Data withHadoop
    2. 2. © Hortonworks Inc. 2013What is Big Data?Page 2
    3. 3. © Hortonworks Inc. 2013Web giants proved the ROI in data productsapplying data science to large amounts of dataPage 3Amazon: 35% ofproduct sales comefrom productrecommendationsNetflix: 75% of streamingvideo results fromrecommendationsPrediction of clickthrough rates
    4. 4. © Hortonworks Inc. 2013Data science is a natural next step afterbusiness intelligencePage 4ValueRefine Extract EnrichData ScienceDashboardsReportsScore-cardsAffinity AnalysisOutlier DetectionClusteringRecommendationRegressionClassificationBusiness Intelligence: measure & count; simple analyticsData Science: discovery & prediction; complex analytics; “data product”DiscoveryPrediction
    5. 5. © Hortonworks Inc. 2013Key use-cases in Finance/Insurance• Customer risk profiling:–How likely is this customer to pay back his mortgage?–How likely is this customer to get sick?• Fraud detection:–Detect illegal credit card activity and alert bank/consumer–Detect illegal insurance claims• Internal fraud detection (compliance):–Is this employee accessing financial information they are notallowed to access?Page 5
    6. 6. © Hortonworks Inc. 2013Key use-cases in Telco/Mobile• Customer life-time-value prediction–What is the LTV for customer X?• Marketing–Which new mobile phone should we offer to customer X so that theyremain with us?–Location based advertising• Failure prediction–When will equipment X in cell tower Y fail?• Cell Tower Management–Predict load and bandwidth on cell towers to optimize networkPage 6
    7. 7. © Hortonworks Inc. 2013Key use-cases in Healthcare• Clinical Decision Support:–What is the ideal treatment for this patient?• Cost management:–What is the expected overall cost of treatment for this patientover the life of the disease• Diagnostics:–Given these test results, what is the likelihood of cancer?• Epidemic management–Predict size and location of epidemic spreadPage 7
    8. 8. © Hortonworks Inc. 2013What is Hadoop?Page 8
    9. 9. © Hortonworks Inc. 2013A Brief History of Apache HadoopPage 92013Focus on INNOVATION2005: Yahoo! createsteam under E14 towork on HadoopFocus on OPERATIONS2008: Yahoo team extends focus tooperations to support multipleprojects & growing clustersYahoo! begins toOperate at scaleEnterpriseHadoopApache ProjectEstablishedHortonworksData Platform2004 2008 2010 20122006STABILITY2011: Hortonworks created to focus on“Enterprise Hadoop“. Starts with 24key Hadoop engineers from Yahoo
    10. 10. © Hortonworks Inc. 2013Leadership that Starts at the CorePage 10• Driving next generation Hadoop– YARN, MapReduce2, HDFS2, HighAvailability, Disaster Recovery• 420k+ lines authored since 2006– More than twice nearest contributor• Deeply integrating w/ecosystem– Enabling new deployment platforms– (ex. Windows & Azure, Linux & VMware HA)– Creating deeply engineered solutions– (ex. Teradata big data appliance)• All Apache, NO holdbacks– 100% of code contributed to Apache
    11. 11. © Hortonworks Inc. 2013Operational Data RefineryPage 11DATASYSTEMSDATASOURCES131 CaptureCapture all dataProcessParse, cleanse, applystructure & transformExchangePush to existing datawarehouse for use withexisting analytic tools23Refine ExploreEnrich2APPLICATIONSCollect data and applya known algorithm to itin trusted operationalprocessTRADITIONAL REPOSRDBMS EDW MPPBusinessAnalyticsCustomApplicationsEnterpriseApplicationsTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)
    12. 12. © Hortonworks Inc. 2013Key Capability in Hadoop: Late bindingPage 12DATASERVICESOPERATIONALSERVICESHORTONWORKSDATA PLATFORMHADOOP COREWEB LOGS,CLICK STREAMSMACHINEGENERATEDOLTPData Mart /EDWClient AppsDynamically ApplyTransformationsHortonworks HDPWith traditional ETL, structure must be agreed upon far in advance and is difficult to change.With Hadoop, capture all data, structure data as business need evolve.WEB LOGS,CLICK STREAMSMACHINEGENERATEDOLTPETL Server Data Mart /EDWClient AppsStore TransformedData
    13. 13. © Hortonworks Inc. 2013Big Data Exploration & VisualizationPage 13DATASYSTEMSDATASOURCESRefine Explore EnrichAPPLICATIONS1 CaptureCapture all dataProcessParse, cleanse, applystructure & transformExchangeExplore and visualizewith analytics toolssupporting Hadoop23Collect data andperform iterativeinvestigation for value32TRADITIONAL REPOSRDBMS EDW MPP1BusinessAnalyticsTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)CustomApplicationsEnterpriseApplications
    14. 14. © Hortonworks Inc. 2013Visualization Tooling• Robust visualization and business tooling• Ensures scalability when working with large datasetsPage 14Native Excel supportWeb browser supportMobile support
    15. 15. © Hortonworks Inc. 2013Application EnrichmentPage 15DATASYSTEMSDATASOURCESRefine Explore EnrichAPPLICATIONS1 CaptureCapture all dataProcessParse, cleanse, applystructure & transformExchangeIncorporate data directlyinto applications23Collect data, analyzeand present salientresults for online apps312TRADITIONAL REPOSRDBMS EDW MPPTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)CustomApplicationsEnterpriseApplicationsNOSQL
    16. 16. © Hortonworks Inc. 2013Web giants proved the ROI in data productsapplying data science to large amounts of dataPage 16Amazon: 35% ofproduct sales comefrom productrecommendationsNetflix: 75% of streamingvideo results fromrecommendationsPrediction of clickthrough rates
    17. 17. © Hortonworks Inc. 2013Interoperating With Your ToolsPage 17APPLICATIONSDATASYSTEMSTRADITIONAL REPOSDEV & DATATOOLSOPERATIONALTOOLSViewpointMicrosoft ApplicationsDATASOURCESMOBILEDATAOLTP,POSSYSTEMSTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)
    18. 18. © Hortonworks Inc. 2013Deep Drive on HadoopComponentsPage 18
    19. 19. © Hortonworks Inc. 2013Enhancing the Core of Apache HadoopPage 19HADOOP COREPLATFORM SERVICES Enterprise ReadinessHDFS YARN (in 2.0)MAP REDUCEDeliver high-scalestorage & processingwith enterprise-readyplatform servicesUnique Focus Areas:• Bigger, faster, more flexibleContinued focus on speed & scale andenabling near-real-time apps• Tested & certified at scaleRun ~1300 system tests on large Yahooclusters for every release• Enterprise-ready servicesHigh availability, disaster recovery,snapshots, security, …
    20. 20. © Hortonworks Inc. 2013Page 20HADOOP COREDATASERVICESDistributedStorage & ProcessingPLATFORM SERVICES Enterprise ReadinessData Services for Full Data LifecycleWEBHDFSHCATALOGHIVEPIGHBASESQOOPFLUMEProvide data services tostore, process & accessdata in many waysUnique Focus Areas:• Apache HCatalogMetadata services for consistent tableaccess to Hadoop data• Apache HiveExplore & process Hadoop data via SQL &ODBC-compliant BI tools• Apache HBaseNoSQL database for Hadoop• WebHDFSAccess Hadoop files via scalable REST API• Talend Open Studio for Big DataGraphical data integration tools
    21. 21. © Hortonworks Inc. 2013Operational Services for Ease of UsePage 23OPERATIONALSERVICESDATASERVICESStore,Process andAccess DataHADOOP COREDistributedStorage & ProcessingPLATFORM SERVICES Enterprise ReadinessOOZIEAMBARIInclude completeoperational services forproductive operations& managementUnique Focus Area:• Apache Ambari:Provision, manage & monitor a cluster;complete REST APIs to integrate withexisting operational tools; job & taskvisualizer to diagnose issues
    22. 22. © Hortonworks Inc. 2013Getting StartedPage 26
    23. 23. © Hortonworks Inc. 2013Hortonworks Process for Enterprise HadoopPage 27Upstream Community Projects Downstream Enterprise ProductHortonworksData PlatformDesign &DevelopDistributeIntegrate& TestPackage& CertifyApacheHCatalogApachePigApacheHBaseOtherApacheProjectsApacheHiveApacheAmbariApacheHadoopTest &PatchDesign & DevelopReleaseNo Lock-in: Integrated, tested & certified distribution lowersrisk by ensuring close alignment with Apache projectsVirtuous cycle when development & fixed issues done upstream & stable project releases flow downstreamStable ProjectReleasesFixed Issues
    24. 24. © Hortonworks Inc. 2013OS Cloud VM AppliancePage 28PLATFORM SERVICESHADOOP COREDATASERVICESOPERATIONALSERVICESManage &Operate atScaleStore,Process andAccess DataEnterprise ReadinessOnly Hortonworksallows you to deployseamlessly across anydeployment option• Linux & Windows• Azure, Rackspace & other clouds• Virtual platforms• Big data appliancesHORTONWORKSDATA PLATFORM (HDP)DistributedStorage & ProcessingDeployable Across a Range of Options
    25. 25. © Hortonworks Inc. 2013Refine-Explore-Enrich DemoPage 29Hands on tutorialsintegrated intoSandboxHDP environment forevaluationThe Sandbox lets you experience ApacheHadoop from the convenience of your ownlaptop – no data center, no cloud and nointernet connection needed!The Hortonworks Sandbox is:• A free download:• A complete, self contained virtualmachine with Apache Hadoop pre-configured• A personal, portable and standaloneHadoop environment• A set of hands-on, step-by-step tutorialsthat allow you to learn and exploreHadoop
    26. 26. © Hortonworks Inc. 2013Hortonworks & MicrosoftPage 30HDInsight• Big Data Insight for Millions, Massiveexpansion of Hadoop• Simplifies Hadoop, Enterprise Ready• Hortonworks Data Platform used forHadoop on Windows Server and Azure• An engineered, open source solution– Hadoop engineered for Windows– Hadoop powered Microsoft business tools– Ops integration with MS System Center– Bidirectional connectors for SQL Server– Support for Hyper-V, deploy Hadoop on VMs– Opens the .NET developer community to Hadoop– Javascript for Hadoop– Deploy on Azure in 10 minutes• Excel• PowerPivot (BI)• PowerView(visualization)• SharePoint+
    27. 27. © Hortonworks Inc. 2013Useful Links• Hortonworks Sandbox:–• HDInsight Service:– ?–User/PWD• Sample Data:Page 31
    28. 28. © Hortonworks Inc. 2013Hadoop 1 hour WorkshopPage 32
    29. 29. © Hortonworks Inc. 2013Useful Links• Hortonworks Sandbox:–• HDInsight Service:– ?–User/PWD• Sample Data:Page 33
    30. 30. © Hortonworks Inc. 2013Working with HDFS
    31. 31. © Hortonworks Inc. 2013What is HDFS?• Stands for Hadoop Distributed File System• Primary storage system for Hadoop• Fast and reliable• Deployed only on Linux (as of May 2012)–Active work around Hadoop on Windows
    32. 32. © Hortonworks Inc. 2013HDFS Characteristics (Cont.)• Write once and read many times• Files only append• Data stored in blocks–Distributed over many nodes–Block sizes often range from 128MB to 1GB
    33. 33. © Hortonworks Inc. 2013HDFS Architecture
    34. 34. © Hortonworks Inc. 2013HDFS ArchitectureNameNodeNameSpaceBlock MapBlock ManagementDataNodeBL1 BL6BL2 BL7NameSpace MetaDataImage (Checkpoint)And Edit Journal LogCheckpoints Image andEdit Journal Log (backup)SecondaryNameNodeDataNodeBL1 BL3BL6 BL2DataNodeBL1 BL7BL8 BL9
    35. 35. © Hortonworks Inc. 2013Data Organization• Metadata–organized into files and directories–linux-like permissions prevent accidental deletions• Files–divided into uniform sized blocks–default 64 MB–distributed across clusters• Rack-aware• Keeps sizing checksums–for corruption detection
    36. 36. © Hortonworks Inc. 2013HDFS Cluster•HDFS runs in Hadoop distributed mode–on a cluster•3 main components:–NameNode–Manages DataNodes–Keeps metadata for all nodes & blocks–DataNodes–Hold data blocks–Live on racks–Client–Talks directly to NameNode then DataNodes
    37. 37. © Hortonworks Inc. 2013NameNode• Server running the namenode daemon–Responsible for coordinating datanodes• The Master of the DataNodes• Problems–Lots of overhead to being the Master–Should be special server for performance–Single point of failure
    38. 38. © Hortonworks Inc. 2013DataNode• A common node running a datanode daemon• Slave•Manages block reads/writes for HDFS•Manages block replication• Pings NameNode and gets instructions back•If heartbeat fails–NameNode removes from cluster–Replicated blocks take over
    39. 39. © Hortonworks Inc. 2013HDFS HeartbeatsHDFSheartbeatsDataNodedaemonDataNodedaemonDataNodedaemonDataNodedaemon“Im datanode X, andI’m OK; I do have somenew information for you:the new blocks are …”NameNodefsimageeditlog
    40. 40. © Hortonworks Inc. 2013HDFS Commands
    41. 41. © Hortonworks Inc. 2012Here are a few (of the almost 30) HDFS commands:-cat: just like Unix cat – display file content (uncompressed)-text: just like cat – but works on compressed files-chgrp,-chmod,-chown: just like the Unix command, changespermissions-put,-get,-copyFromLocal,-copyToLocal: copies filesfrom the local file system to the HDFS and vice-versa. Two versions.-ls, -lsr: just like Unix ls, list files/directories-mv,-moveFromLocal,-moveToLocal: moves files-stat: statistical info for any given file (block size, number of blocks,file type, etc.)Basic HDFS File System Commands$ hadoop fs –command <args>
    42. 42. © Hortonworks Inc. 2012 Page 47Commands Example$ hadoop fs –ls /user/brian/$ hadoop fs -lsr$ hadoop fs –mkdir notes$ hadoop fs –put ~/training/commands.txt notes$ hadoop fs –chmod 777 notes/commands.txt$ hadoop fs –cat notes/commands.txt | more$ hadoop fs –rm notes/*.txt
    43. 43. © Hortonworks Inc. 2012 Page 48Uploading Files into HDFS$ hadoop fs –put filenameSrc filenameDest$ hadoop fs –put filename dirName/fileName$ hadoop fs –put foo bar$ hadoop fs –put foo dirName/fileName$ hadoop fs –lsr dirName
    44. 44. © Hortonworks Inc. 2012 Page 49Retrieving FilesNote: Another name for the –get command is -copyToLocal$ hadoop fs –cat foo$ hadoop fs –get foo LocalFoo$ hadoop fs –rmr directory|file
    45. 45. © Hortonworks Inc. 2013How MapReduce Works
    46. 46. © Hortonworks Inc. 2012 Page 51InputFormatMap Sort ReduceOutputFormatNode NodePartitionerMapReduceBasic MapReduce ArchitectureDistributed File System (HDFS)
    47. 47. © Hortonworks Inc. 2012 Page 52Simple MapReduceMap ReduceKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueInputKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueKey|ValueIntermediateKey|Value,Key|Value,Key|Value,Key|ValueResult= some kind of collectionKey|Value,Key|Value,Key|ValueKey|Value,Key|Value
    48. 48. © Hortonworks Inc. 2012 Page 53InputFormat• Determines how the data is split up• Creates InputSplit[] arrays– Each is individual map– Associated with a list of destination nodes• RecordReader– Makes key,value pairs– Converts data types
    49. 49. © Hortonworks Inc. 2013 Page ‹#›
    50. 50. © Hortonworks Inc. 2012 Page 55Partitioner• Distributes key,value pairs• Decides the target Reducer– Uses the key to determine– Uses Hash function by default– Can be customReduce PhasegetPartition(K2 key, V2 value, int numPartitions)
    51. 51. © Hortonworks Inc. 2012 Page 56The ShuffleMapMapMapReduceReduceHTTPReduce Phase
    52. 52. © Hortonworks Inc. 2012 Page 57Sort• Guarantees sorted inputs to Reducers• Final step of Shuffle• Helps to merge Reducer inputsReduce Phase
    53. 53. © Hortonworks Inc. 2012 Page 58Reduce• Receives output from many Mappers• Consolidates values for commonintermediate keys• Groups values by keyreduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output,Reporter reporter)
    54. 54. © Hortonworks Inc. 2012 Page 59OutputFormat• Validator– For output specs• Sets up a RecordWriter– Which writes out to HDFS– Organizes output into part-0000x files
    55. 55. © Hortonworks Inc. 2012 Page 60A Basic MapReduce Jobmap() implementedprivate final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);}}
    56. 56. © Hortonworks Inc. 2012 Page 61A Basic MapReduce Jobreduce() implementedprivate final IntWritable totalCount = new IntWritable();public void reduce(Text key,Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum +=;}totalCount.set(sum);output.collect(key, totalCount);}
    57. 57. © Hortonworks Inc. 2012 Page 62Another Use Case – Inverted IndexMap | news sports finance email | shoes books | news finance email | operating-system productivity searchbooks | | | www.facebook.comfinance |
    58. 58. © Hortonworks Inc. 2013Pig
    59. 59. © Hortonworks Inc. 2012 Page 64What is Pig?• Pig is an extension of Hadoop that simplifies theability to query large HDFS datasets• Pig is made up of two main components:– A SQL-like data processing language called Pig Latin– A compiler that compiles and runs Pig Latin scripts• Pig was created at Yahoo! to make it easier to analyzethe data in your HDFS without the complexities ofwriting a traditional MapReduce program• With Pig, you can develop MapReduce jobs with afew lines of Pig Latin
    60. 60. © Hortonworks Inc. 2012 Page 65Pig In The EcoSystem• Pig runs on Hadoop utilizing both HDFS andMapReduce• By default, Pig reads and writes files from HDFS• Pig stores intermediate data among MapReduce jobsHDFSMapReducePigHCatalogHBase
    61. 61. © Hortonworks Inc. 2012 Page 66Running PigA Pig Latin script executes in three modes:1. MapReduce: the code executes as a MapReduceapplication on a Hadoop cluster (the default mode)2. Local: the code executes locally in a single JVM using alocal text file (for development purposes)1. Interactive: Pig commands are entered manually at acommand prompt known as the Grunt shell$ pig myscript.pig$ pig -x local myscript.pig$ piggrunt>
    62. 62. © Hortonworks Inc. 2012 Page 67Understanding Pig Execution• Pig Latin is a data flow language• During execution each statement is processed by thePig interpreter• If a statement is valid, it gets added to a logical planbuilt by the interpreter• The steps in the logical plan do not actually executeuntil a DUMP or STORE command
    63. 63. © Hortonworks Inc. 2012 Page 68A Pig Example• The first three commands are built into a logical plan• The STORE command triggers the logical plan to bebuilt into a physical plan• The physical plan will be executed as one or moreMapReduce jobslogevents = LOAD ‘input/my.log’ AS (date, level, code,message);severe = FILTER logevents BY (level == ‘severe’AND code >= 500);grouped = GROUP severe BY code;STORE grouped INTO ‘output/severeevents’;
    64. 64. © Hortonworks Inc. 2012 Page 69An Interactive Pig Session• Command line history and editing• Tab will complete commands (but notfilenames)• To exit enter quit$ piggrunt> A = LOAD ‘myfile’;grunt> DUMP A;(output appears here)
    65. 65. © Hortonworks Inc. 2012 Page 70Pig Command Options• To see a full listing enter:pig –h or help• Execute-e or –execute-f scriptname or -filename scriptname• Specify a parameter setting-p or –parameterexample: -p key1=value1 –p key2=value2• List the properties that Pig will use if they are setby the user-h properties• Display the version-version
    66. 66. © Hortonworks Inc. 2012 Page 71Grunt – HDFS Commands• Grunt acts as a shell to access HDFS• Commands include:fs -lsfs -cat filenamefs -copyFromLocal localfile hdfsfilefs -copyToLocal hdfsfile localfilefs -rm filenamefs -mkdir dirnamefs -mv fromLocation/filename toLocation/filename
    67. 67. © Hortonworks Inc. 2012 Page 72Pig’s Data Model• 6 Scalar Types– int, long, float, double, chararray, bytearray• 3 Complex types– Tuple: ordered set of values• (F, 66, 41000, 95103)– Bag: unordered collection of tuples• { (F, 66, 41000, 95103), (M, 40, 14000, 95102) }– Map: collection of key value pairs• [name#Bob, age#34]
    68. 68. © Hortonworks Inc. 2012 Page 73Relations• Pig Latin statements work with relations• A bag of tuples• Similar to a table in a relational database, where thetuples in the bag correspond to rows in a table• Unlike rows in a table– the tuples in a Pig relation do not have to contain the samenumber of fields– nor do the fields have to be the same data type• Relation schemas are optional
    69. 69. © Hortonworks Inc. 2012 Page 74Defining Relations• The LOAD command loads data from a file into arelation. The syntax looks like:where ‘data’ is either a filename or directory• Use the AS option to define a schema for therelation:• TIP: Use the DESCRIBE command to view theschema of a relationalias = LOAD ‘data’;alias = LOAD ‘data’ AS (name1:type,name2:type,...);DESCRIBE alias;
    70. 70. © Hortonworks Inc. 2012 Page 75A Relation with a Schema• Suppose we have the following data in HDFS:Tom,21,94085,62000John,45,95014,25000Joe,21,94085,50000Larry,45,95014,36000Hans,21,94085,80000• The data above represents the name, age, ZIP codeand salary of employees
    71. 71. © Hortonworks Inc. 2012 Page 76A Relation with a Schema• The following LOAD command defines a relationnamed employees with a schema:• The DESCRIBE command for employees outputs thefollowing:employees = LOAD pig/input/File1USING PigStorage(,)AS (name:chararray, age:int,zip:int, salary:double);describe employees;employees: {name: chararray,age: int,zip:int,salary: double}
    72. 72. © Hortonworks Inc. 2012 Page 77Default Schema Datatype• If not specified, the data type of a field in a relationdefaults to bytearray• What will the data type be for each field in thefollowing relation?employees = LOAD pig/input/File1USING PigStorage(,)AS (name:chararray,age,zip:int,salary);
    73. 73. © Hortonworks Inc. 2012 Page 78Using a Schema• Defining a schema allows you to refer to the valuesof a relation by the name of the field in the schema• Because we defined a schema for the employeesrelation, the FILTER command can refer to thesecond field in the relation by the name “age”employees = LOAD pig/input/File1USING PigStorage(,)AS (name:chararray,age:int,zip:int,salary:double);newhires = FILTER employees BY age <= 21;
    74. 74. © Hortonworks Inc. 2012 Page 79Relations without a Schema• If a relation does not define a schema, then Pig willsimply load the data anyway (because “pigs eatanything”)• The output of the above DESCRIBE command is:employees = LOAD pig/input/File1USING PigStorage(,);DESCRIBE employees;Schema for employees unknown.
    75. 75. © Hortonworks Inc. 2012 Page 80Relations without a Schema• Without a schema, a field is referenced by itsposition within the relation• $0 is the first field, $1 is the second field, and so on• The output of the above commands is:employees = LOAD pig/input/File1USING PigStorage(,);newhires = FILTER employess BY $1 <= 21;DUMP newhires;(Tom,21,94085,5000.0)(Joe,21,94085,50000.0)(Hans,21,94085,80000.0)
    76. 76. © Hortonworks Inc. 2012 Page 81• Data elements may be null– Null means that the data element value is “undefined”• In a LOAD command, null is automatically insertedfor missing or invalid fields• Example:LOAD ‘a.txt’ AS (a1:int, a2:int) usingPigStorage(‘,’);Is loaded as:This data:Nulls in Pig1,2,35(1,2)(5, null)6,bye (6, null)
    77. 77. © Hortonworks Inc. 2012 Page 82The GROUP Operator• The GROUP operator groups together tuples basedon a specified key• The usage for GROUP is:x = GROUP alias BY expression– alias = the name of the existing relation that you want togroup together– expression = a tuple expression that is the key you want togroup by• The result of the GROUP command is a new relation
    78. 78. © Hortonworks Inc. 2012 Page 83A GROUP Example• The output of DESCRIBE is:employees = LOAD pig/input/File1USING PigStorage(,)AS (name:chararray,age:int,zip:int,salary:double);a = GROUP employees BY salary;DESCRIBE a;a: {group: double, employees:{(name: chararray,age:int,zip:int,salary: double)}}
    79. 79. © Hortonworks Inc. 2012 Page 84More GROUP Examplesdaily = LOAD ‘NYSE_daily’ AS (exchange, stock,date, dividends);grpd = GROUP daily BY (exchange, stock);DESCRIBE grpd;grpd: {group: (exchange:bytearray,stock:bytearray), daily: {{exchange:bytearray,stock:bytearray, date:bytearray, dividends:bytearray)}}daily = LOAD ‘NYSE_daily’ AS (exchange, stock);grpd = GROUP daily BY stock;cnt = FOREACH grpd GENERATE group, COUNT (daily);
    80. 80. © Hortonworks Inc. 2012 Page 85The JOIN Operator• The JOIN operator performs an inner join on two ormore relations based on common field values• The syntax for JOIN is:x = JOIN alias BY expression, alias BY expression,…– alias = an existing relation– expression = a field of the relation• The result of JOIN is a flat set of tuples
    81. 81. © Hortonworks Inc. 2012 Page 86A JOIN Example• Suppose we add another set of data that containsthe employee’s name and a phone number:Tom,4085551211Tom,6505550123John,4085554332Joe,4085559898Joe,4085557777
    82. 82. © Hortonworks Inc. 2012 Page 87A JOIN Example• The output of the DESCRIBE above is:e1 = LOAD pig/input/File1 USING PigStorage(,)AS (name:chararray,age:int,zip:int,salary:double);e2 = LOAD pig/input/File2 USING PigStorage(,)AS (name:chararray,phone:chararray);e3 = JOIN e1 BY name, e2 BY name;DESCRIBE e3;e3: {e1::name:chararray, e1::age:int,e1::zip:int,e1::salary:double,e2::name:chararray,e2::phone:chararray}
    83. 83. © Hortonworks Inc. 2012 Page 88A JOIN Example• The JOIN output looks like:grunt> DUMP e3;(Joe,21,94085,50000.0,Joe,4085559898)(Joe,21,94085,50000.0,Joe,4085557777)(Tom,21,94085,5000.0,Tom,4085551211)(Tom,21,94085,5000.0,Tom,6505550123)(John,45,95014,25000.0,John,4085554332)
    84. 84. © Hortonworks Inc. 2012 Page 89The FOREACH Operator• The FOREACH operator transforms data into a newrelation based on the columns of the data• The syntax looks like:x = FOREACH alias GENERATE expression– alias = an existing relation– expression = an expression that determines the output
    85. 85. © Hortonworks Inc. 2012 Page 90A FOREACH Example• The output of this example is a bag:e1 = LOAD pig/input/File1 USING PigStorage(,)AS (name:chararray,age:int,zip:int,salary:double);f = FOREACH e1 GENERATE age,salary;DESCRIBE f;DUMP f;f: {age:int, salary:double}(21,5000.0)(45,25000.0)(21,50000.0)(45,36000.0)(21,80000.0)
    86. 86. © Hortonworks Inc. 2012 Page 91Using FOREACH on Groupse1 = LOAD pig/input/File1 USING PigStorage(,)AS (name:chararray,age:int,zip:int,salary:double);g = GROUP e1 BY age;DESCRIBE g;g: {group: int,e1: {(name:chararray,age:int, zip:int, salary:double)}}f = FOREACH g GENERATE group, SUM(e1.salary);DESCRIBE f;f: {group: int,double}DUMP f;(21,135000.0)(45,61000.0)
    87. 87. © Hortonworks Inc. 2012 Page 92Pig Latin Structured Processing Flow• Pig Latin script describes a directed acyclic graph (DAG)– The edges are data flows and the nodes are operators thatprocess the dataLOADFOREACHJOINLOADFOREACH12 34 5 67
    88. 88. © Hortonworks Inc. 2013Thank You!Questions & AnswersPage 96