Practical Hadoop using Pig
Upcoming SlideShare
Loading in...5
×
 

Practical Hadoop using Pig

on

  • 4,305 views

So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed. ...

So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm

Statistics

Views

Total Views
4,305
Views on SlideShare
4,246
Embed Views
59

Actions

Likes
8
Downloads
203
Comments
0

7 Embeds 59

http://dwellman.tumblr.com 36
https://twitter.com 10
http://safe.tumblr.com 6
http://www.tumblr.com 2
http://www.linkedin.com 2
https://www.linkedin.com 2
http://www.slashdocs.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Practical Hadoop using Pig Practical Hadoop using Pig Presentation Transcript

  • Practical Hadoop with Pig Dave Wellman #openwest @dwellman
  • How does it all work? HDFS Hadoop Shell MR Data Structures Pig Commands Pig Example
  • HDFS
  • HDFS has 3 main actors
  • The Name Node The Name Node is “The Conductor”. It directs the performance of the cluster.
  • The Data Nodes: A Data Node stores blocks of data. Clusters can be contain thousands of Data Nodes. *Yahoo has a 40,000 node cluster.
  • The Client The client is a window to the cluster.
  • The Name Node
  • The heart of the System.
  • The heart of the System. Maintains a virtual File Directory.
  • The heart of the System. Maintains a virtual File Directory. Tracks all the nodes.
  • The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later).
  • The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later). If the NameNode is down, the cluster is offline.
  • Storing Data
  • The Data Nodes
  • Add a Data Node:
  • Add a Data Node: The Data Node says “Hello” to the Name Node.
  • Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements.
  • Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.
  • Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. .
  • Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. The Data Node is now part of cluster and it checks in with the Name Node every 3 seconds.
  • Data Node Heartbeat:
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response.
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster.
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay.
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status.
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data
  • Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data Same process for all nodes within a cluster.
  • Writing Data
  • The client “tells” the NameNode the virtual directory location for the file.
  • A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks”
  • A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go.
  • A64 B64 C28 A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes.
  • A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report
  • The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report The NameNode tells the DataNode where to replicate the block. A64 A64 A64
  • Reading Data
  • The client tells the NameNode it would like to read a file.
  • The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on.
  • A64 B64 C28 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode
  • B64 C28 A64 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode.
  • The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode. The client moves on to the next block in the sequence until the file has been read. B64 C28 A64 B64 C28
  • Failure Recovery
  • A Data Node Fails to “check-in” A64
  • A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. A64
  • A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. A64A64 A64
  • A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. The Data Node replicates that block of data. (Just like a write) A64A64 A64A64
  • Interacting with Hadoop HDFS Shell Commands
  • HDFS Shell Commands. > Hadoop fs –ls <args> Same as unix or osx ls command. /user/hadoop/file1 /user/hadoop/file2 ...
  • HDFS Shell Commands. > Hadoop fs –mkdir <path> Creates directories in HDFS using path.
  • HDFS Shell Commands. > hadoop fs -copyFromLocal <localsrc> URI Copy a file from your client to HDFS. Similar to put command, except that the source is restricted to a local file reference.
  • HDFS Shell Commands. > hadoop fs -cat <path> Copies source paths to stdout.
  • HDFS Shell Commands. > hadoop fs -copyToLocal URI <localdst> Copy a file from HDFS to your client. Similar to get command, except that the destination is restricted to a local file reference.
  • HDFS Shell Commands. cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text touchz
  • Map Reduce Data Structures Basic, Tuples & Bags
  • Basic Data Types: Strings, Integers, Doubles, Longs, Byte, Boolean, etc. Advanced Data Types: Tuples and Bags
  • Tuples are JSON like and simple. raw_data: { date_time: bytearray, seconds: bytearray }
  • Bags hold Tuples and Bags element: { date_time: bytearray, seconds: bytearray group: chararray, ordered_list: { date: chararray, hour: chararray, score: long } }
  • Expert Advice: Always know your data structures. They are the foundation for all Map Reduce operations. Complex (deep) data structures will kill -9 performance. Keep them simple!
  • Processing Data Interacting with Pig using Grunt
  • GRUNT Grunt is a command line interface used to debug pig jobs. Similar to Ruby IRB or Groovy CLI. Grunt is your best weapon against bad pigs. pig -x local Grunt> |
  • GRUNT Grunt> describe Element Describe will display the data structure of an Element Grunt> dump Element Dump will display the data represented by an Element
  • GRUNT > describe raw_data Produces the output: > raw_data: { date_time: bytearray, items: bytearray } Or in a more human readable form: Raw_data: { date_time: bytearray, items: bytearray }
  • GRUNT > dump raw_data You can dump terabytes of data to your screen, so be careful. (05/10/2011 20:30:00.0,0) (05/10/2011 20:45:00.0,0) (05/10/2011 21:00:00.0,0) (05/10/2011 21:15:00.0,0) ...
  • Pig Programs Map Reduce Made Simple
  • Most PIG commands are assignments. • The element names the collection of records that exist out in the cluster. • It’s not a traditional programming variable. • It describes the data from the operation. • It does not change. Element = Operation;
  • The SET command Used to set a hadoop job variable. Like the name of your pig job. SET job.name 'Day over Day - [$input]’;
  • The REGISTER and DEFINE commands -- Setup udf jars REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar DEFINE BUCKET_FORMAT_DATE com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/ yyyy HH:mm', 'HH');
  • The LOAD USING command -- load in the data from HDFS raw_data = LOAD '$input' USING PigStorage('t') AS (date_time, items);
  • The FILTER BY command Selects tuples from a relation based on some condition. -- filter to the week we want broadcast_week = FILTER bucket_list BY (date >= '03-Oct-2011') AND (date <= '10-Oct-2011');
  • The GROUP BY command Groups the data in one or multiple relations. daily_stats = GROUP broadcast_week BY (date, hour);
  • The FOREACH command Generates data transformations based on columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMS(items) AS items; *DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.
  • The GENERATE command Use the FOREACH GENERATE operation to work with columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMSS(items) AS items;
  • The FLATTEN command FLATTEN substitutes the fields of a tuple in place of the tuple. traffic_stats = FOREACH daily_stats GENERATE FLATTEN(GROUP), COUNT(broadcast_week) AS cnt, SUM(broadcast_week.items) AS total;
  • The STORE INTO USING command Store function determine how data stored after a pig job. -- All done, now store it STORE final_results INTO '$output' USING PigStorage();
  • Demo Time! “Because, it’s all a big lie until someone demos’ the code.” - Genghis Khan
  • Thank You. - Genghis Khan