• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Pig
 

Apache Pig

on

  • 778 views

Introduction to Apache PIG

Introduction to Apache PIG

Statistics

Views

Total Views
778
Views on SlideShare
778
Embed Views
0

Actions

Likes
2
Downloads
52
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • BIG Data -  are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,search, sharing, analytics, and visualizing.Data management system which is Highly Available, Reliable, Transparent, High Performance, Scalable, Accessible, Secure, Usable, and Inexpensive.
  • Source: Wikipedia
  • Source: Internet (Googling)
  • Facebook statisticsURL: http://www.facebook.com/press/info.php?factsheet
  • Img Source: Yahoohadoop website“Pig Makes Hadoop Easy to Drive”Pig Vs Hivehttp://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/Pig Vs SQLhttp://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/
  • Input: User profiles, PageVisitsFind the top 5 mostvisited pages by usersaged 18-25
  • Input: User profiles, Page visitsFind the top 5 most visited pages by usersaged 18-25
  • http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; The Pig Latin for this will look like:Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreachByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
  • http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; The Pig Latin for this will look like:Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreachByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
  • SerDe - Serializer/Deserializer
  • Execution TypesPig has two execution types or modes: local mode and Hadoop mode.Local mode (pig -x local)In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode issuitable only for small datasets, and when trying out Pig. Local mode does not useHadoop. In particular, it does not use Hadoop’s local job runner; instead, Pig translatesqueries into a physical plan that it executes itself.Hadoop modeIn Hadoop mode, Pig translates queries into MapReduce jobs and runs them on aHadoop cluster. The cluster may be a pseudo- or fully distributed cluster. Hadoop mode(with a fully distributed cluster) is what you use when you want to run Pig on largedatasets.
  • import java.io.IOException;import org.apache.pig.PigServer;public class WordCount { public static void main(String[] args) {PigServerpigServer = new PigServer(); try {pigServer.registerJar("/mylocation/tokenize.jar");runMyQuery(pigServer, "myinput.txt"; } catch (IOExceptione) {e.printStackTrace(); } } public static void runMyQuery(PigServerpigServer, String inputFile) throws IOException { pigServer.registerQuery("A = load '" + inputFile + "' using TextLoader();");pigServer.registerQuery("B = foreach A generate flatten(tokenize($0));");pigServer.registerQuery("C = group B by $1;");pigServer.registerQuery("D = foreach C generate flatten(group), COUNT(B.$0);");pigServer.store("D", "myoutput"); }}
  • PIG | RDBMSAtom ~ CellTuple ~ RowBags ~ Table
  • Example contents of ‘employee.txt’ a tab delimited text1 Krishna 234000000 none2 Krishna_01 234000000 none124163 Shashi 10000 cloud124164 Gopal 1000000 setlabs124165 Govind 1000000 setlabs124166 Ram 450000 es124167 Madhusudhan 450000 e&r124168 Hari 6500000 e&r124169 Sachith 50000 cloud
  • Example contents of ‘people.txt’ a tab delimited text1 Krishna 234000000 none2 Krishna_01 234000000 none124163 Shashi 10000 cloud124164 Gopal 1000000 setlabs124165 Govind 1000000 setlabs124166 Ram 450000 es124167 Madhusudhan 450000 e&r124168 Hari 6500000 e&r124169 Sachith 50000 cloud --Loading data from people.txt into emps bag and with a schemaemps = LOAD 'people.txt' AS (id:int, name:chararray, salary:double, dept:chararray); --Filtering the data as requiredrich = FILTER emps BY $2 > 100000; --Sortingsrtd = ORDER rich BY salary DESC; --Storing the final resultsSTORE srtd INTO 'rich_people.txt';-- Or alternatively we can dump the record on the screenDUMP srtd;Import data using SQOOP1.Import moviesqoop import \--connect jdbc:mysql://localhost/movielens \--table movie --fields-terminated-by '\t' \--username training --password training2. Import movieratingsqoop import \--connect jdbc:mysql://localhost/movielens \--table movierating --fields-terminated-by '\t' \--username training --password training
  • PARALLEL keyword only effects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. At most 2 map or reduce tasks can run on a machine simultaneously.grunt> personal = load 'personal.txt' as (empid,name,phonenumber);grunt> official = load 'official.txt' as (empid,dept,dc); grunt> joined = join personal by empid, official by empid;grunt> dump joined;
  • http://pig.apache.org/docs/r0.7.0/udf.htmlEval FunctionsEval is the most common type of function How to write? UPPER extends EvalFunc Code snippet -- myscript.pig REGISTER myudfs.jar; A = LOAD 'employee_data' AS (id: int,name: chararray, salary: double, dept: chararray); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B; Sample UDF package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc (String) { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } } How to execute the above script? java -cp pig.jar org.apache.pig.Main -x local myscript.pig or pig -x local myscript.pig "Note: myudfs.jar should be in class path!"Aggregate Functions An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. Aggregate functions are usually applied to grouped data. How to write? COUNT extends EvalFunc (Long) implements Algebraic Ex: COUNT, AVG (built-in)Filter Functions Filter functions are eval functions that return a boolean value. Filter functions can be used anywhere a Boolean expression is appropriate, including the FILTER operator. Ex: IsEmpty (built-in) How to write?IsEmpty extends FilterFunc How to use it? D = FILTER C BY not IsEmpty(A);Load/Store Functions The load/store user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case. Ex: PigStorage (built-in) How to write? LOAD: SimpleTextLoader extends LoadFunc STORE: SimpleTextStorer extends StoreFunc
  • Tuple: An ordered list of Data. A tuple has fields, numbered 0 through (number of fields - 1). The entry in the field can be any datatype, or it can be null. Tuples are constructed only by a TupleFactory. A DefaultTupleFactory is provided by the system. If a user wishes to use their own type of Tuple, they should also provide an implementation of TupleFactory to construct their types of Tuples. Fields are numbered from 0.

Apache Pig Apache Pig Presentation Transcript