Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pig - Processing XML data

This presentation is intended to demonstrate how to processes XML data using PIG.

Pig - Processing XML data

  1. 1. Pig - Working with XML Ram Kedem
  2. 2. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Working with XML • This presentation is intended to demonstrate how to processes XML data using PIG. • The data used to illustrate this topic is taken from MSSQL Northwind database • This presentation is based on • hadoop-1.1.2 • pig-0.11.1 Results may vary under different version of Pig or Hadoop
  3. 3. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Sample Data <orders> <OrderID>10248</OrderID> <CustomerID>VINET</CustomerID> <EmployeeID>5</EmployeeID> <OrderDate>1996-07-04T14:15:14.257</OrderDate> <RequiredDate>1996-08-01T14:15:14.257</RequiredDate> <ShippedDate>1996-07-16T14:15:14.257</ShippedDate> <ShipCity>Reims</ShipCity> <ShipCountry>France</ShipCountry> </orders>
  4. 4. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Load Data – Move into HDFS Our first step is to move the XML file to the HDFS as we’re not working in Local Mode • hadoop fs -put ORDERS_XML.xml /user/hduser/orders/orders_xml.xml
  5. 5. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Load XML Using Pig Next, from Pig Shell we’ll register Piggybank and load the XML using XMLLoader REGISTER '/home/hduser/Downloads/piggybank.jar' ; LOAD_ORDERS = LOAD '/user/hduser/orders/orders_xml.xml' USING org.apache.pig.piggybank.storage.XMLLoader('orders') AS (mydoc:chararray);
  6. 6. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com XML to Pig Structure • Next we’ll translate the XML structure into a format Pig can understand. • This phase involves two steps : • Using Regular Expression translate the XML structure into a Pig “table” (GENERATE FLATTEN) • Map each column in that table and name it (AS)
  7. 7. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com XML to Pig Structure CLEAN = FOREACH LOAD_ORDERS GENERATE FLATTEN (REGEX_EXTRACT_ALL(mydoc, '<orders>s*<OrderID>(.*)</OrderID>s*<Cu stomerID>(.*)</CustomerID>s*<EmployeeID>( .*)</EmployeeID>s*<OrderDate>(.*)</OrderD ate>s*<RequiredDate>(.*)</RequiredDate> s*<ShippedDate>(.*)</ShippedDate>s*<ShipC ity>(.*)</ShipCity>s*<ShipCountry>(.*)</S hipCountry>s*</orders>')) AS (OrderID:chararray,CustomerID:chararray,Emp loyeeID:chararray,OrderDate:chararray,Requi redDate:chararray,ShippedDate:chararray,Shi pCity:chararray,ShipCountry:chararray) ;
  8. 8. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Filter by Id (int) • From this step we can start and query our data. • First query is filter by INT, As we didn't map the columns to the right datatype, we'll use conversion functions. CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry ; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY OrderID == 11066 ;
  9. 9. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Filter by date, using GetYear • Same goes with dates, only this time we'll use the GetYear function as well CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, GetYear(ToDate(ShippedDate)) AS YearShippedDate; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY YearShippedDate == 2014 ; DUMP FILTER_CONV_CLEAN ;
  10. 10. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Comparing Dates using DaysBetween • The value can be either : • Negative (ShippedDate is before 1998-05-04) • Positive (ShippedDate is after 1998-05-04) • Equal to zero (ShippedDate is equal to 1998-05-04) CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, ToDate(ShippedDate) AS GeneralDate, DaysBetween( ToDate(ShippedDate) , ToDate('1998-05-04', 'yyyy-MM-dd') ) ;
  11. 11. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Comparing Dates using DaysBetween CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, ToDate(ShippedDate) AS GeneralDate; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY DaysBetween (GeneralDate , ToDate('2014-11-29', 'yyyy-MM-dd') ) == (long)0; DUMP FILTER_CONV_CLEAN ;
  12. 12. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com How many orders shipped per year and month ? GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, GetYear(ToDate(ShippedDate)) AS YearShippedDate , GetMonth(ToDate(ShippedDate)) AS MonthShippedDate; GROUP_CLEAN = GROUP GENERATE_CLEAN BY (YearShippedDate,MonthShippedDate) ; COUNT_ORDERS = FOREACH GROUP_CLEAN GENERATE group, COUNT(GENERATE_CLEAN.OrderID) AS Count ; ORDER_COUNT_ORDERS = ORDER COUNT_POSTS BY Count ;
  13. 13. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com List occurrences of "Germany" / "GER" (case insensitive search) GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry ; FILTER_BY_SC = FILTER GENERATE_CLEAN BY UPPER(ShipCountry) == 'GERMANY' OR UPPER(ShipCountry) MATCHES '.*GER.*' ;
  14. 14. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Top 10 Orders (by ShippedDate) GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry , ToDate(ShippedDate) AS ShippedDate ; ORDER_LIST = ORDER GENERATE_CLEAN BY ShippedDate DESC ; TOP_10 = LIMIT ORDER_LIST 10 ;
  15. 15. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Top 3 ShippedDate per year GENERATE_CLEAN = FOREACH CLEAN GENERATE GetYear(ToDate(ShippedDate)) AS YearCreationDate , ShipCity, ShipCountry, ToDate(ShippedDate) AS ShipDates; GROUP_CLEAN = GROUP GENERATE_CLEAN BY YearCreationDate ; TOP_3 = FOREACH GROUP_CLEAN { RESULT = TOP(3, 3, GENERATE_CLEAN) ; GENERATE group, RESULT; } ;
  16. 16. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Using DaysBetween GENERATE_CLEAN = FOREACH CLEAN GENERATE OrderID, ShipCity, ShipCountry, RequiredDate, OrderDate, DaysBetween(ToDate(RequiredDate), ToDate(OrderDate)) AS DaysBetween ; GENERATE_CLEAN = LIMIT GENERATE_CLEAN 1000 ; ORDER_LIST = ORDER GENERATE_CLEAN BY DaysBetween DESC ; TOP_10 = LIMIT ORDER_LIST 10 ;

×