Hadoop: HDFS HDFS is a
distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
Hadoop Pigdata = LOAD employee.csv
USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
books = LOAD books.csv.bz2 USING
PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
UDF● Custom Load and Store
classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.