This document discusses Pig Latin, a language for analyzing large datasets in Hadoop. It is intended for procedural programmers who may find SQL restrictive. Pig Latin allows writing MapReduce jobs as a series of statements rather than a single query. The document outlines Pig Latin's data model, commands like LOAD, FILTER and JOIN, and how the Pig system compiles scripts to optimized physical plans. It also introduces Pig Pen for debugging Pig Latin scripts using sample dataset.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
This slide deck is used as an introduction to Relational Algebra and its relation to the MapReduce programming model, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
Lec_4_1_IntrotoPIG.pptx
1. Pig Latin: A Not-So-Foreign
Language for Data Processing
2. Motivation
You‟re a procedural programmer
You have huge data
You want to analyze it
2
3. Motivation
As a procedural programmer…
May find writing queries in SQL unnatural and too restrictive
More comfortable with writing code; a series of statements as
opposed to a long query. (Ex: MapReduce is so successful).
3
4. Motivation
Data analysis goals
Quick
Exploit parallel processing power of a distributed system
Easy
Be able to write a program or query without a huge learning curve
Have some common analysis tasks predefined
Flexible
Transform a data set(s) into a workable structure without much
overhead
Perform customized processing
Transparent
Have a say in how the data processing is executed on the system
5
5. Motivation
Relational Distributed Databases
Parallel database products expensive
Rigid schemas
Processing requires declarative SQL query construction
Map-Reduce
Relies on custom code for even common operations
Need to do workarounds for tasks that have different data
flows other than the expected MapCombineReduce
6
6. Motivation
Relational Distributed Databases
Sweet Spot: Take the best of both SQL and Map-Reduce;
combine high-level declarative querying with low-level
procedural programming…Pig Latin!
Map-Reduce
7
7. Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
8. Outline
System Overview
Pig Latin (The Language)
Data Structures
Commands
Pig (The Compiler)
Logical & Physical Plans
Optimization
Efficiency
Pig Pen (The Debugger)
Conclusion
8
10. Data Model
Atom - simple atomic value (ie: number or string)
Tuple
Bag
Map
11
11. Data Model
Atom
Tuple - sequence of fields; each field any type
Bag
Map
12
12. Data Model
Atom
Tuple
Bag - collection of tuples
Duplicates possible
Tuples in a bag can have different field lengths and field types
Map
13
13. Data Model
Atom
Tuple
Bag
Map - collection of key-value pairs
Key is an atom; value can be any type
14
14. Data Model
Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
Fully nested
More natural for procedural programmers (target user) than
normalization
Data is often stored on disk in a nested fashion
Facilitates ease of writing user-defined functions
No schema required
15
15. Data Model
User-Defined Functions (UDFs)
Can be used in many Pig Latin statements
Useful for custom processing tasks
Can use non-atomic values for input and output
Currently must be written in Java
16
Ex: spam_urls = FILTER urls BY isSpam(url);
16. Speaking Pig Latin
LOAD
Input is assumed to be a bag (sequence of tuples)
Can specify a deserializer with “USING‟
Can provide a schema with “AS‟
newBag = LOAD ‘filename’
<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
17
Queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userID,queryString, timeStamp)
17. Speaking Pig Latin
FOREACH
Apply some processing to each tuple in a bag
Each field can be:
A fieldname of the bag
A constant
A simple expression (ie: f1+f2)
A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
18. Speaking Pig Latin
FILTER
Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
19. Speaking Pig Latin
COGROUP
Group two datasets together by a common attribute
Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
20. Speaking Pig Latin
Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
21. Speaking Pig Latin
Why COGROUP and not JOIN?
May want to process nested bags of tuples before taking the
cross product.
Keeps to the goal of a single high-level data transformation per
pig-latin statement.
However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
22. Speaking Pig Latin
STORE (& DUMP)
Output data to a file (or screen)
STORE bagName INTO ‘filename’
<USING deserializer ()>;
Other Commands (incomplete)
UNION - return the union of two or more bags
CROSS - take the cross product of two or more bags
ORDER - order tuples by a specified field(s)
DISTINCT - eliminate duplicate tuples in a bag
LIMIT - Limit results to a subset
23
23. Compilation
Pig system does two tasks:
Builds a Logical Plan from a Pig Latin script
Supports execution platform independence
No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes
Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
24. Compilation
Building a Logical Plan
Verify input files and bags referred to are valid
Create a logical plan for each bag(variable) defined
25
25. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
26. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
27. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
28. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
29. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
30. Compilation
Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
31. Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
33
32. Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each
COGROUP
Step 2: Push other commands into the
map and reduce functions where Map
possible
May be the case certain commands
require their own map-reduce
Reduce
Load(user.dat)
Filter
Group
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
33. Compilation
Efficiency in Execution
Parallelism
Loading data - Files are loaded from HDFS
Statements are compiled into map-reduce jobs
35
34. Compilation
Efficiency with Nested Bags
In many cases, the nested bags created in each tuple of a COGROUP
statement never need to physically materialize
Generally perform aggregation after a COGROUP and the
statements for said aggregation are pushed into the reduce function
Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
38. Compilation
Efficiency with Nested Bags
Why this works:
COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
39. Compilation
Efficiency with Nested Bags
Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
Inefficiencies
Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to
materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
Every map-reduce job requires data be written and replicated to the
HDFS (although this is offset by parallelism achieved)
41
40. Debugging
How to verify the semantics of an analysis program
Run the program against whole data set. Might take hours!
Generate sample dataset
Empty result set may occur on few operations like join, filter
Generally, testing with sample dataset is difficult
Pig-Pen
Samples data from large dataset for Pig statements
Apply individual Pig-Latin commands against the dataset
In case of empty result, pig system resamples
Remove redundant samples
42
44. Debugging
Pig-Pen
Provides sample data that is:
Real - taken from actual data
Concise - as small as possible
Complete - collectively illustrate the key semantics of each command
Helps with schema definition
Facilitates incremental program writing
45
45. Conclusion
Pig is a data processing environment in Hadoop that is
specifically targeted towards procedural programmers
who perform large-scale data analysis.
Pig-Latin offers high-level data manipulation in a
procedural style.
Pig-Pen is a debugging environment for Pig-Latin
commands that generates samples from real data.
47
46. More Info
Pig, http://hadoop.apache.org/pig/
Hadoop, http://hadoop.apache.org
Anks-
Thay!
48