GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
go-git is a 100% Go libray used to interact with git repositories. Even if it already supports most of the functionality it still lags a bit in performance when compared with the git CLI or some other libraries. I'll explain some of the problems that we face when dealing with git repos and some examples of performance improvements done to the library.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
W świecie mikrousługowym architektura Lambda zadomowiła się już na dobre. Tak przetwarzania streamingowe, jak i batchowe buduje wiele firm. Na rynku (o ile o rynku można mówić w kontekście open source) istnieje wiele frameworków, każdy jednak ma pewne cechy, które — zwłaszcza przy dużych projektach — utrudniają pracę. Jedne służą do przetwarzania real-time, drugie lepiej spisują się w workloadach batchowych. Niektóre z nich zaś można uznać za „rock-solid” tylko jeśli uruchamiamy je na Hadoopie. Nie brak tych problemów jest jednak główną zaletą Beama. A co nią jest? Dowiecie się na prezentacji! Poruszymy takie kwestie jak model przetwarzania, use-case’y, w których Beam się sprawdza, a także środowiska uruchomieniowe. Zobaczycie też, jak uruchamiać joby Apache Beam na Google Cloud Platform.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
go-git is a 100% Go libray used to interact with git repositories. Even if it already supports most of the functionality it still lags a bit in performance when compared with the git CLI or some other libraries. I'll explain some of the problems that we face when dealing with git repos and some examples of performance improvements done to the library.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
1. Pig Latin: A Not-So-Foreign
Language for Data Processing
Christopher Olsten, Benjamin Reed, Utkarsh Srivastava,
Ravi Kumar, Andrew Tomkins
Acknowledgement : Modified the slides from University of Waterloo
2. Motivation
You‟re a procedural programmer
You have huge data
You want to analyze it
2
3. Motivation
As a procedural programmer…
May find writing queries in SQL unnatural and too restrictive
More comfortable with writing code; a series of statements as
opposed to a long query. (Ex: MapReduce is so successful).
3
4. Motivation
Data analysis goals
Quick
Exploit parallel processing power of a distributed system
Easy
Be able to write a program or query without a huge learning curve
Have some common analysis tasks predefined
Flexible
Transform a data set(s) into a workable structure without much
overhead
Perform customized processing
Transparent
Have a say in how the data processing is executed on the system
5
5. Motivation
Relational Distributed Databases
Parallel database products expensive
Rigid schemas
Processing requires declarative SQL query construction
Map-Reduce
Relies on custom code for even common operations
Need to do workarounds for tasks that have different data
flows other than the expected MapCombineReduce
6
6. Motivation
Relational Distributed Databases
Sweet Spot: Take the best of both SQL and Map-Reduce;
combine high-level declarative querying with low-level
procedural programming…Pig Latin!
Map-Reduce
7
7. Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
8. Outline
System Overview
Pig Latin (The Language)
Data Structures
Commands
Pig (The Compiler)
Logical & Physical Plans
Optimization
Efficiency
Pig Pen (The Debugger)
Conclusion
8
10. Data Model
Atom - simple atomic value (ie: number or string)
Tuple
Bag
Map
11
11. Data Model
Atom
Tuple - sequence of fields; each field any type
Bag
Map
12
12. Data Model
Atom
Tuple
Bag - collection of tuples
Duplicates possible
Tuples in a bag can have different field lengths and field types
Map
13
13. Data Model
Atom
Tuple
Bag
Map - collection of key-value pairs
Key is an atom; value can be any type
14
14. Data Model
Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
Fully nested
More natural for procedural programmers (target user) than
normalization
Data is often stored on disk in a nested fashion
Facilitates ease of writing user-defined functions
No schema required
15
15. Data Model
User-Defined Functions (UDFs)
Can be used in many Pig Latin statements
Useful for custom processing tasks
Can use non-atomic values for input and output
Currently must be written in Java
16
Ex: spam_urls = FILTER urls BY isSpam(url);
16. Speaking Pig Latin
LOAD
Input is assumed to be a bag (sequence of tuples)
Can specify a deserializer with “USING‟
Can provide a schema with “AS‟
newBag = LOAD ‘filename’
<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
17
Queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userID,queryString, timeStamp)
17. Speaking Pig Latin
FOREACH
Apply some processing to each tuple in a bag
Each field can be:
A fieldname of the bag
A constant
A simple expression (ie: f1+f2)
A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
18. Speaking Pig Latin
FILTER
Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
19. Speaking Pig Latin
COGROUP
Group two datasets together by a common attribute
Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
20. Speaking Pig Latin
Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
21. Speaking Pig Latin
Why COGROUP and not JOIN?
May want to process nested bags of tuples before taking the
cross product.
Keeps to the goal of a single high-level data transformation per
pig-latin statement.
However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
22. Speaking Pig Latin
STORE (& DUMP)
Output data to a file (or screen)
STORE bagName INTO ‘filename’
<USING deserializer ()>;
Other Commands (incomplete)
UNION - return the union of two or more bags
CROSS - take the cross product of two or more bags
ORDER - order tuples by a specified field(s)
DISTINCT - eliminate duplicate tuples in a bag
LIMIT - Limit results to a subset
23
23. Compilation
Pig system does two tasks:
Builds a Logical Plan from a Pig Latin script
Supports execution platform independence
No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes
Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
24. Compilation
Building a Logical Plan
Verify input files and bags referred to are valid
Create a logical plan for each bag(variable) defined
25
25. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
26. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
27. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
28. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
29. Compilation
Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
30. Compilation
Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
31. Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
33
32. Compilation
Building a Physical Plan
Step 1: Create a map-reduce job for each
COGROUP
Step 2: Push other commands into the
map and reduce functions where Map
possible
May be the case certain commands
require their own map-reduce
Reduce
Load(user.dat)
Filter
Group
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
33. Compilation
Efficiency in Execution
Parallelism
Loading data - Files are loaded from HDFS
Statements are compiled into map-reduce jobs
35
34. Compilation
Efficiency with Nested Bags
In many cases, the nested bags created in each tuple of a COGROUP
statement never need to physically materialize
Generally perform aggregation after a COGROUP and the
statements for said aggregation are pushed into the reduce function
Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
38. Compilation
Efficiency with Nested Bags
Why this works:
COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
39. Compilation
Efficiency with Nested Bags
Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
Inefficiencies
Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to
materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
Every map-reduce job requires data be written and replicated to the
HDFS (although this is offset by parallelism achieved)
41
40. Debugging
How to verify the semantics of an analysis program
Run the program against whole data set. Might take hours!
Generate sample dataset
Empty result set may occur on few operations like join, filter
Generally, testing with sample dataset is difficult
Pig-Pen
Samples data from large dataset for Pig statements
Apply individual Pig-Latin commands against the dataset
In case of empty result, pig system resamples
Remove redundant samples
42
44. Debugging
Pig-Pen
Provides sample data that is:
Real - taken from actual data
Concise - as small as possible
Complete - collectively illustrate the key semantics of each command
Helps with schema definition
Facilitates incremental program writing
45
45. Conclusion
Pig is a data processing environment in Hadoop that is
specifically targeted towards procedural programmers
who perform large-scale data analysis.
Pig-Latin offers high-level data manipulation in a
procedural style.
Pig-Pen is a debugging environment for Pig-Latin
commands that generates samples from real data.
47
46. More Info
Pig, http://hadoop.apache.org/pig/
Hadoop, http://hadoop.apache.org
Anks-
Thay!
48