High level languages for Big Data Analytics (Report)
High-level languages for Big Data Analytics
Jose Luis Lopez Pino
June 19, 2013
This work presents a review of the literature about the high-level lan-
guages which came out since the MapReduce programming model and
Hadoop implementation shook up the parallel programming over huge
datasets. MapReduce was a major step forward in the ﬁeld, but it has
severe limitations that the high-level programming languages try to over-
come them in diﬀerent ways.
Our work intends to compare three of the main high level languages
(Pig Latin, HiveQl and Jaql) based on four diﬀerent criteria that we con-
sider are very relevant and which studies are consistent in our opinion.
Those criteria are expressive power, performance, query processing and
Analyses based on multiple criteria reveals the diﬀerences between the
languages but it is shown that none of the languages analysed (Pig Latin,
HiveQL and Jaql) beats all the other in every criterion. It depends on
the scenario or application we need to consider these comparison results
to choose which language will be more suitable for implementation
Finally, we are going to address two very well-known pitfalls of MapRe-
duce: latency and implementation of complex algorithms.
1.1 The MapReduce programming model
The MapReduce programming model was introduced by Google in 2004. This
model allows programmers without any experience in parallel coding to write
the highly scalable programs and hence process voluminous data sets. This high
level of scalability is reached thanks to the decomposition of the problem into
a big number of tasks.
MapReduce is a completely diﬀerent approach to big data analysis and it
has been proven eﬀective in large cluster systems. This model is based on two
functions that are coded by the user: map and reduce.
• The Map function produces a set of key/value pairs, taking a single pair
key/value as input.
• The Reduce function takes a key and a set of values related to this key
as input and it might also produce a set of values, but commonly it emits
only one or zero values as output
Using this model, programmers do not have to care about how the data is
distributed, how to handle with failures or how to balance the system. However,
this model also have important drawbacks: it is complicated to code very simple
and common task using this single dataﬂow, many of the tasks are expensive to
perform, the user code is diﬃcult to debug, the absence of schema and indexes
and a lot of network bandwidth might be consumed. The purpose of the
diﬀerent high level languages is to address some of those shortcomings.
In current world, the generation of data per day is measured in petabyte scales.
These large amounts of data have made the need of data to be stored in more
than one system at a time. This means partitioning the data and soring in
separate machine. File systems that manage the storage across a network of
machines are called distributed ﬁle system. The challenging aspect of DFS is in
the fault tolerance which leads to data loss. Hadoop becomes the solution for
Doug Cutting, the creator of Hadoop, named it after his son’s toy elephant.
Hadoop has two layers such as storage and execution layer. The storage layer
is Hadoop Distributed File System (HDFS) designed for storing very large ﬁles
with streaming data access patterns, running on clusters of commodity hard-
ware. The ﬁxed block size of Hadoop is 64MB by default. The execution layer
is Hadoop Map Reduce and it is responsible for running a job in parallel on
multiple servers at the same time.
Since the Hadoop cluster nodes are commodity hardware it is very less cost
and the architecture is very simple.A typical Hadoop cluster has a single master
server called name node that manages the File System and the job tracker
process manages the job.This node should be high quality with high processing
speed.It also has multiple slave nodes which are called data nodes that run the
tasks on their local server.
In general the data will be duplicated (3 times) in diﬀerent nodes to support
the fault tolerance. If any of these data nodes fails the master nodes keeps track
of it and immediately replaces the one which is active and updates the data .
2 High level languages
After MapReduce was publicly announced and the Hadoop framework was cre-
ated, multiple high level languages have been created speciﬁcally which intent
to deal with some of the problems of the model mentioned before. Also some
already existing languages have been integrated to work over this model, like R
(MapR, Ricardo) and SQL.
Concerning the selection of the languages for our comparison, we have chosen
those three programming languages that are present in all the comparisons
(Pig Latin, HiveQL and Jaql) for sake of consistency but we have also consider
important to mention some other interesting high level query languages (Meteor
and DryadLINQ). Additionally, we also study how the query is processed by the
system, which translate it into a MapReduce workﬂow.
2.1 Pig Latin
Pig Latin is executed over Hadoop, an open-source implementation of the map-
reuce programming model formerly developed at Yahoo Research. It is a
high level procedural languages that implements high level operations similar
to those that we can ﬁnd in SQL adnd some other interesting operators listed
• FOREACH to process a transformation over every tuple of the set. To
make possible to parallelise this operation, the transformation of one row
should depend on another.
• COGROUP to group related tuples of multiple datasets. It is similar to
the ﬁrst step of a join.
• LOAD to load the input data and its structure and STORE to save data
in a ﬁle.
The main goal of Pig Latin is to reduce the time of development. For this
purpose, it includes feature like a nested data model, user-deﬁned functions and
the possibility of execute some analytic queries over text ﬁles without loading
Unlike SQL, the procedural nature of the languages also allows the program-
mer to have more control over the execution plan, meaning that he can speed
the performance up without relying on this task on the query optimiser.
It is an open source project initially developed by Facebook. Hive has the system
built on top of Hadoop that eﬃciently incorporates map reduce for execution,
HDFS for storage and keeps the metadata in an RDBMS.
In simple terms Hive has been called as a data warehouse built on top of
Hadoop. The main advantage of Hive is it makes the system familiar by ex-
tending the functionalities of SQL and also its queries looks similar to it.
Of course scalability and performance are the other two features. Hive tables
can be directly deﬁned on the HDFS. Schemas are stored in RDBMS. Hive
support complex column types such as map, array, struct data types in addition
to the basic types .
Hive supports most of the transactional sql queries such as
• Sub queries
• Diﬀerent kinds of joins inner, left outer, right outer and outer joins
• Cartesian products, group bys and aggregations
• Union all
• Create table as select.
Hive uses traditional RDBMs to store the metadata. Usually metadata
storage is accessed more frequently and it is always to keep metadata in random
access rather than sequential access. As HDFS is not well suited for random
access Hive stores the metadata in databases like MySQL and Oracle.
It is also important to focus that there is a low latency whenever HiveQL tries
to access metadata. In spite of this impedance, Hive maintains the consistency
between metadata and data .
Jaql is a declarative scripting language built on top of Hadoop and used in
some IBM products (InfoSphere BigInsights and Cognos Consumer Insight).
This language was developed after Pig and Hive and hence it has been designed
with the purpose of making it more scalable, ﬂexible and reusable than the
alternatives that existed at the time.
Simplicity is one of the key goals of the Jaql data model that is clearly
inspired by JSON: values are always trees, there are no references and the textual
representation is very similar. This simplicity has two advantages: it facilitates
the development and also makes easier the distribution between nodes of the
The other main goal of the data model is the adaptability. Jaql can handle
semistructured documents (data without a schema) but also structures records
validated against a schema. In consequence, programs written in Jaql can load
and write information in diﬀerent sources from relational databases to delimited
The ﬂexibility of the language relies on the data model but also on the
control over the evaluation plan because the programmer can work at diﬀerent
levels of abstraction using Jaql’s syntax:
• Full deﬁnition of the execution plan.
• Use of hints to indicate to the optimizer some evaluation features. This
feature is present in most of the database engines that use SQL as query
• Declarative programming, without any control over the ﬂow.
2.4 Other languages
Stratosphere is a system designed to process massive datasets and one of the
main parts that compound it is the Pact programming model. Pact is an
extension to the MapReduce programming model also inspired by functional
programming. One of the limitations of MapReduce is that it is based on only
two simple second-order functions and this new model addresses this problem
including new operators (called contracts) to perform those analyses easier and
• Cross: performs the Cartesian product over the input sets.
• CoGroup: Group all the pairs with the same key and process them with
a user-deﬁne function.
• Match: it also matches key/value pairs from the input data, but pairs
with the same key might be processed separately by the user function.
In addition to this new programming model, the Stratosphere’s stack in-
cludes Meteor, a query language and Sopremo, the operator model used by
Meteor. Meteor, as the other high level languages presented before, was
designed to facilitate the task of developing parallel programs using the pro-
gramming model (in this case Pact). In consequence, Meteor programs are at
the end translated to a Pact program. Sopremo help us to manage collections
of semantically rich operators that can be extended and that are grouped into
To speed up the execution of the code, the optimization is applied in two
steps: ﬁrst the logical plan, which consists of Sopremo operators, is optimized
and secondly Pact’s compiler applies physical optimizations to the result pro-
Dryad  is another execution engine to compute large-scale analysis over
datasets. DryadLINQ gained community’s interest because it is embedded in
.NET programming languages and a large number of programmers are already
familiar with this development platform and with LINQ in particular.
The designers of this system made a big eﬀort to support almost all the
operators available in LINQ plus some extra operators interesting for paral-
lel programming. This framework also allows developers to include their own
implementation of the operators.
After the DryadLINQ code is extracted from the programme, it is trans-
lated to a Dryad plan and then optimized. The optimizer mainly performs four
tasks: it pipelines operations that can be execute by a single machine, removes
redundancy, pushes aggregations and reduces the network traﬃc.
3 Comparing high level languages
The design motivations of the languages are diverse and therefore the diﬀerences
between the languages are multiple. To compare these three high level languages
we have decided to choose four criteria that are interesting from our point of
view and that are well described in the literature that we have reviewed.
First of all, we have analysed the expressiveness and the general performance
of the languages. In general, developers prefer a language that allow them to
write concise code (expressive power) and that is eﬃcient (performance).
After that we dive into two criteria that also have an important impact in
the performance: the join implementation and the query processing. The join
algorithms are a well-known big burden for the performance working with sets
and therefore we will analyse the diﬀerent algorithms implemented by those
We consider these criteria suﬃcient to consider the solution that better suits
our needs, however there are many other facts that are also mentioned or studied
by the literature written so far like the programming paradigm, the code size or
scalability. Scalability is a very relevant criterion since it motivated the creation
of MapReduce, however it is not easy to ﬁnd consistent literature concerning all
the topics that we consider signiﬁcant.
3.1 Expressive power
Robert Stewart classiﬁes the high level languages in three categories in ac-
cordance with their computational power, from less to more powerful:
• Relational complete: a language is considered relational complete if it
includes the primitive operations of the relational algebra: the selection,
the rename, the projection, the set union, the set diﬀerence and the cross
(Cartesian) product. The diﬀerent kinds of joins implemented in each
language are compared in a separate section.
• SQL equivalent: SQL is a standard language querying data stored in re-
lational database management system. It provides all the operations of
the relational algebra plus aggregate functions, which are not part of the
relational algebra although they are a common extension for data compu-
• Turing complete: a Turing complete language must allow conditional
branching, indeﬁnite iterations by means of recursion and emulation of
an inﬁnite memory model.
Pig Latin and HiveQL are considered SQL equivalent because they are more
powerful than relational algebra (they include numeric aggregation functions).
Hence we can consider their expressive power equivalent, but there are evident
diﬀerences. SQL is a standard of the industry that has been developed during
almost 40 years including several extensions and in consequence it is not easy
to develop a SQL compliant system.
HiveQL is inspired by SQL but it does not support the full repertoire in-
cluded in the SQL-92 speciﬁcation. On the other hand, HiveQL has also included
some features notably inspired by MySQL and MapReduce that are not part of
this speciﬁcation. A comparison between SQL and HiveQL by Tom White in
2009 revealed some limitations of HiveQL like indexes, transactions, subqueries
outside the FROM clause, etc. In the case of Pig Latin, its syntax are func-
tionalities are not inspired by SQL and the diﬀerences are more obvious. For
instance, Pig Latin does not have OVER clause and includes the COGROUP
operator that is not present in SQL.
Jaql includes basic ﬂow control using if-else structures and recursion in
higher-order functions and hence it is consider Turing complete. However, tak-
ing into account that Pig and HiveQL programmes can be extends using user
deﬁned functions, Pig Lating and HiveQL might also be considered Turing com-
Finally, Jaql makes possible to compile high-level declarative expressions
to lower-level function calls. As a result of it, the low-level functions can be
extended. Neither Pig Latin nor HiveQL include this feature, called source-to-
source compilation, which could increase the expressiveness of the language.
The usual benchmarking for Pig to measure its performance is PigMix. Pig Mix
is a set of queries to test the performance. These set checks the scalability and
latency . Hive performance benchmark is mainly based on the queries that
are speciﬁed by Pavlo et al in the paper . The queries basically cover the
selection task, Aggregation task and a Join task.
There is also a Pig- Latin implementation for the TPC-H queries and HiveQL
implementation of TPC-H queries 
Even though the objective of the each of these languages is to generate an
equivalent map reduce jobs for their corresponding input script, the runtime
measure of these languages shows diﬀerent results for same kind of bench mark-
ing applications experimented in .
At ﬁrst the paper describes Scale up, Scale out, scale in and runtime as their
performance metrics. In scale up the size of the cluster size was ﬁxed and the
computation is increased which means the number of nodes in Hadoop envi-
ronment is kept constant. The performance of all three languages interestingly
varied based on the distribution of data. For skewed data, Pig and Hive seems
to be more eﬀective in handling it compared to JAQL.
In Scale out the computation is ﬁxed in the sense there is no increase in
computation for a given experiment with the increase in the number of nodes.
Again Pig and Hive the paper argues that Pig and Hive better in utilizing
the increase in the cluster size compared JAQL. But at one point there is no
improvement in performance by increasing the nodes.
Moreover Pig and Hive allows the user to explicitly specify the number of
reducers task. It has been argued that this feature has signiﬁcant inﬂuence on
3.3 Query processing
In order to make a good comparison we should have the basic knowledge on
how these HLQLs are working. In this section we will focus on answering to
the following question. How the abstract user representation of the query or the
script is converted to map reduce jobs? For data intensive parallel computations,
the choice of high level languages mainly depends on the speciﬁc application
scenarios. By taking this into account, we can realize the importance of
understanding the query compilation methods implemented by these languages.
3.3.1 Pig Latin
The structure of Pig Latin is similar to SQL style. The goal of writing Pig
Latin script is to produce an equivalent map reduce jobs that can be executed
in the Hadoop environment. Pig is considered to have the basic characteristic
of query language, hence the initial steps of compilation is similar to SQL query
Pig programs are ﬁrst passed to a parser component, where it checks for
the syntactic correctness of the Pig Latin script. The result of the parser is
a complete logical plan. Unlike SQL where it produces parse tree, the result
of parsing phase in Pig compilation gives a directed acyclic graph (DAG). The
logical plan is then passed to the logical optimizer component where the classi-
cal optimization operations such as pushing of projections are carried out. The
result of logical optimizer is passed to Map-Reduce compiler to compute a se-
quence of Map- Reduce Jobs which is then passed to optimization phase and
ﬁnally submitted to Hadoop for execution. The following (Figure 1) example
describes the generation of logical plan for a simple word count program in Pig
Latin. The output of each operator is shown near to the rectangles.
Figure 1: Compilation of Pig Latin to Logical Plan
The Pig translates the logical plan into physical plan and it replaces with
the physical operator in the map reduce jobs. In most cases the logical operator
becomes the equivalent physical operator. In this case LOAD, FOREACH and
STORE remain the same. In Pig the operator GROUP is translated as LOCAL
REARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan.
Rearranging means either it does hashing or sorting by key. The combination
of local and global rearranges produces the result in such a way that the tuples
having same group key will be moved to same machine.
The input data is broken down into chunks and the map will all run indepen-
dently in parallel to process the input data. This has been handled by the job
tracker of the map reduce framework. Map takes the input data and constructs
the key value pairs. It also sorts the data by key. The shuﬄe phase is managed
by Hadoop. It fetches the corresponding partition of data from map phase and
merges it into single sorted list then grouped by key. This will be the input for
the reduce phase which usually performs the aggregation part of the query in
our case it is count. An equivalent map reduce jobs for our example Pig Latin
is shown in the Figure 2
Figure 2: Physical Plan to Map Reduce Jobs
In DBMS, the query processor transforms the user queries into sequence of
database operations and executes those operations. Initially the query is turned
into a parse tree structure in such a way to transform into relation algebraic
notation. This is termed as logical query plan [Garcia]. As HiveQL is SQL
like declarative structure, the query processing is same as the SQL in tradi-
tional database engine. The following steps describes in brief about the query
processing in HiveQL 
• It gets the Hive SQL string from the client
• The parser phase converts it into parse tree representation
• The semantic analyser component converts the parse tree into block based
internal query format
• The logical query plan generator converts it into logical query represen-
tation and then optimizes it. Prunes the columns early and pushes the
predicates closer to the table
• Finally, the logical plan is converted to physical plan and then map reduce
JAQL written from the application is evaluated ﬁrst by the compiler. The Map
Reduce jobs can be directly called by using JAQL script. Usually the user
relies on JAQL compiler to convert the script into map reduce jobs. JAQL
includes two higher order functions such as mapReduceFn and mapAggregate
to execute map reduce and aggregate operations respectively. The rewriter
engine generates calls to the mapReduceFn or mapAggregate, by identifying the
parts of the scripts and moving them to map, reduce and aggregate function
parameters. Based on the some rules, rewriter converts them to Expr tree.
Finally it checks for the presence of algebraic aggregates, if it is there then it
invokes mrAggregate function. In otherwords it can complete the task with
single map reduce job. Figure 3
Figure 3: Jaql-Query processing stages
Each language has its own way of implementation for query processing. Dur-
ing the review it is noted that
• Pig currently misses out on optimized storage structures like indexes and
column groups. HiveQL provides more optimization functionalities such
as performs join in map phase instead of reduce phase and in case of
sampling queries prunes the buckets that are not needed
• JAQLs physical transparency is an added value feature because it allows
the user to add new run time operator without aﬀecting JAQLs internals.
3.4 JOIN implementation
Join is an essential operation in relational database models. The basic need
for join comes in the fact that the relations are in normalized form. So in the
computation of aggregation or even in much kind of OLAP operations Join
becomes the necessary step to compute the expected results.
3.4.1 Pig Latin
Pig Latin Supports inner join, equijoin and outer join. The JOIN operator
always performs inner join. Pig executes join in two ﬂavours. First join can be
achieved by COGROUP operation followed by FLATTEN . The inner join
can be extended to three specialized joins .
• Skewed Joins: The basic idea is to compute a histogram of the key space
and uses this data to allocate reducers for a given key. Currently pig
allows skewed join of only two tables. The join is performed in Reduce
• Merge Joins: Pig allows merge join only if the input relations are already
sorted. The join is performed in Reduce phase.
• Fragment Replicate joins:This is only possible if one of the two relations
is smaller enough to ﬁt into memory. In this case, the big relation is
distributed across hadoop nodes and the smaller relation is replicated on
each node. Here the entire join operation is performed in Map phase. Of
course this is trivial case.
The choice of join strategy can be speciﬁed by the user while writing the
script. As example join operation in Pig Latin is shown in the Figure 4
Figure 4: Join code in Pig Latin
In the early stages, HiveQL was only designed to support the common join
operation. In this join, in the map phase the joining tables are read and a pair
of join key and value is written into an intermediate ﬁle in order to pass it to
suﬄe phase which is handled by Hadoop. In shuﬄe phase Hadoop sorts and
combines these key value pairs and sends the same tuples having the same key
to the reducers in order to make perform the acutal join operation. Here the
shuﬄe and reduce phase is more expensive since it involves sorting.
In order to overcome this, the map side join was introduced and it is only
possible in case if one of the joining table exactly ﬁts into the memory. It is
similar to the replicate join as in Pig Latin.
JAQL supports only equijoin. JOIN is expressed between two or more input
arrays. It supports multiple types of joins, including natural, left-outer, right-
outer, and outer joins. One of the advantages of Jaql is that physical trans-
parency allows the function support of Jaql to add new join operator and use
them in the queries without modifying anything in query compiler.
The following points represent the summary of join implementation
• Both Pig and Hive has the possibility to performs join in map phase
instead of reduce phase
• For skewed distribution of data, the performance of JAQL for join is not
comparable to other two languages
4 Future work
4.1 Interactive queries
One of the main problems of MapReduce all the languages built on top of this
framework (Pig, Hive, etc.) is the latency. As a complement of those technolo-
gies, some new frameworks that allow programmers to query large datasets in
an interactive manner have been developed, like Dremel or the open source
project Apache Drill.
In order to reduce the latency of the queries compared to other tools for
large dataset analysis, Dremel stores the information as nested columns, uses
a multi-level tree architecture in the query execution and balances the load by
means of a query dispatcher.
We do not have too many details of the query language of Dremel, but
we know that is based on SQL and includes the usual operations (selection,
projection, etc.) and features (user deﬁne functions or nested subqueries) of
SQL-like languages. The characteristic that distinguish this language is that it
operates with nested tables as inputs and outputs.
4.2 Machine learning
Map reduce being a way to process Big data and it is obvious outperforms for
basic operations such as selection on the other hand it is more complicated to
address complex queries by this processing technique. The challenging aspect
of machine learning algorithms is that is not simply computing aggregates over
datasets and it is to identify some hiding patterns in the given data. Example
of such questions includes, what page will the visitor next visit?
Some of the ML learning algorithms and the general approach to process the
data by map reduce is discussed in paper . Bayess classiﬁer requires counting
the occurrences in the training data. In large data set the extraction of features
is intensive and at least the reduce task should be conﬁgured to compute the
summation of each (feature, label) pair.
Mahout is an Apache project for building the scalable machine learning
algorithms. These algorithms include clustering, classiﬁcations, collaborating
ﬁltering and data mining frequent item set. These in turn are predominantly
used in recommendations. The collaborative ﬁltering supports both user- user
and item-item based similarity .
• Pig Latin has the extensions to deal with the predictive analytics capa-
bilities. In this paper, Twitter has implemented learning algorithms by
placing them in Pig Storage functions . The storage functions are
called in the ﬁnal reduces stage of the overall dataﬂow.
• There is a recent work on the extensive support of ML in Hive . The
author tries to follow the same that has been implemented for Pig by
Twitter. Here the machine learning is treated as UAFs.
• A new data analytics platform Ricardo is proposed combines the func-
tionalities of R and Jaql. It basically takes the advantage of statistical
computing features provided by R with the high level language which
generate map reduce jobs using Jaql .
In this literature review we have ﬁrst introduced the MapReduce programming
model paying attention to its main drawbacks and its main open-source imple-
After that we have brieﬂy described some high level languages that try to
address the problems mentioned from diﬀerent perspectives, focusing on those
that are popular in the literature available at the time of writing (Pig Latin,
HiveQL and Jaql) and some interesting alternatives (DryadLINQ and Meteor).
Based on those consistent and relevant studies reviewed, it is clear that there
is no single language that beat all the other options. Jaql was created after the
other two languages and that probably gave it some advantages in its design.
Based on the ﬁrst criterion analysed, we can state that Jaql is expressively more
powerful since it includes basic ﬂow control using if-else structures, meanwhile
with the other two this is only possible using UDF functions. However, we have
seen that Jaql also shows the worst performance in the benchmarks described
before. Pig and Hive probably perform better in those benchmarks because they
support map phase JOIN. Hive also adopts advance optimization techniques for
query processing that certainly speed up the resulting code.
Finally, we have seen how high level languages for big data analytics are ad-
dressing some of the problems of this paradigm. Real-time processing demands
a very low latency of response and this is one of the main disadvantages of
the MapReduce model. In consequence, some new languages for large dataset
analytics that do not use this model have been designed.
Additionally, some machine learning algorithms are diﬃcult to implement
using this model. Some alternatives have shown up in the last years, for instance
the Apache Software Foundation is developing Mahout, a library that implement
scalable machine learning algorithms using the map/reduce paradigm.
 Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej
Kao, Volker Markl, Erik Nijkamp, and Daniel Warneke. MapReduce and
PACT - comparing data parallel programming models. In Proceedings of
the 14th Conference on Database Systems for Business, Technology, and
Web (BTW), BTW 2011, pages 25–44, Bonn, Germany, 2011. GI.
 Dominic Battr´e, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl,
and Daniel Warneke. Nephele/PACTs: A programming model and execu-
tion framework for web-scale analytical processing. In Proceedings of the
1st ACM symposium on Cloud computing, SoCC ’10, pages 119–130, New
York, NY, USA, 2010. ACM.
 Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed
Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita.
Jaql: A scripting language for large scale semistructured data analysis.
In Proceedings of VLDB Conference, 2011.
 Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J
Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceed-
ings of the 2010 ACM SIGMOD International Conference on Management
of data, pages 987–998. ACM, 2010.
 Jeﬀrey Dean and Sanjay Ghemawat. Mapreduce: simpliﬁed data processing
on large clusters. Communications of the ACM, 51(1):107–113, 2008.
 Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shra-
van M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh
Srinivasan, and Utkarsh Srivastava. Building a high-level dataﬂow sys-
tem on top of map-reduce: the pig experience. Proceedings of the VLDB
Endowment, 2(2):1414–1425, 2009.
 Dan Gillick, Arlo Faria, and John DeNero. Mapreduce: Distributed com-
puting for machine learning. Berkley (December 18, 2006), 2006.
 Arvid Heise, Astrid Rheinl¨ander, Marcus Leich, Ulf Leser, and Felix Nau-
mann. Meteor/sopremo: An extensible query language and operator model.
In Proceedings of the International Workshop on End-to-end Management
of Big Data (BigData) in conjunction with VLDB 2012, 2012.
 Michael Isard and Yuan Yu. Distributed data-parallel computing using a
high-level programming language. In Proceedings of the 2009 ACM SIG-
MOD International Conference on Management of data, pages 987–994.
 Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and
Bongki Moon. Parallel data processing with mapreduce: a survey. ACM
SIGMOD Record, 40(4):11–20, 2012.
 Jimmy Lin and Alek Kolcz. Large-scale machine learning at twitter. In
Proceedings of the 2012 international conference on Management of Data,
pages 793–804. ACM, 2012.
 Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoﬀrey Romer, Shiva
Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analy-
sis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330–
 Extension of Hive to support Machine Learning. Hiveql.
 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and
Andrew Tomkins. Pig latin: a not-so-foreign language for data process-
ing. In Proceedings of the 2008 ACM SIGMOD international conference
on Management of data, pages 1099–1110. ACM, 2008.
 Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J
DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of ap-
proaches to large-scale data analysis. In Proceedings of the 35th SIGMOD
international conference on Management of data, pages 165–178. ACM,
 Pig. Fragment replicate join.
 Caetano Sauer and Theo H¨arder. Compilation of query languages into
mapreduce. Datenbank-Spektrum, pages 1–11, 2013.
 Benchmarking standards. Pigmix.
 Benchmarking standards Hive. Hiveql.
 Robert Stewart. Performance and programmability of high level data par-
allel processing languages: Pig, hive, jaql & java-mapreduce, 2010. Heriot-
 Robert J Stewart, Phil W Trinder, and Hans-Wolfgang Loidl. Comparing
high level mapreduce query languages. In Advanced Parallel Processing
Technologies, pages 58–72. Springer, 2011.
 Ronald C Taylor. An overview of the hadoop/mapreduce/hbase frame-
work and its current applications in bioinformatics. BMC bioinformatics,
11(Suppl 12):S1, 2010.
 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Suresh Anthony, Hao Liu, Pete Wyckoﬀ, and Raghotham Murthy.
Hive: a warehousing solution over a map-reduce framework. Proceedings of
the VLDB Endowment, 2(2):1626–1629, 2009.
 Tom White. Hadoop: The deﬁnitive guide. O’Reilly Media, 2012.
 Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´Ulfar Erlingsson,
Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-
purpose distributed data-parallel computing using a high-level language.
In Proceedings of the 8th USENIX conference on Operating systems design
and implementation, pages 1–14, 2008.