High level languages for Big Data Analytics (Report)


Published on

Report for the course 'Business Intelligence Seminar' of the IT4BI Erasmus Mundus Master's Programme

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

High level languages for Big Data Analytics (Report)

  1. 1. High-level languages for Big Data Analytics Jose Luis Lopez Pino jllopezpino@gmail.com Janani Chakkaradhari chjananicse@yahoo.com June 19, 2013 Abstract This work presents a review of the literature about the high-level lan- guages which came out since the MapReduce programming model and Hadoop implementation shook up the parallel programming over huge datasets. MapReduce was a major step forward in the field, but it has severe limitations that the high-level programming languages try to over- come them in different ways. Our work intends to compare three of the main high level languages (Pig Latin, HiveQl and Jaql) based on four different criteria that we con- sider are very relevant and which studies are consistent in our opinion. Those criteria are expressive power, performance, query processing and JOIN implementation Analyses based on multiple criteria reveals the differences between the languages but it is shown that none of the languages analysed (Pig Latin, HiveQL and Jaql) beats all the other in every criterion. It depends on the scenario or application we need to consider these comparison results to choose which language will be more suitable for implementation Finally, we are going to address two very well-known pitfalls of MapRe- duce: latency and implementation of complex algorithms. 1 Introduction 1.1 The MapReduce programming model The MapReduce programming model was introduced by Google in 2004[5]. This model allows programmers without any experience in parallel coding to write the highly scalable programs and hence process voluminous data sets. This high level of scalability is reached thanks to the decomposition of the problem into a big number of tasks. MapReduce is a completely different approach to big data analysis and it has been proven effective in large cluster systems. This model is based on two functions that are coded by the user: map and reduce. • The Map function produces a set of key/value pairs, taking a single pair key/value as input. 1
  2. 2. • The Reduce function takes a key and a set of values related to this key as input and it might also produce a set of values, but commonly it emits only one or zero values as output Using this model, programmers do not have to care about how the data is distributed, how to handle with failures or how to balance the system. However, this model also have important drawbacks: it is complicated to code very simple and common task using this single dataflow, many of the tasks are expensive to perform, the user code is difficult to debug, the absence of schema and indexes and a lot of network bandwidth might be consumed[10]. The purpose of the different high level languages is to address some of those shortcomings. 1.2 Hadoop In current world, the generation of data per day is measured in petabyte scales. These large amounts of data have made the need of data to be stored in more than one system at a time. This means partitioning the data and soring in separate machine. File systems that manage the storage across a network of machines are called distributed file system. The challenging aspect of DFS is in the fault tolerance which leads to data loss. Hadoop becomes the solution for this problem. Doug Cutting, the creator of Hadoop, named it after his son’s toy elephant. Hadoop has two layers such as storage and execution layer. The storage layer is Hadoop Distributed File System (HDFS) designed for storing very large files with streaming data access patterns, running on clusters of commodity hard- ware. The fixed block size of Hadoop is 64MB by default. The execution layer is Hadoop Map Reduce and it is responsible for running a job in parallel on multiple servers at the same time. Since the Hadoop cluster nodes are commodity hardware it is very less cost and the architecture is very simple.A typical Hadoop cluster has a single master server called name node that manages the File System and the job tracker process manages the job.This node should be high quality with high processing speed.It also has multiple slave nodes which are called data nodes that run the tasks on their local server. In general the data will be duplicated (3 times) in different nodes to support the fault tolerance. If any of these data nodes fails the master nodes keeps track of it and immediately replaces the one which is active and updates the data [24]. 2 High level languages After MapReduce was publicly announced and the Hadoop framework was cre- ated, multiple high level languages have been created specifically which intent to deal with some of the problems of the model mentioned before. Also some already existing languages have been integrated to work over this model, like R (MapR, Ricardo[4]) and SQL. 2
  3. 3. Concerning the selection of the languages for our comparison, we have chosen those three programming languages that are present in all the comparisons (Pig Latin, HiveQL and Jaql) for sake of consistency but we have also consider important to mention some other interesting high level query languages (Meteor and DryadLINQ). Additionally, we also study how the query is processed by the system, which translate it into a MapReduce workflow. 2.1 Pig Latin Pig Latin is executed over Hadoop, an open-source implementation of the map- reuce programming model formerly developed at Yahoo Research[14]. It is a high level procedural languages that implements high level operations similar to those that we can find in SQL adnd some other interesting operators listed below: • FOREACH to process a transformation over every tuple of the set. To make possible to parallelise this operation, the transformation of one row should depend on another. • COGROUP to group related tuples of multiple datasets. It is similar to the first step of a join. • LOAD to load the input data and its structure and STORE to save data in a file. The main goal of Pig Latin is to reduce the time of development. For this purpose, it includes feature like a nested data model, user-defined functions and the possibility of execute some analytic queries over text files without loading the data. Unlike SQL, the procedural nature of the languages also allows the program- mer to have more control over the execution plan, meaning that he can speed the performance up without relying on this task on the query optimiser. 2.2 HiveQL It is an open source project initially developed by Facebook. Hive has the system built on top of Hadoop that efficiently incorporates map reduce for execution, HDFS for storage and keeps the metadata in an RDBMS. In simple terms Hive has been called as a data warehouse built on top of Hadoop. The main advantage of Hive is it makes the system familiar by ex- tending the functionalities of SQL and also its queries looks similar to it. Of course scalability and performance are the other two features. Hive tables can be directly defined on the HDFS. Schemas are stored in RDBMS. Hive support complex column types such as map, array, struct data types in addition to the basic types [23]. Hive supports most of the transactional sql queries such as • Sub queries 3
  4. 4. • Different kinds of joins inner, left outer, right outer and outer joins • Cartesian products, group bys and aggregations • Union all • Create table as select. Hive uses traditional RDBMs to store the metadata. Usually metadata storage is accessed more frequently and it is always to keep metadata in random access rather than sequential access. As HDFS is not well suited for random access Hive stores the metadata in databases like MySQL and Oracle. It is also important to focus that there is a low latency whenever HiveQL tries to access metadata. In spite of this impedance, Hive maintains the consistency between metadata and data [23]. 2.3 JAQL Jaql[3] is a declarative scripting language built on top of Hadoop and used in some IBM products (InfoSphere BigInsights and Cognos Consumer Insight). This language was developed after Pig and Hive and hence it has been designed with the purpose of making it more scalable, flexible and reusable than the alternatives that existed at the time. Simplicity is one of the key goals of the Jaql data model that is clearly inspired by JSON: values are always trees, there are no references and the textual representation is very similar. This simplicity has two advantages: it facilitates the development and also makes easier the distribution between nodes of the program. The other main goal of the data model is the adaptability. Jaql can handle semistructured documents (data without a schema) but also structures records validated against a schema. In consequence, programs written in Jaql can load and write information in different sources from relational databases to delimited plain files. The flexibility of the language relies on the data model but also on the control over the evaluation plan because the programmer can work at different levels of abstraction using Jaql’s syntax: • Full definition of the execution plan. • Use of hints to indicate to the optimizer some evaluation features. This feature is present in most of the database engines that use SQL as query language. • Declarative programming, without any control over the flow. 4
  5. 5. 2.4 Other languages 2.4.1 Meteor Stratosphere[2] is a system designed to process massive datasets and one of the main parts that compound it is the Pact programming model. Pact[1] is an extension to the MapReduce programming model also inspired by functional programming. One of the limitations of MapReduce is that it is based on only two simple second-order functions and this new model addresses this problem including new operators (called contracts) to perform those analyses easier and more efficiently: • Cross: performs the Cartesian product over the input sets. • CoGroup: Group all the pairs with the same key and process them with a user-define function. • Match: it also matches key/value pairs from the input data, but pairs with the same key might be processed separately by the user function. In addition to this new programming model, the Stratosphere’s stack in- cludes Meteor, a query language and Sopremo, the operator model used by Meteor[8]. Meteor, as the other high level languages presented before, was designed to facilitate the task of developing parallel programs using the pro- gramming model (in this case Pact). In consequence, Meteor programs are at the end translated to a Pact program. Sopremo help us to manage collections of semantically rich operators that can be extended and that are grouped into packages. To speed up the execution of the code, the optimization is applied in two steps: first the logical plan, which consists of Sopremo operators, is optimized and secondly Pact’s compiler applies physical optimizations to the result pro- gramme. 2.4.2 DryadLINQ Dryad [25] is another execution engine to compute large-scale analysis over datasets. DryadLINQ gained community’s interest because it is embedded in .NET programming languages and a large number of programmers are already familiar with this development platform and with LINQ in particular. The designers of this system made a big effort to support almost all the operators available in LINQ plus some extra operators interesting for paral- lel programming. This framework also allows developers to include their own implementation of the operators[9]. After the DryadLINQ code is extracted from the programme, it is trans- lated to a Dryad plan and then optimized. The optimizer mainly performs four tasks: it pipelines operations that can be execute by a single machine, removes redundancy, pushes aggregations and reduces the network traffic. 5
  6. 6. 3 Comparing high level languages The design motivations of the languages are diverse and therefore the differences between the languages are multiple. To compare these three high level languages we have decided to choose four criteria that are interesting from our point of view and that are well described in the literature that we have reviewed. First of all, we have analysed the expressiveness and the general performance of the languages. In general, developers prefer a language that allow them to write concise code (expressive power) and that is efficient (performance). After that we dive into two criteria that also have an important impact in the performance: the join implementation and the query processing. The join algorithms are a well-known big burden for the performance working with sets and therefore we will analyse the different algorithms implemented by those languages. We consider these criteria sufficient to consider the solution that better suits our needs, however there are many other facts that are also mentioned or studied by the literature written so far like the programming paradigm, the code size or scalability. Scalability is a very relevant criterion since it motivated the creation of MapReduce, however it is not easy to find consistent literature concerning all the topics that we consider significant. 3.1 Expressive power Robert Stewart[20] classifies the high level languages in three categories in ac- cordance with their computational power, from less to more powerful: • Relational complete: a language is considered relational complete if it includes the primitive operations of the relational algebra: the selection, the rename, the projection, the set union, the set difference and the cross (Cartesian) product. The different kinds of joins implemented in each language are compared in a separate section. • SQL equivalent: SQL is a standard language querying data stored in re- lational database management system. It provides all the operations of the relational algebra plus aggregate functions, which are not part of the relational algebra although they are a common extension for data compu- tation. • Turing complete: a Turing complete language must allow conditional branching, indefinite iterations by means of recursion and emulation of an infinite memory model. Pig Latin and HiveQL are considered SQL equivalent because they are more powerful than relational algebra (they include numeric aggregation functions). Hence we can consider their expressive power equivalent, but there are evident differences. SQL is a standard of the industry that has been developed during almost 40 years including several extensions and in consequence it is not easy to develop a SQL compliant system. 6
  7. 7. HiveQL is inspired by SQL but it does not support the full repertoire in- cluded in the SQL-92 specification. On the other hand, HiveQL has also included some features notably inspired by MySQL and MapReduce that are not part of this specification. A comparison between SQL and HiveQL by Tom White in 2009 revealed some limitations of HiveQL like indexes, transactions, subqueries outside the FROM clause, etc.[24] In the case of Pig Latin, its syntax are func- tionalities are not inspired by SQL and the differences are more obvious. For instance, Pig Latin does not have OVER clause and includes the COGROUP operator that is not present in SQL. Jaql includes basic flow control using if-else structures and recursion in higher-order functions and hence it is consider Turing complete. However, tak- ing into account that Pig and HiveQL programmes can be extends using user defined functions, Pig Lating and HiveQL might also be considered Turing com- plete. Finally, Jaql[3] makes possible to compile high-level declarative expressions to lower-level function calls. As a result of it, the low-level functions can be extended. Neither Pig Latin nor HiveQL include this feature, called source-to- source compilation, which could increase the expressiveness of the language. 3.2 Performance The usual benchmarking for Pig to measure its performance is PigMix. Pig Mix is a set of queries to test the performance. These set checks the scalability and latency [18]. Hive performance benchmark is mainly based on the queries that are specified by Pavlo et al in the paper [15]. The queries basically cover the selection task, Aggregation task and a Join task. There is also a Pig- Latin implementation for the TPC-H queries and HiveQL implementation of TPC-H queries [19] Even though the objective of the each of these languages is to generate an equivalent map reduce jobs for their corresponding input script, the runtime measure of these languages shows different results for same kind of bench mark- ing applications experimented in [21]. At first the paper describes Scale up, Scale out, scale in and runtime as their performance metrics. In scale up the size of the cluster size was fixed and the computation is increased which means the number of nodes in Hadoop envi- ronment is kept constant. The performance of all three languages interestingly varied based on the distribution of data. For skewed data, Pig and Hive seems to be more effective in handling it compared to JAQL. In Scale out the computation is fixed in the sense there is no increase in computation for a given experiment with the increase in the number of nodes. Again Pig and Hive the paper argues that Pig and Hive better in utilizing the increase in the cluster size compared JAQL. But at one point there is no improvement in performance by increasing the nodes. Moreover Pig and Hive allows the user to explicitly specify the number of reducers task. It has been argued that this feature has significant influence on the performance. 7
  8. 8. 3.3 Query processing In order to make a good comparison we should have the basic knowledge on how these HLQLs are working. In this section we will focus on answering to the following question. How the abstract user representation of the query or the script is converted to map reduce jobs? For data intensive parallel computations, the choice of high level languages mainly depends on the specific application scenarios[17]. By taking this into account, we can realize the importance of understanding the query compilation methods implemented by these languages. 3.3.1 Pig Latin The structure of Pig Latin is similar to SQL style. The goal of writing Pig Latin script is to produce an equivalent map reduce jobs that can be executed in the Hadoop environment. Pig is considered to have the basic characteristic of query language, hence the initial steps of compilation is similar to SQL query processing [6]. Pig programs are first passed to a parser component, where it checks for the syntactic correctness of the Pig Latin script. The result of the parser is a complete logical plan. Unlike SQL where it produces parse tree, the result of parsing phase in Pig compilation gives a directed acyclic graph (DAG). The logical plan is then passed to the logical optimizer component where the classi- cal optimization operations such as pushing of projections are carried out. The result of logical optimizer is passed to Map-Reduce compiler to compute a se- quence of Map- Reduce Jobs which is then passed to optimization phase and finally submitted to Hadoop for execution[6]. The following (Figure 1) example describes the generation of logical plan for a simple word count program in Pig Latin. The output of each operator is shown near to the rectangles. Figure 1: Compilation of Pig Latin to Logical Plan The Pig translates the logical plan into physical plan and it replaces with the physical operator in the map reduce jobs. In most cases the logical operator becomes the equivalent physical operator. In this case LOAD, FOREACH and STORE remain the same. In Pig the operator GROUP is translated as LOCAL REARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan. 8
  9. 9. Rearranging means either it does hashing or sorting by key. The combination of local and global rearranges produces the result in such a way that the tuples having same group key will be moved to same machine[6]. The input data is broken down into chunks and the map will all run indepen- dently in parallel to process the input data. This has been handled by the job tracker of the map reduce framework. Map takes the input data and constructs the key value pairs. It also sorts the data by key. The shuffle phase is managed by Hadoop. It fetches the corresponding partition of data from map phase and merges it into single sorted list then grouped by key. This will be the input for the reduce phase which usually performs the aggregation part of the query in our case it is count. An equivalent map reduce jobs for our example Pig Latin is shown in the Figure 2 Figure 2: Physical Plan to Map Reduce Jobs 3.3.2 HiveQL In DBMS, the query processor transforms the user queries into sequence of database operations and executes those operations. Initially the query is turned into a parse tree structure in such a way to transform into relation algebraic notation. This is termed as logical query plan [Garcia]. As HiveQL is SQL like declarative structure, the query processing is same as the SQL in tradi- tional database engine. The following steps describes in brief about the query processing in HiveQL [23] • It gets the Hive SQL string from the client • The parser phase converts it into parse tree representation • The semantic analyser component converts the parse tree into block based internal query format 9
  10. 10. • The logical query plan generator converts it into logical query represen- tation and then optimizes it. Prunes the columns early and pushes the predicates closer to the table • Finally, the logical plan is converted to physical plan and then map reduce jobs 3.3.3 JAQL JAQL written from the application is evaluated first by the compiler. The Map Reduce jobs can be directly called by using JAQL script. Usually the user relies on JAQL compiler to convert the script into map reduce jobs. JAQL includes two higher order functions such as mapReduceFn and mapAggregate to execute map reduce and aggregate operations respectively. The rewriter engine generates calls to the mapReduceFn or mapAggregate, by identifying the parts of the scripts and moving them to map, reduce and aggregate function parameters. Based on the some rules, rewriter converts them to Expr tree. Finally it checks for the presence of algebraic aggregates, if it is there then it invokes mrAggregate function. In otherwords it can complete the task with single map reduce job. Figure 3 Figure 3: Jaql-Query processing stages Each language has its own way of implementation for query processing. Dur- ing the review it is noted that • Pig currently misses out on optimized storage structures like indexes and column groups. HiveQL provides more optimization functionalities such as performs join in map phase instead of reduce phase and in case of sampling queries prunes the buckets that are not needed • JAQLs physical transparency is an added value feature because it allows the user to add new run time operator without affecting JAQLs internals. 10
  11. 11. 3.4 JOIN implementation Join is an essential operation in relational database models. The basic need for join comes in the fact that the relations are in normalized form. So in the computation of aggregation or even in much kind of OLAP operations Join becomes the necessary step to compute the expected results. 3.4.1 Pig Latin Pig Latin Supports inner join, equijoin and outer join. The JOIN operator always performs inner join. Pig executes join in two flavours. First join can be achieved by COGROUP operation followed by FLATTEN [4]. The inner join can be extended to three specialized joins [16]. • Skewed Joins: The basic idea is to compute a histogram of the key space and uses this data to allocate reducers for a given key. Currently pig allows skewed join of only two tables. The join is performed in Reduce phase. • Merge Joins: Pig allows merge join only if the input relations are already sorted. The join is performed in Reduce phase. • Fragment Replicate joins:This is only possible if one of the two relations is smaller enough to fit into memory. In this case, the big relation is distributed across hadoop nodes and the smaller relation is replicated on each node. Here the entire join operation is performed in Map phase. Of course this is trivial case. The choice of join strategy can be specified by the user while writing the script. As example join operation in Pig Latin is shown in the Figure 4 Figure 4: Join code in Pig Latin 3.4.2 HiveQL In the early stages, HiveQL was only designed to support the common join operation. In this join, in the map phase the joining tables are read and a pair of join key and value is written into an intermediate file in order to pass it to suffle phase which is handled by Hadoop. In shuffle phase Hadoop sorts and combines these key value pairs and sends the same tuples having the same key to the reducers in order to make perform the acutal join operation. Here the shuffle and reduce phase is more expensive since it involves sorting. 11
  12. 12. In order to overcome this, the map side join was introduced and it is only possible in case if one of the joining table exactly fits into the memory. It is similar to the replicate join as in Pig Latin. 3.4.3 JAQL JAQL supports only equijoin. JOIN is expressed between two or more input arrays. It supports multiple types of joins, including natural, left-outer, right- outer, and outer joins. One of the advantages of Jaql is that physical trans- parency allows the function support of Jaql to add new join operator and use them in the queries without modifying anything in query compiler. The following points represent the summary of join implementation • Both Pig and Hive has the possibility to performs join in map phase instead of reduce phase • For skewed distribution of data, the performance of JAQL for join is not comparable to other two languages 4 Future work 4.1 Interactive queries One of the main problems of MapReduce all the languages built on top of this framework (Pig, Hive, etc.) is the latency. As a complement of those technolo- gies, some new frameworks that allow programmers to query large datasets in an interactive manner have been developed, like Dremel[12] or the open source project Apache Drill. In order to reduce the latency of the queries compared to other tools for large dataset analysis, Dremel stores the information as nested columns, uses a multi-level tree architecture in the query execution and balances the load by means of a query dispatcher. We do not have too many details of the query language of Dremel, but we know that is based on SQL and includes the usual operations (selection, projection, etc.) and features (user define functions or nested subqueries) of SQL-like languages. The characteristic that distinguish this language is that it operates with nested tables as inputs and outputs. 4.2 Machine learning Map reduce being a way to process Big data and it is obvious outperforms for basic operations such as selection on the other hand it is more complicated to address complex queries by this processing technique. The challenging aspect of machine learning algorithms is that is not simply computing aggregates over datasets and it is to identify some hiding patterns in the given data. Example of such questions includes, what page will the visitor next visit? 12
  13. 13. Some of the ML learning algorithms and the general approach to process the data by map reduce is discussed in paper [7]. Bayess classifier requires counting the occurrences in the training data. In large data set the extraction of features is intensive and at least the reduce task should be configured to compute the summation of each (feature, label) pair. Mahout is an Apache project for building the scalable machine learning algorithms. These algorithms include clustering, classifications, collaborating filtering and data mining frequent item set. These in turn are predominantly used in recommendations. The collaborative filtering supports both user- user and item-item based similarity [22]. • Pig Latin has the extensions to deal with the predictive analytics capa- bilities. In this paper, Twitter has implemented learning algorithms by placing them in Pig Storage functions [11]. The storage functions are called in the final reduces stage of the overall dataflow. • There is a recent work on the extensive support of ML in Hive [13]. The author tries to follow the same that has been implemented for Pig by Twitter. Here the machine learning is treated as UAFs. • A new data analytics platform Ricardo is proposed combines the func- tionalities of R and Jaql. It basically takes the advantage of statistical computing features provided by R with the high level language which generate map reduce jobs using Jaql [4]. 5 Conclusions In this literature review we have first introduced the MapReduce programming model paying attention to its main drawbacks and its main open-source imple- mentation (Hadoop). After that we have briefly described some high level languages that try to address the problems mentioned from different perspectives, focusing on those that are popular in the literature available at the time of writing (Pig Latin, HiveQL and Jaql) and some interesting alternatives (DryadLINQ and Meteor). Based on those consistent and relevant studies reviewed, it is clear that there is no single language that beat all the other options. Jaql was created after the other two languages and that probably gave it some advantages in its design. Based on the first criterion analysed, we can state that Jaql is expressively more powerful since it includes basic flow control using if-else structures, meanwhile with the other two this is only possible using UDF functions. However, we have seen that Jaql also shows the worst performance in the benchmarks described before. Pig and Hive probably perform better in those benchmarks because they support map phase JOIN. Hive also adopts advance optimization techniques for query processing that certainly speed up the resulting code. Finally, we have seen how high level languages for big data analytics are ad- dressing some of the problems of this paradigm. Real-time processing demands 13
  14. 14. a very low latency of response and this is one of the main disadvantages of the MapReduce model. In consequence, some new languages for large dataset analytics that do not use this model have been designed. Additionally, some machine learning algorithms are difficult to implement using this model. Some alternatives have shown up in the last years, for instance the Apache Software Foundation is developing Mahout, a library that implement scalable machine learning algorithms using the map/reduce paradigm. References [1] Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, and Daniel Warneke. MapReduce and PACT - comparing data parallel programming models. In Proceedings of the 14th Conference on Database Systems for Business, Technology, and Web (BTW), BTW 2011, pages 25–44, Bonn, Germany, 2011. GI. [2] Dominic Battr´e, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. Nephele/PACTs: A programming model and execu- tion framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 119–130, New York, NY, USA, 2010. ACM. [3] Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita. Jaql: A scripting language for large scale semistructured data analysis. In Proceedings of VLDB Conference, 2011. [4] Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceed- ings of the 2010 ACM SIGMOD International Conference on Management of data, pages 987–998. ACM, 2010. [5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [6] Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shra- van M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow sys- tem on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009. [7] Dan Gillick, Arlo Faria, and John DeNero. Mapreduce: Distributed com- puting for machine learning. Berkley (December 18, 2006), 2006. [8] Arvid Heise, Astrid Rheinl¨ander, Marcus Leich, Ulf Leser, and Felix Nau- mann. Meteor/sopremo: An extensible query language and operator model. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData) in conjunction with VLDB 2012, 2012. 14
  15. 15. [9] Michael Isard and Yuan Yu. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 2009 ACM SIG- MOD International Conference on Management of data, pages 987–994. ACM, 2009. [10] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. Parallel data processing with mapreduce: a survey. ACM SIGMOD Record, 40(4):11–20, 2012. [11] Jimmy Lin and Alek Kolcz. Large-scale machine learning at twitter. In Proceedings of the 2012 international conference on Management of Data, pages 793–804. ACM, 2012. [12] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analy- sis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330– 339, 2010. [13] Extension of Hive to support Machine Learning. Hiveql. [14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data process- ing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008. [15] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of ap- proaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, pages 165–178. ACM, 2009. [16] Pig. Fragment replicate join. [17] Caetano Sauer and Theo H¨arder. Compilation of query languages into mapreduce. Datenbank-Spektrum, pages 1–11, 2013. [18] Benchmarking standards. Pigmix. [19] Benchmarking standards Hive. Hiveql. [20] Robert Stewart. Performance and programmability of high level data par- allel processing languages: Pig, hive, jaql & java-mapreduce, 2010. Heriot- Watt University. [21] Robert J Stewart, Phil W Trinder, and Hans-Wolfgang Loidl. Comparing high level mapreduce query languages. In Advanced Parallel Processing Technologies, pages 58–72. Springer, 2011. [22] Ronald C Taylor. An overview of the hadoop/mapreduce/hbase frame- work and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010. 15
  16. 16. [23] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [24] Tom White. Hadoop: The definitive guide. O’Reilly Media, 2012. [25] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general- purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1–14, 2008. 16