Integrating dbm ss as a read only execution layer into hadoop
Upcoming SlideShare
Loading in...5

Integrating dbm ss as a read only execution layer into hadoop



J Gabriel Lima -

J Gabriel Lima -



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Integrating dbm ss as a read only execution layer into hadoop Integrating dbm ss as a read only execution layer into hadoop Document Transcript

  • The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies Integrating DBMSs as a Read-Only Execution Layer into Hadoop Mingyuan An, Yang Wang Weiping Wang, Ninghui Sun Key Laboratory of Computer System and Key Laboratory of Computer System and Architecture, Chinese Academy of Sciences Architecture, Chinese Academy of Sciences Institute of Computing Technology, Chinese Institute of Computing Technology, Chinese Academy of Sciences Academy of Sciences Graduate University of Chinese Academy of Beijing, China Sciences {wpwang, snh} Beijing, China {anmingyuan, aaron} ABSTRACT—To obtain the efficiency of DBMS, HadoopDB techniques has greatly pushed the popularity, and many combines Hadoop and DBMS, and claims the superiority over systems have been constructed on top of Hadoop. Hadoop in terms of performance. However, the approach of The loose constraints on the data schema and execution HadoopDB is simply putting MapReduce onto unmodified style in Hadoop bring the user the maximum flexibility. single-machined DBMSs which has several obvious weaknesses. In essence, HadoopDB is a parallel DBMS with The user can implement the upper application in an fault tolerance, which incurs unnecessary overhead due to the unrestrictive way. However, as a very thin layer with basic DBMS legacy. Instead of augmenting DBMS with Hadoop mechanism and functionality, Hadoop is not efficient techniques, we propose a new system architecture integrating enough directly facing the user [4][5][6]. In common cases, modified DBMS engines as a read-only execution layer into it lacks many performance-critical optimizations such as Hadoop, where DBMS plays a role of providing efficient read- compact data representation, helper structures, etc. only operators rather than managing the data. Besides the Database management system, by comparison, has obtained efficiency from DBMS engine, there are other optimized implementation improving the efficiency greatly. advantages. The modified DBMS engine is able to directly It has read optimized storage format, sophisticated query process data from the HDFS (Hadoop Distributed File System) files at the block level, which means that the data execution, different kinds of indexes, data or query cache replication can be handled by HDFS naturally, and the block- better understanding the semantics of the application, etc. level parallelism is easily achieved. The global index access However, lack of fault tolerance, as one of the most mechanism is added according to the MapReduce paradigm. important reasons, makes DBMS incompetent for large- The data loading speed is also guaranteed by directly writing scale data processing applications. the data into HDFS with simplified logic. Experiments show HadoopDB [7] puts a middleware between Hadoop and that our system outperforms both original Hadoop and DBMSs, so gets fault tolerance from Hadoop. HadoopDB HadoopDB styled system. makes itself a parallel DBMS with fault tolerance, claiming the ability to support large-scale data processing Keywords-Hadoop, database, large-scale data processing, applications. But this method simply takes complete global index access DBMSs as the underlying storage and execution units, which has some problems. I. INTRODUCTION First, with respect to fault tolerance, although it can Google File System (GFS) [1] and MapReduce [2] are take advantage of MapReduce to achieve it in the execution developed (or popularized) by Google for large-scale layer, replication in the data layer is not fully implemented. dataset storage and processing. GFS is a distributed file In the experiments conducted in HadoopDB project, the system optimized for large sequential read operations, and data replicas are maintained manually. Before starting the provides fault tolerance mechanism in the data layer. benchmark, the data are first split into chunks and MapReduce is a programming paradigm for parallel replicated onto the nodes in batch mode using some kind of processing. Using MapReduce, the user can easily express scripts. This approach obvious does not support online the application task without the complexity of detailed loading. Without the fault tolerance in the data layer, it will parallel execution. The runtime system of MapReduce still suffer from failures, so will not be a real scalable parallelizes and schedules the job to take full use of the system in a large-scale environment. Implementing the available resources in a large-scale parallel system, while replication mechanism in the middleware on top of DBMSs provides fault tolerance mechanism in the execution layer. will need great effort, amounting to a big project of the Because of the fault tolerance, high scalability and ease fault tolerance domain. All these make HadoopDB hardly for use, the techniques underlying MapReduce and GFS are be used in practice. Actually, HadoopDB is just a prototype very attractive for large-scale data processing applications. focusing on testing the query execution performance with Hadoop [3] as an open source system implementing these the prepared data replicas in advance, rather than a complete system architecture solution.978-0-7695-4287-4/10 $26.00 © 2010 IEEE 17DOI 10.1109/PDCAT.2010.43
  • Second, the underlying single-machined DBMSs are techniques [10][11]. These works mainly deal with theunable to use global index residing outside each single language problem.system, so may be not optimal in terms of performance for The second one is for efficiency. Some works havesome kind of queries. Maybe this can be handled in the been done to improve the kernel part of Hadoopmiddleware in some way, but it is still making a detour. [12][13][14][15]. Some other works improve the system by Third, the data loading speed of DBMS is slow due to external mechanisms [7] [16]. The work on integratingvery strict constraints on the data schema and semantics. DBMS and Hadoop comes from the thoughts in [5], whichThe data usually need to go through complex logic before point out that the brute-force style work of MapReduce isbeing stored into the storage, which is not necessary for not optimal in terms of efficiency, and something should bemany applications. done to merge the techniques from MapReduce and DBMS. The construction of HadoopDB takes a DBMS basis, so HadoopDB [7] is the first, as we know, to try to merge theincurs inevitable limitations due to the DBMS legacy. We two systems. Although pioneered the idea, HadoopDB stillbelieve that Hadoop has already done a lot to make itself a has limitations which will bottleneck the application in realcompetent system for large-scale data processing scene.applications, so the right way is to take a Hadoop basis, andborrow DBMS techniques when appropriate. III. SYSTEMS FOR LARGE-SCALE DATA ANALYSIS In this paper, we propose our approach to integrate In this section we will first introduce the requirementsDBMSs as a read-only execution layer into Hadoop. Based of our application of large dataset analysis. We believe thaton Hadoop, we incorporate modified DBMS engines which these requirements are typical for many other applications.are augmented with a customized storage engine capable of Based on these requirements, we revisit existing systemsdirectly accessing data from HDFS and taking use of global for data processing. Finally, in terms of merging theindex access method. In this architecture, DBMS plays a techniques of both Hadoop and DBMS, we determine therole of providing efficient read-only operators, instead of position where our system should stand.managing the data. The following benefits are obtained:(1) With data being put on HDFS, the fault tolerance A. Application Requirements problem in the data layer is solved naturally. Our case is a network security application. There are(2) The DBMS engine executes the sub-queries with the some monitors keeping watching the whole network and efficiency advantage as is the case for HadoopDB. generating sampling records for captured events. These Besides, based on HDFS and MapReduce, a global generated data are streaming into the analytic platform in index mechanism is able to be put into action with the real time. Once stored into the system, the data should be DBMS engines, and significantly improves the available for ad-hoc read-only queries. This analytic performance for certain queries. platform is the one we focus on.(3) As a read-only layer, DBMS is not responsible for the data loading, but instead, the data are loaded through a 1) Large-scale Parallel Processing loader outside the DBMS, or the user can write the data The large-scale parallel processing power is a basic directly to the HDFS in a predefine manner. Doing this requirement for all large dataset analysis applications. greatly accelerates the data loading speed to the raw Different with the point operations of key-value model in speed of HDFS writing, while keeps the convenience web services, analysis work usually needs to access big part and flexibility for the user. of the whole dataset even for a single query, which must The remainder of the paper is organized as follows: resort to large-scale parallel processing for huge raw power.Section II introduces the related work; Section III analyzes In such environment, automatic mechanism is very critical.the application requirements and existing systems, and then Automatic parallelization, scheduling and fault handlingpositions the desired system; Section IV describes our liberate the user from heavy programming and maintenanceproposed system; Section V gives experimental results; work. All these play an important role in guaranteeing theSection VI concludes the paper. scalability of the system. 2) High Efficiency II. RELATED WORK Efficiency is a necessary concern because it will take up With the wide deployment in the data analysis field, so many resources for each single query, and higherthere appear two types of works to improve Hadoop-based efficiency can lower the cost The first one is for the usability. Despite the Besides the common reasons, there is a special one inflexibility, for many users, the low level language of our case. In many other applications, the data have a shortMapReduce is somewhat inconvenient to use compared to life cycle. The data are loaded into the system in batchthat in a higher level such as SQL of relational DBMS. So mode, then some almost fixed queries are put onto the data,some systems on top of Hadoop are developed. Facebook’s and after that, the data will be removed or offloaded to theHive [8] and Yahoo’s Pig [9] are examples of this kind. offline system. In such condition, organizing the data intoThey provide simple declarative languages capable of some sophisticated structure is not worthwhile given theexpressing complex ad-hoc queries on structured data. extra maintenance cost and the low utility. Sometimes theSome other higher level languages or system level products tasks on the data are simply to generate statistical reports asare also developed on top of either MapReduce or similar timing jobs, for example, only working during night, so it 18
  • may be all right even if the execution takes a less efficient applied to every layer and component of DBMS, such asway. These applications can be seen as data processing optimized data storage format, diverse access methods,rather than data analyzing. sophisticated query execution, efficient data cache, etc. By comparison, our application is to deal with ad-hoc Many of these techniques are widely copied and reinventedanalytic queries over long existing dataset. It is worthwhile by other systems [17]. This is the advantage of DBMS, andto adopt some optimized data structures and execution is desirable for large dataset analysis.mechanism to improve the query efficiency. For example, For data loading, there is the limitation in DBMS-basedthe queries are often with predicates on some attributes, for systems. Due to strict constraints such as ACID property,which using index can reduce the execution cost in certain the system can not load the data in an efficient way,cases. Although it has extra cost on index maintenance, this especially for online loading. The online loading speed ofwill be amortized by the repeated usage. Ad-hoc queries any DBMS node is lower than 10MB/s, as far as we know.also make the cache usable. Queries on the same set of data The systems in data warehouse application are often offlinemay exist, so the cache will make sense. Using index also systems with very weak online loading requirement, inposes demand for index data cache. After all, the long life which the common case is to load data in batch at regularcycle of the data and ad-hoc queries will justify the effort time points.for optimization. Hadoop: Many applications replace DBMS with Hadoop for a couple of reasons. One of the most important 3) Continuous High Speed Data Loading ones is that Hadoop is scalable due to the fault tolerance. In The data in our application are streaming in addition, Hadoop is easier to deploy and use. For datacontinuously at relatively high speed. This requires the processing tasks, MapReduce provides very simple andsystem be capable of loading data with high speed in an flexible parallel programming paradigm, and is able toonline mode. The data can be stored in appending manner, express complex queries.and once stored there will be no update on them. High As to parallel processing, Hadoop is totally born for it.speed online loading requires the logic on the path be MapReduce run-time system has full ability to parallelizesimple enough. Unnecessary strict semantics checking the whole processing in large-scale systems. Bothshould be avoided. MapReduce and HDFS are completely fault tolerant,B. Existing Systems Reconsidered making the whole system highly scalable. This is one of the In response to the requirements of the application, we most important properties of Hadoop, and is critical forconsider three types of systems: database management large dataset analysis. In addition, the block-levelsystem, Hadoop, and HadoopDB. If we treat the techniques replication of HDFS gives MapReduce great opportunitiesof DBMS and Hadoop as two extremes, there should be a for high degree of parallelism and fine-grained executionbroad spectrum between them. To satisfy the requirements fault-tolerance, which improves the performanceof the application, it is right to draw strength from both. significantly.HadoopDB is actually a DBMS equipped with some For efficiency, Hadoop seems to have a long way to go.Hadoop techniques. However, we believe that our system MapReduce is working in a brute-force way. Whatever theshould start from the other side. query is, it has to scan all the data without helper structures such as index. The scan and processing are not guaranteed 1) Existing Systems to be efficient, because it is the user’s work to implement DBMS: DBMS has long story for data management. the details. The data are often stored on HDFS in textThe irreplaceable domain of DBMS is transaction format which is straightforward to the user but not compact.processing, where ACID property must be guaranteed. As In these aspects, Hadoop is not as good as DBMS, at leastto the data analysis, DBMSs also hold an important for now.position. Parallel DBMSs are very popular in data As to data loading, writing data directly to HDFS canwarehouse market. In this domain, parallel DBMS provides be guaranteed with an acceptable speed. For many large-a high degree of parallelism and achieves good scale data analysis applications including ours, weakperformance for analytic queries. consistency is enough. The data can go though without For parallel processing, parallel DBMS only competes complex logic, so writing structured data can also achieveat a limited scale. The serious problem with DBMSs is that the same speed as the case for unstructured byte stream.most of them are not fault tolerant. If something goes Although the speed can be lowered when replicas arewrong during the execution of a query, it has to restart stored, this is an inevitable tradeoff with fault tolerance onentirely. In a large-scale system consisting of thousands of data.components, a long running query will never succeed HadoopDB: DBMS and Hadoop have their ownconsidering the high failure rate. While parallel DBMSs are superiority in the appropriate domain. To better satisfy thecompetent for data warehouse, they are not suitable for emerging applications, DBMS must incorporate faultlarge-scale applications in this aspect. The largest DBMS- tolerance in order to be scalable, while Hadoop shouldbased analytic system as we know consists of only 100 borrow techniques from DBMS to improve the efficiency.machines. HadoopDB is constructed based on this idea. DBMSs are For efficiency, DBMS has embodied decades of taken as the storage and execution units, and MapReduceacademic and industrial research. Optimizations have been mechanism takes responsibility for parallelization and fault 19
  • tolerance on top of the underlying DBMSs. Fig. 1 shows DBMS camp, which seems capable of supporting this kindthe architecture of HadoopDB briefly. HDFS is used to of the system metadata and the result set of the query. HadoopDB merges techniques from both DBMS andAll source data are stored in DBMSs. When executing a Hadoop, but it is hardly be used for the applications due toquery, Maps are scheduled to the nodes according to the the DBMS legacy and the further required implementingmetadata which tells the location of each block of the data. work. So now we must identify two different ways ofMaps issue SQL queries to the underlying DBMSs and merging DBMS and Hadoop techniques, which can helpemit the result records to Reduces. Reduces aggregate positioning the desired system. The difference between theresult sets from multiple nodes, and write the final results two ways is about the starting point for constructing anonto HDFS. integrated system: one from DBMS, another from Hadoop. HadoopDB tries to introduce fault tolerance and fine- grained parallelism into the parallel DBMS, so belongs to the first category. While high efficiency is reserved, all strict constraints of DBMS are also inherited. HadoopDB is desired for applications where the strict schema and semantics of data are given high priority. So it is capable of dealing with traditional database applications. After all, HadoopDB means Hadoop database, not database Hadoop. The system we need should go from the other side. Hadoop satisfies the majority of our need except that in efficiency, so we should integrate DBMS techniques into Hadoop-based system, rather than the reverse. MapReduce Figure 1. Architecture of HadoopDB. and HDFS are all developed for large-scale data processing applications, and it is only the efficiency that needs special For parallel processing, HadoopDB is just partially fault concern. Hence, we should position our system closer totolerant. MapReduce only guarantees the fault tolerance in the Hadoop side, while be positive to incorporate desiredthe execution layer. The data are stored in DBMSs rather properties from the DBMS.than HDFS, so the availability of data should be speciallyhandled. Common DBMS has no special concern on fine- C. Our Approachgrained data replication for intra-query parallelization, Different with HadoopDB, we take DBMSs as read-without which the MapReduce framework can not only execution components. For a specific query, dataset oncompletely exploit the parallelism. HadoopDB uses HDFS is split logically into blocks as usual, and each blockbatched approach to dump the data out of the database and is assigned an executor which is now a database executionreplicate them to some other nodes, which doesn’t support thread; all intermediate results computed by the databaseonline loading at all. This functionality can be achieved in engines are aggregated by Reducers which are the same asthe middleware, but implementing replications of fine before; final results are written onto HDFS naturally.granularity on top of table of DBMS needs non-trivial work, Using this approach, the parallel execution is still inwhich makes it hard to use in practice. block granularity, and fault tolerance is guaranteed in both For efficiency, the advantage of DBMS in this aspect the data layer and the execution layer.can be reflected in the integrated system, because each Efficiency now depends on the DBMS layer. Manyquery will be translated into sub-queries actually executed techniques such as data cache, query cache, optimizedby individual DBMS query engine. Despite the operators in the database will still make effect. But dataimprovement, there is still one limitation on global access methods are partially different as before due to thestructure mechanism. Because all DBMSs in HadoopDB customized storage engine using HDFS. Index accessare unmodified ones of single-machined version, this layer mechanism should be reconsidered and adapted to work incan not take use of any global structures, such as global this situation.index which makes sense when the query is with predicates The data loading process is intuitive. Streaming dataof high selectivity. can be packed into optimized format as that in DBMS, and Finally comes the data loading requirement. In then be directly written onto HDFS bypassing the logic forHadoopDB, DBMS-based storage obviously inherits the transactions, which will not cost too much through thislimitation. The data have to be loaded through a complete simple logic.DBMS logic, so it is difficult to improve the loading speed. IV. DBMS ENGINE INTEGRATED HADOOP SYSTEM 2) Discussion We have given an analysis of two typical systems and In this section, we will give a detailed description of thethe integrated HadoopDB respectively. Lack of fault system constructed for our application. We first give thetolerance eliminates traditional parallel DBMSs from the overview of the system architecture, and then focus on thecandidates for large-scale data processing applications. query execution process. Besides the familiar full scanHowever, HadoopDB, as a fault tolerant parallel DBMS in execution, a global index access mechanism in MapReduceessence, becomes a promising representative from the framework is introduced. 20
  • A. Overview fits DBMSs into the MapReduce execution framework very The system consists of four parts as shown in Fig. 2. well.The bottom is the storage layer HDFS. On top of HDFS are From the perspective of the whole system, we embedthe database query engines as the executors. The top is modified database engines into Hadoop rather than justMapReduce system. The middleware layer contains the gluing them together like HadoopDB, which yields a moredata loader, the indexer, etc. coordinated system. The data loader stores the incoming data onto storage in B. Query Executiona simple way. The data are packed in binary format into thepages, each of which is the smallest I/O unit of the database Fig. 3 describes the framework of the query execution.query engine. The The query is first translated into sub-queries expressed in SQL. Sub-queries will be executed by each database engine thread. Besides the operations applied, the sub-query also indicates the position information of the data block on which the database engine thread should process. This position information is figured out by the splitting process on the source data file, and is used by MapReduce runtime system to schedule the sub-tasks. Each sub-query is passed to an instance of Map on a specific node, where the position parameters in the sub-query will be set to the according splitting result values. Figure 2. Architecture. page size is fixed according to system parameter, and is32KB by default. Using binary format reduces the occupieddisk space compared to the text representation, so improvesboth the loading and query performance. The binary formatis also obeyed by the customized storage engine in thedatabase when parsing the data into tuples. Note thatalthough in binary formant for structured data, the CPUcost of loading is not increased, and in fact, it is moreefficient than using text format. The data will be replicated Figure 3. Query execution.automatically by HDFS in the block granularity. The Each Map instance issues the sub-query of the SQLdefault block size is 64MB, and is configurable. The block format to the local database engine thread, and emits thesize of HDFS is an integer multiple of the page size for results returned by the database. The Map instance doesn’tease of implementation. need to aggregate the local intermediate results, because the The indexer can create some kind of index on the sub-query executed by the database already finishes this.loaded data in the batch mode. Because HDFS is append- The most critical part for this process is the customizedonly, it will be complex to build an updatable index in the storage engine that provides the ability to access HDFSreal-time manner. Actually, the data are often queried with data at block level.time range predicate, so creating separated indexes alongtime dimension on a periodic basis is an acceptable solution 1) Customized Storage Enginein real scenes. We support B+-tree index for now. The B+- The query engine accesses the data through the storagetree index searches all the data across the cluster, so it is a engine using a collection of routines in an iterator index structure. The index data are also stored in init() is first evoked every time the executor wants toHDFS, and can be seen by each database executor. The accessing the data. After that, get_next() routine is calleddetailed index structure and index access method in repeatedly by the executor, which returns a tuple each time,MapReduce framework will be described in Section IV.C. and the executor applies the operations on the stream of the The database executors are actually MySQL server tuples. When the query is finished, a close() function isthreads. MapReduce run-time system schedules a sub-task called, which cleans the context.(Map instance) to a specific node, on which the sub-task We store the dataset on HDFS, so the first thing we doissues a SQL query to the underlying MySQL server on this is to implement the routines using HDFS API. Thenode. We implement a new storage engine for MySQL so implementation is almost the same as that for local filethat the query engine can get tuples from HDFS data files. system, except that all file system calls is replaced by theSome tricks are applied to make the query engine capable HDFS collection. The data are page formatted, and theof accessing tuples from a specific block of the HDFS file, reading is on the page basis. The predefined data format iswhich provides the ability to execute at the block level and obeyed when parsing the tuple out of the page. A data 21
  • cache is also implemented in the storage engine, where nodes must be launched, because all the local indexesLRU evict algorithm is adopted as usual. search the same value domain. The control overhead such The schema definition of the HDFS dataset must be as that on setup and cleanup of sub-tasks will occupy a bigregistered in the database, so that the query can be executed part of the total running time. Comparatively, when usingby the database without syntax or semantics exceptions. global index, only some of the nodes possessing theThis is achieved by the cooperation of the data loader, qualified index entries need to be started. However, globaldatabase query engine and the storage engine. When a new index access will incur global communications whichtable is to be created, the data loader creates necessary data should not be neglected.files for this table and issues ‘create table’ to all nodes. The HadoopDB consists of single-machined DBMSs whichdatabase query engine will process this DDL (Data are unable to take use of global index, so the supported oneDefinition Language) query and record the information is just local index mechanism.about the table into the metadata. Then the query engine In our integrated system, the data are replicated andwill call the create() function of the storage engine with distributed by HDFS, so the DBMS layer has no knowledgenecessary parameters. In create(), the normal routine is to about the locality of the data, which means that local indexcreate data files and other data structures needed, but in our doesn’t make sense. We choose to implement the globalimplementation, it just opens the data files already created index mechanism. The index mechanism must giveby the data loader, and initiates the context. After executing consideration to the MapReduce execution style so thatthe command, the data loader gets ready to load the makes the index access be parallelized in this framework.incoming data to the new table data file, and all databasesare available for queries on this table. 1) Index Creation Every database query engine now is able to see the The indexer is responsible for building B+-tree index onwhole dataset on HDFS. However, it can only execute the dataset, and the index file is stored into HDFS, so it canquery at the table level, so you can not specify which part be accessed by all the nodes. Because there will only beof this table the query should process. This manner is not read requests, the index doesn’t need to support updateappropriate for MapReduce. MapReduce framework operations. So the entries in the B+-tree node are dense-logically splits the dataset into blocks, and assigns each packed, leaving no free space for later insert. We create theMap instance a block to process. To fit into this paradigm, B+-tree index as follows: first, sort thethe database must be able to process at the block level (value_on_index_attribute, offset) pairs from all the recordsrather than the table level. Here we make use of pseudo and write them sequentially into the index file whichcolumn to achieve this goal. Besides the columns for the directly forms the leaf nodes of the tree; then scan the leafdata, we introduce an addition pseudo column blk, which nodes, create all the intermediate nodes and the root nodeexists in the metadata of the database but is not stored in a bottom-up fashion, and append them to the index file.actually. This column is used to pass parameters about the Traditional B+-tree index may be created through insertposition information of the data block which needs operations in an online mode which makes the leaf nodesprocessing. With determined data block, the Map instance not physically contiguous in the file, but connected byadds a predicate on the pseudo column to the where clause pointers. Our approach guarantees that the leaf nodesof the SQL query. When executing this modified query, the occupy a contiguous range of space in the index file whichposition constants will be sent to the storage engine, so facilitates the parallel access to these leaf nodes duringonly the tuples in the indicated storage range will be read in. query. The structure of the index is illustrated in Fig. 4.During this process, the query engine works as beforewithout the perception of this matter. Till now, the database engines are well integrated intoHadoop framework.C. Global Index Mechanism One of the useful auxiliary data structures in DBMS isthe index. For certain queries, index assisted execution canimprove the efficiency. For example, with predicate of highselectivity, the qualified tuples for the query are only asmall part of the whole. The brute-force scan on the wholedataset will waste too much energy. If there is some indexon the predicate attribute, using index to directly retrievethe qualified tuples may save a lot. In parallel DBMS, there are two kinds of indexes in Figure 4. Leaf nodes occupy continuous space in the index file and Mapsterms of locality. Local index is the one that resides on one work on selected leaf node data blocks.node and only searches the local dataset on the same node;global index is the one that has references to the data acrossthe whole cluster, and the global index itself usuallydistributes across all the nodes. When using local index, all 22
  • 2) Index Access is set up on the cluster, and one MySQL server of version To support the index access method, we add another 5.0 is running on each individual node.pseudo column idx to the table schema, and modify the The benchmark is from our application. Although fromstorage engine implementation accordingly. In the case of specific domain, the data schema and operation are veryindex access, idx and blk attributes will be used together to common to many other applications. The data schema is agive the indication to the storage engine. table with 9 integer columns, which are time, systemID, When the query is with predicate of high selectivity on deviceID, eventType, port, inBytes, outBytes, inPackets,the indexed attribute, the index access method will be outPackets respectively. time is the number of secondschosen. Before starting up the MapReduce tasks, several since the Epoch. It starts from 1235750430 in thetraversals through the index are taken to locate the start and benchmark, and increases by 1 every 131072 records.end positions in the leaf nodes for each predicate value. systemID is uniformly distributed in the integer range [1,The index entries in the leaf nodes between the start and the 15]. deviceID and eventType are respectively uniformlyend positions are those pointing to the records satisfying distributed in the integer range [1, 50] and [100000000,the predicate. If the predicate is a range, only two traversals 100000014]. port, inBytes, outBytes, inPackets andwill be needed. Because the height of tree is usually very outPackets are all uniformly distributed in the integer rangelow, this process will not take much time compared to the [0, 65535]. The whole dataset has about 471,859,200later processing using MapReduce. records over a time range of one hour. According to the start and end positions, Maps are The queries we use aregenerated attaching to the leaf node pages in the selected SELECT truncate(time/60,0), systemID, deviceID, eventType,ranges as shown in Fig. 4. Each Map will add the sum(inBytes), sum(outBytes), sum(inPackets), sum(outPackets)predicates on blk and idx to the where clause of the SQL FROM table [WHERE port IN (port_list)]query. idx parameter specifies the index to be used, and blk GROUP BY truncate(time/60, 0), systemID, deviceID, eventTypeparameters now indicate the start and end offsets to the leafnodes. During the execution of the sub-query, the storage where the where clause may vary in different experiments.engine will scan the index entries in the selected range of This query actually groups and summaries the data byindex leaf nodes, and retrieves the tuple using the offset in systemID, deviceID and eventType attributes in a minute-the index entry. The other phases of the execution are the granularity with a predicate on the port attribute.same with that for the full scan case. We compare three systems: Hadoop, HadoopDB-like In this execution mode, the number of Maps is related system (HadoopDB-L for short) which is implemented onto the selectivity of the predicate. Using the index, a top of MySQL, and our database engine integrated Hadoopminimal number of Maps are generated and only the system (DBEHadoop for short). The data in Hadoop are inqualified records are read in. text format with columns separated by space characters, and the whole dataset in the benchmark occupies aboutD. Summary 25GB space. The data in HadoopDB-L and DBEHadoop We have proposed a new system architecture are in paged binary format with each value of attributeintegrating modified database engines as a read-only occupying four bytes, and the whole dataset occupies aboutexecution layer into Hadoop. Data replications are handled 15GB space. All systems are configured with 2 replicas inby HDFS naturally. The modified database engine is able to 64MB block granularity for the data, so the actual storageprocess the data from HDFS file at the block level, so fits space for the dataset doubles. The data of HadoopDB-L arevery well into MapReduce. The global index access manually replicated across all nodes, and each block ofmechanism is added according to MapReduce paradigm. 64MB data is simulated using a separate table. For eachThe loading speed can also be guaranteed using HDFS. block (table), there is a local index on the port attribute. InThis integrated system satisfies our application very much. DBEHadoop, a global index on port is created. Because of The essential difference with HadoopDB is that we being in a public platform, we set the max number ofconstruct our system on a Hadoop basis, rather than a parallel Maps per node to 3, which is less than the numberDBMS basis. DBMS provides us the efficient operators, of cores in the node.while the managing of data is handled by other Hadoop- B. Query without Predicatebased components. The first experiment is on the query without where V. EXPERIMENTS clause, so all the data needs to be scanned. The result set contains 2,287,500 records. The system buffer is cleanedA. Configurations before the execution. The experiments are conducted in a cluster consisting of Fig. 5 shows the running time for the three systems to15 nodes connected by a gigabit Ethernet, which is a part of execute this query. The execution time of Hadoop is mucha public computing platform. Each node has two dual-core longer than that of HadoopDB and DBEHadoop. There areAMD Opteron™ Processor 275, 8GB DRAM, and a two factors affecting the performance of Hadoop. The first136GB SCSI disk. The kernel of the operating system is one is the raw size of the data file, and the second one is theLinux 2.6.9-4.2.ELsmp x86_64. The bandwidth of the local CPU efficiency during the execution. The raw size of datafile system sequential I/O is about 60MB/s. Hadoop 0.19.2 file in Hadoop is larger than the two other systems, so the 23
  • disk I/O is larger in Hadoop. In addition, larger data file 160 Hadoop HadoopDB-L DBEHadoop running time in secondsneeds more Maps to process, which incurs more controloverhead. In terms of CPU efficiency, due to Java language, 120Hadoop is less I/O bound when processing large amount ofrecords with more columns like the one in our case, 80especially in text format. For this query, Hadoop takesmuch more CPU time. 40 200 running time in seconds 0 160 1 10 20 number of ports selected 120 Figure 6. Running time for full scan execution. 80 Next we compare the performance of HadoopDB-L and 40 DBEHadoop with index assistance. Fig. 7 shows the result. 0 For the case of one port, index access is better than full scan, because random read for a very small amount of Hadoop HadoopDB-L DBEHadoop records outperforms the sequential scan of the large dataset. For HadoopDB-L, index access is only a little better than Figure 5. Running time for the query without predicate. full scan, and this is because in both modes, each data The performance of HadoopDB-L and DBEHadoop are block needs a Map with actually little work to do due to thealmost the same. They have similar data format, so the data high selectivity, so control overhead takes a considerablesizes are similar. Reading data through HDFS doesn’t large part of the total running time. DBEHadoop is muchcause too much overhead for DBEHadoop because of the more efficient than Hadoop-L, because of the very smallpaged sequential I/O, which makes the two systems number of Maps needed. For the case of 10 ports, theessentially same in the query execution. Using the database performance of index access is almost the same with fullfor underlying execution improves the CPU efficiency, so scan execution, which indicates that index access will notthe systems are more I/O bound compared to Hadoop. be superior from this point. For HadoopDB-L, the cost of random read offsets the benefit of small I/O volume. ForC. Query with Predicate DBEHadoop, random read and communication cost We now add predicate to the query. The query is to only together offset the benefit. For the case of 20 ports,process the records with port value in the port_list. When DBEHadoop index access method is much more expensivethe predicate of high selectivity exists, it has a chance to than full scan method. Compared to local index, the cost ofuse an index to accelerate the execution. So we conduct this global index access increases more evidently due to theset of experiments under high selectivity predicate, where network communication in addition to the random disc I/O,index makes sense. We select 1, 10 and 20 ports which offsets the benefit of the small number of Maps.respectively to repeat the experiment. Because the value of However, in these conditions, full scan method will beport is uniformly distributed, the number of selected ports chosen.rather than the specific values determines the performance. 200 Hadoop DB-L full scanThe system buffer is cleaned before execution. Hadoop DB-L index access running time in seconds 160 Fig. 6 is the result of the three systems using full scan DBEHadoop full scan DBEHadoop index accessexecution which scans all the data and applies predicate to 120qualify the records. The time is shorter than that withoutpredicate for each system, because fewer records need 80processing after applying the predicate. Full scan has thesame performance in all cases for each specific system, 40because with only small number of qualified records, theI/O and control cost actually dominate the total time. 0Hadoop takes more time than the other systems mainly 1 10 20 number of ports selectedbecause of the larger data size (more I/O and Mapinstances). HadoopDB-L and DBEHadoop have the similar Figure 7. Full scan vs. index access.performance. D. Execution under Warm Buffer In many cases, the same set of data may be queried by different users, so it is necessary to evaluate the performance under warm buffer. The previous experiments 24
  • are repeated using the same way except that the data are generates the data itself and loads it into the system. Doingalready in the system buffer cache. this way ignores the overhead of communication between Fig. 8 shows the result for full scan execution. When the data source and the loader, thus evaluates the rawwithout the predicate, the time for Hadoop is similar to that loading speed of the underlying system. For HadoopDB-L,under cold buffer, due to the CPU bound behavior. the loader loads the records into local MySQL serverHadoopDB-L and DBEHadoop are faster than the case with through prepared batch inserts using JDBC, and nocold buffer. When with predicate of high selectivity, all replication is considered. For DBEHadoop, the loader loadsthree systems take much shorter time than the cold buffer records through routines that format the data into pages and 200 write them onto HDFS, and a replication of degree 2 is automatically maintained by HDFS. Hadoop Fig. 10 gives the result, from which we can see running time in seconds 160 HadoopDB-L DBEHadoop DBEHadoop is much faster than HadoopDB-L. 120 HadoopDB-L must load the data through DBMS logic, so the speed is very low, especially in the online mode for 80 streaming data. Although only using MySQL server here, it is in the same order of magnitude for other DBMSs as we 40 know. DBEHadoop reserves the advantage of high loading speed of Hadoop though direct writing to HDFS. 0 Replication mechanism causes extra overhead, and when all 1 10 20 multiple loaders work in parallel, the overhead increases number of ports selected due to increased network traffic. However, the loading speed is fast enough for common applications. Figure 8. Running time for full scan execution under warm buffer. 350case, which is a result of saving on I/O cost. The difference loading speed (MB/s) HadoopDB-L 300in performance between Hadoop and other two systems 250 DBEHadoopbecomes smaller, which is because decreased operations 200involved hide the inefficiency of Hadoop in some degree.HadoopDB-L and DBEHadoop have similar performance 150in all cases. 100 Fig. 9 gives the result for the index access execution. 50Different with the cold buffer case, DBEHadoop index 0access method gets the equal performance with full scanmethod when 20 ports are selected, while HadoopDB-L 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15index access method is almost the same with full scan number of loadersmethod in all cases. Without I/O cost, the small number ofMaps mainly contributes to the improvement of Figure 10. Data loading speed.performance. F. Summary 120 HadoopDB-L & DBEHadoop full scan Through these experiments, we show that DBEHadoop HadoopDB-L index access running time in seconds DBEHadoop index access is as efficient as HadoopDB-L for full scan queries, and for queries with predicate of high selectivity, the global index 80 access mechanism adopted in DBEHadoop is much more efficient than HadoopDB-L. For the data loading, DBEHadoop achieves very good performance, which is far 40 better than HadoopDB-L. VI. CONCLUSION 0 Hadoop and DBMS are not ideal for large dataset analysis. HadoopDB as an integrated system merging 1 10 20 number of ports selected techniques from both is promising, but still limited due to some reasons which are difficult to overcome. We believeFigure 9. Running time for index access execution under warm buffer. that it is the way by which HadoopDB is constructed that makes itself hard to satisfy the emerging applications. Taking a Hadoop basis, rather than a DBMS basis, andE. Data Loading incorporating DBMS techniques is the right way to In the cluster, we start several loaders one on each node construct systems for large-scale data processingto test the streaming data loading speed. The loader just applications. 25
  • We propose a new system architecture integrating [6] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden,modified DBMS engines as a read-only execution layer Erik Paulson, Andrew Pavlo, Alexander Rasin, “MapReduce and parallel DBMSs: friends or foes?” communications of the acm, vol.into Hadoop, where DBMS plays a role of providing 53, no. 1, 2010.efficient operators instead of managing the data. Besides [7] Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avithe same advantages with HadoopDB, our system solves Silberschatz, Er Rasin, “HadoopDB: An architectural hybrid ofthe limitation posed by HadoopDB in real scenes. The MapReduce and DBMS technologies for analytical workloads,” inHDFS-based storage solves the fault tolerance problem in Proc. VLDB’09, 2009the data layer. The modified database engine is able to [8] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain and Zheng Shao,process data from HDFS file at the block level, so fits very “Hive – A warehousing solution over a MapReduce framework,” in Proc. VLDB’09, 2009.well into MapReduce. A global index access mechanism [9] Christopher Olston, Benjamin Reed and Utkarsh Srivastava, “Pigadapted according to the MapReduce paradigm is added Latin: A not-so-foreign language for data processing,” in Proc.and shows better performance compared to HadoopDB for SIGMOD’08, 2008.certain queries. The proposed system reserves the [10] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda,advantage of Hadoop in the data loading speed, which is far and J. Currey, “DryadLINQ: A system for general-purposebetter than HadoopDB. All the properties make the system distributed data-parallel computing using a high-level language,”more appropriate for large-scale dataset analysis 2008.applications. [11] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with Sawzall,” Scientific Programming, vol. ACKNOWLEDGMENT 13, no. 4, 2005. [12] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled We would like to thank the anonymous reviewers for Elmeleegy, Scott Shenker, Ion Stoica, “Job scheduling for multi-their valuable feedback on this work. This research is User MapReduce clusters,” technical report No. UCB/EECS-2009-supported by National Natural Science Foundation of China 55, 2009.(Grant No. 60903047). [13] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, Russell Sears, “MapReduce online,” technical REFERENCES report No. UCB/EECS-2009-136[1] S. Ghemawat, H. Gobioff, and S-T. Leung, “The Google file [14] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, system,” in Proc. SOSP’03, 2003, p. 29. Christos Kozyrakis, “Evaluating MapReduce for multi-core and multiprocessor Systems,” in Proc. HPCA’07, 2007[2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51 (1), pp. [15] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, 107-113, Jan. 2008. Ion Stoica, “Improving MapReduce performance in heterogeneous environments,” in Proc. OSDI’08, 2008[3] Hadoop website. [Online]. Available: [16] Jimmy Lin , Shravya Konda , Samantha Mahindrakar, “Low-latency, high-throughput access to static global sources within the Hadoop[4] Andrew Pavlo, Erik Paulson, Alexander Rasin, “A comparison of framework,” HCIL Technical Report HCIL-2009-01, 2009. approaches to large-scale data analysis,” in Proc. SIGMOD’09, 2009, p. 165. [17] Joseph M. Hellerstein, Michael Stonebraker, James Hamilton, “Architecture of a database system,” Foundations and Trends in[5] Daniel J. Abadi, “Data management in the cloud: limitations and Databases, Vol. 1, No. 2 (2007) 141–259, 2007. opportunities,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009. 26