Transcript of "Distributed parallel architecture for big data"
116 Informatica Economică vol. 16, no. 2/2012 Distributed Parallel Architecture for "Big Data" Catalin BOJA, Adrian POCOVNICU, Lorena BĂTĂGAN Department of Economic Informatics and Cybernetics Academy of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org, email@example.comThis paper is an extension to the "Distributed Parallel Architecture for Storing and Pro-cessing Large Datasets" paper presented at the WSEAS SEPADS’12 conference in Cam-bridge. In its original version the paper went over the benefits of using a distributed parallelarchitecture to store and process large datasets. This paper analyzes the problem of storing,processing and retrieving meaningful insight from petabytes of data. It provides a survey oncurrent distributed and parallel data processing technologies and, based on them, will pro-pose an architecture that can be used to solve the analyzed problem. In this version there ismore emphasis put on distributed files systems and the ETL processes involved in a distribut-ed environment.Keywords: Large Dataset, Distributed, Parallel, Storage, Cluster, Cloud, MapReduce,Hadoop1 Introduction A look back to the immediate historyshows that the data storing capacity is con- years, figure 2. As a fact, in 1981 you must use 200 Seagate units, each having a five megabytes capacity and costing 1700$, totinuously increasing, while the cost per GB store one gigabyte of data.stored is decreasing, figure 1. The first harddisk came from IBM in 1956, it was calledThe IBM 350 Disk Storage and it had a ca-pacity of 5 MB, [10-11]. In 1980, IBM 3380breaks the gigabyte-capacity limit providingstorage for 2.52 GB. After 27 years, HitachiGST that acquired IBM drive division in2003, deliver the first terabyte hard drive. Af-ter only two years, in 2009, Western Digitallaunches industry’s first two terabyte hard Fig. 2. Average $ cost per GB (based ondrive. In 2011, Seagate introduces the )world’s first 4TB hard drive, . The reduce costs for data storage has repre- sented the main premise of the current data age, in which it is possible to record and store almost everything, from business to personal data. As an example, in the field of digital photos, it is estimated that around the world has been taken over 3.5 trillion photos and in 2011 alone have been made around 360 billion new snapshots, . Also, the Fig. 1. Average HDD capacity (based on availability and the speed of Internet connec- ) tions around the world have generated in- creased data traffic that is generated by aIn terms of price , the cost per gigabyte wide range of mobile devices and desktopdecreased from an average of 300.000 $ to a computers. Only for mobile data, Cisco ex-merely an average of 11 cents in the last 30 pects that mobile data traffic will grow from
Informatica Economică vol. 16, no. 2/2012 1170.6 exabytes (EB) in 2011 to 6.3 EB in 2015, ment a cluster analysis model.. The expansion of data communicationshas promoted different data services, social, 2 Processing and storing large datasetseconomic or scientific, to central nodes for As professor Anand Rajaraman questioned,storing and distributing large amounts of da- more data usually beats better algorithmta. For example, Facebook social network . The question is used to highlight the ef-hosts more than 140 billion photos, which is ficiency of a proposed algorithm for the Net-more than double compared to 60 billion pic- flix Challenge, . Despite the statement istures at the end of 2010. In terms of storage, still debatable, it brings up a true point. Aall these snapshots data take up more than 14 given data mining algorithm yields better re-petabytes. In other fields, like research, the sults with more data and it can reach theLarge Hadron Collider particle accelerator same accuracy of results of a better or morenear Geneva, will produce about 15 petabytes complex algorithm. In the end, the objectiveof data per year. The SETI project is record- of a data analysis and mining system is toing each month around 30 terabytes of data process more data with better algorithms,which are processed by over 250.000 com- .puters each day, . The supercomputer of In many fields more data is important be-the German Climate Computing Center cause it provides a more accurate description(DKRZ) has a storage capacity of 60 of the analyzed phenomenon. With more da-petabytes of climate data. In the financial ta, data mining algorithms are able to extractsector, records of every day financial opera- a wider group of influence factors and moretions generate huge amounts of data. Solely, subtle influences.the New York Stock Exchange records about Today, large datasets means volumes of hun-one terabyte of trade data per day, . dreds of terabytes or petabytes and these areDespite this spectacular evolution of storage real scenarios. The problem of storing thesecapacities and of deposits size, the problem large datasets is generated by the impossibil-that arises is to be able to process it. This is- ity to have a drive with that size and moresue is generated by available computing important, by the large amount of time re-power, algorithms complexity and access quired to access it.speeds. This paper makes a survey of differ- Access speed of large data is affected by theent technologies used to manage and process disk speed performances , Table 1, inter-large data volumes and proposes a distributed nal data transfer, external data transfer, cacheand parallel architecture used to acquire, memory, access time, rotational latency, thatstore and process large datasets. The objec- generate delays and bottlenecks.tive of the proposed architecture is to imple- Table 1. Example of disk drives performance (source ) Interface HDD Average Internal transfer External transfer Cache Spindle rotational [rpm] latency [ms] [Mbps] [MBps] [MB] SATA 7,200 11 1030 300 8 – 32 SCSI 10,000 4.7 – 5.3 944 320 8 High-end SCSI 15,000 3.6 – 4.0 1142 320 8 – 16 SAS 10,000 / 15,000 2.9 – 4.4 1142 300 16Despite the rapid evolution of drives capaci- 250 of them to store 1 PB of data. Once thety, described in figure 1, the large datasets of storage problem is solved another questionup to one petabytes can only be stored on arises, regarding how easy/hard is to readmultiple disks. Using 4 TB drives requires that data. Considering an optimal transfer
118 Informatica Economică vol. 16, no. 2/2012rate of 300 MB/s then the entire dataset is (file name, chunk index)read in 38.5 days. A simple solution is toread from the all disks at once. In this way, GFS Clients GFS Masterthe entire dataset is read in only 3.7 hours. (chunk handle, chunk loca-If we take into consideration the communica- tions)tion channel, then other bottlenecks are gen-erated by the available bandwidth. In the end, Instructions to chunk serverthe performance of the solution is reduced tothe speed of the slowest component. Chunk server stateOther characteristics of large data sets addsupplementary levels of difficulty: many input sources; in different econom- (chunk handle, byte range) Chunk Serv- ic and social fields there are multiple ers sources of information; (chunk data) redundancy, as the same data can be pro- vided by different sources; lack of normalization or data representa- tion standards; data can have different formats, unique IDs, measurement units; different degrees of integrity and con- sistency; data that describes the same Fig. 3. Simplified GFS Architecture phenomenon can vary in terms of meas- ured characteristics, measuring units, In a distributed approach, the file system time of the record, methods used. aims to achieve the following goals:For limited datasets the efficient data man- it should be scalable; the file systemagement solution is given by relational SQL should allow for additional hardware todatabases, , but for large datasets some of be added to increase storing capacitytheir founding principles are eroded , , and/or performance.. it should offer high performance; the file system should be able to locate the data3 Proposed Approach of interest on the distributed nodes in aThe next solutions provide answers to the timely manner.question regarding how to store and process it should reliable; the file system shouldlarge datasets in an efficient manner that can be able to recreate from the distributedjustify the effort. nodes the original data in a complete and undistorted manner.3.1 Large Data Sets Storage it should have high availability; the fileWhen accessing large data sets, the storage system should account for failures andfile system can become a bottle neck. Thats incorporate mechanisms for monitoring,why a lot of thought was put into redesigning error detection, fault tolerance and auto-the traditional file system for better perfor- matic recovery.mance when accessing large files of data. Google, one of the biggest search engine ser- vices providers, which handles “big web da- ta” has published a paper about the file sys- tem they claim its being used by their busi- ness called "The Google File System". From an architecture point of view the Google File System (GFS) comprises of a "single master", multiple "chunk servers" and its being accessed by multiple "clients".
Informatica Economică vol. 16, no. 2/2012 119In the simplified version of the GFS architec- ta into target data.ture, figure 3 it shows how the GFS clients The ETL process  is base on three ele-communicate with the master and with the ments:chunk servers. Basically the clients are ask- Extract – The process in which the data ising for a file and the GFS master tells them read from multiple source systems into awhich chunk servers contain the data for that single format. In this process data is ex-file. Then the GFS clients make a request for tracted from the data source;the data at those respective chunk servers. Transform – In this step, the source dataThen GFS chunk servers transmit the data di- is transform into a format relevant to therectly to the GFS clients. No user data actual- solution. The process transform the dataly passes the GFS master, this way it avoids from the various systems and made ithaving the GFS master as a bottleneck in the consistent;data transmission. Load – The transformed data is now writ-Periodically, the GFS master communicates ten into the warehouse.with the chunk servers to get their state and Usually the systems that acquire data are op-to transmit them instructions. timized so that the data is being stored as fastEach chunk of data is identified by an immu- as possible. Most of the time comprehensivetable and globally unique 64 bit chunk handle analyses require access to multiple sources ofassigned by the master at the time of chunk data. It’s common that those sources storecreation . raw data that yields minimal information un-The chunk size is a key differentiator from less properly process.more common file system. GFS uses 64 MB This is where the ETL or ELT processesfor a chunk, limiting this way the number or come into play. An ETL process will take therequests to the master for chunk locations, it data, stored in multiple sources, transform it,reduces the network overhead and it reduces so that the metrics and KPIs are readily ac-the metadata size on the master, allowing the cessible, and load it in an environment thatmaster to store the metadata in memory. has been modeled so that the analysis queriesHadoop, a software framework derived from are more efficient . An ETL system isGoogles papers about MapReduce and GFS, part of a bigger architecture that includes atoffers a similar file system, called Hadoop least one Database Management System,Distributed Files System (HDFS). DBMS. It is placed upstream from a DBMSWhat GFS was identifying as "Master" its because it feeds data directly into the nextbeing called NameNode in HDFS and the level.GFS "Chunk Servers" can be found as The ETL tools advantages are , ,"Datanodes" in HDFS. : save time and costs when developing and3.2 ETL maintaining data migration tasks;The Extract, Transform and Load (ETL) pro- use for complex processes to extract,cess provides an intermediary transfor- transform, and load heterogeneous datamations layer between outside sources and into a data warehouse or to perform otherthe end target database. data migration tasks;In literature ,  we identified ETL and in larger organizations for different dataELT. ETL refers to extract, transform and integration and warehouse projects ac-load in this case activities start with the use cumulate;of applications to perform data transfor- such processes encompass common sub-mations outside of a database on a row-by- processes, shared data sources and tar-row basis, and on the other hand ELT refers gets, and same or similar operations;to extract, load and transform which implied ETL tools support all common databases,the use first the relational databases, before file formats and data transformations,performing any transformations of source da- simplify the reuse of already created
120 Informatica Economică vol. 16, no. 2/2012 (sub-)processes due to a collaborative de- termediate subsets associated with the same velopment platform and provide central intermediate key and send them to the Re- scheduling. duce function. Portability: usually the ETL code can be The Reduce function, also accepts an inter- developed one a specific target database mediate key and subsets. This function merg- and ported later to other supported data- es together these subsets and key to form a bases. possibly smaller set of values. Normally justWhen developing the ETL process, there are zero or one output value is produced per Re-two options: either takes advantage of an ex- duce function.isting ETL tool, some key players, in this In  is highlight that many real worlddomain, are: IBM DataStage, Ab Initio, tasks such used MapReduce model. ThisInformatica or custom code it. Both ap- model is used for web search service, forproaches have benefits and pitfalls that need sorting and processing the data, for data min-to be carefully considered when selecting ing, for machine learning and for a big num-what better fits the specific environment. ber of other systems.The benefits of an in-house custom built ETL The entire framework manages how data isprocess are: split among nodes and how intermediary Flexibility; the custom ETL process can query results are aggregate. be designed to solve requirements specif- A general MapReduce architecture can be il- ic to the organization that some of the lustrated as Figure 4. ETL tools may have limitation with; Performance; the custom ETL process can be finely tuned for better perfor- mance; Tool agnostic; the custom ETL process can be built using skills available in- house; Cost efficient: custom ETL processes usually use resources already available in the organization, eliminating the addi- tional costs with licensing an ETL tool and training the internal resources on us- ing that specific ETL tool.3.3 MapReduce, Hadoop and HBaseMapReduce (MR) is a programming modeland an associated implementation for pro-cessing and generating large data sets, .The model was developed by Jeffrey Dean Fig. 4. MapReduce architecture (based onand Sanjay Ghemawat at Google. The foun- )dations of the MapReduce model are definedby a map function used top process key-value The MR advantages are , , , ,pairs and a reduce functions that merges all , , :intermediate values of the same key. the model is easy to use, even for pro-The large data set is split in smaller subsets grammers without experience with paral-which are processed in parallel by a large lel and distributed systems;cluster of commodity machines. storage-system independence as it not re-Map function  takes an input data and quires proprietary database file systemsproduces a set of intermediate subsets. The or predefined data models; data is storedMapReduce library groups together all in- in plain text files and it is not required to respect relational data schemes or any
Informatica Economică vol. 16, no. 2/2012 121 structure; in fact the architecture can use query predicate push down via server data that has an arbitrary format; side scan and get filters; fault tolerance; optimizations for real time queries ; the framework is available from high lev- a Thrift gateway and a REST-ful Web el programming languages; one such so- service that supports XML, Protobuf, and lution is the open-source Apache Hadoop binary data encoding options project which is implemented in Java; extensible JRuby based (JIRB) shell; the query language allows record-level support for exporting metrics via the manipulation; Hadoop metrics subsystem to files or projects as Pig and Hive  are provid- Ganglia; or via JMX. ing a rich interface that allows program- HBase database stores data in labeled tables. mers to do join datasets without repeating In this context  the table is designed to simple MapReduce code fragments; have a sparse structure and data is stored inHadoop is a distributed computing platform, table rows, and each row has a unique keywhich is an open source implementation of with arbitrary number of columns.the MapReduce framework proposed byGoogle . It is based on Java and uses the 3.4 Parallel database systemsHadoop Distributed File System (HDFS). A distributed database (DDB) is a collectionHDFS is the primary storage system used by of multiple, logically interconnected data-Hadoop applications. It is uses to create mul- bases distributed over a computer network. Atiple replicas of data blocks for reliability, distributed database management system,distributing them around the clusters and distributed DBMS, is the software systemsplitting the task into small blocks. The rela- that permits the management of the distribut-tionship between Hadoop, HBase and HDFS ed database and makes the distribution trans-can be illustrated as Figure 5. parent to the users. A parallel DBMS is aHBase is a distributed database. HBase is an DBMS implemented on a multiprocessoropen source project for a database, distribut- computer. . The parallel DBMS imple-ed, versioned, column-oriented, modeled af- ments the concept of horizontal partitioningter Google’ Bigtable .  by distributing parts of a large relational table across multiple nodes to be processed in HDFS parallel. This requires a partitioned execution Output of the SQL operators. Some basic operations, Output like a simple SELECT, can be executed in- dependently on all the nodes. More complex Input Input operations are executed through a multiple- operator pipeline. Different multiprocessor parallel system architectures , like share- Input memory, share-disks or share nothing, define Hadoop HBase possible strategies to implement a parallel Output DBMS, each with its own advantages and drawbacks. The share-nothing approach dis- Fig. 5. The relationship between Hadoop, tributes data across independent nodes and HBase and HDFS (based on ) has been implemented by many commercial systems as it provides extensibility and avail-Some as the features of HBASE as listed at ability. are: Based on the above definitions, we can con- convenient base classes for backing clude that parallel database systems improve Hadoop MapReduce jobs with HBase ta- performance of data processing by paralleliz- bles including cascading, hive and pig ing loading, indexing and querying data. In source and sink modules; distributed database systems, data is stored in
122 Informatica Economică vol. 16, no. 2/2012different DBMSs that can function inde- also reduces the administration effort as itpendently. Because parallel database systems comes as a single vendor out of the box solu-may distribute data to increase the architec- tion. The data warehouse appliances offerture performance, there is a fine line that sep- high availability through built-in fail-overarates the two concepts in real implementa- capabilities using data redundancy for eachtions. disk.Despite the differences between parallel and Ideally, each processing unit of the datadistributed DBMSs, most of their advantages warehouse appliance should process the sameare common to a simple DBMS, ,: amount of data at any given time. To achieve stored data is conform to a well-deﬁned that, the data should be distributed uniformly schema; this validates the data and pro- across each processing unit. Data skew is a vides data integrity; measure to evaluate how data is distributed data is structured in a relational paradigm across each processing unit. A data skew of 0 of rows and columns; means that the same number of records is SQL queries are fast; distributed on each processing unit. A data the SQL query language is flexible, easy skew of 0 is ideal. to learn and read and allows program- By having each processing unit do the same mers to implement complex operations amount of work it ensures that all processing with ease; units finish their task about the same time, use hash or B-tree indexes to speed up minimizing any waiting times. access to data; Another aspect that has an important impact can efficiently process datasets up to two on the query performance is having the all petabytes of data. the data that is related on the same pro-Known commercial parallel databases as Te- cessing unit. This way the time required toradata, Aster Data, Netezza , DATAllegro, transfer data between the processing units isVertica, Greenplum, IBM DB2 and Oracle eliminated. For example, if the user requiresExadata, have been proven successful be- the sales by country report, having both thecause: sales data for a customer and his geographic allow linear scale-up, ; the system information on the same processing unit will can maintain constant performance as the ensure that the processing unit has all the in- database size is increasing by adding formation that it needs and each processing more nodes to the parallel system; unit is able to perform its tasks independent- allow linear speed-up, ; for a data- ly. base with a constant size, the perfor- The way data is distributed across the parallel mance can be increased by adding more database nodes influence the overall perfor- components like processors, memory and mance. Though the power of the parallel disks; DBMS is given by the number of nodes, this implement inter-query, intra-query and can be also a drawback. For simple queries intra-operation parallelism, ; the actual processing time can be much reduced implementation effort; smaller to the time needed to launch the par- reduced administration effort; allel operation. Also, nodes can become hot high availability. spots or bottle necks as they delay the entireIn a massively parallel processing architec- system.ture (MPP), adding more hardware allows formore storage capacity and increases queries 4 Proposed architecturespeeds. MPP architecture, implemented as a The proposed architecture is used to processdata warehouse appliance, reduces the im- large financial datasets. The results of the da-plementation effort as the hardware and ta mining analysis help economists to identi-software are preinstalled and tested to work fy patterns in economic clusters that validateon the appliance, prior to the acquisition. It existing economic models or help to define
Informatica Economică vol. 16, no. 2/2012 123new ones. The bottom-up approach is more It is important that the ETL server also sup-efficient, but more difficult to implement, ports parallel processing allowing it to trans-because it can suggests, based on real eco- form large data sets in timely manner. ETLnomic data, relations between economic fac- tools like Ab Initio, DataStage andtors that are specific to the cluster model. The Informatica have this capability built in. Ifdifficulty comes from the large volume of the ETL server does not support parallel pro-economic data that needs to be analyzed. cessing, then it should just define the trans-The architecture, described in figure 6, has formations and push the processing to thethree layers: target parallel DBMS. the input layer implements data acquisi- tion processes; it gets data from different sources, reports, data repositories and ar- chives which are managed by govern- mental and public structures, economic agencies and institutions, NGO projects; some global sources of statistical eco- nomic data are Eurostat, International Monetary Fund and World Bank; the problem of these sources is that they use independent data schemes and bringing them to a common format it is an inten- sive data processing stage taking into consideration national data sources or crawling the Web for free data, the task becomes a very complex one; the data layer stores and process large da- tasets of economic and financial records; this layer implements distributed, parallel processing; the user layer provides access to data and manage requests for analysis and reports.The ETL intermediary layer placed betweenthe first two main layers, collects data fromthe data crawler and harvester component,converts it in a new form and loads it in theparallel DBMS data store. The ETL normal-ize data, transforms it based on a predefinedstructure and discards not needed or incon-sistent information. Fig. 6. Proposed architectureThe ETL layer inserts data in the parallel dis-tributed DBMS that implements the Hadoop The user will submit his inquiries through aand MapReduce framework. The objective of front end application server, which will con-the layer is to normalize data and bring it to a vert them into queries and submit them to thecommon format, requested by the parallel parallel DBMS for processing.DBMS. The front end application server will includeUsing an ETL process, data collected by the an user friendly metadata layer, that will al-data crawler & harvester gets consolidated, low the user to query the data ad-hoc, it willtransformed and loaded into the parallel also include canned reports and dashboards.DBMS, using a data model optimized for da- The objective of the proposed architecture ista retrieval and analysis. to separate the layers that are data processing
124 Informatica Economică vol. 16, no. 2/2012intensive and to link them by ETL services data schema in order to obtain a full de-that will act as data buffers or cache zones. scription of the data; the same schema isAlso the ETL will support the transformation used to validate data;effort. DBMSs use B-trees indexes to achieveFor the end user the architecture is complete- fast searching times; indexes can be de-ly transparent. All he will experience is the fined on any attributes are managed by thelook and feel of the front end application. system; MR framework does not imple-Based on extensive comparisons between ment this built-in facility and an imple-MR and parallel DBMS, , , , we mentation of a similar functionality isconclude that there is no all-scenarios good done by the programmers who control thesolution for large scale data analysis because: data fetching mechanism; both solutions can be used for the same DBMSs provide high level querying lan- processing task; you can implement any guages, like SQL, which are ease to read parallel processing task based either on a and write; instead the MR use code frag- combinations of queries or a set of MR ments, seen as algorithms, to process rec- jobs; ords; projects as Pig and Hive  are the two approaches are complementary; providing a rich interface based on high– each solution gives better performance level programming languages that allows over the other one in particular data sce- programmers to reuse code fragments. narios; you must decide with approach parallel DBMSs have been proved to be saves time for the needed data processing more efficient in terms of speed but they task; are more vulnerable to node failures. the process of configuring and loading da- A decision which approach to take must be ta into the parallel DBMS is more com- made taking into consideration: plex and time consuming than the setting performance criteria; up of the MR architecture; one reason is internal structure of processed data; that the DBMS requires complex schemas available hardware infrastructure; to describe data, whereas MR can process maintenance and software costs. data in arbitrary format; A combined MR-parallel DBMS solution, the performance of the parallel DBMS is  can be a possibility as it benefits from given by system fine tune level; the sys- each approach advantages. tem must be configured accordingly to the tasks needed to complete and to the avail- 5 Conclusion able resources; Processing large datasets obtained from mul- common MR implementations take full tiple sources is a daunting task as it requires advantage of the reduced complexity by tremendous storing and processing capaci- processing data with simple structure, be- ties. Also, processing and analyzing large cause the entire MR model is built on the volumes of data becomes non-feasible using key-value pair format; a MR system can a traditional serial approach. Distributing the be used to process more complex data, but data across multiple processing units and the input data structure must be integrated parallel processing unit yields linear im- in a custom parser in order to obtain ap- proved processing speeds. propriate semantics; not relaying on a When distributing the data is critical that common recognized data structure has an- each processing unit is allocated the same other drawback as data is not validated by number of records and that all the related da- default by the system; this can conduct to ta sets reside on the same processing unit. situations in which modified data violates Using a multi-layer architecture to acquire, integrity or other constraints; in contrast, transform, load and analyze the data, ensures the SQL query language used by any par- that each layer can use the best of bread for allel DBMS, takes full advantage of the its specific task. For the end user, the experi-
Informatica Economică vol. 16, no. 2/2012 125ence is transparent. Despite the number of Update, 2010–2015,layers that are behind the scenes, all that it is http://www.cisco.com/en/US/solutions/cexposed to him is a user friendly interface ollateral/ns341/ns525/ns537/ns705/ns82supplied by the front end application server. 7/white_paper_c11-520862.htmlIn the end, once the storing and processing  D. Henschen, “New York Stock Ex-issues are solved, the real problem is to change Ticks on Data Warehouse Appli-search for relationships between different ances,” InformationWeek, 2008,types of data . Others has done it very http://www.informationweek.com/news/successfully, like Google in Web searching software/bi/207800705or Amazon in e-commerce.  IBM, IBM 350 disk storage unit, http://www-Acknowledgment 03.ibm.com/ibm/history/exhibits/storageThis work was supported from the European /storage_profile.htmlSocial Fund through Sectored Operational  Wikipedia, History of IBM magneticProgrammer Human Resources Development disk drives,2007-2013, project number http://en.wikipedia.org/wiki/History_of_POSDRU/89/1.5/S/59184, “Performance and IBM_magnetic_disk_drivesexcellence in postdoctoral research in Roma-  Wikipedia, History of hard disk drives,nian economics science domain”. http://en.wikipedia.org/wiki/History_of_ hard_disksReferences  I. Smith, Cost of Hard Drive Storage T. White, Hadoop: The Definitive Guide, Space, O’Reilly, 2009 http://ns1758.ca/winch/winchest.html G. Bruce Berriman, S. L. Groom, “How  Seti@home, SETI project, Will Astronomy Archives Survive the http://setiathome.berkeley.edu/ Data Tsunami?,” ACM Queue, Vol. 9  A. Rajaraman, More data usually beats No. 10, 2011, better algorithms, part I and part II, http://queue.acm.org/detail.cfm?id=2047 2008, 483 http://anand.typepad.com/datawocky/20 P. Helland, “If You Have Too Much Da- 08/03/more-data-usual.html ta, then “Good Enough” Is Good  J. Good, How many photos have ever Enough,” Vol. 9 No. 5, ACM Queue, been taken?, Sep 2011, 2011 http://1000memories.com/blog/ Apache Hadoop,  J. Bennett, S. Lanning, “The Netflix http://en.wikipedia.org/wiki/Hadoop Prize,” Proceedings of KDD Cup and Apache Software Foundation, Apache Workshop 2007, San Jose, California, Hadoop, 2007 http://wiki.apache.org/hadoop/FrontPage  J. Dean, S. Ghemawat, “MapReduce: J. Dean, S. Ghemawat, “MapReduce: A simplified data processing on large clus- Flexible Data Processing Tool,” Com- ters,” Commun. ACM, Vol. 51, No. 1, munications of the ACM, vol. 53, no. 1, pp. 107–113, 2008. 2010  S. Messenger, Meet the worlds most M. C. Chu-Carroll, Databases are ham- powerful weather supercomputer, 2009, mers; MapReduce is a screwdriver, http://www.treehugger.com/clean- Good Math, Bad Math, 2008, technology/meet-the-worlds-most- http://scienceblogs.com/goodmath/ powerful-weather-supercomputer.html 2008/01/databases_are_hammers_mapre  A. Pavlo, E. Paulson, A. Rasin, D.J. duc.php Abadi, D.J. DeWitt, S. Madden and M Cisco, Cisco Visual Networking Index: Stonebraker, “A comparison of ap- Global Mobile Data Traffic Forecast proaches to large-scale data analysis,” In
126 Informatica Economică vol. 16, no. 2/2012 Proceedings of the 2009 ACM SIGMOD sulting, Inc., June 11, 2007, Available International Conference, 2009. online at: M. Tamer Özsu, P. Valduriez, “Distrib- http://www.scribd.com/doc/55883818/A uted and Parallel Database System,” -Generalized-Lesson-in-ETL- ACM Computing Surveys, vol. 28, 1996, Architecture pp. 125 – 128  A. Albrecht, METL: Managing and Inte- Seagate, Performance Considerations, grating ETL Processes, VLDB ‘09, Au- http://www.seagate.com/www/en- gust 2428, 2009, Lyon, France Copy- us/support/before_you_buy/speed_consi right 2009 VLDB Endowment, ACM, derations http://www.vldb.org/pvldb/2/vldb09- P. Vassiliadis, “A Survey of Extract- 1051.pdf Transform-Load Technology,” Interna-  R. Davenport, ETL vs ELT, June 2008, tional Journal of Data Warehousing & Insource IT Consultancy, Insource Data Mining, Vol. 5, No. 3, pp. 1-27, 2009 Academy, M. Stonebreaker et al., “MapReduce and http://www.dataacademy.com/files/ETL- Parallel DBMSs: Friends or Foes,” vs-ELT-White-Paper.pdf Communications of the ACM 53(1):64--  S. Ghemawat, H. Gobioff and Shun-Tak 71 2010. Leung (GOOGLE) - The Google File Apache HBASE, System, http://hbase.apache.org/ http://www.cs.brown.edu/courses/cs295- H. Yang, A. Dasdan, R. Hsiao, D. Par- 11/2006/gfs.pdf ker, “Map-reduce-merge: simplified re-  C. Olston, B. Reed, U. Srivastava, R. lational data processing on large clus- Kumar and A. Tomkins, “Pig Latin: A ters,” Rain (2007), Publisher: ACM, Not-So-Foreign Language for Data Pro- Pages: 1029-1040 ISBN: 781595936868 cessing,” In SIGMOD ’08, pp. 1099– , 1110, 2008. http://www.mendeley.com/research/map  S. Pukdesree, V. Lacharoj and P. reducemergesimplified-relational-data- Sirisang, “Performance Evaluation of processing-on-large-clusters/ Distributed Database on PC Cluster J. Dean, S. Ghemawat, “MapReduce: Computers,” WSEAS Transactions on Simplified Data Processing on Large Computers, Issue 1, Vol. 10, January Clusters,” USENIX Association OSDI 2011, pp. 21 – 30, ISSN 1109-2750. ’04: 6th Symposium on Operating Sys-  A. Abouzeid, K. Bajda-Pawlikowski, tems Design and Implementation, D.J. Abadi, A. Silberschatz and A. http://static.usenix.org/event/osdi04/tech Rasin, “HadoopDB: An architectural hy- /full_papers/dean/dean.pdf brid of MapReduce and DBMS technol- X. Yu, “Estimating Language Models ogies for analytical workloads,” In Pro- Using Hadoop and Hbase,” Master of ceedings of the Conference on Very Science Artificial Intelligence, Universi- Large Databases, 2009. ty of Edinburgh, 2008,  C. Boja, A. Pocovnicu, “Distributed Par- http://homepages.inf.ed.ac.uk/miles/msc- allel Architecture for Storing and Pro- projects/yu.pdf cessing Large Datasets,” 11th WSEAS ETL Architecture Guide Int. Conf. on SOFTWARE ENGINEER- http://www.ipcdesigns.com/etl_metadata ING, PARALLEL and DISTRIBUTED /ETL_Architecture_Guide.html SYSTEMS (SEPADS 12), Cambridge, W. Dumey, A Generalized Lesson in UK, February 22-24, 2012, ISBN: 978- ETL Architecture Durable Impact Con- 1-61804-070-1, pp. 125-130.
Informatica Economică vol. 16, no. 2/2012 127 Catalin BOJA is Lecturer at the Economic Informatics Department at the Academy of Economic Studies in Bucharest, Romania. In June 2004 he has graduated the Faculty of Cybernetics, Statistics and Economic Informatics at the Academy of Economic Studies in Bucharest. In March 2006 he has grad- uated the Informatics Project Management Master program organized by the Academy of Economic Studies of Bucharest. He is a team member in various undergoing university research projects where he applied most of his projectmanagement knowledge. Also he has received a type D IPMA certification in project man-agement from Romanian Project Management Association which is partner of the IPMA or-ganization. He is the author of more than 40 journal articles and scientific presentations atconferences. His work focuses on the analysis of data structures, assembler and high levelprogramming languages. He is currently holding a PhD degree on software optimization andon improvement of software applications performance. Adrian POCOVNICU is a PhD Candidate at Academy of Economic Stud- ies. His main research areas are: Multimedia Databases, Information Retriev- al, Multimedia Compression Algorithms and Data Integration. He is a Data Integration Consultant for ISA Consulting, USA. Lorena BĂTĂGAN has graduated the Faculty of Cybernetics, Statistics and Economic Informatics in 2002. She has become teaching assistant in 2002. She has been university lecturer since 2009. She is university lecturer at Fac- ulty of Cybernetics, Statistics and Economic Informatics from Academy of Economic Studies. She holds a PhD degree in Economic Cybernetics and Sta- tistics in 2007. She is the author and co-author of 4 books and over 50 articles in journals and proceedings of national and international conferences, sympo-siums.