Hadoop Deep drive


Published on

Sponsored by Hp and Intel..

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Deep drive

  1. 1. SPECIAL REPORT HPC and HadoopDeep DiveUncovering insight withdistributed processing Copyright © InfoWorld Media Group. All rights reserved. Sponsored by
  2. 2. i HPC and HadoopDeep Dive 2 Hadoop: A platform for the big data era Businesses are using Hadoop across low-cost hardware clusters to find meaningful patterns in unstructured data. Here’s how Hadoop works and how you can reap its benefits i By Andrew Lampitt in the last decade. HDFS handles storage, and MapRe- duce processes data. The beauty of Hadoop is its built- WE ARE AT THE BEGINNING of a data explosion. in resiliency to hardware failure and that it spreads its Ninety percent of the world’s data has been created tasks across a cluster of low-cost servers (nodes), freeing in the past two years. It is estimated that we generate the developer from the onerous challenge of managing 2.5 exabytes (about 750 million DVDs) of data each the scale-out infrastructure (increasing the number of day, more than 80 percent of which is unstructured. As nodes as required while maintaining a single, reliable, traditional approaches have allowed only for analyzing harmonious system). structured data, unlocking insight from semistructured Most analytic tasks need to combine data in some and unstructured data is one of the largest computing way. MapReduce abstracts the problem, providing opportunities in our era. a programmatic model to describe computational Standing squarely at the center of that opportunity requirements over sets of keys and values, such as attri- is Hadoop, a distributed, reliable processing and stor- bute name and value (“book and author” or “region age framework for very large data sets. The phrase “big and temperature”). Those instructions are then trans- data” resists precise definition, but simply put, it applies lated into physical reads and writes on the various disks to sets of data that are too costly or unwieldy to man- where the data is located. age with traditional technologies and approaches. Big data is generally considered to be at least terabytes of THE RISE OF BIG DATA data, usually unstructured or semistructured. The attributes generally associated with big data, first Hadoop has already established itself as the de facto laid out in a paper published in 2001 by industry ana- platform for unstructured and semistructured big data lyst Doug Laney, are the volume, velocity, and variety in analytics, the discipline aimed at gaining insight from of the data. In terms of volume, single-digit terabytes of data. Thanks to Hadoop, which can spread enormous data are considered small; dozens or hundreds of tera- processing jobs across low-cost servers, big data analysis bytes are considered midsize; and petabytes are large is now easily and affordably within reach. by any measure. The velocity of adding a terabyte or more per week is not uncommon in big data scenarios. THE IMPORTANCE OF HADOOP Variety refers to the range of data types beyond struc- Doug Cutting and Mike Cafarella implemented the first tured data to include semistructured and unstructured version of what eventually became Hadoop in 2004. types, such as email, images, Web pages, Web logs, and Hadoop evolved into a top-level open source Apache so on. (To be clear, Hadoop can handle structured data project after Cutting joined Yahoo in 2006 and con- as well.) Some argue that to be considered big data, the tinued to lead the project. data set must have all of these attributes, while others Hadoop is comprised of two core components: feel that one or two attributes are enough. Hadoop File System (HDFS) and MapReduce, both of Just as the term “big data” is hotly debated, so too which were inspired by papers Google published early is what constitutes big data technology. As a rule of I N F O W O R L D . C O M D E E P D I V E S E R I E S
  3. 3. i HPC and HadoopDeep Dive 3 thumb, big data technologies typically address two or THE HARDWARE TRENDS THAT three of the Vs described previously. Commonly, big SPAWNED HADOOP data technologies are open source and considered To fully appreciate Hadoop, it’s important to consider quite affordable in terms of licensing and hardware the hardware trends that gave it life and encouraged its compared to proprietary systems. However, some growth. It’s easy to recognize the advantages of using proprietary systems can manage characteristics of the low-cost servers. First, processors are fast, with chip three Vs and are often referred to as big data systems density following Moore’s Law. The cost of the rest of as well. the hardware, including storage, continues to fall. But a For the purpose of this discussion, other popular big Hadoop implementation is not as simple as just network- data technologies include NoSQL databases and MPP ing servers together. (massively parallel processing) relational databases. Hadoop solves three major challenges of scale-out Also, with respect to architectures, big data is generally computing using low-cost hardware: slow disk access to considered to be shared-nothing (each node is indepen- data; abstraction of data analysis (separating the logical dent and self-sufficient) as opposed to shared-every- programming analysis from the strategy to physically thing (all the processors share same memory address execute it on hardware); and server failure. The data space for read/write access). abstraction that MapReduce solves is described later; Another loosely defined yet popular term, NoSQL, the other two challenges are solved with elegant strat- refers to those databases that do not rely on SQL alone. egies that play to the strengths of low-cost hardware. They represent a break from the conventional rela- Disk access remains relatively slow, since it has not tional database. They offer developers greater afford- grown at the same rates as other hardware improve- ability, flexibility, and scalability while eschewing some ments nor grown proportionally to storage capacity. constraints that relational databases enforce, typically That’s why Hadoop spreads the data set out among around ACID compliance (atomicity, consistency, iso- many servers and reads as required in parallel. Instead lation, and durability). of reading a lot of data from one slow disk, it’s faster to For example, “eventual consistency” is a character- read a smaller set of data in parallel across many disks. istic of some NoSQL databases. In other words, given When dealing with large numbers of low-cost serv- a sufficiently long period of time, updates propagate ers, component failure is common. Estimates vary, through the system to achieve consistency. Relational but loosely speaking, approximately 3 percent of hard databases on the other hand are consistent. The major drives fail per year. With 1,000 nodes, each having 10 categories of NoSQL databases are: document stores, or so disks, that means 300 hard drives will fail (not to Google BigTable clones, graph databases, key-value mention other components). Hadoop overcomes this stores, and data grids, each with their own strengths liability with reliability strategies and data replication. for specific use cases. Reliability strategies apply to how compute nodes are MPP relational databases also fall into the big data managed. Replication copies the data set so that in the discussion due to the workloads they can handle and event of failure, another copy is available. their scale-out, shared-nothing nature. Yet MPP rela- The net result of these strategies is that Hadoop tional databases can sometimes be quite costly to buy allows low-cost servers to be marshaled as a scale-out, and maintain — attributes not typically associated with big data, batch-processing computing architecture. big data systems. Finally, if there’s one shared characteristic of big HADOOP AND THE data management systems, it’s that their boundaries CHANGING FACE OF ANALYTICS continue to blur. Hadoop continues to add database- Analytics has become a fundamental part of the enter- like and SQL-like extensions and features. Also, many prise’s information fabric. At Hadoop’s inception in big data databases have added hooks to Hadoop and 2004, conventional thinking about analytic workloads MapReduce. Big data is driving convergence of a range focused primarily on structured data and real-time of old and new database technologies. queries. Unstructured and semistructured data were I N F O W O R L D . C O M D E E P D I V E S E R I E S
  4. 4. i HPC and HadoopDeep Dive 4 considered too costly and unwieldy to tap. Business BLOCKS analysts, data management professionals, and business Blocks are the fundamental unit of space used by a managers were the primary stakeholders. physical disk and a file system. The HDFS block is Today, developers have become part of the core 64MB by default but can be customized. HDFS files analytics audience, in part because programming skills are broken into and stored as block-size units. However, are required simply to set up the Hadoop environ- unlike a file system for a single disk, a file in HDFS that ment. The net result of developers’ work with Hadoop is smaller than a single HDFS block does not take up enables traditional analytics stakeholders to consume a whole block. big data more easily, an end result that is achieved in HDFS blocks provide multiple benefits. First, a file a number of ways. on HDFS can be larger than any single disk in the net- Because Hadoop stores and processes data, it resem- work. Second, blocks simplify the storage subsystem bles both a data warehouse and an ETL (extraction, in terms of metadata management of individual files. transformation, and loading) solution. HDFS is where Finally, blocks make replication easier, providing fault data is stored, serving a function similar to data ware- tolerance and availability. Each block is replicated to house storage. Meanwhile, MapReduce can act as both other machines (typically three). If a block becomes the data warehouse analytic engine and the ETL engine. unavailable due to corruption or machine failure, a The flipside is the limited capabilities in these same copy can be read from another location in a way that areas. For example, as a data warehouse, Hadoop pro- is transparent to the client. vides only a limited ability to be accessed via some SQL, and handling real-time queries is a challenge NAMENODES AND DATANODES (although progress is being made in both areas). Com- An HDFS cluster has a master-slave architecture with pared to an ETL engine, Hadoop lacks prebuilt, sophis- multiple namenodes (the masters) for failover, and a ticated data transformation capabilities. number of datanodes (slaves). The namenode manages There has been a fair amount of discussion about the filesystem namespace, the filesystem tree, and the the enterprise readiness of Hadoop and its possible use metadata for all the files and directories. This persists on as an eventual widespread replacement of the rela- the local disk as two files, the namespace image, and the tional data warehouse. Clearly, many organizations edit log. The namenode keeps track of the datanodes on are using Hadoop in production today and consider it which all the blocks for a given file are located. to be enterprise-ready, despite arbitrary version num- Datanodes store and retrieve blocks as directed by bers such as the Apache Hadoop version 1.0 release clients, and they notify the namenode of the blocks in January 2012. being stored. The namenode represents a single point Hadoop offers a rich set of analytic features for devel- of failure, and therefore to achieve resiliency, the sys- opers, and it’s making big strides in usability for person- tem manages a backup strategy in which the namenode nel who know SQL. It includes a strong ecosystem of writes its persistent state to multiple filesystems. related technologies. On the business side, it offers tra- ditional enterprise support via various Hadoop distribu- A CLOSER LOOK AT MAPREDUCE tion vendors. There’s good reason for its acceptance. MapReduce is a programming model for parallel process- ing. It works by breaking the processing into two phases: A CLOSER LOOK AT HDFS the map phase and the reduce phase. In the map phase, HDFS is a file system designed for efficiently storing and input is divided into smaller subproblems and processed. processing large files (at least hundreds of megabytes) In the reduce phase, the answers from the map phase on one or more clusters of low-cost hardware. Its design are collected and combined to form the output of the is centered on the philosophy that write-once and read- original bigger problem. The phases have corresponding many-times is the most efficient computing approach. map and reduce functions defined by the developer. As One of the biggest benefits of HDFS is its fault tolerance input and output, each function has key-value pairs. without losing data. A MapReduce job is a unit of work consisting of the I N F O W O R L D . C O M D E E P D I V E S E R I E S
  5. 5. i HPC and HadoopDeep Dive 5 input data, the MapReduce program, and configuration DATA SECURITY details. Hadoop runs a job by dividing it into two types of Some people fear that aggregating data into a unified tasks: map tasks and reduce tasks. The map task invokes Hadoop environment increases risk, in terms of acciden- a user-defined map function that processes the input tal disclosure and data theft. The conventional ways to key-value pair into a different key-value pair as output. answer such data security issues are the use of encryp- When processing demands more complexity, commonly tion and/or access control. the best strategy is to add more MapReduce jobs, rather The traditional perspective about database encryp- than having more complex map and reduce functions. tion is that a disk encrypted at the operating system’s In creating a MapReduce program, the first step for file system level, called wire-level encryption, is typically the developer is to set up and configure the Hadoop good enough. Anyone who steals the hard drive ends development environment. Then the developer can up with nothing readable. Recently, Hadoop has added create two separate map and reduce functions. Ideally, full-on wire encryption to HDFS, alleviating some secu- unit tests should be included along the way in the pro- rity concerns. cess to maximize development efficiency. Next, a driver Access control is less robust. Hadoop’s standard secu- program is created to run the job on a subset of the rity model has been to accept that the user was granted data from the developer’s development environment, access and assumes that users cannot access the root of where the debugger can identify a potential problem. nodes, access shared clients, or read/modify packets on Once the MapReduce job runs as expected, it can be the network of the cluster. run against a cluster that may surface other issues for Hadoop does not yet have row-level or cell-level further debugging. security that are standard in relational databases Two node types control the MapReduce job execu- (although it seems HBase is making progress in that tion process: a jobtracker node and multiple tasktracker direction, and an alternative project called Accumulo, nodes. The jobtracker coordinates and tracks all the jobs aimed at real-time use cases, hopes to add this function- by scheduling tasktrackers to run tasks. They in turn sub- ality). Access to data on HDFS is often all or nothing for mit progress reports back to the jobtracker. If a task fails, a small set of users. Sometimes, when users are given the jobtracker reschedules it for a different tasktracker. access to HDFS, it’s with the assumption that they will Hadoop divides the input data set into segments likely have access to all data on the system. The general called splits. One map task is created for each split, consensus is that improvements are needed. and Hadoop runs the map function for each record But just as big data has caused us to look at analytics within the split. Typically, a good split size matches that differently, it also makes us scrutinize how information of an HDFS block (64MB by default but customizable) is secured differently. The key to understanding security because it is the largest size guaranteed to be stored in the context of big data is that one must appreciate on a single node. Otherwise, if the split spanned two the paradigms of big data first. It is not appropriate blocks, it would slow the process since some of the to simply map structured relational concepts onto an split would have to be transferred across the network unstructured data set and characteristics. to the node running the map task. Hadoop does not make data access a free for all. Because map task output is an intermediate step, it Authentication is required against centralized creden- is written to local disk. A reduce task ingests a map task tials, same as for other internal systems. Hadoop has output to produce the final MapReduce output and permissions that dictate which uses may access what store it in HDFS. If a map task were to fail before hand- data. With respect to cell-level security, some advise ing its output off to the reduce task, Hadoop would producing a file that has just the columns and rows simply rerun the map task on another node. When available to each corresponding person. If this seems many map and reduce tasks are involved, the flow can onerous, consider that every MapReduce job creates a be quite complex and is known as “the shuffle.” Due projection of the data, effectively analogous to a sand- to the complexity, tuning the shuffle can dramatically box. Creating a data set with privileges for a particular improve processing time. user or group is thus simply a MapReduce job. I N F O W O R L D . C O M D E E P D I V E S E R I E S
  6. 6. i HPC and HadoopDeep Dive 6 CORE ANALYTICS USE CASES OF HADOOP MapReduce scripts in the language of your choice using The primary analytic use cases for Hadoop may be gen- TRANSFORM; pluggable user-defined functions; plug- erally categorized as ETL and queries — primarily for gable user-defined types; and pluggable SerDes (serialize/ unstructured and semistructured data but structured deserialize) to read different kinds of data formats, such data as well. Archiving massive amounts of data afford- as data formatted in JSON (JavaScript Object Notation). ably and reliably on HDFS for future batch access and querying is a fundamental benefit of Hadoop. LOW-LATENCY QUERYING WITH HBASE HBase (from “Hadoop Database”), a NoSQL database HADOOP-AUGMENTED ETL modeled after Google’s BigTable, provides developers In this use case, Hadoop handles transformation — real-time, programmatic and query access to HDFS. the “T” in the ETL process. Either the ETL process or HBase is the de facto database of choice in working Hadoop may handle the extraction from the source(s) with Hadoop and thus a primary enabler to building and/or loading to the target(s). analytic applications. A crucial role of an ETL is to prepare data (cleansing, HBase is a column family database (not to be con- normalizing, aligning, and creating aggregates) so that it’s fused with a columnar database, which is a relational in the right format to be ingested by a data warehouse analytic database that stores data in columns) built (or Hadoop jobs) so that business intelligence tools may on top of HDFS that enables low-latency queries and in turn be used to more easily interact with the data. updates for large tables. One can access single rows A key challenge for the ETL is hitting the so-called quickly from a billion-row table. HBase achieves this by data window — the amount of time allotted to trans- storing data in indexed StoreFiles on HDFS. Additional form and load data before the data warehouse must benefits include a flexible data model, fast table scans, be ready for querying. When vast amounts of data are and scale in terms of writes. It is well suited for sparse involved, hitting the data load window can become data sets (data sets with many sparse records), which insurmountable. That’s where Hadoop comes in. The are common in many big data scenarios. data can be loaded into Hadoop, transformed at scale To assist data access further, for both queries and ETL, with Hadoop (and then potentially as an intermediate HCatalog is an API to a subset of the Hive metastore. step, the Hadoop output may be loaded into the ETL As a table and storage management service, it provides for transformation by the ETL engine). Final results are a relational view of data in HDFS. In essence, HCatalog then loaded into the data warehouse. allows developers to write DDL (Data Description Lan- guage) to create a virtual table and access it in HiveQL. BATCH QUERYING WITH HIVE With HCatalog, users do not need to worry about where Hive, frequently referred to as a data warehouse sys- or in what format (RCFile format, text files, sequence tem, originated at Facebook and provides a SQL inter- files, etc.) their data is stored. This also enables interoper- face to Hadoop for batch querying (high-latency). In ability across data processing tools, such as MapReduce other words, it is not designed for fast, iterative que- and Hive, as well as Pig (described later). rying (seeing results and thinking of more interesting queries in real time). Rather it is typically used as a first RELATED TECHNOLOGIES AND HOW pass of the data to find the “the gems in the mountain,” THEY CONNECT TO BI SOLUTIONS and add the gems to the data warehouse. A lively ecosystem of Apache Hadoop subprojects sup- Hive is commonly used via traditional business intel- ports deployments for broad analytic use cases. ligence tools since its Hive Query Language (HiveQL) is similar to a subset of SQL, used for querying relational SCRIPTING WITH PIG databases. HiveQL provides the following basic SQL fea- Pig is a procedural scripting language for creating tures: from clause subquery; ANSI join (equi-join only); MapReduce jobs. For more complex problems not eas- multitable insert; multi group-by; sampling; and objects ily addressable by simple SQL, Hive is not an option. traversal. For extensibility, HQL provides: pluggable Pig offers a way to free the developer from having to I N F O W O R L D . C O M D E E P D I V E S E R I E S
  7. 7. i HPC and HadoopDeep Dive 7 do the translation into MapReduce jobs, allowing for Another approach is to replace MapReduce with the focus on the core analysis problem at hand. Impala SQL Engine that runs on each Hadoop node. This increases performance over Hive in a couple ways. EXTRACTING AND LOADING DATA First, it does not start Java processes so it does not incur WITH SQOOP MapReduce’s latency. Second, Impala does not persist Sqoop enables bulk data movement between Hadoop intermediate results to disk. and structured systems such as relational databases and NoSQL systems. In using Hadoop as an ETL, Sqoop SEARCH WITH SOLR acts as the E (extract) and L (load) while MapReduce Search is a logical interface for unstructured data as well handles the T (transform). Sqoop can be called from as analytics, and Hadoop’s heritage is closely associated traditional ETL tools, essentially allowing Hadoop to be with search. Hadoop originated within Nutch, an open treated as another source or target. source Web-search software built on the Lucene search library. All three were created by Doug Cutting. Solr AUTOMATED DATA COLLECTION AND is a highly scalable search server based on Lucene. Its INGESTION WITH FLUME major features include: full-text search, hit highlighting, Flume is a fault-tolerant framework for collecting, aggre- faceted search, dynamic clustering, database integration, gating, and moving large amounts of log data from dif- and rich document handling. ferent sources (Web servers, application servers and mobile devices, etc.) to a centralized data store. INSIGHT AND CHALLENGES FOR THE FUTURE WORKFLOW WITH OOZIE Today, vast quantities of data — whether structured, MapReduce becomes a more powerful, bigger solution unstructured, or semistructured — can be affordably stored, when multiple jobs are connected in a dependent fash- processed, and queried with Hadoop. Hadoop may be ion. In Oozie, the Hadoop Workflow Scheduler, a client used alone or as an augmentation to traditional relational submits the workflow to a scheduler server. This enables data warehouses for additional storage and processing different types of jobs to be run in the same workflow. capabilities. Developers can analyze big data without hav- Errors and status can be tracked, and if a job fails, the ing to manage the underlying scale-out architecture. The dependencies are not run. Oozie can be used to string results from Hadoop can then be made more widely avail- jobs together from other Hadoop tools including MapRe- able to business users through traditional analytic tools. duce, Pig, Hive, Sqoop, etc., as well as Java programs and There are still opportunities for improvement. shell scripts. Hadoop requires specialized skills to set up and use proficiently. Good developers can ramp up, but none- OPTIONS FOR ACHIEVING LOWER- theless, it’s a framework and approach that needs to LATENCY QUERIES FOR HADOOP be considered differently than the traditional ways As described earlier, achieving fast queries is of high of working with structured data. The environment interest in the analytics space. Unfortunately, Hive does needs to be simplified so that existing analysts can not allow that as a default, and HBase has its own limita- benefit from the technology more easily. tions. To solve this challenge, a number of approaches Two of the most talked-about areas include real- are available to decrease latency on queries. time query improvement and more sophisticated visu- The first option is to integrate HDFS with alization tools to make Hadoop even more directly HadoopDB (a hybrid of parallel database and MapRe- accessible to business users. Additionally, it’s recog- duce) or an analytic database that supports such inte- nized that Hadoop can add layers of abstraction to gration. This enables low-latency queries and richer become a better analytic application platform by SQL functionality with the ability to query all the data, extending the capabilities around HBase. both in HDFS and the database. Additionally, MapRe- Finally, the Hadoop YARN project aims to pro- duce may be invoked via SQL. vide a generic resource-management and distributed I N F O W O R L D . C O M D E E P D I V E S E R I E S
  8. 8. i HPC and HadoopDeep Dive 8 application framework whereby one can deploy mul- Yahoo!, and more. It seems apparent that we are still at tiple services, including MapReduce, graph-processing, the beginning of the big data phenomenon with much and simple services alongside each other in a Hadoop more growth, evolution, and opportunity coming soon. i YARN cluster, thus greatly expanding the use cases for which a Hadoop cluster may serve. Andrew Lampitt has served as a big data strategy consultant for In all, Hadoop has established itself as a dominant, suc- several companies. A recognized voice in the open source com- cessful force in big data analytics. It is used by some of the munity, he is credited with originating the “open core” terminology. biggest name brands in the world, including Amazon.com, Lampitt is also co-founder of zAgile, the creators of Wikidsmart, a AOL, Apple, eBay, Facebook, HP, IBM, LinkedIn, Micro- commercial open source platform for integration of software engi- soft, Netflix, The New York Times, Rackspace, Twitter, neering tools as well as CRM and help desk applications. I N F O W O R L D . C O M D E E P D I V E S E R I E S
  9. 9. i HPC and HadoopDeep Dive 9 HPC and Hadoop resources HIGH PERFORMANCE COMPUTING: Accelerate high-performance innovation at any scale. Accelerate innovation with HP Converged Infrastructure designed and engineered for high-performance computing – from mainstream applications to the grandest scientific challenges. Learn more HP PROLIANT SERVERS: HP Performance Optimized Datacenter HP Performance Optimized Datacenter (POD) is an alternative to traditional brick and mortar Data Centers that are quicker to deploy, power and cooling efficient, and less expensive. HP POD are ideal for High Performance Computing, Disaster Recovery, Cloud, IT expansion, strategic locations. Learn more SOLUTIONS FROM HADOOP: You Can’t Spell Hadoop™ without HP Many companies interested in Hadoop lack the expertise required to configure, manage and scale Hadoop clusters for optimal performance and resource utilization. HP offers a com- plete portfolio of solutions to enable customers to deploy, man- age and scale Hadoop systems swiftly and painlessly. Learn more Sponsored by Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. I N F O W O R L D . C O M D E E P D I V E S E R I E S