• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop Enterprise Readiness
 

Hadoop Enterprise Readiness

on

  • 2,840 views

 

Statistics

Views

Total Views
2,840
Views on SlideShare
2,840
Embed Views
0

Actions

Likes
3
Downloads
167
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Enterprise Readiness Hadoop Enterprise Readiness Document Transcript

    • Hadoop EnterpriseReadinessDell | Hadoop White PaperBy Aurelian DumitruDell | Hadoop White Paper Series
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessTable of ContentsIntroduction 3Audience 3The value of big data analytics in the enterprise 3Case study: Using big data analytics to optimize/automate IT operations 7Big data analytics challenges in the enterprise 9The adoption of Hadoop technology 9Hadoop technical strengths and weaknesses 10Dell | Hadoop solutions 10Dell | Hadoop for the enterprise 12About the author 15Special thanks 15About Dell Next Generation Computing Solutions 15Hadoop ecosystem component “decoder ring” 15Bibliography 16To learn more 16This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without expressor implied warranties of any kind.© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.For more information, contact Dell. 2
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessIntroductionThis white paper describes the benefits and challenges of leveraging big data analytics in an enterprise environment.The white paper begins with a holistic view of business process phases and highlights ways in which analytics may stimulatebetter business operational efficiency, drive higher returns from existing or new investments, and also help business leadersmake rapid adjustments to the business strategy in response to varying market trends and/or customer demands.The white paper continues with a case study of how big data analytics helps information technology (IT) departments runinformation systems more efficiently and with little or no downtime.Lastly, the paper introduces the Dell | Hadoop solutions and presents several best practices for deploying and using Hadoop inthe enterprise.AudienceDell intends this white paper for anyone in the business or IT community who wants to learn about the advantages andchallenges of implementing and using big data analytics solutions (like Hadoop) in a production environment. Readers shouldbe familiar with general concepts of business process design and implementation and also with the correlation betweenbusiness processes and IT practices.The value of big data analytics in the enterpriseBusiness processes define the way business activities are performed, the expected set of inputs, and the desired outcomes.Business processes often integrate business units, workgroups, infrastructures, business partners, etc. to achieve keyperformance goals (i.e. strategy, operations, functionality). Business process adjustments and improvements are expected asthe company attempts to improve its operations or to create a competitive advantage. Business process maturity andexecution excellence are the core competencies of any modern company. Switching from last decade’s product-centricbusiness model to today’s customer-driven model requires reengineering of the business processes (i.e. just-in-time businessintelligence) along with deeper collaboration among departments.[2]Enterprise business processes relate to cross-functional management of work activities across the boundaries of the variousdepartments (or business units) within a large enterprise. Controlling the sequence of work activities (and the correspondinginformation flow) while delivering to customer’s needs is fundamental to the successful implementation and execution of thebusiness process. Because of its intrinsic complexity, enterprises start taking a process-centric approach to designing,planning, monitoring, and automating the business operations. One example of such approach stands out: Business ProcessManagement (BPM).“BPM is a holistic management approach focused on aligning all aspects of an organization with the wants and needs ofclients. It promotes business effectiveness and efficiency while striving for innovation, flexibility, and integration withtechnology. BPM attempts to improve processes continuously.”[3] 3
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessThe main BPM phases (Figure 1) and their respective owners are: 1. Vision—Functional leads in an organization create the strategic goals for the organization. The vision can be based on market research, internal strategy objectives, etc. 2. Design & Simulate—Design leads in the organization work to identify existing processes or to design “to-be” processes. The result is a theoretical model that is tested against combinations of variables. Special consideration is given to “what if” scenarios. The aim of this step is to ensure that the design delivers the key performance goals established in the Vision phase. 3. Implement—The theoretical design is adopted within the organization. A high degree of automation and integration are two key ingredients for successful implementation. Other key elements may be personnel training, user documentation, streamlined support, etc. 4. Execute—The process is fully adopted within the organization. Special measures and procedures are being put in place to enable the organization to investigate/monitor the execution of the process and test it against established performance objectives. An example of such measures and procedures is what Gartner defines as Business Activity Monitoring (BAM) [4]. 5. Monitor & Optimize—The process is being Figure 1: Business Process Management (BPM) Phases monitored against performance objectives. Actual performance statistics are gathered and analyzed. Example of such statistics can be the measure of how quickly an online order is processed and sent for shipping. In addition, these statistics can be used to work with other organizations to improve their connected processes. The highest possible degree of automation can help tremendously. Automation can cut costs, save time, add value, and eventually lead to competitive advantage. Process Mining [5] is a collection of tools and methods related to process monitoring.How can analytics help the business?In 2005 Gartner released a thought-provoking study about combining business intelligence with a business process platform.This results in what Gartner calls an environment in which processes are self-configuring and driven by clients or transactions.The real challenge with such an endeavor is mapping business rules to intelligent processes that, by definition, need to beself-configurable and transaction-driven. 4
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessRecent advancements in high-volume data management technologies and data analysis algorithms make the mapping frombusiness rules to intelligent processes plausible. First, analytics enable flow automation and monitoring. Second, removal ofmanual steps helps improve process reliability and efficiency. Third, analytics can become one of the driving factors forcontinuous optimizations of business processes in the enterprise.In conclusion, analytics can be the foundation for the environment that Gartner had envisioned (Figure 2). Embedding analytics into the process lifecycle has tremendous benefits. For example, during the Vision phase, functional leads need to understand market trends, customer behavior, internal business challenges, etc. Being able to comb through treasure troves of data quickly and pick the right signals impacts the long-term profitability of the business. Reliance on analytics during the Design & Simulate phase helps the designers rule out suboptimal designs. During the Execute and Monitor & Optimize phases, analytics can provide automation, ongoing performance evaluation, and decision-making. Why can analytics be the business processes foundation? Although analytics use cases vary between each BPM phase, they all seem to answer the same basic questions: What happened? Why did it happen? Will it happen again?—etc. This convergence should be expected. In biology, convergent evolution is a powerful explanatory paradigm. [1] “Convergent Figure 2: BPM + Analytics Environment evolution describes the acquisition of the same biological trait in unrelated lineages. The wing is a classic example. Although their last common ancestor did not have wings, birds and batsdo.” [7] A similar phenomenon is occurring in the business analytics world because although different questions demanddifferent answers, the algorithms that generate the answers are fairly similar.The different use cases are converging into three categories of analytics [6] (Figure 3): 1. Reporting Analytics process historic data for purposes of reporting statistical information and interpreting the insights identified by analysing the data 2. Forecast Analytics begins with a summary constructed from historic data and defines scenarios for better outcomes (“Model & Simulate”) 3. Predictive Analytics encompasses the previous two categories and adds strategic decision-making.Reporting Analytics helps analysts characterize the performance of a process instance by aggregating historical data andpresenting it in a human-readable interpretation (i.e. spreadsheets, dashboards, etc.). Business analysts use Reporting Analyticsto compare measured performance against objectives. They use Reporting Analytics only to understand the process. Theintelligence gathered from Reporting Analytics cannot be used to influence process optimizations or to adjust the overallstrategy. Process tuning or strategy adjustments are the subject of one of the next two types of analytics. 5
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessFigure 3: Business Analytics CategoriesForecast Analytics uses data mining algorithms to process historical data (“Report“ in Figure 3) and derive insights of statisticalsignificance. These insights are then used to define a set of rules (or mathematical models) called “forecast models,” which arenothing but mathematic formulas. These models are being iterated (“Model & Simulate” in Figure 3) until the model with thebest outcome wins. Forecast Analytics helps analysts optimize the process within prescribed boundaries. Practitioners cantune the process, for example by adopting automation which is fundamentally the first step toward intelligent processes.Forecast Analytics’ primary role is to influence optimizations needed to tune a process; however it doesn’t provide the analystwith the insights needed to make strategy adjustments.Predictive Analytics offers the greatest opportunity to influence the strategy from which business objectives will be born.Predictive Analytics begins with historic facts, takes in consideration data mining and fast-tracks forecast models definitionand validation. Predictive Analytics looks at the strategy and its derived processes holistically (“Predict” in Figure 3).Let’s look at an example. We’ll consider the case of a home improvement company. Historical data indicates that ant killersells very well across southern U.S. during summer months. Historical data also indicates that shelf inventory sells very slowlyand at deep discounts after Labor Day. This year the company wants to make sure there is no shelf inventory come Labor Day.Also the ant killer manufacturer has announced a new product that combines the ant killer with a lawn fertilizer. How cananalytics help?Foremost, the company needs to start with Reporting Analytics to understand factors like volume of sales per month, geo-distribution across the region, sales volume for each sales representative, discounts after Labor Day, etc. Second, the companyneeds to consider Forecast Analytics to simulate various sales scenarios and choose the one that meets the strategiccriterion—no inventory left come Labor Day. The results may include: accelerate sales in July and August using coupons, hiremore sales representatives to “push” the inventory quicker, etc. Third, the company needs to use Predictive Analytics toidentify the best strategy for selling the new product. Contributing factors to the new strategy may be not only the ant killersales figures but also information like excessive drought zones (in these areas homeowners need both bug killers andfertilizers to keep their lawns bug free and beautiful during summer months), single-home ownership rates, demographiccharacteristics, social networks, etc.To summarize, the three categories of analytics build on each other. They all attack the same problem, though they do it atdifferent levels and take a different view. It all starts with historical data, which is what Reporting Analytics is concerned with.Next comes Forecast Analytics, which has the power to influence the outcome of the interaction with the customer. ForecastAnalytics shows us a glimpse into the future, though it is very narrow because it is based on limited insight. Predictive Analyticsreally opens the window into the future and lets us choose if we like it or not. 6
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessGreat, I understand it now! What about these exponentially growing volumes of data? Would analytics scale?An emerging trend that begins disrupting traditional analytics is the ever-increasing amount of mostly unstructured data thatorganizations need to store and process. Tagged as big data, it refers to the means by which an organization can create,manipulate, store, and manage extremely large data sets (think tens of petabytes of data). Difficulties include capture, storage,search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and largerdatasets, which allow analysts to gain insights never possible before. [8] [10]Big data analytics require technologies like MPP (massively parallel processing) to process large quantities of data. Examples oforganizations with large quantities of data are the oil and gas companies that gather huge amounts of geophysical data.Two chief concerns of big data analytics are linear scalability and efficient data processing.[9] Nobody wants to start down thebig data analytics path and realize that in order to keep up with data growth the solution needs armies of administrators.In short, leveraging big data analytics in the enterprise presents both benefits and challenges. From the holistic perspective,big data analytics enable businesses to build processes that encompass a variety of value streams (customers, businesspartners, internal operations, etc.). The technology offers a much broader set of data management and analysis capabilitiesand data consumption models. For example, the ability to consolidate data at scale and increase visibility into it has been adesire of the business community for years. Technologies like Hadoop finally make it possible. Businesses no longer need toskip on reporting and insights simply because the technology is not capable or it is too expensive.Case study: Using big data analytics to optimize/automate IT operations Steve: “What was wrong with the server that crashed last week?” Bill: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed.”Anyone who has been in IT operations must have had the above dialog, sometimes quite often. Today’s data centers generateimmense quantities of data, and the answer to the above question lies in IT’s ability to mine the data and uncover the chain ofevents. IT operations are a crucial aspect of most organizational operations. Companies rely on their information systems to run their operations. IT must therefore keep high standards for assuring business continuity in spite of hardware or software glitches, network connectivity disruptions, unreliable power systems, etc. Effective IT operations require a balanced investment in both system data gathering and data analysis. Most IT operations nowadays gather up-to-the- minute (or second in some cases) logs from the servers, storage devices, network components, applications running on this infrastructure (i.e. Linux system log), and even the power and cooling components.The data lifecycle (Figure 4) begins with the data being generated and collected. The vast majority of the collected dataconsists of plain text files that have very little in common in the way the content is structured. Data can be stored in its originalformat or it can be pre-processed and then stored. Pre-processing increases the value of the data by removing less significantcontent. The data is then stored and made available for processing.Processing of the data is mainly focused on two attributes:  Extract the value (also called insights) from the data through the use of statistical analysis  Make the results available for presentation in a format that readily communicates the valueFigure 4: Data Lifecycle 7
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessThe last phase of the data lifecycle is the presentation of the insights uncovered along the way. At this phase data reaches itsmaximum potential and has the biggest impact on decisions derived from analysis. In the broad spectrum of options,presentation may imply graphic presentation of the results (i.e. pie chart) or only bundling the results and shipping them off toan application for further examination.Big data analytics can help optimize/automate IT operations in several ways:  Improve the quality of the control processes by embedding big data analytics in the control path  Keep the system operating within set boundaries by being able to predict the future operational state of the system  Minimize system downtime by avoiding predictable failuresFigure 5 illustrates an example of embedding analytics in the control loop of the data center management system.As explained above, system components (hardware or software) generate metering data that is readily available on a system-wide data bus. The analytics engine grabs the metering data from the data bus, processes it, and examines the results againsthistorical data (i.e. data that was gathered in a previous iteration). Next, the analytics engine computes the deviation andcompares it with the standard deviation defined in profiles. The analytics engine forwards the comparison results to theintelligent controller, which, after evaluating the particular condition against pre-defined policies, issues control commandsback to the system.Figure 5: Embedding Analytics in Automated System ControlThe control system described above allows IT managers to rethink the operational efficiency of the data center. By harnessingthe power of sophisticated analytics, the system’s response can be correlated in a timely manner with the control stimuli andexternal factors over a broad spectrum of conditions and application workloads. IT managers can optimize the system for thesupply side (i.e. utilities), for the demand side (i.e. software applications, business processes, etc.) or for both. The long-termpayoffs should outweigh the cost of analytics. 8
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessBig data analytics challenges in the enterpriseThe adoption of big data analytics in the enterprise can deliver huge benefits but also presents equally important challenges.Adopting big data analytics is both an opportunity and a challenge. Examples are in order:  An inability to share/correlate knowledge (data and algorithms) across organizational boundaries impacts the bottom line. As mentioned above, analytics are converging. Two or more business units may be working on a similar set of challenges. With no leveraged knowledge among them, each business unit will duplicate efforts only to discover similar solutions. Sharing the value of big data underpins substantial productivity gains and accelerates innovation.  Data is locked in many disparate data marts. This is not necessarily a new challenge. This has been seen since the early days of enterprise databases when two or more departments could not agree on a common set of requirements and decided to go their own ways and build separate data stores. The advent of big data exacerbates the age-old dispute—the sheer volume of data requires even more data marts to be stored. Big data mitigates this challenge by leveraging technologies that are built from the ground up to be scalable and schema-agnostic.  Traditional enterprise IT processes (i.e. user authentication and authorization) don’t scale with big data. Not being able to enforce and audit access controls against huge quantities of data leaves the enterprise open to unauthorized access and theft of the intellectual property.The adoption of Hadoop technologyHadoop has become the most widely known big data technology implementation. The rise of Hadoop proved to beunstoppable. There is a very vibrant community around Hadoop. Venture capitalists are pouring money into startups much likewe saw back in 2000 with Internet companies. Most of these startups are enacted as academic research projects. Customerdemand eventually brings them into the mainstream marketplace where they start competing with more establishedproviders.On the receiving end of the market, businesses start picking up the pace at which Hadoop is deployed. Businesses begin torealize that data management, processing, and consumption are emerging as key challenges.The wide adoption of Hadoop is hindered by both socio-business and technical factors.Examples of socio-business factors are:  Hiring Just like with any high-end niche technology, the emergence of Hadoop requires bleeding-edge data analytics design, processing, and visualization skills. For example, the Hadoop MapReduce API is more complex than SQL. Managing Hadoop deployments is equally complex. These skill sets are in short supply, thus slowing down the adoption of the technology. Hiring will get better as tools and the underlying technology improves.  Confusion among vendors as well as buyers The rapidly changing market landscape makes it difficult for technology innovators to forecast resource allocation and maximize their returns on investments. Buyers are equally confused because they need more information about the actual business value of the technology and about the costs and the characteristics of successful deployments. Companies like Dell are taking a customer-centric approach. They work directly with customers and vendors to ease the adoption of the technology by providing end-to-end Hadoop solutions and business value metrics, all wrapped in strong services and consulting offerings.  The “checkbox” mentality and the genesis of a new form of vendor lock-in Traditional enterprises demand their IT organizations to require support contracts for all their software applications. The “checkbox” mentality is one in which support is provided so IT can mark off the appropriate checkbox. Yet, businesses realize that true opportunities to improve the bottom line come from a deeper understanding of their internal processes; thus demand for big data is rapidly increasing. That leaves IT with only one option. That is, choose one from many competing vendors. Because of fierce competition among vendors, the one chosen vendor will try to lock in as much functionality as possible. The answer is a leveraged approach: use open source as much as possible and pay only for the support that is deemed absolutely necessary. Look for vendors that offer both open-source and commercial versions of the technology needed. A different, yet long-term, answer is standardization (i.e. of the API, the data models, the algorithms, etc.) 9
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessHadoop technical strengths and weaknessesHadoop has been designed from the ground up for seamless scalability and massively parallel compute and storage. Hadoophas been optimized for high aggregated data throughput (as opposed to query latency). The real power of Hadoop is in thenumber of compute nodes in the cluster instead of the compute and store capacity of each individual node.Hadoop’s strengths are:  It is highly scalable—Yahoo runs Hadoop on thousands of nodes  It integrates storage and compute—the data is processed right where it is stored  It supports a broad range of data formats (CSV, XML, XSL, GIF, JPEG, SAM, BAM, TXT, JSON, etc.).  Data doesn’t have to be “normalized” before it is stored in Hadoop.Examples of the Hadoop’s weaknesses are:  Security—Hadoop has a fairly incoherent security design. Data access controls are implemented at the lowest level of the stack (the file system on each compute node). Also there is no binding between data access and job access models.  Advanced IT operations and developer skills are required.  Lack of enterprise hardening—the NameNode is a single point of failure.Dell | Hadoop solutionsThe Dell | Hadoop solutions lower the barrier to adoption for businesses looking to use Hadoop in production. Dell’scustomer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoop solutions running oncommodity hardware. Dell provides all the hardware and software components and resources to meet the customer’srequirements and no other supplier need be involved.The hardware platforms for the Dell | Hadoop solutions (Figure 6) are the Dell™ PowerEdge™ C Series and Dell™PowerEdge™ R Series. Dell PowerEdge C Series servers are focused on hyperscale and cloud capabilities. Rather thanemphasizing gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizingtotal cost of ownership. It’s all about getting the processing customers need in the least amount of space and in an energy-efficient package that slashes operational costs. Dell PowerEdge R Series servers are widely popular with a variety ofcustomers for their ease of management, virtually tool less serviceability, power and thermal efficiency, and customer-inspireddesigns. Dell PowerEdge R Series servers are multi-purpose platforms designed to support multiple usage models/workloadsfor customers who want to minimize differing hardware product types in their environments.The operating system of choice for the Dell | Hadoop solutions is Linux (i.e. Red Hat Enterprise Linux, CentOS, etc.). Therecommended Java Virtual Machine (JVM) is the Oracle Sun JVM.The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which the Hadoopsoftware stack runs.Figure 6: Dell | Hadoop Solution Taxonomy 10
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessThe bottom layer of the Hadoop stack (Figure 6) comprises two frameworks: 1. The Data Storage Framework (HDFS) is the filesystem that Hadoop uses to store data on the cluster nodes. Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem. 2. The Data Processing Framework (MapReduce) is a massively parallel compute framework inspired by Google’s MapReduce papers.The next layer of the stack in the Dell | Hadoop solution design is the network layer. Dell recommends implementing the Hadoopcluster on a dedicated network for two reasons: 1. Dell provides network design blueprints that have been tested and qualified. 2. Network performance predictability—sharing the network with other applications may have a detrimental impact on the performance of the Hadoop jobs.The next two frameworks—the Data Access Framework and the Data Orchestration Framework—comprise utilities that arepart of the Hadoop ecosystem.Dell listened to its customers and designed a Hadoop solution that is fairly unique in the marketplace. Dell’s end-to-endsolution approach means that the customer can be in production with Hadoop in shortest time possible. The Dell | Hadoopsolutions embody all the software functions and services needed to run Hadoop in a production environment. The customeris not left wondering, “What else is missing?” One of Dell’s chief contributions to Hadoop is a method to rapidly deploy andintegrate Hadoop in production. Other major contributions include integrated backup, management, and security functions.These complementary functions are designed and implemented side-by-side with the core Hadoop core technology.Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to deployed on variousnodes. Designing, deploying, and optimizing the network layer to match Hadoop’s scalability requires a lot of thinking andalso consideration for the type of workloads that will be running on the Hadoop cluster. The deployment mechanism that Delldesigned for Hadoop automates the deployment of the cluster from “bare-metal” (no operating system installed) all the wayto installing and configuring the Hadoop software components to specific customer requirements. Intermediary steps includesystem BIOS update and configuration, RAID/SAS configuration, operating system deployment, Hadoop software deployment,Hadoop software configuration, and integration with the customer’s data center applications (i.e. monitoring and alerting).Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop becomes the defacto platform for business-critical applications, the data that is stored in Hadoop is crucial for ensuring business continuity.Dell’s approach is to offer several enterprise-grade backup solutions and let the customer choose.Customers also commented on the current security model of Hadoop. It is a real concern because as a larger number ofbusiness users share access to exponentially increasing volumes of data, the security designs and practices need to evolve toaccommodate the scale and the risks involved. Also HIPAA, Sarbanes-Oxley, SAS70, and PCI Security Standards Council mayhave an interest in data stored in Hadoop. Particularly in industries like healthcare and financial services, access to the data hasto be enforced and monitored across the entire stack. Unfortunately, there is no clear answer on how the securityarchitecture of Hadoop is going to evolve. Dell’s approach is to educate the customer and also work directly with leadingvendors to deliver a model that suits the enterprise.Lastly, Dell’s open, integrated approach to enterprise-wide systems management enables customers to build comprehensivesystem management solutions based on open standards and integrated with industry-leading partners. Instead of building apatchwork of solutions leading to systems management sprawl, Dell integrates the management of the Dell hardware runningthe Hadoop cluster with the “traditional” Hadoop management consoles (Ganglia, Nagios).To summarize, Dell is adding Hadoop to its data analytics solutions portfolio. Dell’s end-to-end solution approach means thatDell will provide readily available software interfaces for integration between the solutions in the portfolio. Dell will provide theETL connector (Figure 6) that integrates Hadoop with the Dell | Aster Data solution. 11
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessDell | Hadoop for the enterpriseIn this section we introduce several best practices for deploying and running Hadoop in an enterprise environment:  Hardware selection  Integrating Hadoop with Enterprise Data Warehouse (data models, data governance, design optimization)  Data security  Backup and recoveryThe focus in the paper is only on the introduction and high-level overview of these best practices. Our goal is to raise theawareness among enterprise practitioners and help them create successful Hadoop-based designs. We leave theimplementation to be presented in an upcoming white paper titled Hadoop Enterprise How-To published in the same series.The inherent challenge with recommendations for Hadoop in the enterprise is that there is not a lot of published research todraw on. However, Dell has a very strong practice in defining and implementing best practices for its enterprise customers.Thus, we had to take a different approach. Namely, we began with a gap analysis of Hadoop and drew on our enterprisepractice to derive recommendations that are likely to have the most profound impact on building Hadoop solutions for theenterprise.As mentioned above, we intentionally left the details for additional white papers because we did not want to run the risk ofmaking this high-level outline overly complex and, thus, fail to meet the original goal, which was to raise awareness.Let’s now look at what it takes to run Hadoop in the enterprise.First off, we’ve been using clustering technologies like HPCC in the enterprise for years. How is Hadoop different fromHPCC?The main difference between high-performance computuing (HPC) and Hadoop is in the way the compute nodes in thecluster access the data that they need to process. Traditional HPC architectures employ a shared-disk setup—all computenodes process data loaded in a shared network storage pool. Network latency and disk bandwidth become the critical factorsfor HPC job performance. Therefore, low-latency network technologies (like InfiniBand) are commonly deployed in HPC.Hadoop uses a shared-nothing architecture—data is distributed and copied locally on each compute node. Hadoop does notneed low-latency network; therefore using cheaper Ethernet networks for Hadoop clusters is the common practice for thevast majority of Hadoop deployments. [11]Got it! Let’s now look at the hardware. Is there anything I should be concerned with?The quick answer is YES. First and foremost, standardization is key. Using the same server platform for all Hadoop nodes can saveconsiderable money and allows for faster deployments. Other best practices for hardware selection include:  Use commodity hardware—Commodity hardware can be re-assigned between applications as needed. Specialized hardware cannot be moved that easily.  Purchase full racks—Hadoop scales very well with number of racks, so why not let Dell do the rack-n-stack and wheel in the full rack?  Abstract the network and naming—Any IP addressing scheme, no matter how complex or laborious, can scale to only a few hundred nodes. Using DNS and CNAMEs scales much better.Okay, I got the racks in production. How do I exchange data between Hadoop and my data marts?The answer varies depending on who is asking the question.To an IT architect, this is a typical system integration challenge. That is, there are two systems (Hadoop and the data mart) thatneed to be integrated with each other. For example, the IT architect would have to design the network connectivity betweenthe two systems. Figure 7 illustrates a possible network connectivity design. 12
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessFigure 7: Example of Network Connectivity between a Hadoop Cluster and a Data MartTo a data analyst, this is a data pipeline design challenge (Figure 8). His chief concerns are data formatting, availability of datafor processing and analysis, query performance, etc. The data analyst doesn’t need to know the topology of the networkconnectivity between the Hadoop cluster and the particular data mart.The difference between the two perspectives could hardly be greater.The solution is a mix of IT best practices and database administration best practices. The details are covered in an upcomingwhite paper, titled Integrating Hadoop and Data Warehouse, published in this same series of papers.Figure 8: Example of Data Pipeline between Hadoop and Data Warehouse 13
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessGreat, I now have data in Hadoop! How should I secure the access to it?Out of all the technical challenges that Hadoop exhibits, the security model is likely to be the biggest obstacle for theadoption of Hadoop in the enterprise. Hadoop relies on Linux user permissions for data access. These user permissions areenforced only at the lowest level of the stack (the HDFS layer on each compute node) instead of being checked and enforcedat the metadata layer (the NameNode) or higher. Jobs use the same userID to get access to data stored in Hadoop. A personskilled in the art can deploy a man-in-the-middle or denial-of-service attack.It should be noted that both Yahoo and Cloudera are making intense efforts to bring Hadoop’s security in line with the enterprise.Meanwhile, the security best practices include:  Ensure strong perimeter security—for example, use strong authentication and encryption for all network access to the Hadoop cluster.  Use Kerberos inside Hadoop for user authentication and authorization.  If purchasing support from Cloudera is an option, use Cloudera Enterprise to streamline the management of the security functions across all the machines in the cluster.Great, I’ll pay close attention to security! Last question: how do I back up the data in Hadoop?Again, it depends on who is asking.The IT administrator would be concerned about backup policies, media management, etc.The data analyst wants to make sure that the data has been saved entirely, which means that the backup solution needs to bedata-aware. Sometimes a dataset may be composed of more than one file. Any file in Hadoop is broken down in a number ofblocks that are handed off to Hadoop nodes for storage. A file-aware (or even worse, block-aware) backup solution will notmaintain the dataset metadata (the association rules between files) which will render the dataset completely useless.The intersection between the two views is the vision for Hadoop data backup. The best practices include:  Decide where the data is backed up: NAS, SAN, cloud, or another Hadoop cluster. While using the cloud for backing up the data makes perfect sense, most of the enterprises tend to keep the data private within the corporate firewall. Saving the data to another Hadoop cluster also makes sense; however the destination Hadoop cluster will need a backup solution of its own. Realistically, there are only two options for backup: NAS and SAN. If the backup needs only volume and average performance is acceptable, then the answer is NAS. For best-in-class performance and undisrupted access requirements the answer is SAN.  Dedupe your data.  Prioritize your data—back up only the data that is deemed valuable.  Add dataset metadata awareness to the backup.  Establish backup policies for both metadata and actual data.Great, thanks, that makes sense! What do I do if I have questions?First, please don’t hesitate to contact the author—contact information is provided below. Second, Dell offers a broad variety ofconsulting, support, and training services for Hadoop. Your Dell sales representative can put you in touch with the DellServices team. 14
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessAbout the authorAurelian “A.D.” Dumitru is the Dell | Hadoop chief architect. In that role he is responsible for all architecture decisions andlong-term strategy for Hadoop. A.D. has over 20 years of experience. He has been with Dell for more than 11 years in variousengineering, architecture, and management positions. His background is in hyperscale massively parallel compute systems.His interests are in automated process control, intelligent processes, and machine learning. Over the years he has authored ormade significant contributions to more than 20 patent applications, from RFID and automated process controls to softwaresecurity and mathematical algorithms. For similar topics please check his personal blog www.RationalIntelligence.com.Special thanksThe author wishes to thank Nicholas Wakou, Howard Golden, Thomas Masson, Lee Zaretsky, Joey Jablonski, Scott Jensen,John Igoe, and Matthew McCarthy for their helpful comments.About Dell Next Generation Computing SolutionsWhen cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell NextGeneration Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from computeand power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune your company’s“factory” for maximum performance and efficiency.Dell’s Next Generation Computing Solutions provide operational models backed by unique product solutions to meet theneeds of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while allowingscalability as your company grows.Deployment and support are tailored to your unique operational requirements. Dell’s Cloud Computing Solutions can helpyou minimize the tangible operating costs that have hyperscale impact on your business results.Hadoop ecosystem component “decoder ring” 1. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data 2. MapReduce: a software framework for distributed processing of large data sets on compute clusters 3. Avro: a data serialization system 4. Chukwa: a data collection system for managing large distributed systems 5. HBase: a scalable, distributed database that supports structured data storage for large tables 6. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying 7. ZooKeeper: a high-performance coordination service for distributed applications 8. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 9. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database. 10. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log data. Its architecture is based on streaming data flows. (Source: http://hadoop.apache.org/) 15
    • Dell | Hadoop White Paper Series: Hadoop Enterprise ReadinessBibliography[1] Donald F. Ferguson et al. Enterprise Business Process Management—Architecture, Technology and Standards. LectureNotes on Computer Science 4102, 1-15, 2006[2] Andrew Spanyi, Business Process Management (BPM) is a Team Sport: Play it to Win! Meghan-Kiffer Press, June 2003, ISBN978-0929652023[3] http://en.wikipedia.org/wiki/Business_process_management[4] David W. McCoy, Business Activity Monitoring: Calm Before the Storm, Gartner 2002,http://www.gartner.com/resources/105500/105562/105562.pdf[5] http://en.wikipedia.org/wiki/Process_mining[6] http://www.bpminstitute.org/articles/article/article/bringing-analytics-into-processes-using-business-rules.html[7] http://en.wikipedia.org/wiki/Convergent_evolution[8] http://en.wikipedia.org/wiki/Big_data[9] http://www.asterdata.com/blog/2008/05/19/discovering-the-dimensions-of-scalability/[10] McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May 2011[11] S. Krishnan et al., myHadoop—Hadoop-on-demand on Traditional HPC Resources, University of California at San Diego,2010 To learn more To learn more about Dell cloud solutions, contact your Dell representative or visit: www.dell.com/hadoop©2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications arecorrect at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography.Dell’s Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumer’s statutory rights.Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc. 16