Combining hadoop with big data analytics


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Combining hadoop with big data analytics

  1. 1. Combining Hadoop withBig Data Analytics
  2. 2. Page 2 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowThis E-Guide explores the growing popularity ofleveraging Hadoop, an open source software framework, totackle the challengesof big data analytics, aswellassome ofthe key businessand technicalbarriersto adoption. Gainexpertadvice on the skill sets and rolesneeded for successfulbig data analyticsprograms,and discoverhoworganizationscan design and assemble the rightteam.Hadoop for Big Data Puts ArchitectsonJourney ofDiscoveryBy: Jessica Twentyman, Contributor―Problems don‘t care how you solve them,‖ said James Kobielus, an analystwith Forrester Research, in a recent blog on the topic of ‗big data‘. ―The onlything that matters is that you do indeed solve them, using any tools orapproaches at your disposal.‖In recent years, that imperative -- to solve the problems posed by big data --has led data architects at many organizations on a journey of discovery. Putsimply, the traditional database and business intelligence tools that theyroutinely use to slice and dice corporate data are just not up to the task ofhandling big data.To put the challenge into perspective, think back a decade: few enterprisedata warehouses grew as large as a terabyte. By 2009, Forrester analystswere reporting that over two-thirds of Enterprise Data Warehouses (EDWs)deployed were in the 1 to 10-terabyte range. By 2015, they claim, a majorityof EDWs in large organizations will be 100 TB or larger, ―with petabyte-scaleEDWs becoming well entrenched in sectors such as telecoms, financialservices and web commerce.‖What‘s needed to analyze these big data stores are special new tools andapproaches that deliver ―extremely scalable analytics,‖ said Kobielus. By―extremely scalable,‖ he means analytics that can deal with data that stands
  3. 3. Page 3 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-Howout in four key ways: for its volume (from hundreds of terabytes to petabytesand beyond); its velocity (up to and including real-time, sub-second delivery);its variety (diverse structured, unstructured and semi-structured formats); andits volatility (wherein scores to hundreds of new data sources come and gofrom new applications, services, social networks and so on).Hadoop for Big Data on the RiseOne of the principal approaches to emerge in recent years has been ApacheHadoop, an open source software framework that supports data-intensivedistributed analytics involving thousands of nodes and petabytes of data.The underlying technology was originally invented at search engine giantGoogle by in-house developers looking for a way to usefully index textualdata and other ―rich‖ information and present it back to users in meaningfulways. They called this technology MapReduce – and today‘s Hadoop is anopen-source version, used by data architects for analytics that are deep andcomputationally intensive.In a recent survey of enterprise Hadoop users conducted by VentanaResearch on behalf of Hadoop specialist Karmasphere, 94% of Hadoopusers reported that they can now perform analyses on large volumes of datathat weren‘t possible before; 88% said that they analyze data in greaterdetail; and 82% can now retain more of their data.It‘s still a niche technology, but Hadoop‘s profile received a serious boostover that past year, thanks in part to start-up companies such as Clouderaand MapR that offer commercially licensed and supported distributions ofHadoop. Its growing popularity is also the result of serious interest shown byEDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialistGreenplum in June 2010; Teradata announced its acquisition of Aster Data inMarch 2011; and IBM announced its own Hadoop offering, Infosphere, inMay 2011.What stood out about Greenplum for EMC was its x86-based, scale-outMPP, shared-nothing design, said Mark Sears, architect at the new
  4. 4. Page 4 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-Howorganization EMC Greenplum. ―Not everyone requires a set-up like that, butmore and more customers do,‖ he said.―In many cases,‖ he continued ―these are companies that are already adeptat analysis, at mining customer data for buying trends, for example. In thepast, however, they may have had to query a random sample of data relatingto 500,000 customers from an overall pool of data of 5 million customers.That opens up the risk that they might miss something important. With a bigdata approach, it‘s perfectly possible and relatively cost-effective to querydata relating to all 5 million.‖Hadoop Ill Understood by the BusinessWhile there‘s a real buzz around Hadoop among technologists, it‘s still notwidely understood in business circles. In the simplest terms, Hadoop isengineered to run across a large number of low-cost commodity servers anddistribute data volumes across that cluster, keeping track of where individualpieces of data reside using the Hadoop Distributed File System (HDFS).Analytic workloads are performed across the cluster, in a massively parallelprocessing (MPP) model, using tools such as Pig and Hive. Results,meanwhile, are delivered as a unified whole.Earlier this year, Gartner analyst Marcus Collins mapped out some of theways he‘s seeing Hadoop used today: by financial services companies, todiscover fraud patterns in years of credit-card transaction records; by mobiletelecoms providers, to identify customer churn patterns; by academicresearchers, to identify celestial objects from telescope imagery.―The cost-performance curve of commodity servers and storage is puttingseemingly intractable complex analysis on extremely large volumes of datawithin the budget of an increasingly large number of enterprises,‖ heconcluded. ―This technology promises to be a significant competitiveadvantage to early adopter organizations.‖Early adopters include daily deal site Groupon, which uses the Clouderadistribution of Hadoop to analyse transaction data generated by over 70million registered users worldwide, and NYSE Euronext, which operates the
  5. 5. Page 5 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowNew York stock exchange as well as other equities and derivatives marketsacross Europe and the US. NYSE Euronext uses EMC Greenplum tomanage transaction data that is growing, on average, by as much as 2TBeach day. US bookseller Barnes & Noble, meanwhile, is using Aster DatanCluster (now owned by Teradata) to better understand customerpreferences and buying patterns across three sales channels: its retailoutlets, its online store and e-reader downloads.Skills Deficit in Big Data AnalyticsWhile Hadoop‘s cost advantage might give it widespread appeal, the skillschallenge it presents is more daunting, according to Collins. ―Big dataanalytics is an emerging field, and insufficiently skilled people exist withinmany organizations or the [wider] job market,‖ he said.Skills-related issues that stand in the way of Hadoop adoption include thetechnical nature of the MapReduce framework, which requires users todevelop Java code, and the lack of institutional knowledge on how toarchitect the Hadoop infrastructure. Another impediment to adoption is thelack of support for analysis tools used to examine data residing within theHadoop architecture.That, however, is starting to change. For a start, established BI tools vendorsare adding support for Apache Hadoop: one example is Pentaho, whichintroduced support in May 2010 and, subsequently, specific support for theEMC Greenplum distribution.Another sign of Hadoop‘s creep towards the mainstream is support from dataintegration suppliers, such as Informatica. One of the most common uses ofHadoop to date has been as an engine for the ‗transformation‘ stage of ETL(extraction, transformation, loading) processes: the MapReduce technologylends itself well to the task of preparing data for loading and further analysisin both convential RDBMS-based data warehouses or those based onHDFS/Hive. In response, Informatica announced that integration is underwaybetween EMC Greenplum and its own Data Integration Platform in May2011.
  6. 6. Page 6 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowAdditionally, there‘s the emergence of Hadoop appliances, as evidenced byEMC‘s Greenplum HD (a bundle that combines MapR‘s version of Hadoop,the Greenplum database and a standard x86 based server), announced inMay, and more recently (August 2011), the Dell/Cloudera Solution (whichcombines Dell PowerEdge C2100 servers and PowerConnect switches withCloudera‘s Hadoop distribution and its Cloudera Enterprise managementtools).Finally, Hadoop is proving wellsuited to deployment in cloud environments, afact which gives IT teams the chance to experiment with pilot projects oncloud infrastructures. Amazon, for example, offers Amazon ElasticMapReduce as a hosted service running on its Amazon Elastic ComputeCloud (EC2) and Amazon Simple Storage Service (S3). The Apachedistribution of Hadoop, moreover, now comes with a set of tools designed tomake it easier to deploy and run on EC2.―Running Hadoop on EC2 is particularly appropriate where the data alreadyresides within Amazon S3,‖ said Collins. If the data does not reside onAmazon S3, he said, then it must be transferred: the Amazon Web Services(AWS) pricing model includes charges for network costs and Amazon ElasticMapReduce pricing is in addition to normal Amazon EC3 and S3 pricing.―Given the volume of data required to run big data analytics, the cost oftransferring data and running the analysis should be carefully considered,‖ hecautioned.The bottom line is that ―Hadoop is the future of the EDW and its footprint incompanies‘ core EDW architectures is likely to keep growing throughout thisdecade,‖ said Kobielus.‘Big Data’ Analytics Programs Require TechSavvy, BusinessKnow-HowBy: Rick Sherman, ContributorArticles and case studies written about ―big data‖ analytics programsproclaim the successes of organizations, typically with a focus on the
  7. 7. Page 7 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-Howtechnologies used to build the systems. Product information is useful, ofcourse, but it‘s also critical for enterprises embarking on these projects to besure they have people with the right skills and experience in place tosuccessfully leverage big data analytics tools.Knowledge of new or emerging big data technologies, such as Hadoop, isoften required, especially if a project involves unstructured or semi-structureddata. But even in such cases, Hadoop know-how is only a small portion ofthe overall staffing requirements; skills are also needed in areas such asbusiness intelligence (BI), data integration and data warehousing. Businessand industry expertise is another must-have for a successful big dataanalytics program, and many organizations also require advanced analyticsskills, such as predictive analytics and data mining capabilities.Data scientist is a new job title that is getting some industry buzz. But it‘sreally a combination of skills from the areas listed above, including predictivemodeling, both statistical and business analysis, and data integrationdevelopment. I‘ve seen some job postings for data scientists that require astatistical degree as well as an industry-specific one. Having both degreesand matching professional experience is great, but what are the chances offinding job candidates who do? Pretty slim!Taking a more realistic look at what to plan for, I describe below the key skillsets and roles that should be part of big data analytics deployments. Insteadof listing specific job titles, I‘ve kept it general; organizations can bundletogether the required roles and responsibilities in various ways based on thecapabilities and experience levels of their existing workforce and newemployees that they hire.Business knowledge. In addition to technical skills, companies engaging inbig data analytics projects need to involve people with extensive businessand industry expertise. That must include knowledge of a company‘s ownbusiness strategy and operations as well as its competitors, current industryconditions and emerging trends, customer demographics and both macro-and microeconomic factors.
  8. 8. Page 8 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowMuch of the business value derived from big data analytics comes not fromtextbook key performance indicators but rather from insights and metricsgleaned as part of the analytics process. That process is part science (i.e.,statistics) and part art, with users doing what-if analysis to gain actionableinformation about an organization‘s business operations. Such findings areonly possible with the participation of business managers and workers whohave firsthand knowledge of business strategies and issues.Business analysts. This group also has an important role to play in helpingorganizations to understand the business ramifications of big data. Inaddition to doing analytics themselves, business analysts can be tasked withthings such as gathering business and data requirements for a project,helping to design dashboards, reports and data visualizations for presentinganalytical findings to business users and assisting in measuring the businessvalue of the analytics program.BI developers. These people work with the business analysts to build therequired dashboards, reports and visualizations for business users. Inaddition, depending on internal needs and the BI tools that an organization isusing, they can enable self-service BI capabilities by preparing the data andthe required BI templates for use by business executives and workers.Predictive model builders. In general, predictive models for analyzing bigdata need to be custom-built. Deep statistical skills are essential to creatinggood models; too often, companies underestimate the required skill level andhire people who have used statistical tools, including models developed byothers, but don‘t have knowledge of the mathematical principles underlyingpredictive models.Predictive modelers also need some business knowledge and anunderstanding of how to gather and integrate data into models, although inboth cases they can leverage other people‘s more extensive expertise increating and refining models. Another key skill for modelers to have is theability to assess how comprehensive, clean and current the available data isbefore building models. There often are data gaps, and a modeler has to beable to close them.
  9. 9. Page 9 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowDon‘t be lulled into thinking that tools alone spell big data success. The realkey to success is ensuring that your organization has the skills it needs.Data architects. A big data analytics project needs someone to design thedata architecture and guide its development. Typically, data architects willneed to incorporate various data structures into an architecture, along withprocesses for capturing and storing data and making it available for analysis.This role involves traditional IT skills such as data modeling, data profiling,data quality, data integration and data governance.Data integration developers. Data integration is important enough torequire its own developers. These folks design and develop integrationprocesses to handle the full spectrum of data structures in a big dataenvironment. In doing so, they ideally ensure that integration is done not insilos but as part of a comprehensive data architecture.It‘s also best to use packaged data integration tools that support multipleforms of data, including structured, unstructured and semi-structuredsources. Avoid the temptation to develop custom code to handle extract,transform and load operations on pools of big data; hand coding canincrease a project‘s overall costs -- not to mention the likelihood of anorganization being saddled with undocumented data integration processesthat can‘t scale up as data volumes continue to grow.Technology architects. This role involves designing the underlying ITinfrastructure that will support a big data analytics initiative. In addition tounderstanding the principles of designing traditional data warehousing and BIsystems, technology architects need to have expertise in the newtechnologies that often are used in big data projects. And if an organizationplans to implement open source tools, such as Hadoop, the architects needto understand how to configure, design, develop and deploy such systems.It can be exciting to tackle a big data analytics project and to acquire newtechnologies to support the analytics process. But don‘t be lulled into thinkingthat tools alone spell big data success. The real key to success is ensuring
  10. 10. Page 10 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-Howthat your organization has the skills it needs to effectively manage large andvaried data sets as well as the analytics that will be used to derive businessvalue from the available data.About the Author:Rick Sherman is the founder of Athena IT Solutions, which providesconsulting, training and vendor services on business intelligence, dataintegration and data warehousing. Sherman has written more than 100articles and spoken at dozens of events and webinars; he also is an adjunctfaculty member at Northeastern University‘s Graduate School of Engineering.He blogs at The Data Doghouse and can be reached at
  11. 11. Page 11 of 11 Sponsored byCombining Hadoop withBig Data AnalyticsContentsHadoop for Big DataPuts Architects onJourneyof Discovery‘Big Data’ AnalyticsPrograms Require TechSavvy, BusinessKnow-HowFree resourcesfor technologyprofessionalsTechTarget publishes targeted technology media that address your need forinformation and resources for researching products, developing strategy andmaking cost-effective purchase decisions. Our network of technology-specificWeb sites gives you access to industry experts, independent content andanalysis and the Web‘s largest library of vendor-provided white papers,webcasts, podcasts, videos, virtual trade shows, research reports and more—drawing on the rich R&D resources of technology providers to addressmarket trends, challenges and solutions. Our live events and virtual seminarsgive you access to vendor neutral, expert commentary and advice on theissues and challenges you face daily. Our social community IT KnowledgeExchange allows you to share real world information in real time with peersand experts.What makes TechTarget unique?TechTarget is squarely focused on the enterprise IT space. Our team ofeditors and network of industry experts provide the richest, most relevantcontent to IT professionals and management. We leverage the immediacy ofthe Web, the networking and face-to-face opportunities of events and virtualevents, and the ability to interact with peers—all to create compelling andactionable information for enterprise IT professionals across all industriesand markets.Related TechTarget Websites