Understanding the Elements of Big Data: More than a Hadoop Distribution
Upcoming SlideShare
Loading in...5
×
 

Understanding the Elements of Big Data: More than a Hadoop Distribution

on

  • 8,813 views

Understanding the

Understanding the
Elements of Big Data:
More than a Hadoop
Distribution

Statistics

Views

Total Views
8,813
Views on SlideShare
8,808
Embed Views
5

Actions

Likes
4
Downloads
381
Comments
1

2 Embeds 5

http://paper.li 3
http://www.slashdocs.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • very good
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Understanding the Elements of Big Data: More than a Hadoop Distribution Understanding the Elements of Big Data: More than a Hadoop Distribution Document Transcript

    • White PaperUnderstanding theElements of Big Data:More than a HadoopDistributionPrepared by:Martin Hall, Founder, KarmasphereMay 2011
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionTable of ContentsExecutive Summary ................................................................................................................................................3 The Elements Of Big Data ..................................................................................................................................3 Big Data Challenges ...........................................................................................................................................4 Big Data Ecosystem ............................................................................................................................................4 Who Should Read This White Paper ..................................................................................................................4Situation Analysis And Industry Trends................................................................................................................5 Who Employs Big Data Technology And Techniques?.......................................................................................5 Big Data Macro Trends .......................................................................................................................................5 Hadoop Adoption ................................................................................................................................................6The Elements Of Big Data.......................................................................................................................................7 Architecture .........................................................................................................................................................7 Data Management ..............................................................................................................................................7 Data Analysis .................................................................................................................................................... 11 Data Use ...........................................................................................................................................................12The Players ............................................................................................................................................................12 Open Source Projects, Developers And Communities .....................................................................................13 Big Data Developers, Analysts And Other End-Users ......................................................................................13 Commercial Suppliers .......................................................................................................................................14Conclusion .............................................................................................................................................................15 Choices .............................................................................................................................................................15 Short Path To Big Data Insight ..........................................................................................................................15 Learn More ........................................................................................................................................................15Glossary .................................................................................................................................................................16KARMASPHERE
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionExecutive SummaryIt is perhaps no coincidence that the Hadoop mascot is an elephant. Big Data canseem like the proverbial pachyderm as described by blindfolded observers. Thedefinition of “Big Data” varies greatly depending upon which part of the “animal”you touch, and where your interests lie.The “big” in Big Data refers to unprecedented quantities of information –terabytes,petabytes and more of new and legacy data generated by today’s fast-movingbusinesses and technology. In many instances, data collected over the course ofdays or weeks exceeds the entire corpus of legacy data in a given domain –examples abound in retail and social media and financial services, and also in scientific disciplines like geneticsand astronomy and climate science. The data deluge is even challenging the physical logistics of storage It is very sad that today there is so little truly useless information. – Oscar Wilde, 1894A distinguishing feature of Big Data is a mixture of traditional structured data together with massive amounts ofunstructured information. The data can come from legacy databases and data warehouses, from web server logsof ecommerce companies and other high-traffic web sites, from M2M (Machine-to-Machine) data traffic andsensor nets.This white paper outlines the structure of Big Data solutions based on Hadoop and explores the particulars of theelements that comprise it.The Elements of Big DataAt the highest level, Big Data presents three top-level elements: • Data Management – data storage infrastructure, and resources to manipulate it • Data Analysis – technologies and tools to analyze the data and glean insight from it • Data Use – putting Big Data insights to work in Business Intelligence and end-user applicationsUnderlying and pervading these high-level categories are the data (legacy and new, structured and unstructured)and the IT infrastructure that supports managing and operating upon it. Source: Karmasphere Figure 1 – Key Elements of Big DataKARMASPHERE 3
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionBig Data ChallengesBesides the obvious difficulty of storing and parsing terabytes and exabytes of mostly unstructured information,Big Data itself – the platforms and tools – presents developers and analysts with important challenges: • Despite fast-growing deployment, Hadoop and other Big Data technologies are still time-consuming to set up, deploy and use • Building and running Hadoop jobs and queries is non-trivial for developers and analysts. They need “deep” understanding of Hadoop particulars – cluster size and structure, job performance, etc. • Analyzing and iterating queries and results with Hadoop does not leverage existing skills and tools for Business IntelligenceMany companies and open source projects are being launched to ease entry into Big Data and to ensure highersuccess rates of Hadoop-based data mining. To understand the impact and value-add for these technologies andproducts, it is important to comprehend the audiences they address.Big Data EcosystemIn each element of Big Data (Figure 1), there are multiple participants with complex relationships among them.Under Data Management there are suppliers of Hadoop-based solutions and other MapReduce technologysuppliers with both Cloud and data center solutions. There are offerings in Big Data Analytics that addressspecific development and analysis requirements, complementing one another and addressing multiple phasesin the Big Data application life cycle. And while most Big Data applications reflect and support the operations ofparticular end-users companies and products, there are others that cross industry and corporate boundaries.In assimilating the particulars of the ecosystem, the players and the layers and the niches within it, you shouldalways remember that: • Not everyone who works with Hadoop is in competition • Not everyone in the ecosystem is a Hadoop distribution vendor • While building upon open source technologies like Hadoop, Hive and Java, the value in Big Data offerings encompasses an increasingly rich mix of services and commercial software that goes beyond that open source coreWho Should Read This White PaperThis White Paper provides a pragmatic vision and realistic overview of the elements that comprise Big Data. Itsintended audience comprises both business people new to Big Data and technologists looking for perspectiveupon this emerging industry.In particular, this white paper speaks to: • Data Scientists • Big Data applications developers • Big Data Analysts and the IT staff that support themKARMASPHERE 4
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionSituation Analysis and Industry TrendsBig Data is not just defined by the sheervolume of information, but also by the trendsin the growth of that data and how the ITindustry and its customers are meeting theBig Data challenge.Who Employs Big Data Technologyand Techniques?Not every massive data store or data-intensive segment is ready to embrace BigData. However, numerous industries andsegments stand out as leading deployers ofBig Data platforms and analytics (Figure 2). Source: Karmasphere Figure 2 – Industries Deploying HadoopBig Data Macro TrendsCross-industry and IT industry-wide trends show data creation and consumption overwhelming conventional(legacy) approaches to data management, predicating new approaches: • Growth in all types of data collection is estimated at 60% CAGR and the $100B information management industry is growing at 10% CAGR1 • Information generation is outstripping growth in storage capacity by a factor of two and the gap continues to grow2 (however, old data never dies – retention of historical data is on the upswing as well) • Sources of Big Data stores are becoming more varied, e.g., sensor nets and mobile devices: globally, there are today 4.5B mobile phone subscribers • There are almost 2B regular Internet users globally and total internet data traffic will top 667 exabytes by 20133 • Data marketplaces (the places you go to get the data you need4) are growing – third party data availability is on the rise, with the estimated worldwide market valued at $100B5 • Hadoop is increasingly the preferred Big Data Management platform for applications and analytics1 IDC2 Ibid3 Cisco4 Stratus Security5 BuzzDataKARMASPHERE 5
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionHadoop AdoptionWhile a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. • Hadoop adoption impetus is greatest when projects combine “Big Analytics” (fast, comprehensive analysis of complex data) and massive, unstructured data sets (Figure 3) • Hadoop forms the foundation of infrastructure at leading social media companies Facebook, LinkedIn and Twitter • Hadoop is the fastest growing Big Data technology, with 26% of organizations using it today in data centers and in the Cloud, and an additional 45% seriously considering its deployment (Figure 4) • Hadoop downloads increased 300% from 2009 to 20106 Source: Karmasphere and Booz Allen Hamilon Figure 3 – Data Set Attributes and Hadoop Adoption “Sweet Spot” • Google searches for the term “Hadoop” outstrip all other related queries – in fact, “Hadoop” searches outnumber even those for “Big Data” by a factor of four7 • Hadoop-related hiring (job descriptions) rose 7,074% between Q3 2009 and Q1 20118 • Sold-out attendance at Hadoop and Big Data conferences such as Hadoop Summit, Hadoop World, Strata, Data Scientist Summit Source: Karmasphere Figure 4 – Hadoop Adoption Trends6 DBTA Survey, Q1 20117 451 Group 20108 Hive, Hbase, Pig, Hadoop Job Trends – SimplyHired.comKARMASPHERE 6
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionThe Elements of Big DataArchitectureFigure 5 details the key elements of Big Data and the relationships among them. The following sections of thiswhite paper explore these elements. Later in the document, we’ll also examine some of the companies andcommunities implementing the “fabric” of Big Data and contributing to it. Data Analytics & Use End Users BIG ANALYTICS Apps Developer Analytics Environments Products BI & Developers Data Analysts Visualization Tools Business Analysts Data Management & Storage BIG DATA Operational Data Unstructured Data Hadoop NoSQL (MapReduce NoSQL MPP (Hadoop RDBMS DW 1010101 & HDFS) Based) (non Hadoop) 0101010 1010101 ETL & Data Workflow / System Structured Data Integration Scheduler Tools Products Products MPP RDBMS DW Administrators Source: Karmasphere Figure 5 – Big Data ArchitectureData ManagementData Management is the logical starting place in exploring Big Data. It is where the data “lives” and whereanalytics acts upon it.Legacy SystemsFor the last two decades, Data Management has built upon three related primary technologies: • Relational Data Base Management Systems – to store and manipulate structured data • MPP Systems – to crunch increasingly massive data sets and scale with data growth • Data Warehousing – to subset and host data for subsequent reportingKARMASPHERE 7
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionLimitations in Legacy SystemsWhile these technologies continue as important within Big Data, their role is more circumscribed due tolimitations from: • Scalability: as data sets on RDBMSs grow, performance slows, requiring major (not incremental) investments in compute capacity. These investments are today outstripping the budgets of organizations, especially as data grows exponentially. • Representative Data: With declining ability to process whole data sets, information in Data Warehouses is no longer statistically representative of the data from which it is derived. As such, business intelligence derived from it is less reliable. • Unstructured Data: RDBMS and Data Warehousing are definitively structured data entities. However, data growth is focused on unstructured data by a factor of 20:1.RDBMS, MPP and DW are not going away any time soon. Rather, they are taking on new roles in support of BigData management, most importantly to process and host the output of paradigms such as MapReduce and tocontinue to provide input to BI software and to applications.The DataThe “Data” in Big Data originates from a wide variety of sources and can be organized into two broad categories:structured and unstructured data.Structured DataStructured Data by definition already resides in formal data stores, typically in an RDBMS, a Data Warehouse oran MPP system, and accounts for approximately 5% of the total data deluge9 (the rest is unstructured). It is oftencategorized as “legacy data” carried forward from before Big Data had currency, but can also be recently deriveddata stored in pre-Big Data paradigms (RDBMS, DW, MPP, etc.). The “structure” typically refers to formal datagroupings into database records with named fields and/or row and column organization, with establishedassociations among the data elements.While most Big Data discussions see Structured Data as an input, Big Data Management derives Structured Datasets as an output as well (Operational Data).Unstructured DataUnstructured Data, by contrast, comprises data collected during other activities and stored in amorphous logs orother files in a file system. Unstructured data can include raw text or binary and contain a rich mix of lexicalinformation and/or numerical values, with or without delimitation, punctuation or metadata. Source: Karmasphere Figure 6 – Data Sources and Operational Data9 The EconomistKARMASPHERE 8
    • White Paper Understanding the Elements of Big Data: More than a Hadoop Distribution69.178.92.118 - - [15/Mar/2011:04:29:05 -0400] “GET /images/logos/fst-website-logo-01.png HTTP/1.0”200 8507 “http://www.linuxpundit.com/about.php” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 ( .NET CLR 3.5.30729) SearchToolbar/1.2”67.195.112.226 - - [15/Mar/2011:04:34:27 -0400] “GET /robots.txt HTTP/1.1” 200 442 “-” “Mozilla/5.0(compati-ble; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”209.85.226.83 - - [15/Mar/2011:04:39:18 -0400] “GET / HTTP/1.0” 200 4037 “-” “Feedfetcher-Google;(+http://www.google.com/feedfetcher.html; feed-id=5960076085481870538)”67.195.114.55 - - [15/Mar/2011:04:39:20 -0400] “GET /cv/docs/LUD67_CGL.pdf HTTP/1.0” 200 1247313 “-”“Mozil-la/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”65.52.110.77 - - [15/Mar/2011:04:39:22 -0400] “GET /robots.txt HTTP/1.1” 200 442 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)”209.85.226.80 - - [15/Mar/2011:05:03:38 -0400] “GET / HTTP/1.0” 200 4045 “-” “Feedfetcher-Google;(+http://www.google.com/feedfetcher.html; feed-id=5960076085481870538)”189.61.149.200 - - [15/Mar/2011:05:04:09 -0400] “GET / HTTP/1.0” 200 12003 “-” “Mozilla/4.0 (compatible;MSIE 5.0; Windows NT)”65.52.110.36 - - [15/Mar/2011:05:05:27 -0400] “GET /robots.txt HTTP/1.0” 200 442 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)”65.52.110.77 - - [15/Mar/2011:05:10:20 -0400] “GET /index.php HTTP/1.0” 200 11796 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)”95.108.128.241 - - [15/Mar/2011:05:13:59 -0400] “GET /documents/white_paper_motorola_evoke_teardown.pdfHTTP/1.0” 200 1165232 “-” “Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)”65.52.110.36 - - [15/Mar/2011:05:26:51 -0400] “GET /articles.php HTTP/1.0” 200 37434 “-” “Mozilla/5.0(compati-ble; bingbot/2.0; +http://www.bing.com/bingbot.htm)”203.124.22.107 - - [15/Mar/2011:05:27:35 -0400] “GET /cv/docs/RTOS_transition.pdf HTTP/1.0” 200 529567“http://www.google.co.in/url?sa=t&source=web&cd=43&ved=0CCAQFjACOCg&url=http%3A%2F%2Fembeddedpundit.com%2Fcv%2Fdocs%2FRTOS_transition.pdf&rct=j&q=RTOS%20IN%20PDF&ei=2jB_TfKfBca4rAf01YSnBw&usg=AFQjCNHoceJZbCvRaDM2_1EYHqQ-YX-whg” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB6.6; SLCC2;.NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)”203.124.22.107 - - [15/Mar/2011:05:27:37 -0400] “GET /cv/docs/RTOS_transition.pdf HTTP/1.0” 206 12641“-” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB6.6; SLCC2; .NET CLR 2.0.50727;.NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)” Figure 7 – Typical Unstructured Data – Web Server Log FilesData Sources and SizeTo comprehend the extent and challenges of handling Big Data, it is imperative to understand where the datacomes from, its scope and scale. Unstructured Data Structured Data • Web server and search engine logs • Customer Databases (“data exhaust”) • Legacy BI / CRM / ERP systems • Logs from other types of servers (e.g., telecom • Inventory and Supply Chain switches and gateways) • E-Commerce / Web Commerce records • Social Media / Gaming messages • Multimedia – voice, video, images • Sensor data / M2M communications Figure 8 – Sources for Structured and Unstructured DataThe “Big” in Big Data is to some degree in the eye of the beholder, but generally refers to data sets in the rangeof terabytes and beyond, composed of unstructured and structured data. These data sets can emanate frommassive short-term activity (e.g., traffic on popular web sites or real-time telemetry from thousands of sensors) orfrom more modest collection of data over longer time periods (e.g., decade-scale climate data or long-term healthstudies).Hadoop and MapReduceApache Hadoop and other MapReduce implementations constitute the core of modern Data Management.Hadoop and its underlying distributed file system (HDFS) offer numerous advantages over legacy DataManagement, in particular:KARMASPHERE 9
    • White Paper Understanding the Elements of Big Data: More than a Hadoop Distribution • Scalability – Hadoop and HDFS have been proven to scale up to 2000 working nodes in a data management cluster, and beyond • Reliability – HDFS is architected to be fault-resilient and self-repairing with minimal or no operator intervention for node failover • Data Centric – Big Data is almost always larger in size and scope than the software that processes it. Hadoop architecture recognizes this fact and distributes Hadoop jobs to where the data resides instead of vice-versa. • Cost – Because Hadoop clusters are built from freely distributable open source software running on standard PC-type compute blades, it is incredibly cost effective and scalable with linear incremental scaling investment • Innovation – Hadoop is an active open source project with a dynamic developer and user community. Hadoop, like its project parent Apache and also like Linux, benefits from this worldwide network, rapidly advancing in capability and code quality, several steps ahead of competing Data Management paradigms and many times faster than legacy solutions.Hadoop is an open source project under the umbrella of the Apache Foundation. Later in this document we’llreview commercial suppliers of Hadoop distributions and other MapReduce implementations.Operational DataWithin Hadoop clusters (or adjacent to them), there exist multiple options for storing and manipulating structureddata created from execution of Hadoop jobs. This structured data can represent Big Data outcomes orintermediate stages of complex multi-stage jobs and queries.As with most other technologies that interoperate with Hadoop, Hadoop itself is fairly agnostic to choice ofnon-SQL relational databases (hence the term “NoSQL”) and scalable document stores. These databasetechnologies include: • HBase – the standard Hadoop database, an open-source, distributed, versioned, column-oriented store, pro- viding Bigtable-like capabilities over Hadoop. HBase includes base classes for backing Hadoop MapReduce jobs; query predicate push; optimizations for real time queries; a Thrift gateway and a REST-ful web service to support XML, Protobuf, and binary data encoding; an extensible JRu-by-based (JIRB) shell; and support for the Hadoop metrics subsystem. Like Hadoop, HBase is an Apache project, hosted at http://hbase.apache.org/ • Cassandra – Apache Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data mode. The Cassandra project lives at http://cassandra.apache.org. A good example of using Cassandra together with Hadoop lies in the Datastax Brisk platform – learn more at http://www.datastax.com • Mongo – an open source, scalable, high-performance, schema-free, document-oriented database written in C++. The MongoDB project is hosted at http://www.mongodb.org/. To use Mongo and Hadoop together, check out https://github.com/mongodb/mongo-hadoop • CouchDB - Apache CouchDB is a document-oriented database supporting queries and indexing in a MapReduce fashion using JavaScript. CouchDB provides APIs that can be accessed via HTTP requests to support web applications. Learn more at http://couchdb.apache.orgData Management InfrastructureThe most salient characteristics of Big Data deal with “What” and “How,” but “Where” can be equally important.While Big Data is mostly “agnostic” or orthogonal to infrastructure, the underlying platforms present implicationsfor cost, scalability and performance.KARMASPHERE 10
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionPhysical and VirtualMapReduce, Hadoop and other Big Data technologies originally evolved as internal projects at companies likeGoogle and Yahoo that needed to scale massively with low incremental cost. They were designed to takeadvantage of “standard” hardware – primarily Intel Architecture blades – running the FOSS Linux and openapplication platforms like Java, in local, and later, remote data centers.Rapidly maturing, Big Data infrastructure proved a perfect candidate for public and private Cloud hosting, and soBig Data users frequently leverage PaaS (Platform as a Service) instead of actual data centers. Leading this trendis Amazon, whose Web Services and Elastic MapReduce (EMR) greatly simplify companies’ first forays into BigData and also provide for tremendous scalability throughout the lifetime of Big Data projects.Hosting TrendsThe Hadoop project website states “GNU/Linux is supported as a development and production platform” andindeed most Hadoop installations, physical and virtual, build on Linux infrastructure. While most code in Hadoopand related projects can migrate to other UNIX-type platforms (Solaris, etc.), Microsoft Windows hosting is morechallenging. Hadoop core code exhibits dependencies primarily on Java, but traditionally needs support fromUNIX shells, SSH and other utilities. As such, Windows hosting, predicated upon availability and stability of theCygwin emulation environment, is not supported as a production environment.Big Data developers and analysts, however, make extensive use of other development hosts. Data collected byKarmasphere for its Karmasphere Studio Community Edition and professional products shows developer hostdistribution of 45% from Windows, 34% from Linux and 22% from MacOS.Data AnalysisAnalysis is where companies begin to extract value from Big Data. Distinct from Business Intelligence (see DataConsumption below), Big Data analysis involves development of applications and using those apps to gain insightinto massive data sets.DevelopmentBig Data developers resemble other enterprise IT software engineers in many aspects: in particular, they • Use the same programming languages, starting with Java, augmented with higher-level languages like Pig Latin and Hive • Develop in the same environments, especially the Eclipse and Netbeans IDEs • Build applications that manipulate data stores, in some cases using SQLHowever, today’s Big Data developers diverge from traditional enterprise IT programmers in key aspects of theirtrade: • Their audience is more specialized – not average enterprise end users, but data analysts • The software they create must manipulate orders of magnitude larger data sets, increasingly with seemingly exotic programming constructs like MapReduce • They rely on batched execution, with unique and complex job execution sequences (most resembling High Performance Computing)Hadoop ProgrammingWhile a tutorial on Hadoop is beyond the scope of this white paper, it is useful to understand the coreprogramming tasks faced by Big Data developers. To gain insight from Big Data with Hadoop, developers mustbootstrap Hadoop clusters, set up input sources to distribute data across the Hadoop file system, create code toKARMASPHERE 11
    • White Paper Understanding the Elements of Big Data: More than a Hadoop Distributionimplement the elements of MapReduce (mappers, partitioners, comparators, combiners, reducers, etc.),successfully build, deploy and run those jobs, and dispose of output to intermediate data stores or structured datastorage (RDBMs, etc.) for subsequent analysis.Big Data developers, especially ones new to the Hadoop framework, need to focus their energies on optimizingMapReduce, not on dealing with the intricacies of Hadoop implementation. Karmasphere and other companiesoffer a range of products to simplify the Hadoop development process. In particular, Karmasphere Studioprovides a graphical environment to develop, debug, deploy and monitor MapReduce jobs, cutting time, cost andeffort to get results from Hadoop.Learn more about Karmasphere Studio at http://karmasphere.com/Products-Information/karmasphere-studio.html.AnalyticsBuilding and executing jobs for Hadoop is only half the challenge of Big Data analysis.The outcome of Hadoop job execution, while greatly condensed and more structured, does not automatically yieldinsight to guide business decisions. Ideally, Big Data Analysts should be able to use familiar tools to characterize,query and visualize data sets coming out of Hadoop.Karmasphere and other suppliers offer Big Data analysts software platforms and tools to simplify and streamlineinteraction with Hadoop clusters, extract data sets and glean insight from that data. In particular, KarmasphereAnalyst provides Big Data analysts with quick, efficient SQL access and insight to Big Data on any Hadoop clusterfrom within a graphical desktop environment. Working with structured and unstructured data, automatically dis-covering its schema, it lets analysts, SQL programmers, developers and DBAs develop and debug SQL with anyHadoop cluster.Learn more about Karmasphere Analyst athttp://karmasphere.com/Products-Information/karmasphere-analyst.html.Data UseIf Big Analytics is about mining Big Data for insights, Data Use (consumption) is about acting upon thosediscoveries. Data Use falls into two rough categories:Business Intelligence and Visualization – feeding into traditional BI suites and into OLAP, the output of BigData provides business analysts with comprehensive data sets, not just statistically-selected sub-sets that fit intolegacy databases and schemas. By improving the scope and quality of data, Big Data greatly enhances the reli-ability of conclusions drawn from it and improves BI outcomes.Big Data Applications – using Big Data outcomes to drive applications in web commerce, social gaming, datavisualization, search, etc. Businesses in these and other areas are drawing upon Big Data not just for high-levelbusiness insights, but to provide concrete input to user-facing applications.The PlayersFor each Big Data element (Figure 9), there are multiple participants, with complex relationships among them.Under Data Management there are suppliers of Hadoop distributions as well as MapReduce technologysuppliers with both Cloud and data center solutions. There are offerings in Big Analytics that fulfill specificdevelopment and analysis requirements, complementing one another and addressing multiple phases in the BigData application life cycle. And while most Big Data applications reflect and support the operations of particularend-users, companies and products, there are others that cross industry and corporate boundaries.In comprehending the elements of Big Data – the players and the layers and the niches within it – you shouldalways remember that:KARMASPHERE 12
    • White Paper Understanding the Elements of Big Data: More than a Hadoop Distribution • Not everyone who works with Hadoop is in competition • Not everyone in Big Data is a Hadoop distribution vendor • While building upon open source technologies like Hadoop, Hive and Java, the value in Big Data offerings encompasses an increasingly rich mix of services and commercial software that goes beyond that open source coreOpen Source Projects, Developers and CommunitiesUnlike dominant legacy data technologies (proprietary RDBMS, etc.), Big Data has strong ties to Free and OpenSource Software (FOSS) and to the community development model. Indeed, the technologies at the center of BigData are primarily FOSS, many of them under the Apache project umbrella: • Hadoop – the data management platform at the core of Big Data. Key corporate contributors include Cloudera, Facebook, Linked-In and Yahoo. • HDFS – the Apache Hadoop distributed file system • Hive – open source data warehouse and query infrastructure built on top of Hadoop • Java – the language of Hadoop and of Hadoop job programming, originally developed and maintained by Sun Microsystems (now Oracle) • Linux – for hosting Hadoop clusters and also as a development host10, with perhaps the largest global developer community of any FOSS project. RedHat and Canonical11 in particular are investing in supporting Big Data • Eclipse and Netbeans – common Big Data applications and analytics development environments (IDEs) • NoSQL – multiple implementations used in Big Data infrastructure include projects such as Apache HBase, CouchDB and Cassandra and MongoDBThese projects boast communities of hundreds and in some cases thousands of developers and user/contributors, along with a smaller cadre of core maintainers/committers who guide project evolution and vetpatches. A large swathe of FOSS Big Data project developers participate under the corporate banner of theiremployers while others toil away out of personal or academic interest.Big Data Developers, Analysts and other End-UsersGiven the open source nature of much Big Data technology, the term “developer” ends up being overloaded. “BigData Developers” in common parlance are not those programmers involved in building the software described inthe previous section, but rather engaged in building software for and on it, to create: • Hadoop / MapReduce jobs • Analytic queries and report software • Web and mobile apps and enterprise applications realizing value from and presenting Big Data outcomesExamples of companies performing Big Data development and realizing value from it include: • TidalTV - video advertising, optimization, and yield management – http://www.tidaltv.com • XGraph – connect audience marketing – http://www.xgraph.com10 About 1/3 of Big Data development occurs on Linux workstations11 Cloudera and Karmasphere both host their software on Ubuntu LinuxKARMASPHERE 13
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionCommercial SuppliersWhile many of the underlying Big Data technologies are developed as open source software, the elements of BigData include a rich mix of commercial software and services suppliers. Source: Karmasphere Figure 9 – Big Data PlayersISVsCompanies deploying Hadoop increasingly turn to commercial Independent Software Vendors (ISVs) for bothfully-supported base platforms and value-added capabilities beyond those included in community-based Big Datasoftware.Hadoop Distribution Suppliers / Integrators – aggregate, integrate and productize Hadoop, Hive and otherelements for easy installation and use on Linux-based clusters and other host systems. These companies addvalue to Hadoop with “one stop shopping” and through the addition of complementary software and services. Theleading Hadoop distribution suppliers are Cloudera (http://www.cloudera.com), Datastax(http://www.datastax.com) and IBM (http://www.ibm.com/software/data/infosphere/hadoop/). Entering the space inMay 2011 is also EMC (http://www.emc.com/about/news/press/2011/20110509-03.htm).Stay tuned for announcements from more players.Big Data Analytics Solutions and Tools Providers – these companies offer products that streamlinedevelopment of Hadoop applications and simplify interaction with and visualization of Hadoop outcomes bysupporting familiar SQL and spreadsheet-style interfaces to Hadoop. Analytics suppliers include Karmasphere(http://www.karmasphere.com), and a few companies offering analytics for non-Hadoop database solutions (MPP,RDBMS, etc.).Big Data Hosting Companies and Service Providers – “Hadoop On Demand”Many first-generation deployers of Hadoop invested in their own local infrastructure (clusters of standardhardware running Hadoop over Linux). Increasingly, companies are also looking outside their own data centersfor Big Data hosting, turning to remote data centers (collocation), platform-level Cloud hosting and “Big Data asa Service.” The best example of this last paradigm is Amazon Web Services’ Elastic MapReduce – learn more athttp://aws.amazon.com/elasticmapreduce/.KARMASPHERE 14
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionConclusionTo outside observers and first-time visitors, the elements of Big Data, the players that implement and supply them,and the transactions among those players, display their own peculiar logic. Depending on one’s point ofintroduction, it is easy to miss the forest while focusing on interesting and feature-rich trees. For business peopleand technologists already engaged and involved in Big Data, a practical “heads down” approach can often limitpoint of view and obscure commercial and technical opportunities.ChoicesFor Big Data users, success comes down to two key choices: • Infrastructure – where and how to host a project, which technologies to deploy, and how to scale • Value Extraction – paths to insight and methods for analysis and consumptionComparably, Big Data suppliers need to ask themselves: • What do Big Data developers and analysts really need? • How to add value to Hadoop and other Open Source Big Data projects? • How to accommodate both new requirements emerging from Hadoop and other open source projects, as well as expectations for familiar/legacy capabilities?Today’s data collection trends stagger the imagination. Companies and research projects now routinely collectterabytes of data, gathering more volume in days or weeks than did all of human civilization over thousands ofyears. In order that this data storm neither overwhelms its potential users nor goes unexamined in virtualwarehouses, data mining and exploration are undergoing a complete reinvention. This Big Data shift is changingnot just data processing methods but indeed the entire data management paradigm, moving to build on Hadoopand other Big Data tools and platforms.The sheer volume of data involved and the tantalizing possibility of game-changing insight buried within it givebusinesses a sense of great urgency. The pressure for insight and innovation leads many to jump into Big Databefore they understand the scope of the challenges facing them. As a result, many organizations expendinordinate time and resources – hundreds of man/months and tens of thousands of dollars – on setting up andtweaking Hadoop and other infrastructure before arriving at any possible actionable insight.Short Path to Big Data InsightThe main purpose of this White Paper has been to educate readers about the elements of Big Data in generaland, in particular, to suggest shorter paths across the Big Data landscape. At Karmasphere, our mission is tobring the power of Apache Hadoop to developers and analysts and to enable companies to unlock competitiveadvantage from their datasets with easy-to-use solutions.A great way to get started is by downloading Karmasphere Studio Community Edition and by attending webinarsand tutorials sponsored by Karmasphere and its Big Data partners.Learn MoreVisit http://www.karmasphere.com to learn more.KARMASPHERE 15
    • White Paper Understanding the Elements of Big Data: More than a Hadoop DistributionGlossaryAmazon Web Services (AWS) - a set of Cloud-based services hosted by Amazon that together form a reliable,scalable, and inexpensive computing platform. More at http://aws.amazon.com/Data Warehouse (DW) – a structured database used for reporting, offloaded from other operational systems. DWcomprises three layers: staging (used to store raw data for use by developers), integration (integrates data andprovides abstraction), and access (actually supplying data to system users).Elastic MapReduce (EMR) - Amazon Elastic MapReduce is a web service that enables businesses,researchers, data analysts, and developers to manage vast amounts of data. EMR utilizes a hosted Hadoopframework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and AmazonSimple Storage Service (Amazon S3). More at http://aws.amazon.com/elasticmapreduce/Hadoop - a software framework that supports data-intensive distributed applications. Hadoop enablesapplications to work with thousands of nodes and petabytes of data. Hadoop is built in and uses the Javaprogramming language and is maintained as a top-level Apache.org project being built and used by a globalcommunity of contributors. More at http://hadoop.apache.org/Hadoop Distributed File System (HDFS) – the primary storage system used by Hadoop applications. HDFScreates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enablereliable, rapid computations. More at http://hadoop.apache.org/hdfs/Hive – a data warehouse infrastructure built on top of Hadoop. Hive provides tools to enable easy data ETL, amechanism to put structures on the data, and the capability to query and analyze large data sets stored inHadoop. Hive defines a simple SQL-like query language (Hive QL) to let users familiar with SQL make Hadoopqueries. Hive QL also allows programmers familiar with MapReduce to plug in custom mappers and reducers toperform more sophisticated analysis, extending the language. More at http://hive.apache.org/MapReduce – a software framework for distributed computing on large data sets on clusters of computers. Theframework is inspired by the “map and reduce” functions in functional programming, although their purpose in theMapReduce framework is not the same as in their original forms.Massively Parallel Processing (MPP) – a distributed memory computer system with many individual nodes,each an independent computer in itself. In the context of Big Data, MPP connotes database processing systemwith hundreds or thousands nodes, but with centralized storage.NoSQL - a relational database management system with distributed data stores that eschews use of StructuredQuery Language (SQL)On-line Analytical Processing (OLAP) - an approach to answering multi-dimensional analytical databasequeries. Databases configured for OLAP use multidimensional data models, and borrow aspects of navigationaldatabases and hierarchical databases. OLAP query output is typically displayed in a matrix format. OLAP is partof the broader category of business intelligence, encompassing data mining.Pig – a platform for analyzing large data sets consisting of a high-level language for expressing data analysisprograms, coupled with infrastructure for evaluating these programs. Pig’s program structure is amenable toparallelization, enabling Pig to handle very large data sets.Structured Query Language (SQL) – a standard database computer language designed for managing data inrelational database management systems (RDBMS)KARMASPHERE 16