This document contains a presentation about using open source software and commodity hardware to process big data in a cost effective manner. It discusses how Apache Hadoop can be used to collect, store, process and analyze large amounts of data without expensive proprietary software or hardware. The presentation provides examples of how Hadoop is being used by various companies and explores different approaches for refining, exploring and enriching data with Hadoop.
Advantages of Hiring UIUX Design Service Providers for Your Business
Bigger Data For Your Budget
1. VDave Porter
Dave Porter – SproutCore Architect, Appnovation
davep@appnovation.com
Bigger Data For Your Budget
CANADIAN HEADQUARTERS
152 West Hastings Street
Vancouver BC, V6B 1G8
UNITED STATES OFFICE
3414 Peachtree Road, #1600
Atlanta Georgia, 30326-1164
UNITED KINGDOM OFFICE
3000 Hillswood Drive
Hillswood Business Park
Chertsey KT16 0RS, UK
www.appnovation.com
info@appnovation.com
How to turn your Big Data into Big Insights
without breaking the bank
2. VDave Porter
John Kreisa
VP Marketing, Hortonworks
Dave Porter
SproutCore Architect,
Appnovation Technologies
Speakers
4. VDave Porter
LOCATIONS
VANCOUVER OFFICE
152 West Hastings Street
Vancouver BC, V6B 1G8
ATLANTA OFFICE
3414 Peachtree Road, #1600
Atlanta Georgia, 30326-1164
LONDON OFFICE
3000 Hillswood Drive
Hillswood Business Park
Chertsey KT16 0RS, UK
39. VDave Porter
Thank You For Your Participation!
CANADIAN HEADQUARTERS
152 West Hastings Street
Vancouver BC, V6B 1G8
UNITED STATES OFFICE
3414 Peachtree Road, #1600
Atlanta Georgia, 30326-1164
UNITED KINGDOM OFFICE
3000 Hillswood Drive
Hillswood Business Park
Chertsey KT16 0RS, UK
www.appnovation.com
info@appnovation.com
Editor's Notes
Big Data is made up of traditional structured data in databases, but increasingly it’s also coming in from unstructured sources – server logs, sensor logs, raw transaction logs, and, let’s say you’re analyzing Twitter for market sentiment or searching the web for signs of terrorist plots, you’re going to be digging through reams of Human-Quality input.
Where’s it coming from? As computers and networks speed up, their ability to capture and store more of what’s happening in the real world has gone up, and it’s kicked off a feedback loop. As high-speed trading has taken over the finance industry, the volume of transactions has skyrocketed. More scientific data points were generated in the last five years than in the previous 100,000 years of human existence, and that’s likely to be true again in five years. And it’s not just MIT and Wall Street. We’re increasingly living our lives through machines that can capture and aggregate more of our actions than ever before.
Data, meaning structured or unstructured information collected and stored in computing systems, is increasing exponentially.
Big Data is literally promising to cure cancer, and fight off drug-resistant tuberculosis. It found the Higgs Boson, and it’s going to find life on other planets.And of course it promises to let you see deeper into your business. Insights into real-world problems that we didn’t ever have the data to collect, or the tools to analyze before. Understanding everything like Amazon understands your taste in movies. Google can track the flu better than the CDC.Big Data is promising to be a kind of Magical Insight Portal.
Image courtesy of http://www.greenbookblog.orgOf course, magic doesn’t pay the bills, so the question is what can big data do for your business? I’d like to start with a very simple example:
Let’s say you’re a regional retail giant with an inventory system that tracks all of the transactions, then batch-processes them for your chief inventory manager overnight. Let’s say a radio DJ in Framingham plugs Widget A, and suddenly your Framingham location is sold out by 11 AM. Your inventory guy won’t find out about the unexpected spike until the next morning, and it’s probably day two before a truck can arrive, by which time the DJ is talking about something else.And that’s sort of okay, right? Waking up to discover your sales were through the roof yesterday is a sort of nice, 1990’s-style victory.
But instead of overnight, let’s restructure our processing with Big Data techniques to be able to run on an hourly cycle. The system can tell that Framingham is selling through widgets faster than normal by 10AM, and it knows they’re out by noon. Before noon, the inventory guy gets an alert on his HTML5 dashboard, and an email on his phone, and he’s got a truck en route from the warehouse in time to restock the shelves the next morning. He’s cut his response time down from 24 hours down to 1, and he’s restocked the shelves in hours instead of days. Most importantly, you doubled your sales of Widget A.
The big data challenge is twofold: Collecting and storing the data, and then chewing through it to produce the valuable insights.
Existing solutions work great, but they’re costly. It’s expensive custom “enterprise”-grade hardware (which is code for expensive) with expensive licensed software. The regional retail giant can’t
Here’s the promise we’re delivering today. You can have the same insights into your accumulating data at a fraction of the price.
Scaling for the same budget requires a paradigm shift.Enter Hadoop. Hadoop is free & open-source software running on commodity hardware like you pick up at Best Buy. (slight exaggeration.) On a commodity hardware budget, the retail inventory system is able to run hourly and will allow dramatically faster reaction to inventory events.
Not just retail, and not just speeding processes up. Review a couple of other use cases.
Still planning on having a better analogy for Wednesday. This one is really growing on me though.
I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.
Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.
And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
While overly simplistic, this graphic represents what we commonly see as a general data architecture:A set of data sources producing dataA set of data systems to capture and store that data: most typically a mix of RDBMS and data warehousesA set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications.Your environment is undoubtedly more complicated, but conceptually it is likely similar.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).