I would like to spend the next 30 minutes really covering 3 primary areas:A quick background on who we areA bit about our philosophy and our approachThen I’d like to spend a bit of time on the primary patterns of use that we see for organizations using HDP, and Hadoop more broadly
I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
While overly simplistic, this graphic represents what we commonly see as a general data architecture:A set of data sources producing dataA set of data systems to capture and store that data: most typically a mix of RDBMS and data warehousesA set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications.Your environment is undoubtedly more complicated, but conceptually it is likely similar.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
So we’ve covered the overall architecture and how Hadoop fits, let’s discuss the patterns of use that we’re seeing for using Hadoop.At a high level, we describe the 3 key patterns of use as Refine, Explore, and Enrich.Refine captures the data into the platform and transforms (or refines it) into the desired formats.Explore is about creating laks of data that you can interactively surf through to find valuable insights.Enrich is about leveraging analytics and models to influence your online applications, making them more intelligent.So while some categorize Hadoop as just a Batch platform, it is increasingly being used and evolving to serve a wide range of usage patterns that span Batch, Interactive, and Online needs.Let me cover these patterns in a little more detail.
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
Additionally, we are a leading provider of Hadoop support through our Hortonworks University, with courses for both development and operations. If required, we can also provide expert consulting services from both ourselves or our System Integrator partners.And for anyone looking to get their hands on Hadoop, we have recently introduced the Hadoop Sandbox program which enables users to download a full instance of HDP together with guided tutorials covering both development and administration topics.
So that is really our focus:Play a leading role in the ecosystem to continue to lead the innovation for both Core Hadoop and also the associated open source projectsIdentifying and addressing the Enterprise requirements to enable broad adoptionEnabling interoperability of the ecosystemAll of this done with a consistent philosophy: 100% open source.
Ben Marden - Making sense of Big Data
Making sense of Big Data