Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
The majority of enterprise data has traditionally come from large scale ERP, CRM, and other applications. Each application has become siloed without the ability to gain insights across ALL the data. Now the enterprise must rationalize existing data silos but also gain value from the explosion of data that is being generated from the new paradigm sources.
The challenge is the existing data management platforms have become both architecturally and financially impractical. Architecturally - these systems were not designed to store or process vast quantities of data Financially – the licensing structures with the traditional approach are no longer feasible
These challenges and the rate at which data is being produced require a completely new approach to managing data. If we fast-forward another 3 to 5 years, more than 50% of the data under management within the enterprise will be from these new data paradigm sources. We have come to an inflection point on how the enterprise can manage their data.
What has created this inflection point is the growth and value from the new paradigm data.
New data paradigm sources have put tremendous pressure on existing platforms but have also created tremendous opportunities.
Exponential Growth. 85% year over year growth. Varied Nature. The incoming data can have little or no structure, or structure that changes too frequently for reliable schema creation at time of ingest. Value at High Volumes. The incoming data can have little or no value as individual, or small groups of, records. But at high volumes and longer historical perspectives can be inspected for patterns and used for advanced analytic applications.
This New Data Paradigm opens up the Opportunity for both an architectural and business transformation that applies to virtually every industry.
In today’s data-rich world, overlooked insight translates into missed opportunity.
The opportunities afforded by the age of Big Data have given rise to a new ultra-competitive breed of business that consumes the full spectrum of its data, transforming immense volumes and varieties of data into currency.
Our customers are investing in next-generation “systems of insight,” with advanced analytic apps providing a single, holistic view of customers and processes, and delivering predictive analytics around business performance and discovery through machine learning.
Underpinning these capabilities is a YARN-based architecture that delivers huge new processing power, scale, and efficiency especially when it’s properly integrated with existing operational and data warehousing systems.
HDP usage typically begins by creating new analytic applications fueled by the data that was not previously being captured.
As more and more applications are created, more opportunity is unlocked across ALL data sets, from the new types of data from sensors/machines, server logs, clickstreams, and other traditional sources like ERP and CRM.
Ultimately, HDP’s YARN-based architecture acts as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with.
Ultimately, most organizations that adopt Hadoop, aspire to create a data lake where multiple applications use a shared set of resources, for both storage and processing all with a consistent level of service.
The value in the data lake ultimately results in delivery of “systems of insight” where advanced algorithms and applications that access multiple data sets allow organizations to derive brand new value from data that was once unable to be investigated or simply to complex to combine and analyze. Hadoop doesn’t just create a Data Lake—it opens the platform for analysts to view multiple data sources in multiple dimensions and reduce time to insight.
This journey from apps to lake is only possible with HDP and its YARN based architecture.
Since starting the company, one of our core missions was to make Hadoop an enterprise viable data platform.
With HDP and its YARN-based architecture, the market now has a multi-tenant data platform built on a centralized architecture that provides the shared enterprise services of Resource Management, Operations, Security, Governance in a consistent manner for all Data Access patterns, for batch, interactive, or real-time applications.
These enterprise readiness capabilities help enable HDP to be used everywhere.
While it’s clear that HDP is ready for the enterprise, that doesn’t mean that we stop our work on enterprise readiness.
In fact, it’s just the opposite. There are more security, governance and operational advancements taking place in the Hadoop ecosystem now than ever before.
And we continue to advance all of the services with the community.
From Jeff Dean http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Outlines stays the same Map and Reduce change to fit the problem
Faced with this challenge the team at yahoo conceived and created apache hadoop to address the challenge. They then were convinced that contribution of this platform into an open community would speed innovation. They open sourced the technology and did so within the governance of the Apache Software Foundation. (ASF) This introduced two distinct significant advantages.
Not only could they manage new data types at scale but the now had a commercially feasible approach.
However, there will still significant challenges. The first generation of Hadoop was: - designed and optimized for Batch only workloads, - it required dedicated clusters for each application, and, - it didn’t integrate easily with many of the existing technologies present in the data center.
Also, like any emerging technology, Hadoop was required to meet a certain level of readiness required by the enterprise.
After running Hadoop at scale at yahoo, the team spun out to form Hortonworks with the intent to address these challenges and make Hadoop enterprise ready.
Access, Execution, Resource Mgt
Since HDP provides a centralized architecture that is built on YARN with common services for security, operations, and governance, it enables the enterprise to run a wide range of applications simultaneously with well managed service levels. More applications and more data can run in the same shared cluster which simplifies the security, operations, and governance.
Since the other pure play vendors have NOT built their products from the ground-up on a centralized YARN architecture, their platform architectures are disjoint.
Without a consistent set of services applied to all applications and workloads, users are forced to silo their clusters in order to achieve predictable performance and service levels – which is more complex and costly.
And since the critical services for security, operations, and governance are implemented as bolt-ons, the deployment architecture is further complicated.
In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo!
This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type.
Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies.
This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
Pig-Latin, a language intended to sit between the two Provides standard relational transforms (join, sort, etc.) Schemas are optional, used when available, can be defined at runtime User Defined Functions are first class citizens
An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs
Pig executes in a unique fashion: some commands build on previous commands, while certain commands trigger a MapReduce job.
Interactive queries at scale
Originally created by a team at Facebook
HDP 2.x ships with HiveServer2, a Thrift-based implementation that allows multiple concurrent connections and also supports Kerberos authentication.
Note that this property is set to mr by default.
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.
The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN.
Simply put, YARN is the resource manager for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.
[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop.
[CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service.
[CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.
For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
You have talked about the components of Hadoop, now this slide talks about the various roles of Hadoop professionals.
HDP is versatile to handle any data for any application and anywhere
ANY DATA Hadoop was initially designed to store and process vast quantities of data and is still the optimal platformj to do so. With YARN and the introduction of all types of access methids from batch to interactive and real time, access to process and analyze this data has become even easier.
ANY APPLICATION YARN also opens up Hadoop so that it can extend the value of linear scale storage and processing to existing applications. This also allows you to reuse your existing skillsets and resources, but with hadop as a foundation.
To date, Hortonworks has certified over 70 ISVs to be YARN ready and the list is growing.
ANYWHERE As a key part of the modern data architecture, Hadoop needs to be available across a wide range of deployment choices, and we enable the widest choice in the industry.
In 2011, we established our partnership with Microsoft based on a shared vision of a hybrid world where Hadoop can run on-premises on Windows Server or Linux, within turnkey appliances, and in the cloud as a fully managed service or simply running within virtual machines on infrastructure-as-a-service clouds.
Our work with Microsoft brought Hadoop to the Windows Server ecosystem and we’re the only vendor serving that market opportunity today.
While most of our customers are deploying on-premises Hadoop clusters, we are uniquely positioned to support a hybrid architecture as enterprises embrace cloud for specific use cases.
This is a great use case, but only spend 3-4 minutes on it.
Run Hive Queries to Refine the Trucks data to get the average mileage Compute the risk factor for each driver (milage
Power Pivot again – this time demonstrating which driver’s had the most incidents.
Power Pivot map again – this time showing the areas where the incidents occurred.