The document discusses Hadoop and its capabilities for real-time analytics. It describes how Hadoop provides massive storage and parallel processing. It also explains the key components of Hadoop including HDFS for storage and YARN as the data operating system. The document outlines use cases for real-time analytics in various industries and demonstrates a real-time analytics application for monitoring truck driver behavior.
So, where does Hadoop fit in the data center? This picture here is a very simple depiction of the typical data architecture in any organization.
- There are sources of data: ERP, CRM, other digital sources
- That data is then stored in a data system: a data warehouse, MPP system, etc
- Then an application of some kind accesses that data system: a packaged application such as Excel or Tableau, a custom application written by a developer, or even another business application
This has been the foundation of the data center for years. We have had some challenges with this architecture all along, however, we are seeing increased pressure to modify and improve this basic blueprint because
A) this approach created silos of data and it was difficult to either share the data or get a single view of it
B) these systems are costly to scale
C) and they are also coupled to a very static schema. Changes to a data model are difficult if not imnpossible. This limits flexibility and iniight.
Finally, the emergence of NEW types of data as we digitize the world around us such as clickstream, machine sensor, etc, are growing at exponential rates. We are all becoming data driven organizations.
In fact that sheer volume of data is to grow 20X between 2013 and 2020 – and which puts tremendous pressure on this architecture. The old architecture is neither technologically nor commercially practical.
YARN is relatively the element that enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..
It enables users to:
- Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time.
- It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.
It is the architectural center of Hadoop
- it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated
- It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem