Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

•Download as PPTX, PDF•

3 likes•1,988 views

The document discusses the growing importance of Hadoop and big data processing. It notes that by 2015, organizations that build modern information management systems using technologies like Hadoop will outperform peers financially by 20%. It then outlines Hortonworks' vision, including developing Hadoop into an enterprise-ready platform that can support a wide range of workloads and use cases beyond just batch processing. Finally, it discusses Hortonworks' role in driving adoption of Hadoop through open source community contributions as well as commercial support.

Technology

© Hortonworks Inc. 2013. Confidential and Proprietary.
Hadoop in LondonJuly 9, 2013
Herb Cunitz
Hortonworks President
@hcunitz
Page 1

© Hortonworks Inc. 2013. Confidential and Proprietary.
Why is Hadoop Important?
We Believe that
More than Half the
World's Data Will
Be Processed by
Apache Hadoop.

By 2015, Organizations that
Build a Modern Information
Management System Will
Outperform their Peers
Financially by 20 Percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”

© Hortonworks Inc. 2013. Confidential and Proprietary.
New Sources
(sentiment, clickstream, geo, sensor, …)
Traditional Data ArchitectureAPPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
OLTP, POS SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
Pressured
TRADITIONAL REPOS
RDBMS EDW MPP
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Traditional Sources
(RDBMS, OLTP, OLAP)

© Hortonworks Inc. 2013. Confidential and Proprietary.
PressuredTraditional Data Architecture
Source: IDC
New Sources
(sentiment, clickstream, geo, sensor, …)
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020

© Hortonworks Inc. 2013. Confidential and Proprietary.
New Sources
(sentiment, clickstream, geo, sensor, …)
Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMSDATASOURCES
OLTP, POS
SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
TRADITIONAL REPOS
RDBMS EDW MPP
Traditional Sources
(RDBMS, OLTP, OLAP)
MANAGE &
MONITOR
OPERATIONAL
TOOLS
BUILD &
TEST
DEV & DATA
TOOLS
ENTERPRISE
HADOOP PLATFORM

© Hortonworks Inc. 2013. Confidential and Proprietary.
Agile “Data Lake” Solution Architecture
Capture All Data Process & Structure1 2 Distribute Results3 Feedback & Retain4
Dashboards, Re
ports, Visualizati
on, …
Web, Mobile,
CRM, ERP,
Point of sale
Business
Transactions
& Interactions
Business
Intelligence
& Analytics
Classic Data
Integration & ETL
Logs & Text Data
Sentiment Data
Structured
DB Data
Clickstream Data
Geo & Tracking Data
Sensor & Machine Data
Enterprise
Hadoop
Platform

© Hortonworks Inc. 2013. Confidential and Proprietary.
BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER…
Key Requirement of a “Data Lake”
Store ALL DATA in one place…
…and Interact with that data in MULTIPLE WAYS
HDFS (Redundant, Reliable Storage)

© Hortonworks Inc. 2013. Confidential and Proprietary.
Applications Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
IN-MEMORY
Spark
HPC MPI
OpenMPI
ONLINE
HBase
OTHER…
ex. Search
YARN Takes Hadoop Beyond Batch
Applications run “IN” Hadoop versus “ON” Hadoop…
…with Predictable Performance and Quality of Service
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)

© Hortonworks Inc. 2013. Confidential and Proprietary.
2.0 Architected for the
Broad Enterprise
Hadoop 2.0 Key Highlights
Rolling Upgrades
Disaster Recovery
Snapshots
Full Stack HA
Hive on Tez
YARN
HDP 2.0 Features
Single Cluster,
Many Workloads
BATCH
INTERACTIVE
ONLINE
STREAMING
ZERO downtime
Multi Data Center
Point in time Recovery
Reliability
Interactive Query
Mixed workloads
Enterprise Requirements

© Hortonworks Inc. 2013. Confidential and Proprietary.
Making Hadoop Enterprise Ready
OS/VM Cloud Appliance
Enterprise Hadoop Platform
PLATFORM
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery,Security and Snapshots
OPERATIONAL
SERVICES
Manage & Operate
at Scale
DATA
SERVICES
Store, Process
and Access Data
CORE
Distributed
Storage & Processing

© Hortonworks Inc. 2013. Confidential and Proprietary.
SQL-IN-Hadoop with Apache Hive
Stinger Initiative
Focus Areas
Make Hive 100X Faster
Make Hive SQL Compliant
HDFS2
YARN
HIVE
SQL
MAP
REDUCE
Business
Analytics
Custom
Apps
TEZ

© 2013 Forrester Research, Inc. Reproduction Prohibited 13

© 2013 Forrester Research, Inc. Reproduction Prohibited 14

© Hortonworks Inc. 2013. Confidential and Proprietary.
Innovate
Participate
Integrate
Many Communities Must Work As One
Open
Source
End
Users
Vendors

© Hortonworks Inc. 2013. Confidential and Proprietary.
Ecosystem Completes the Puzzle
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management

© Hortonworks Inc. 2013. Confidential and Proprietary.
Hadoop Wave ONE: Web-scale Batch Apps
time
relative%
customers
Customers want
solutions & convenience
Customers want
technology & performance
Source: Geoffrey Moore - Crossing the Chasm
2006 to 2012
Web-Scale
Batch Applications
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM

© Hortonworks Inc. 2013. Confidential and Proprietary.
Customers want
solutions & convenience
Customers want
technology & performance
Hadoop Wave TWO: Broad Enterprise Apps
time
relative%
customers
Source: Geoffrey Moore - Crossing the Chasm
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
2013 & Beyond
Batch, Interactive, Online,
Streaming, etc., etc.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Hortonworks – We Do Hadoop
Open Source
Community
Partner
Ecosystem
Commercial
Adoption

© Hortonworks Inc. 2013
Thank You
Page 20

I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!

Building a Big Data platform with the Hadoop ecosystem

Gregg Barrett

Big data technologies and Hadoop infrastructure

Roman Nikitchenko

John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.

Hadoop - Architectural road map for Hadoop Ecosystem

nallagangus

Top Hadoop Big Data Interview Questions and Answers for Fresher

JanBask Training

Big Data Warehousing: Pig vs. Hive Comparison

Caserta

Seminar_Report_hadoop

Varun Narang

Big Data and Hadoop Introduction

Dzung Nguyen

Big data Hadoop Analytic and Data warehouse comparison guide

Danairat Thanabodithammachari

Hadoop Seminar Report

Atul Kushwaha

Big Data and Hadoop

Flavio Vit

Hw09 Welcome To Hadoop WorldCloudera, Inc.

Hadoop demo ppt

Phil Young

Hadoop Family and Ecosystem

tcloudcomputing-tw

Big data Hadoop

Ayyappan Paramesh

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Agile Testing Alliance

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...

Renato Bonomini

Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume. Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase” This leads to mainly three point of views for analysis to make sure service levels are achieved: - Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way - Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time - Interest in planning storage capacity (filesystem and HDFS) This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”

Supporting Financial Services with a More Flexible Approach to Big Data

WANdisco Plc

In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn: - How to leverage data from across an entire global enterprise - How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions - What industry leaders have put in place

Hadoop introduction , Why and What is Hadoop ?

sudhakara st

Non-Stop Hadoop for Hortonworks

Hortonworks

Integration of HIve and HBaseHortonworks

Supporting Financial Services with a More Flexible Approach to Big Data

Hortonworks

Financial services companies can reap tremendous benefits from 'Big Data' and they have moved quickly to deploy it. But these companies also place heavy demands on 'Big Data' infrastructure for flexibility, reliability and performance. In this webinar, Hortonworks joins WANDisco to look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn: How to leverage data from across an entire global enterprise How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions What industry leaders have put in place

Hadoop Reporting and Analysis - Jaspersoft

Hortonworks

What's hot

Common and unique use cases for Apache HadoopBrock Noland

Hadoop_Its_Not_Just_Internal_Storage_V14

John Sing

Hadoop - Architectural road map for Hadoop Ecosystem

nallagangus

Top Hadoop Big Data Interview Questions and Answers for Fresher

JanBask Training

Big Data Warehousing: Pig vs. Hive Comparison

Caserta

Seminar_Report_hadoop

Varun Narang

Big Data and Hadoop Introduction

Dzung Nguyen

Big data Hadoop Analytic and Data warehouse comparison guide

Danairat Thanabodithammachari

Hadoop Seminar Report

Atul Kushwaha

Big Data and Hadoop

Flavio Vit

Hw09 Welcome To Hadoop WorldCloudera, Inc.

Hadoop demo ppt

Phil Young

Hadoop Family and Ecosystem

tcloudcomputing-tw

Big data Hadoop

Ayyappan Paramesh

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Agile Testing Alliance

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...

Renato Bonomini

Supporting Financial Services with a More Flexible Approach to Big Data

WANdisco Plc

Hadoop introduction , Why and What is Hadoop ?

sudhakara st

Non-Stop Hadoop for Hortonworks

Hortonworks

Integration of HIve and HBaseHortonworks

What's hot (20)

Common and unique use cases for Apache Hadoop

Hadoop_Its_Not_Just_Internal_Storage_V14

Hadoop - Architectural road map for Hadoop Ecosystem

Top Hadoop Big Data Interview Questions and Answers for Fresher

Big Data Warehousing: Pig vs. Hive Comparison

Seminar_Report_hadoop

Big Data and Hadoop Introduction

Big data Hadoop Analytic and Data warehouse comparison guide

Hadoop Seminar Report

Big Data and Hadoop

Hw09 Welcome To Hadoop World

Hadoop demo ppt

Hadoop Family and Ecosystem

Big data Hadoop

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...

Supporting Financial Services with a More Flexible Approach to Big Data

Hadoop introduction , Why and What is Hadoop ?

Non-Stop Hadoop for Hortonworks

Integration of HIve and HBase

Similar to Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

Supporting Financial Services with a More Flexible Approach to Big Data

Hortonworks

Hadoop Reporting and Analysis - Jaspersoft

Hortonworks

Why hadoop for data science?Hortonworks

Create a Smarter Data Lake with HP Haven and Apache Hadoop

Hortonworks

An organization’s information is spread across multiple repositories, on-premise and in the cloud, with limited ability to correlate information and derive insights. The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization. - Leverage 100% of your data: Text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven (powered by HP IDOL and HP Vertica), making it possible to integrate this valuable content and insights into various line of business applications. - Democratize and enable multi-dimensional content analysis: - Empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease, using the 100% open source Hortonworks Data Platform. - Extend the enterprise data warehouse: Synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in. - Dramatically reduce complexity with enterprise-ready SQL engine: Tap into the richest analytics that support JOINs, complex data types, and other capabilities only available with HP Vertica SQL on the Hortonworks Data Platform. Speakers: - Ajay Singh, Director, Technical Channels, Hortonworks - Will Gardella, Product Management, HP Big Data

Transform Your Business with Big Data and Hortonworks

Pactera_US

Transform You Business with Big Data and HortonworksHortonworks

Introduction to Hadoop

POSSCON

201305 hadoop jpl-v3Eric Baldeschwieler

Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data

Hortonworks

Hadoop is a great platform for storing and processing massive amounts of data. Elasticsearch is the ideal solution for Searching and Visualizing the same data. Join us to learn how you can leverage the full power of both platforms to maximize the value of your Big Data. In this webinar we'll walk you through: How Elasticsearch fits in the Modern Data Architecture. A demo of Elasticsearch and Hortonworks Data Platform. Best practices for combining Elasticsearch and Hortonworks Data Platform to extract maximum insights from your data.

Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit

Apache Hadoop and its role in Big Data architecture - Himanshu Bari

jaxconf

In today’s world of exponentially growing big data, enterprises are becoming increasingly more aware of the business utility and necessity of harnessing, storing and analyzing this information. Apache Hadoop has rapidly evolved to become a leading platform for managing and processing big data, with the vital management, monitoring, metadata and integration services required by organizations to glean maximum business value and intelligence from their burgeoning amounts of information on customers, web trends, products and competitive markets. In this session, Hortonworks' Himanshu Bari will discuss the opportunities for deriving business value from big data by looking at how organizations utilize Hadoop to store, transform and refine large volumes of this multi-structured information. Connolly will also discuss the evolution of Apache Hadoop and where it is headed, the component requirements of a Hadoop-powered platform, as well as solution architectures that allow for Hadoop integration with existing data discovery and data warehouse platforms. In addition, he will look at real-world use cases where Hadoop has helped to produce more business value, augment productivity or identify new and potentially lucrative opportunities.

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks

Data Con LA

Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us. Bio:- Arun C. Murthy Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.

What is hadoop

Asis Mohanty

Hadoop data-lake-white-paper

Supratim Ray

Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop

Hortonworks

Cloud Austin Meetup - Hadoop like a champion

Ameet Paranjape

Apache Hadoop on the Open Cloud

Hortonworks

OOP 2014

Emil Andreas Siemes

Discover.hdp2.2.ambari.final[1]

Hortonworks

Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2. In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including: Improvements to Ambari core - such as support for ResourceManager HA Extensions to Ambari platform - introducing Ambari Administration and Ambari Views Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"

Building a Modern Data Architecture with Enterprise Hadoop

Slim Baltagi

Similar to Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks (20)

Supporting Financial Services with a More Flexible Approach to Big Data

Hadoop Reporting and Analysis - Jaspersoft

Why hadoop for data science?

Create a Smarter Data Lake with HP Haven and Apache Hadoop

Transform Your Business with Big Data and Hortonworks

Transform You Business with Big Data and Hortonworks

Introduction to Hadoop

201305 hadoop jpl-v3

Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data

Hadoop Powers Modern Enterprise Data Architectures

Apache Hadoop and its role in Big Data architecture - Himanshu Bari

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks

What is hadoop

Hadoop data-lake-white-paper

Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop

Cloud Austin Meetup - Hadoop like a champion

Apache Hadoop on the Open Cloud

OOP 2014

Discover.hdp2.2.ambari.final[1]

Building a Modern Data Architecture with Enterprise Hadoop

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level

Hortonworks

IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy

Hortonworks

Forrester forecasts* that direct spending on the Internet of Things (IoT) will exceed $400 Billion by 2023. From manufacturing and utilities, to oil & gas and transportation, IoT improves visibility, reduces downtime, and creates opportunities for entirely new business models. But successful IoT implementations require far more than simply connecting sensors to a network. The data generated by these devices must be collected, aggregated, cleaned, processed, interpreted, understood, and used. Data-driven decisions and actions must be taken, without which an IoT implementation is bound to fail. https://hortonworks.com/webinar/iot-predictions-2019-beyond-data-heart-iot-strategy/

Getting the Most Out of Your Data in the Cloud with Cloudbreak

Hortonworks

Johns Hopkins - Using Hadoop to Secure Access Log Events

Hortonworks

Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys

Hortonworks

Cybersecurity today is a big data problem. There’s a ton of data landing on you faster than you can load, let alone search it. In order to make sense of it, we need to act on data-in-motion, use both machine learning, and the most advanced pattern recognition system on the planet: your SOC analysts. Advanced visualization makes your analysts more efficient, helps them find the hidden gems, or bombs in masses of logs and packets. https://hortonworks.com/webinar/catch-hacker-real-time-live-visuals-bots-bad-guys/

HDF 3.2 - What's New

Hortonworks

Curing Kafka Blindness with Hortonworks Streams Messaging Manager

Hortonworks

With the growth of Apache Kafka adoption in all major streaming initiatives across large organizations, the operational and visibility challenges associated with Kafka are on the rise as well. Kafka users want better visibility in understanding what is going on in the clusters as well as within the stream flows across producers, topics, brokers, and consumers. With no tools in the market that readily address the challenges of the Kafka Ops teams, the development teams, and the security/governance teams, Hortonworks Streams Messaging Manager is a game-changer. https://hortonworks.com/webinar/curing-kafka-blindness-hortonworks-streams-messaging-manager/

Interpretation Tool for Genomic Sequencing Data in Clinical Environments

Hortonworks

The healthcare industry—with its huge volumes of big data—is ripe for the application of analytics and machine learning. In this webinar, Hortonworks and Quanam present a tool that uses machine learning and natural language processing in the clinical classification of genomic variants to help identify mutations and determine clinical significance. Watch the webinar: https://hortonworks.com/webinar/interpretation-tool-genomic-sequencing-data-clinical-environments/

IBM+Hortonworks = Transformation of the Big Data Landscape

Hortonworks

Premier Inside-Out: Apache Druid

Hortonworks

Accelerating Data Science and Real Time Analytics at Scale

Hortonworks

TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA

Hortonworks

Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...

Hortonworks

Trimble Transportation Enterprise is a leading provider of enterprise software to over 2,000 transportation and logistics companies. They have designed an architecture that leverages Hortonworks Big Data solutions and Machine Learning models to power up multiple Blockchains, which improves operational efficiency, cuts down costs and enables building strategic partnerships. https://hortonworks.com/webinar/blockchain-with-machine-learning-powered-by-big-data-trimble-transportation-enterprise/

Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense

Hortonworks

For years, the healthcare industry has had problems of data scarcity and latency. Clearsense solved the problem by building an open-source Hortonworks Data Platform (HDP) solution while providing decades worth of clinical expertise. Clearsense is delivering smart, real-time streaming data, to its healthcare customers enabling mission-critical data to feed clinical decisions. https://hortonworks.com/webinar/delivering-smart-real-time-streaming-data-healthcare-customers-clearsense/

Making Enterprise Big Data Small with Ease

Hortonworks

Webinewbie to Webinerd in 30 Days - Webinar World Presentation

Hortonworks

Driving Digital Transformation Through Global Data Management

Hortonworks

Using your data smarter and faster than your peers could be the difference between dominating your market and merely surviving. Organizations are investing in IoT, big data, and data science to drive better customer experience and create new products, yet these projects often stall in ideation phase to a lack of global data management processes and technologies. Your new data architecture may be taking shape around you, but your goal of globally managing, governing, and securing your data across a hybrid, multi-cloud landscape can remain elusive. Learn how industry leaders are developing their global data management strategy to drive innovation and ROI. Presented at Gartner Data and Analytics Summit Speaker: Dinesh Chandrasekhar Director of Product Marketing, Hortonworks

HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features

Hortonworks

Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises. To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1. https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/

Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...

Hortonworks

Unlock Value from Big Data with Apache NiFi and Streaming CDC

Hortonworks

Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It provides an end-to-end platform that can collect, curate, analyze, and act on data in real-time, on-premises, or in the cloud with a drag-and-drop visual interface. It’s being used across industries on large amounts of data that had stored in isolation which made collaboration and analysis difficult. Join industry experts from Hortonworks and Attunity as they explain how Apache NiFi and streaming CDC technology provides a distributed, resilient platform for unlocking the value of data in new ways.

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level

IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy

Getting the Most Out of Your Data in the Cloud with Cloudbreak

Johns Hopkins - Using Hadoop to Secure Access Log Events

Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys

HDF 3.2 - What's New

Curing Kafka Blindness with Hortonworks Streams Messaging Manager

Interpretation Tool for Genomic Sequencing Data in Clinical Environments

IBM+Hortonworks = Transformation of the Big Data Landscape

Premier Inside-Out: Apache Druid

Accelerating Data Science and Real Time Analytics at Scale

TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA

Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...

Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense

Making Enterprise Big Data Small with Ease

Webinewbie to Webinerd in 30 Days - Webinar World Presentation

Driving Digital Transformation Through Global Data Management

HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features

Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...

Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

The Future of Platform Engineering

Jemma Hussein Allen

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Abida Shariff

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Search and Society: Reimagining Information Access for Radical Futures

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

The Future of Platform Engineering

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

When stars align: studies in data quality, knowledge graphs, and machine lear...

Epistemic Interaction - tuning interfaces to provide information for AI support

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Connector Corner: Automate dynamic content and events by pushing a button

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Overview.pdf

"Impact of front-end architecture on development cost", Viktor Turskyi

Assuring Contact Center Experiences for Your Customers With ThousandEyes

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

1. © Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop in LondonJuly 9, 2013 Herb Cunitz Hortonworks President @hcunitz Page 1

2. © Hortonworks Inc. 2013. Confidential and Proprietary. Why is Hadoop Important? We Believe that More than Half the World's Data Will Be Processed by Apache Hadoop.

3. By 2015, Organizations that Build a Modern Information Management System Will Outperform their Peers Financially by 20 Percent. – Gartner, Mark Beyer, “Information Management in the 21st Century”

4. © Hortonworks Inc. 2013. Confidential and Proprietary. New Sources (sentiment, clickstream, geo, sensor, …) Traditional Data ArchitectureAPPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications Pressured TRADITIONAL REPOS RDBMS EDW MPP OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Traditional Sources (RDBMS, OLTP, OLAP)

5. © Hortonworks Inc. 2013. Confidential and Proprietary. PressuredTraditional Data Architecture Source: IDC New Sources (sentiment, clickstream, geo, sensor, …) 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020

6. © Hortonworks Inc. 2013. Confidential and Proprietary. New Sources (sentiment, clickstream, geo, sensor, …) Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMSDATASOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications TRADITIONAL REPOS RDBMS EDW MPP Traditional Sources (RDBMS, OLTP, OLAP) MANAGE & MONITOR OPERATIONAL TOOLS BUILD & TEST DEV & DATA TOOLS ENTERPRISE HADOOP PLATFORM

7. © Hortonworks Inc. 2013. Confidential and Proprietary. Agile “Data Lake” Solution Architecture Capture All Data Process & Structure1 2 Distribute Results3 Feedback & Retain4 Dashboards, Re ports, Visualizati on, … Web, Mobile, CRM, ERP, Point of sale Business Transactions & Interactions Business Intelligence & Analytics Classic Data Integration & ETL Logs & Text Data Sentiment Data Structured DB Data Clickstream Data Geo & Tracking Data Sensor & Machine Data Enterprise Hadoop Platform

8. © Hortonworks Inc. 2013. Confidential and Proprietary. BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER… Key Requirement of a “Data Lake” Store ALL DATA in one place… …and Interact with that data in MULTIPLE WAYS HDFS (Redundant, Reliable Storage)

9. © Hortonworks Inc. 2013. Confidential and Proprietary. Applications Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm GRAPH Giraph IN-MEMORY Spark HPC MPI OpenMPI ONLINE HBase OTHER… ex. Search YARN Takes Hadoop Beyond Batch Applications run “IN” Hadoop versus “ON” Hadoop… …with Predictable Performance and Quality of Service HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management)

10. © Hortonworks Inc. 2013. Confidential and Proprietary. 2.0 Architected for the Broad Enterprise Hadoop 2.0 Key Highlights Rolling Upgrades Disaster Recovery Snapshots Full Stack HA Hive on Tez YARN HDP 2.0 Features Single Cluster, Many Workloads BATCH INTERACTIVE ONLINE STREAMING ZERO downtime Multi Data Center Point in time Recovery Reliability Interactive Query Mixed workloads Enterprise Requirements

11. © Hortonworks Inc. 2013. Confidential and Proprietary. Making Hadoop Enterprise Ready OS/VM Cloud Appliance Enterprise Hadoop Platform PLATFORM SERVICES Enterprise Readiness High Availability, Disaster Recovery,Security and Snapshots OPERATIONAL SERVICES Manage & Operate at Scale DATA SERVICES Store, Process and Access Data CORE Distributed Storage & Processing

12. © Hortonworks Inc. 2013. Confidential and Proprietary. SQL-IN-Hadoop with Apache Hive Stinger Initiative Focus Areas Make Hive 100X Faster Make Hive SQL Compliant HDFS2 YARN HIVE SQL MAP REDUCE Business Analytics Custom Apps TEZ

15. © Hortonworks Inc. 2013. Confidential and Proprietary. Innovate Participate Integrate Many Communities Must Work As One Open Source End Users Vendors

16. © Hortonworks Inc. 2013. Confidential and Proprietary. Ecosystem Completes the Puzzle Data Systems Applications, Business Tools, & Dev Tools Infrastructure & Systems Management

17. © Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop Wave ONE: Web-scale Batch Apps time relative% customers Customers want solutions & convenience Customers want technology & performance Source: Geoffrey Moore - Crossing the Chasm 2006 to 2012 Web-Scale Batch Applications Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM

18. © Hortonworks Inc. 2013. Confidential and Proprietary. Customers want solutions & convenience Customers want technology & performance Hadoop Wave TWO: Broad Enterprise Apps time relative% customers Source: Geoffrey Moore - Crossing the Chasm Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM 2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc.

19. © Hortonworks Inc. 2013. Confidential and Proprietary. Hortonworks – We Do Hadoop Open Source Community Partner Ecosystem Commercial Adoption

Editor's Notes

Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare
To frame up my talk, I chose this quote from Mark Beyer of Gartner:“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”Whether it’s opening up new business opportunities or outperforming your competitors by 20% or more, the important point to be made is that big data technologies offer very real and compelling BUSINESS and FINANCIAL value to go along with TECHNOLOGY that is able to do things never before possible.
So let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database tables. [CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. In the world of Big Data, we’ve got classic TRANSACTIONS as well as New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
To get a sense of the scope of these NEW SOURCES of data, let’s look at some stats from IDC.[CLICK] According to IDC, 2.8ZB of data were created and replicated in 2012.A Zettabyte for those unfamiliar with the term is 1 BILLION Terabytes.[CLICK] 85% of that is from New Sources of Data.[CLICK] Out of that 85%, machine-generated data is a key driver in the growth and just that one new source of data is expected to grow by 15X by 2020.[CLICK] Fast-forward to 2020 and we’ll have 40 Zettabytes of data in the digital universe! This represents 50-fold growth from the beginning of 2010.[CLICK] Needless to say, wrestling that scale of data is like this poor guy trying to wrestle a champion Sumo athlete. Overwhelmed and outmatched to say the least. Fortunately, your data architecture need not be outmatched.
As the volume of data has exploded, Enterprise Hadoop has emerged as a peer to traditional data systems. The momentum for Hadoop is NOT about revolutionary replacement of traditional databases. Rather it’s about adding a data system uniquely capable of handling big data problems at scale and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with every layer of the stack:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
So I’d like to walk you through a solution architecture focused on how new and existing data sources flow through this modern data architecture. The architecture starts with two major areas of data processing that are very familiar to enterprises:1. Business Transactions & Interactions2. Business Intelligence & AnalyticsEnterprise IT has been connecting these systems via classic Data Integration and ETL processing for many years in order to deliver STRUCTURED and REPEATABLE business analytics. The business determines the questions to ask and IT collects and structures the data needed to answer those questions.[CLICK] As we’ve discussed, New Data Sources representing Interactions and Observations have come onto the scene. And Enterprise Hadoop has appeared as a new system capable of capturing ALL of this multi-structured data into one place. Hadoop acts as a “Data Lake” if you will. Some call it a Data Reservoir, a Catch Basin, a Data Refinery, the foundation for a Data Hub & Spoke architecture. Regardless of name, it’s a place where ALL data can be brought together where it can then be flexibly aggregated and transformed into useful formats that help fuel new insights for the business. Structure and schema is applied when needed, NOT as a prerequisite before landing the data. [CLICK] The next step is about getting the data into the right format for the people and applications that need it. Some folks will earmark subsets of the Data Lake for data scientists, researchers, or particular departments to interact with. Tools like Hive and HBase are commonly used for interacting with Hadoop data directly.Others will directly integrate Enterprise Hadoop with Business Intelligence & Analytics solutions so they can obtain a 360 ̊ view of their customers and enhance their ability to more accurately understand customer Interactions that lead to or inhibit their Transactions.Still others will perform complex analytic models and calculations of key parameters in Hadoop and flow the results into online applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.[CLICK] And to achieve a closed loop analytics system, companies are leveraging Hadoop to cost-effectively retain large volumes of data for long periods of time. Keeping an active archive of the past 10 years of historical retail data enables companies to blend that data with 10 years of weather data so they can analyze the impact of weather on “Black Friday” selling season, for example.The result? Customers now have an agile data architecture that enables them to maximize the value from ALL of their data: transactions + interactions + observations.
So as mainstream enterprises begin to store ALL of their data in one place, they will increasingly want to create applications that interact with that data in a wide variety of ways. While classic batch-oriented MapReduce is powerful, it’s just one of many application types people need.[CLICK] Interactive SQL solutions running on or next to Hadoop have gotten lots of press over recent months. Online data systems that store their data in HDFS are on the rise. As is Streaming and Complex Event Processing solutions, and Graph Processing. In-Memory Data Processing is another area. Even classic HPC Message Passing Interface apps are storing data in HDFS.
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
1.0Architected for the Large Web Properties to; Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
Since enterprise Hadoop lies at the heart of the next-generation data architecture, it needs to provide the services and features that make it an enterprise-viable data platformAt the center, we start with Apache Hadoop for distributed file storage and data processing (a la HDFS, MapReduce, and YARN).[CLICK] Beyond that core, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And the community has been hard at work in both the 1.0 and 2.0 lines of Hadoop addressing these needs. [CLICK] And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily. This is where components like Apache Hive for SQL access, HCatalog for describing and managing your tables within Hadoop, Pig for script-based data processing, HBase for online data serving, Sqoop and Flume for getting data into Hadoop, etc.[CLICK] It’s also important…I would argue equally important…to make the platform easy to operate. Components like Apache Ambari for provisioning, management and monitoring of the cluster, Oozie for job & workflow scheduling and a new framework called Apache Falcon for Data Lifecycle Management fit here.[CLICK] So all of that: Core and Platform Services, Data Services, and Operational Services all come together into what I think of as “Enterprise Hadoop”.[CLICK] Ensuring that Enterprise Hadoop can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and VMware is important. Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important. As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps enterprises deploy Hadoop quickly, easily and in a familiar way.
As mentioned previously, SQL for Hadoop has been a hot topic for the past 6 months or so. And rightly so. There are easily millions of people with SQL skills that would like to leverage those skills as they look to gain insight and value from data stored in Hadoop. With that as backdrop, at the beginning of the year, the Stinger Initiative was rolled out. It’s focus was to rally the Apache Hive community around the goals of making Hive 100X faster, so it can handle those interactive querying use cases, and making Hive more SQL compliant so its BI use cases are richer. Oh, and by the way, this work needs to happen in a way that PRESERVES Hive’s awesome capability of processing ginormous data sets. As part of the Stinger Initiative, a new data processing framework has emerged as a sibling to MapReduce. This project is called Apache Tez and it handles the interactive querying use cases for Hive by eliminating needless HDFS writes that have traditionally slowed down Hive. Since Tez is built on YARN, Interactive SQL querying use cases can now run natively IN Hadoop and coexist nicely with classic MapReduce processing – yielding predictable performance and SLAs for apps running in the cluster.
Everybody’s adopting Hadoop as a data processing platform because it accepts any kind of data and can process at almost any scale.But, as people adopt Hadoop and throw all this data on they start to find other challenges. For example how do you ensure data is being processed reliably? How do you know I’m not keeping data that is too old? If you process data globally, how do you deal with multi-datacenter replication?The challenge the tools that exist for Hadoop including tools like Oozie, Distcp and others operate at a very low level, so you need expert developers to build and test data processing solutions. This sort of custom development takes a lot of time and money and is error prone since you deal at such a low level.Still everybody does it this way because there aren’t real alternatives. I see a lot of people who use custom scripts to delete files when they get too old. This approach has a lot of drawbacks.Hadoop traditionally doesn’t provide native tools that solve problems like retention, anonymization, reprocessing and other needs.Falcon’s solves this by letting developers work at a much higher level of abstraction.Falcon provides native APIs for data processing, retention, replication and others that abstract away low level tools like scheduling and the mechanical details of replication.With Falcon developers do more, do it easier, and avoid common mistakes.Avoiding common mistakes is probably the most important thing.Data management on Hadoop is not easy, and Falcon was developed by engineers who worked on large scale data management at Yahoo complete with all the battle scars it brings.Falcon has a lot of the practical lessons learned baked into its APIs and ready for developers to simply use.Question: What data lifecycle management needs do you have in your environment?
Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
One thing I’ve learned in my last 10 years of working in the enterprise open source arena is that it’s best to think of “Community” in a broad way. In the Hadoop space, there is clearly the open source community. Without the innovative Apache open source technology, none of us would be here today.There’s also the end user community that spans the tech-savvy early adopter types as well as the more pragmatic and conservative adopters. Then the 3rd piece is the broader ecosystem that integrates with, extends, enhances, builds on, etc.
Now let’s expand the scope to include ALL of the sponsors!I love this slide because it is very BUSY!The cool thing is that we have almost 70 sponsors that provide really nice coverage across all layers of the data stack. This is a great example that the Hadoop market is maturing quite nicely!
Hadoop Wave ONE started in 2006 and did a GREAT job at Web-scale Batch-oriented data processing. A vibrant community and strong enterprise interest propelled Hadoop across the Chasm at the end of 2012.
The 2nd wave of Hadoop has started and it will continue to fuel Hadoop on its path through mainstream adoption. Everyone in this room is at the forefront of a movement that will have lasting impact across the industry. Hadoop has the opportunity to process half the world’s data. There’s still a lot of work to be done.
Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

Similar to Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

Editor's Notes