2014 july 24_what_ishadoop

EVERYONE LIKES
ELEPHANTS
Adam Muise
amuise@hortonworks.com
Principal Architect
Hortonworks

We do Hadoop
The leaders of Hadoop’s
development
Community driven,
Enterprise Focused
Drive Innovation in the
platform – We lead
the roadmap
100% Open Source –
Democratized Access to
Data

We do Hadoop successfully.
Support
Professional Services
Training

What is Hadoop?
What is everyone talking about?

“Big Data” is the marketing term
of the decade in IT

What lurks behind the hype is the
democratization of Data, a move to
aggregate disparate data silos into
one shiny pile of analytic gold

So what are the problems with
Big Data?

Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume

Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume

Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume

Storage, Management, Processing all
become challenges with Data at
Volume

Traditional technologies adopt a
divide, drop, and conquer approach

The solution?
EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Yet Another EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Analytical DB
Data
Data
Data
Data
Data
Data
Data Data
Data OLTP
Data
Data
Data
Data
Data
Data
Data Data
Data
Another EDW
Data
Data
Data
Data
Data
Data
Data Data
Data

Ummm…you
dropped something
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
DataData
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
DataData
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Yet Another EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Analytical DB
Data
Data
Data
Data
Data
Data
Data Data
Data
OLTP
DataData
Data
Data
Data
Data
Data Data
Data
Another EDW
Data
Data
Data
Data
Data
Data
Data Data
Data

Analyzing the data usually raises
more interesting questions…

Wait, you’ve seen this before.
DataData
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
DataData
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Analytics Sausage Factory
Data Data
Data
Data
Data
Data
Data Data
Data …Data
Data
Data … Data
Data
Data
Data

“Prices, Stupid passwords, and
Boring Statistics.”
- Hans Rosling
http://www.youtube.com/watch?v=hVimVzgtD6w

Your data silos are lonely places.
EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Accounts
Data
Data
Data
Data
Data
Data
Data Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data Data
Data
Web Properties
Data
Data
Data
Data
Data
Data
Data Data
Data

… Data likes to be together.
EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Accounts
Data
Data
Data
Data
Data
Data
Data Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data Data
Data
Web Properties
Data
Data
Data
Data
Data
Data
Data Data
Data

Data likes to socialize too.
EDW
Data
Data
Data
Data
Data
Data
Data Data
Data
Accounts
Data
Data
Data
Data
Data
Data
Data Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data Data
Data
Web Properties
Data
Data
Data
Data
Data
Data
Data Data
Data
Machine Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Twitter
DataData
Data
Data
Data
Data
Data Data
Data
Facebook
Data
Data
Data
Data
Data
Data
Data Data
Data
CDR
Data
Data
Data
Data
Data
Data
Data Data
Data
Weather Data
Data
Data
Data
Data
Data
Data
Data Data
Data

New types of data don’t quite fit into
your pristine view of the world.
My Little Data Empire
Data
Data
Data
Data
Data
Data
Data Data
Data
Logs
Data
DataData
Data
Data
Data
Data
Machine Data
Data
DataData
Data
Data
Data
Data
?
?
?
?

To resolve this, some people take
hints from Lord Of The Rings...

…and create One-Schema-To-
Rule-Them-All…
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchema

…but that has its problems too.
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL

What if the data was processed and
stored centrally? What if you didn’t
need to force it into a single
schema?
We call it a Data Lake.
EDW
Data
Data
Data
Data
Data
Data
Data
Schema
BI & Analytics
Schema Schema
Data
Data
Data
Data Lake
Data
Data
Data
Data
Data
DataData
Data
Data
Data
Data
Data
Schema
Schema
Data
Data
Data
Process Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
DataData Sources
Data Sources

A Data Lake Architecture enables:
- Landing data without forcing a single schema
- Landing a variety and large volume of data
efficiently
- Retaining data for a long period of time with a very
low $/TB
- A platform to feed other Analytical DBs
- A platform to execute next gen data analytics and
processing applications (SAS, Informatica,
Graph Analytics, Machine Learning, SAP,
etc…)

In most cases, more data is better.
Work with the population, not just a
sample.

Your view of a client today.
Male
Female
Age: 25-30
Town/City
Middle Income Band
Product Category
Preferences

Your view with more data.
Male
Female
Age: 27 but
feels old
GPS coordinates
$65-68k per year
Product
recommendations
Tea Party
Hippie
Looking to start a
business
Walking into
Starbucks right now…
A depressed Toronto
Maple Leaf’s Fan
Products left in
basket indicate drunk
amazon shopper
Gene
Expression for
Risk Taker
Thinking about
a new house
Unhappy with his cell
phone plan
Pregnant
Spent 25 minutes
looking at tea cozies

Enter the Hadoop.
http://www.fabulouslybroke.com/2011/05/ninja-elephants-and-other-awesome-stories/
………

Hadoop was created because
traditional technologies never cut it
for the Internet properties like
Google, Yahoo, Facebook, Twitter,
and LinkedIn

Traditional architecture didn’t
scale enough…
DB DB
DB
SAN
AppApp AppApp
DB DB
DB
SAN
AppApp AppApp DB DB
DB
SAN
AppApp AppApp

Traditional architectures cost too
much at that volume…
$/TB
$pecial
Hardware
$upercomputing

If you could design a system that
would handle this, what would it
look like?

It would probably need a highly
resilient, self-healing, cost-efficient,
distributed file system…
Storage Storage Storage

It would probably need a completely
parallel processing framework that
took tasks to the data…
Processing Processing Processing

It would probably run on commodity
hardware, virtualized machines, and
common OS platforms

It would probably be open source so
innovation could happen as quickly
as possible

It would need a critical mass of
users

{Processing + Storage}
=
{YARN + HDFS}

To do this, we need to install
Hadoop right?

The Sandbox is ‘Hadoop in a Can’.
It contains one copy of each of the
Master and Worker node processes
used in a cluster, only in a single
virtual node.
Processing
Storage
Linux VM

Getting started with Sandbox VM:
- Pick your flavor of VM at…
http://www.hortonworks.com/sandbox
- Start the sandbox VM
- find the IP displayed
- go to…
http://172.16.130.137
- Register
- Click on ‘Start Tutorials’
- On the left hand nav, click on ‘HCatalog, Basic Pig
& Hive Commands’

http://hortonworks.com/hadoop-tutorial/how-to-use-
hcatalog-basic-pig-hive-commands/
In this tutorial you can…
- Land files in HDFS
- Assign metadata with HCatalog
- Use SQL with Hive
- Learn to process data with Pig

Hadoop has other open source
projects…

Apache Hadoop
FlumeAmbari
HBase
Falcon
MapReduce
HDFS
SqoopHCatalog
Pig
Hive
Storm
YARN
Knox
Tez

Hortonworks Data Platform
FlumeAmbari
HBase
Falcon
MapReduce
HDFS
SqoopHCatalog
Pig
Hive
Storm YARN
Knox
Tez

What else are we working on?
hortonworks.com/labs/

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 62
There is NO second place
Hortonworks
We do Hadoop.

2014 july 24_what_ishadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to 2014 july 24_what_ishadoop

Similar to 2014 july 24_what_ishadoop (20)

More from Adam Muise

More from Adam Muise (12)

Recently uploaded

Recently uploaded (20)

2014 july 24_what_ishadoop