Hadoop For Enterprises

Hadoop for
Enterprise
rev 7

Rajesh Nadipalli
Mar 2012
rajesh.nadipalli@gmail.com

Hadoop getting attention
• Feb 2012: Microsoft, Hortonworks in partnership to develop Excel
plug-in for Hadoop

• Jan 2012: Oracle announces Big Data Appliance with Cloudera’s
Hadoop distribution

• Dec 2011: EMC released Unified Analytics Platform which includes
Greenplum Apache Hadoop distribution

• Oct 2011: Microsoft plans to add Hadoop support to SQL server 2012

• May 2010: IBM introduces Hadoop based InfoSphereBigInsights

Rajesh.nadipalli@gmail.com

In this Presentation…
 Big Data – Big Opportunities
 Hadoop for Enterprise – Reference
Arch
 Map Reduce Overview
 Hive
 References


BIG DATA – BIG
OPPORTUNITIES

Rajesh.nadipalli@gmail.c

Big Data - Business
Opportunity
Enterprises today are challenged with..
 Exponential data growth
 Complex data needs- structured & unstructured
 Real time insights with key indicators
 Heterogeneous environment: private and
public clouds
 Tighter budgets and the need to do more with
less

Traditional relational databases are not
able to scale and meet these challenges


http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars

Big Data – 4 V’s (Forrester)


Why Hadoop?
Hadoop provides…
 Distributed File System
 Parallel computing across several nodes
 Support for structured and un-structured
content
 Fault tolerance and linear scalability
 Open source under Apache foundation
 Increasing support from vendors
 Key Philosophy: “moving compute is cheaper
than moving data”
Forrester regards Hadoop as the nucleus of the next-generation EDW in the
cloud.

Some users of Hadoop…
http://wiki.apache.org/hadoop/PoweredBy

• Use Hadoop to store copies of internal log and dimension data sources and use it as
a source for reporting/analytics and machine learning.
• Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and
about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
• Each (commodity) node has 8 cores and 12 TB of storage.

• Hadoop used to analyze the log of search and do some mining work on web page
database
• We handle about 3000TB per week Our clusters vary from 10 to 500 nodes

• 532 nodes cluster (8 * 532 cores, 5.3PB).
• Heavy usage of Java MapReduce, Pig, Hive, HBase
• Using it for search optimization and Research

•5 machine cluster (8 cores/machine, 5TB/machine storage)
•Existing 19 virtual machine cluster (2 cores/machine 30TB storage
•Predominantly Hive and Streaming API based jobs (~20,000 jobs a week)
•Daily batch ETL; Log analysis; Data mining; Machine learning


HADOOP REFERENCE
ARCHITECTURE


Hadoop for Enterprise – Technology Stack
User Experience
Ad-hoc Notifications Embedded
Analytics
queries /Alerts Search

Data Access
Excel R (Rhipe,
Hive Pig Datameer
RBits)

Zookeeper (Orchestration, Quorum)
Pentaho (Scheduling, Integrations)
Data Processing
Mapreduce

Hadoop Data Store
Hbase (NOSQL DB)

HDFS

Sqoop

Data Sources

Application Database Log RSS
Cloud Others
s (internal) s Files Feeds


Hadoop for BI – Reference Architecture
Data Hadoop Distributed Computing Enterprise Apps
Environment
Dashboards
RDBMS

Excel
M
XML
A N-Node
JSON P scalable
cluster ERP, Enterprise
Apps
Binary R
E
CSV D
U
Log C Import RDBM
E S
Java Hadoop File
Objects System
(HDFS)


http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/wp-big-data-with-oracle-
521209.pdf?ssSourceSiteId=ocomen

Oracle’s Big Data Solution

• Oracle sees Hadoop is good for unstructured sourcing and map reduce.
• It recommends to use Oracle database for the final analyze stage
• Oracle Data Integrator can make Hive queries (ETL)
• Oracle has a wrapper on top of sqoop which is called Oraoop (see
references)

DATA PROCESSING


Hadoop Mapreduce Overview
Map Reduce Process
Node 1
010101010101010101010
Node 1
10
222222222222222222
010101010101010101010 Map
3333333333333333333
10
3333333333333333333
DATA (from HDFS) 010101010101010101010
10
RESULTS
01010101010101010101010
01010101010101010101010 Node 2 2222222222222222
01010101010101010101010 22
01010101010101010101010 010101010101010101010
Node 2 3333333333333333
01010101010101010101010 Reduc
Split 10 33
01010101010101010101010 2222222222222222222 e
010101010101010101010 4444444444444444
01010101010101010101010 Map 3333333333333333333
10 44
01010101010101010101010 4444444444444444444
010101010101010101010
01010101010101010101010 10
01010101010101010101010
01010101010101010101010
01010101010101010101010
Node 3
Node 3
010101010101010101010
10
010101010101010101010 222222222222222222
10
Map 3333333333333333333
010101010101010101010 3333333333333333333
10


Map Reduce Tips
 The first step is to understand what data you
have, and how to feed it into the Hadoop
distributed computing environment.

 Using distributed applications, provide
analytics of the massive data sets, while
simultaneously enabling the surfacing of
opportunities.

 Hadoop stores your information for future
queries, enhancing the exploratory
capabilities (as well as historical reference) of
your data. Rajesh.nadipalli@gmail.com

DATA STORE


HDFS
 Distributed file system consisting of
◦ One single node is called “Namenode” and
has metadata
◦ Several “Datanodes”
 Designed to run on commodity hardware
 Data gets imported as blocks (64 MB)
 These Blocks are replicated (typically 3
copies) to protect for hardware failures
 Access via Java API’s or hadoop
command line ($hadoop fs…)


http://hadoop.apache.org/common/docs/current/hdfs_design.
html

HDFS architecture

Hadoop next revision has a failover Namenode called “Avatar”


HBase
 Distributed, column-oriented database
(NoSQL)
 Failure-tolerant
 Low latency
 HDFS aware
 Access via Java APIs or REST APIs
 It is not a replacement for RDBMS
 Recommended to use Hbase when
◦ Data is searched by key (or range)
◦ Data does not conform to a schema (for
instance if you have attributes that change by
record).

Hbase Architecture
Zookeeper

Avatar
Hbase
(Failover of
Master
master)
Region Region Region Region
Server Server Server Server

 Zookeeper maintains quorum and knows which server
is the master
 Master keeps track of regions and region servers
 Region servers store table regions


Hbase Column Storage
Hbase stores data like tags for a key;
for example:

Row Column Column Cell
Family
Cast Cast:Actor1 Harrison Ford
Star Wars Cast:Actor2 Carrie Fisher
Reviews Review: IMDB Review URL
Review: ET Review URL2


DATA ACCESS


Hive Overview
 Data warehouse software built on top
of Hadoop
 HiveQL provides a SQL like interface
and performs a map reduce job
 Provides structure to HDFS data
similar to Oracle External table


Hive Architecture

Hive CLI
Browse Query

Hive QL
Hive Parser
Metastore Execution

SerDe (Map Reduce)

HDFS


Pig Overview
 Pig is a layer on top of map-reduce for
statisticians (programmers)
 It provides several standard operators:
join, order by etc
 It allows user defined functions to be
included.
 Java or phyton supported for UDF’s


http://www.datameer.com/

Datameer Overview
Key philosophy: Business users understand Excel; let them
do the grouping, sorting, filtering, aggregates

Key Steps:
 Datameer’s source is a mapreduce output.
 Datameer takes a quick sample of 5000 records.
 The end user is next presented an Excel like interface on top
of this 5000 records. This is where the end users can define
their filters, formula, grouping, aggregations, joins across
sheets (even join hadoop data with data from a relational
database table)
 Once the end user has defined what they want as the end
result, they can submit a job to run on the complete dataset.
 Datameer will then build the necessary map reduce jobs and
run it on the complete data set.
 Next the user gets the results and can build charts, tables etc
– all on the browser

http://www.informationweek.com/news/software/info_management/232601675?cid=RSSfeed_IWK_News

Excel Integration
Microsoft announced Excel integration with
Hadoop (Feb 2012) with HortonWorks

Key Highlights:
 Microsoft &Hortonworks will deliver a Hive
ODBC driver that will enable integration with
Excel
 Microsoft’s PowerPivot in-memory plug-in
for Excel will handle larger data sets
 There is also a plan for Javascript framework
for Hadoop enabling Ajax like iterative

INTEGRATION,
SCHEDULING


Pentaho Data Integration
 Pentaho is considered as “strong
performer” by Forrestor (Feb 2012)
 It makes building MapReduce easy via
it’s Data Integration IDE.
 It can read/write to HDFS, run map
reduce and Pig scripts
 The IDE has several standard
connectors, transformation, and allows
custom java code
 http://www.pentaho.com/big-data/

http://www.youtube.com/watch?v=KZe1UugxXcs&feature=player_emb
edded

Pentaho Data Integration
Build Reducer
2

1 Build
Mapper

Run Map
3
Reduce


Talend - ETL
 Talend is another ETL
development, scheduling and
monitoring tool
 It supports HDFS, Pig, Hive, Sqoop
 http://www.talend.com/products-big-
data/


Talend ETL – with Hadoop

• Can invoke Hadoopcalls (generates Hive
queries)
• See right slide “Processing”


USER EXPERIENCE


User Experience
This layer of stack is generally custom
development. However some tools
that work with Hadoop are:
 Tableau for data analysis &
visualizations
 SAS Enterprise Miner
 IBM BigInsights


REFERENCES


References
 http://hadoop.apache.org/
 http://www.cloudera.com/
 http://www-01.ibm.com/software/data/bigdata/
 http://www.cs.duke.edu/starfish/index.html
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-
node-cluster/
 http://karmasphere.com/Download/karmasphere-studio-community-virtual-
appliance-for-ibm.html
 http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-
presentation
 http://www.slideshare.net/trihug/trihug-november-pig-talk-by-alan-
gates?from=ss_embed
 http://www.trihug.org/
 http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper
_c11-690561.html
 http://www.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With-
OraHive.pdf

http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-
690561.html


http://wiki.apache.org/hadoop/PoweredBy

Key Hadoop Players


MAP-R
 No single point of failure of name node
 Performance improvements (5 times
faster than HDFS)
 Snapshots, Multi-site copies
 They have separate Mapreduce
(extended mapreduce)
 MapR is 8K blocks instead of 64MB
block size of HDFS


Open Topics – why there is
adoption issue
 Security – no concept of roles
 Backup, Recovery
 ACID not supported


Thank You to my viewers


Questions / Comments


Hadoop For Enterprises

More Related Content

What's hot

Viewers also liked

Similar to Hadoop For Enterprises

More from nvvrajesh

Recently uploaded

Hadoop For Enterprises

Editor's Notes