YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption

YARN: The Key To Overcoming The
Challenges To Broad Based Hadoop Adoption

1 RedPoint Global Inc.17 June 2014© Confidential
Overview - What is Hadoop/Hadoop 2.0
Lower
cost
scaling
No need
for
structure
Ease of
data
capture
Hadoop 1.0
• All operations based on Map Reduce
• Intrinsic inconsistency of code based
solutions
• Highly skilled and expensive resources
needed
• 3rd party applications constrained by the
need to generate code
Hadoop 2.0
• Introduction of the YARN:
“a general-purpose, distributed, application
management framework that supersedes the classic
Apache Hadoop MapReduce framework for
processing data in Hadoop clusters.”
• Mature applications can now operate
directly on Hadoop
• Reduce skill requirements and increased
consistency

RedPoint Data Management on Hadoop
Partitioning
AM / Tasks
Execution
AM / Tasks
Data I/O
Key / Split
Analysis
Parallel Section (UI)
YARN
MapReduce

Top Challenges to Adoption
• Severe shortage of MR
skilled resources
• Very expensive resources
and hard to retain
• Inconsistent skills lead to
inconsistent results
• Under utilizes existing
resources
• Prevents broad leverage
of investments across
enterprise
Skills Gap
• A nascent technology
ecosystem around
Hadoop
• Emerging technologies
only address narrow
slivers of functionality
• New applications are not
enterprise class
• Legacy applications have
built short term
capabilities
Maturity & Governance
• Data is not useful in its
raw state, it must be
turned into information
• Benefit of Hadoop is that
same data can be used
from many perspectives
• Analysts must now do
the structuring of the
data based on intended
use of the data
Data Into Information

RedPoint Overcomes Challenges
First YARN compliant ETL/data quality
toolset on the market – brings together
both Big Data and traditional data to create
Big Information!
• Customer or Party Data
• Processing Speed
• Match Quality
• Ease of Use
by in:
RANKED
#1 The power to make
your data the biggest
asset your organization
has

Key features of RedPoint Data Management
Master Key Management
ETL & ELT Data Quality
Web Services Integration
Integration & Matching
Process Automation
& Operations
• Profiling, reads/writes,
transformations
• Single project for all jobs
• Cleanse data
• Parsing, correction
• Geo-spatial analysis
• Grouping
• Fuzzy match
• Create keys
• Track changes
• Maintain matches
over time
• Consume and publish
• HTTP/HTTPS protocols
• XML/JSON/SOAP formats
• Job scheduling, monitoring,
notifications
• Central point of control
All functions
can be used
on both
TRADITIONAL
and
BIG DATA
Creates
clean,
integrated,
actionable
data –
quickly,
reliably and
at low cost

Monitoring and Management Tools
RedPoint Functional Footprint
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVEPIG
HTTP
STREAM
STRUCTURE
HCATALOG
(metadata services)
Query/Visualization/
Reporting/Analytical
Tools and Apps
SOURCE
DATA
- Sensor Logs
- Clickstream
- Flat Files
- Unstructured
- Sentiment
- Customer
- Inventory
DBs
JMS
Queue’s
Fil
es
Fil
esFiles
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD
SQOOP/Hive
Web HDFS
YARN
         
          
          
 
 
 n
HDFS
1            

           
           
            

No Coding Necessary
For data management in Hadoop:
• Easy-to-use interface
• Leverages existing skills
• Executes in Hadoop 2.0
(using YARN architecture)
• Fast – no MapReduce
• Can combine Big Data
with traditional data
• Data becomes actionable
by RedPoint Interaction
WITH REDPOINT
the only pure YARN data management platform
Makes Hadoop data management easy, fast, low-cost.
Makes Big Data clean, integrated, usable.
You get more out of your Big Data investment.
Use MapReduce
 complex
 requires new skills
 inefficient execution
Move data out of Hadoop
 extra time and effort
 extra storage (expensive)
 defeats the purpose of Hadoop
PREVIOUS OPTIONS

Resource
Manager
Launches
Tasks
Node Manager
DM App Master
DM Task
Node Manager
DM Task
DM Task
Node Manager
DM Task
DM Task
Launches DM
App Master
Data Management
Designer
DM
Execution
Server
Parallel Section
Running DM Task
1
2
3
RedPoint DM for Hadoop: Processing Flow

The Data Management designer

DM Parallel Section on Hadoop

DM Hadoop Settings

RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totals nearly 150 lines):
public static class MapClass
extends Mapper<WordOffset, Text, Text, IntWritable> {
private final static String delimiters =
"',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿";
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WordOffset key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line, delimiters);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Sample Pig script without the UDF:
SET pig.maxCombinedSplitSize 67108864
SET pig.splitCombination true
A = LOAD '/testdata/pg/*/*/*';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0
C = FOREACH B GENERATE UPPER(word) AS word;
D = GROUP C BY word;
E = FOREACH D GENERATE COUNT(C) AS occurrences, group
F = ORDER E BY occurrences DESC;
STORE F INTO '/user/cleonardi/pg/pig-count';
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization
needed
User Defined Functions
required prior to
running script
No tuning or
optimization required

RedPoint in a modern data architecture
APPLICATIONS
Data Quality
Data Integration
Identity Resolution
ELT  ETL  Cleanse  Match  De-dupe  Merge/Purge  Household
Partition  Parse  Append  Standardize  Key  Automate  Monitor  Notify
DATASYSTEMSDATASOURCES
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensors, social media)
Pure YARN application
No MapReduce needed.
No in-cluster installation.
One application, one graphical user interface for traditional and Big Data
Pre-built native
adapters
Any analytics
Any reporting
Any other application
YARN
1          
          
          
 
 
 n
HDFS
HADOOPTRADITIONAL RESPOSITORIES
+ others

Who Should Care
Companies interested in exploring the promise of Big Data
Analytics and need an easy way to get started.
Companies already investing heavily investing in Big Data
Analytics technologies but are stuck due to the shortage of
skilled resources
Large organizations that are focused on “Operational
Offloading” and need to achieve it cost effectively
Companies who recognize that much of the data that lands in
Hadoop is external to the organization and need to have Data
Quality and proper data governance applied to their Hadoop
data.

Users can work across any/all data
Easy to integrate data from any source
No need for extra storage
No time wasted moving data
Minimizes extra computing resources
No compromises in quality or
integration for data in Hadoop
Overcomes the skills gap
Existing staff can start working now
RedPoint benefits and value
Makes Hadoop
data management:
•Faster
•Easier
•Less expensive
•More effective
FEATURES
Pure YARN,
no MapReduce
Graphical UI,
not code-based
All DQ/DI
functions available
Executes in Hadoop,
no data movement
Zero footprint install,
nothing in the cluster
Same product for
Hadoop and database
Top rated
for ease-of-use
BENEFITS VALUE

For More Information on RedPoint
Visit us in booth P13
Download YARN
article here:
http://bit.ly/YARN-
Article
Email:
contact.us@redpoin
t.net

YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption

Similar to YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption

Editor's Notes