Hadoop applicationarchitectures

Hadoop Ecosystem Architectures
BigData + Oracle/SQL Server Databases
Summary from Absolute SW slides

BigData Failures
 >50% of Hadoop initiatives fail; Why?
- Start: Assume Hadoop replaces a database and the
DB apps
- Progression: Assume Hadoop supplements the DB
and is not a complete replacement. Some of the
batch jobs can migrate to Hadoop
This may solve the problem of having to pay the next round
of licensing fees for the next higher step up in db capacity
 Most of these initiatives still fail. Why?

Hadoop/DB Migrations
 Takes too long to migrate the db schema to
Hadoop for the longer batch queries. Too
long=> increased cost=> :(
- Vendor Training is not adequate
 to get business logic implemented in an API on top of
Hadoop quickly.(tools e.g. SQOOP)
 For devops/production/customization
 Confusion in which components to use; workflows
w/Oozie; PIG+UDFs or Spark or Hive+UDFs; HBase
- Fix: Use REST APIs/Services + Hadoop MR+Spark
Shell; Training

What is a better strategy?
 Besides going all in with Hadoop and buying
the Cloudera/MapR/Hortonworks sales pitch;
what is missing?
 Goal: quickly establish a user base; not 2
years. ~6 months;
- Mix REST services with Hadoop/HDFS. Tableau
one example, better to custom develop
- Start w/ opensource hadoop; not CM or Ambari;
build the source; learn to apply the patches to Jira
bugs (used to be important). Drives understanding
in internals for configuration, skills for production

Open Source strategy
 Normally takes 1-2y
- Training reduces time from POC to deployment to 6
months for first use case
 Training on both REST services to establish a corporate
agile strategy/template with Hadoop takes years to
develop. Different than Hadoop Vendor training for
implementing business logic
 Covers REST examples w/Spring and/or Guice and
building the source, removing the unnecessary
components to keep the code base small; adding
integration tests specific to a customer deployment using
iTest; puppet scripts and how to deploy from a single
source tree using Jenkins

Use case: DB Queries
 Misconception replacing DB queries in complex schema
with Hadoop Hive/Pig/Spark queries as a strategy
- Develop REST BE/FE template/skills(<1H
implementation). Can Deploy w/HDFS(w/wo
indexes) Queries. Why?
• Faster perf, less code to do the same thing, less
admin; lower cost at small scale. REST services
are closer to a db than Hadoop. :) users
 With training REST services take 1h-1day to build.
 Hadoop impediments:
 having to provision a cluster, understanding what
the XML files do, running benchmarks, configuring
kerberos, setting ACLS, versioning data, testing
backup and recovery strategies, testing

REST + Hadoop
 Successful deployments contain a mix of
homegrown services + Hadoop components
- Training to develop REST services quickly
 No Spring, no J2EE, no Glassfish, no complex s/w with
millions of lines of code.
 DI with Google Guice; maven; Jetty; FE using jQuery or
use Twitter bootstrap. Keep the BE and FE simple first
before looking at web frameworks like Play, Django,
Ruby, node.js... etc...
 Training materials: no Guice, w/Guice
- Package REST services with Hadoop distro using
the Bigtop Skills

Back to Hadoop
 K/V storage; why?
- Add nodes to scale out horizontally; i.e. need more
memory to handle more data<=> more db rows
problem/soln
 M/R spills to disk; speeding up data reads are
ok but M/R still a problem; Spark/Scala in
memory computation w/KV store
 Building a data repository, customize the CDK
to reflect the schemas. Productionize using
Guice. Spring too rigid, not morphlines(like
SED)

Hive/Pig/Oozie/Sqoop
 Departments pick their own tools/approach
based on the problem description
 HTTPFS isn't an API
 Add REST API
 Hive/PIG slow to develop. Developing UDFs
take time, production code hard to
maintain/modify
 buried behind production firewall
 Better with beeline add jar

Scala/Spark
 Some parts of Scala/Spark not parallelizable
- Parallelize over threads in ExecutionContext vs.
Workers in separate JVMs
 Takes 3x to get something right for users
1) Learning;everything new(vendor training good)
2) Know what is important for your own use case;
focus time on soln here; code is different than first
time. e.g. scala teaching
3) now know what the problem definition is and
probably what the best soln is; can focus on
execution and making service fast and usable

Analytics Use case: Model building
 Models take a long time to build. Example:
Random Forest
- 4h on 8GB macbook(~2010;R)
- 4h on AWS Large instance(R)
- 16h(Mahout; not same impl as R) on M/R in AWS
cluster on 4 nodes. More not faster
- Soln:
 Distributed+MultiTenant. Not Mahout

Hadoop applicationarchitectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop applicationarchitectures

Similar to Hadoop applicationarchitectures (20)

Recently uploaded

Recently uploaded (20)

Hadoop applicationarchitectures