This document discusses strategies for integrating Hadoop with existing databases and applications. It notes that over 50% of Hadoop initiatives fail because they attempt to replace databases with Hadoop. A better approach is to use Hadoop to supplement databases by migrating some batch jobs to Hadoop. The document also recommends developing REST APIs and services alongside Hadoop to enable faster development of use cases compared to using only Hadoop components like Hive and Pig. This hybrid approach of REST and Hadoop can help establish user base within 6 months rather than the 2 years typically required for full Hadoop deployments.
2. BigData Failures
>50% of Hadoop initiatives fail; Why?
- Start: Assume Hadoop replaces a database and the
DB apps
- Progression: Assume Hadoop supplements the DB
and is not a complete replacement. Some of the
batch jobs can migrate to Hadoop
This may solve the problem of having to pay the next round
of licensing fees for the next higher step up in db capacity
Most of these initiatives still fail. Why?
3. Hadoop/DB Migrations
Takes too long to migrate the db schema to
Hadoop for the longer batch queries. Too
long=> increased cost=> :(
- Vendor Training is not adequate
to get business logic implemented in an API on top of
Hadoop quickly.(tools e.g. SQOOP)
For devops/production/customization
Confusion in which components to use; workflows
w/Oozie; PIG+UDFs or Spark or Hive+UDFs; HBase
- Fix: Use REST APIs/Services + Hadoop MR+Spark
Shell; Training
4. What is a better strategy?
Besides going all in with Hadoop and buying
the Cloudera/MapR/Hortonworks sales pitch;
what is missing?
Goal: quickly establish a user base; not 2
years. ~6 months;
- Mix REST services with Hadoop/HDFS. Tableau
one example, better to custom develop
- Start w/ opensource hadoop; not CM or Ambari;
build the source; learn to apply the patches to Jira
bugs (used to be important). Drives understanding
in internals for configuration, skills for production
5. Open Source strategy
Normally takes 1-2y
- Training reduces time from POC to deployment to 6
months for first use case
Training on both REST services to establish a corporate
agile strategy/template with Hadoop takes years to
develop. Different than Hadoop Vendor training for
implementing business logic
Covers REST examples w/Spring and/or Guice and
building the source, removing the unnecessary
components to keep the code base small; adding
integration tests specific to a customer deployment using
iTest; puppet scripts and how to deploy from a single
source tree using Jenkins
6. Use case: DB Queries
Misconception replacing DB queries in complex schema
with Hadoop Hive/Pig/Spark queries as a strategy
- Develop REST BE/FE template/skills(<1H
implementation). Can Deploy w/HDFS(w/wo
indexes) Queries. Why?
• Faster perf, less code to do the same thing, less
admin; lower cost at small scale. REST services
are closer to a db than Hadoop. :) users
With training REST services take 1h-1day to build.
Hadoop impediments:
having to provision a cluster, understanding what
the XML files do, running benchmarks, configuring
kerberos, setting ACLS, versioning data, testing
backup and recovery strategies, testing
7. REST + Hadoop
Successful deployments contain a mix of
homegrown services + Hadoop components
- Training to develop REST services quickly
No Spring, no J2EE, no Glassfish, no complex s/w with
millions of lines of code.
DI with Google Guice; maven; Jetty; FE using jQuery or
use Twitter bootstrap. Keep the BE and FE simple first
before looking at web frameworks like Play, Django,
Ruby, node.js... etc...
Training materials: no Guice, w/Guice
- Package REST services with Hadoop distro using
the Bigtop Skills
8. REST + Hadoop
Successful deployments contain a mix of
homegrown services + Hadoop components
- Training to develop REST services quickly
No Spring, no J2EE, no Glassfish, no complex s/w with
millions of lines of code.
DI with Google Guice; maven; Jetty; FE using jQuery or
use Twitter bootstrap. Keep the BE and FE simple first
before looking at web frameworks like Play, Django,
Ruby, node.js... etc...
Training materials: no Guice, w/Guice
- Package REST services with Hadoop distro using
the Bigtop Skills
9. Back to Hadoop
K/V storage; why?
- Add nodes to scale out horizontally; i.e. need more
memory to handle more data<=> more db rows
problem/soln
M/R spills to disk; speeding up data reads are
ok but M/R still a problem; Spark/Scala in
memory computation w/KV store
Building a data repository, customize the CDK
to reflect the schemas. Productionize using
Guice. Spring too rigid, not morphlines(like
SED)
10. Hive/Pig/Oozie/Sqoop
Departments pick their own tools/approach
based on the problem description
HTTPFS isn't an API
Add REST API
Hive/PIG slow to develop. Developing UDFs
take time, production code hard to
maintain/modify
buried behind production firewall
Better with beeline add jar
11. Scala/Spark
Some parts of Scala/Spark not parallelizable
- Parallelize over threads in ExecutionContext vs.
Workers in separate JVMs
Takes 3x to get something right for users
1) Learning;everything new(vendor training good)
2) Know what is important for your own use case;
focus time on soln here; code is different than first
time. e.g. scala teaching
3) now know what the problem definition is and
probably what the best soln is; can focus on
execution and making service fast and usable
12. Analytics Use case: Model building
Models take a long time to build. Example:
Random Forest
- 4h on 8GB macbook(~2010;R)
- 4h on AWS Large instance(R)
- 16h(Mahout; not same impl as R) on M/R in AWS
cluster on 4 nodes. More not faster
- Soln:
Distributed+MultiTenant. Not Mahout