0
THE FUTURE OF
HADOOP: CHOOSING
THE RIGHT OPTIONS

Subash D’Souza
Hadoop Innovation Summit
2014
WHO AM I?
 Recognized as a Champion of Big Data by Cloudera
 Co-Organizer - Los Angeles Hadoop User Group
 Organizer - ...
HADOOP: OLD & NEW
 Hadoop first released in 2006.
 Based on the GFS and MapReduce papers released by Google
 Ever since...
A BRIEF HISTORY
Google
releases GFS
paper

2002

2003

Google
releases
MapReduce
paper

2004

Nutch adds
distributed
file ...
PREVIOUSLY, THE STATE OF
DATA
As a data analyst, previously, you were not able to
ask questions you wanted to ask because...
BIG DATA IMPACT
FOCUS
 No standard way to get to the data
 This is a plus and minus, plus because there is variety to choose from, minus...
HADOOP TECHNOLOGIES
DISTRIBUTIONS OF HADOOP
Apache
Hortonworks
Cloudera
MapR
Intel
IBM
Pivotal
HORTONWORKS HDP 2.0

Source: hortonworks.com
CLOUDERA ENTERPRISE
DATA HUB

Source: cloudera.com & techweekly.com
MAPR M7 ENTERPRISE

Source: business-software.com & wn.com
INTEL DISTRIBUTION FOR
APACHE HADOOP

Source: gigaom.com
IBM BIGINSIGHTS
ENTERPRISE EDITION

Source: ndm.net
PIVOTAL HD

Source: infoq.com
CHOICES
 Hortonworks – Completely Open Source – Everything on their platform is available
from Apache Hadoop Distribution...
ADVANTAGES OF YARN
Ability to handle multi tenant clients, i.e. running
multiple
applications
atop
the
same
framework(mul...
SQL-ON-HADOOP
The different
available
Hive
Impala
Drill
Stinger/Tez
HAWQ
Hadapt
Presto
Shark

SQL-On-Hadoop

tools...
SQL-ON-HADOOP
BENCHMARK - SCAN

Source:
SQL-ON-HADOOP
BENCHMARK - AGGREGATE

Source:
SQL-ON-HADOOP
BENCHMARK - JOIN

Source:
SQL ON HADOOP VS.
TRADITIONAL RDBMS
Data on Hadoop is not as responsive as a RDBMS
Data in Hadoop can scale much better ...
QUESTION?
Do we augment or replace our current data
infrastructure?
Answer – Augment
Why? – combine the best of both wo...
CHALLENGES
Data Protection
Security
SLA’s – Service Level Agreements
Integration w/ applications
Services and support...
STARTUPS VS. MATURE
Startups that are in data should make the
consideration of going with YARN to gain the
advantages of ...
GETTING STARTED WITH
HADOOP VS. ESTABLISHED
HADOOP PRACTICES
Getting started with Hadoop – Opportunity to get off
the gro...
REAL TIME ANALYTICS
 Kiji
 HBase
 Storm
 Shark
 Redshift
 Impala
 Stinger
 Drill
 Accumolo
 Presto
 Hawq
 IBM ...
REAL TIME STREAMING
Flume
Kafka
Scribe
HBase
SECURITY
Kerberos with ACL’s
Cloudera Sentry
Project Knox
Accumolo(BigTable clone)
HBase w/Cell Security
DEVELOPERS TOOLSET
Cloudera CDK renamed to Kite
Java M/R
Spring for Hadoop
Hive
Pig
Scalding
Impala
Others
MANAGEMENT, GUI, MACHIN
E
LEARNING, MONITORING, SC
HEDULING & GRAPH DB
Ambari
Cloudera Manager
HUE
Mahout
Giraph
Zoo...
FUTURE OF HADOOP: YARN &
NEAR REAL TIME SQL-ONHADOOP
Multi Tenancy
HA(High Availability)
Tools for SQL-On-Hadoop
Impal...
WHAT DO YOU CHOOSE?
The choices are huge
The toolsets are varied
First focus on the problems you are trying to solve. D...
CAVEATS
Work still being done on bringing real time sql-onhadoop to YARN.
Impala has Llama for this.
Stinger for Hive P...
BIG DATA = BIG IMPACT
Ken Rudin, Director of Analytics, Facebook
“You need to go the last mile and evangelize your
insig...
GIVING BACK
Hadoop is an open source project
Work done on this and the ecosystem tools are by
committers and contributor...
THANKS
Subash D’Souza
Twitter: @sawjd22
Linkedin: www.linkedin.com/in/sawjd/
Email: subashdsouza@gmail.com
Upcoming SlideShare
Loading in...5
×

Hadoop Innovation Summit 2014

1,206

Published on

Slides from my presentation at Hadoop Innovation Summit 2014
The Future of Hadoop: Choosing the right options

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,206
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
62
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop Innovation Summit 2014"

  1. 1. THE FUTURE OF HADOOP: CHOOSING THE RIGHT OPTIONS Subash D’Souza Hadoop Innovation Summit 2014
  2. 2. WHO AM I?  Recognized as a Champion of Big Data by Cloudera  Co-Organizer - Los Angeles Hadoop User Group  Organizer - Los Angeles HBase User Group  Organizer – Los Angeles Big Data Users Group  Organizer - Big Data Camp LA  Speaker – Big Data Camp LA 2013  Leading a BOF Session at Hadoop Summit Europe 2014  Author – HBase Developer’s Cookbook (Out Fall 2014)  Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop
  3. 3. HADOOP: OLD & NEW  Hadoop first released in 2006.  Based on the GFS and MapReduce papers released by Google  Ever since adoption has been massive and rapid  Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the Social Security Administration are adopting Hadoop  Hadoop 2.0 AKA YARN went GA in September of 2013  Is backwards compatible with Hadoop 1.0 API’s  Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager and Node Managers
  4. 4. A BRIEF HISTORY Google releases GFS paper 2002 2003 Google releases MapReduce paper 2004 Nutch adds distributed file system Doug Cutting launches Nutch project MapR founded 2005 Hortonworks founded Cloudera founded 2006 2007 Hadoop spun out of Nutch project at Yahoo MapReduce implemented in Nutch Stinger/ Tez to be released Hadoop 2.0 w/HA available 2008 2009 2010 2011 Hadoop breaks Terasort world record 2012 2013 2014 YARN goes GA HBase, Zookee per, Flume and more added to CDH Impala (SQL on Hadoop) launched
  5. 5. PREVIOUSLY, THE STATE OF DATA As a data analyst, previously, you were not able to ask questions you wanted to ask because you did not have the data points available Corollary, you couldn’t think of questions to ask of your data because you didn’t know you had access to those data points
  6. 6. BIG DATA IMPACT
  7. 7. FOCUS  No standard way to get to the data  This is a plus and minus, plus because there is variety to choose from, minus because the no. of tools to pull the data is huge and evermore expanding As a company what do you choose? What do you focus on? Question – Do you replace your current data infrastructure or do you augment it?
  8. 8. HADOOP TECHNOLOGIES
  9. 9. DISTRIBUTIONS OF HADOOP Apache Hortonworks Cloudera MapR Intel IBM Pivotal
  10. 10. HORTONWORKS HDP 2.0 Source: hortonworks.com
  11. 11. CLOUDERA ENTERPRISE DATA HUB Source: cloudera.com & techweekly.com
  12. 12. MAPR M7 ENTERPRISE Source: business-software.com & wn.com
  13. 13. INTEL DISTRIBUTION FOR APACHE HADOOP Source: gigaom.com
  14. 14. IBM BIGINSIGHTS ENTERPRISE EDITION Source: ndm.net
  15. 15. PIVOTAL HD Source: infoq.com
  16. 16. CHOICES  Hortonworks – Completely Open Source – Everything on their platform is available from Apache Hadoop Distribution. Available as a free download or with paid support.  Cloudera – Offers the open source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. Available as a free download or with paid support with the additional tools  MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary MFS(MapR File System). Everything else on their stack is based on the open source Apache distribution. Offers a free M3 version along with paid M5 and M7 versions.
  17. 17. ADVANTAGES OF YARN Ability to handle multi tenant clients, i.e. running multiple applications atop the same framework(multi-tenancy) Splits the work of Job tracker into Resource Manager and Application master so Job tracker does not have to allocate resources as well as manage the tasks Ability to restart Jobs from the place where they failed Scales well beyond the limitations of MR1(4000
  18. 18. SQL-ON-HADOOP The different available Hive Impala Drill Stinger/Tez HAWQ Hadapt Presto Shark SQL-On-Hadoop tools currently
  19. 19. SQL-ON-HADOOP BENCHMARK - SCAN Source:
  20. 20. SQL-ON-HADOOP BENCHMARK - AGGREGATE Source:
  21. 21. SQL-ON-HADOOP BENCHMARK - JOIN Source:
  22. 22. SQL ON HADOOP VS. TRADITIONAL RDBMS Data on Hadoop is not as responsive as a RDBMS Data in Hadoop can scale much better than an RDBMS Data in Hadoop can be accessed using a variety of mechanisms such as Hive, Imapala, Drill, etc. i.e. the query engines are abstracted from the Hadoop(HDFS) storage layer. The same cannot be said of RDBMS where you would need between one system to another example, Oracle cannot pull from SQL Server and vice versa
  23. 23. QUESTION? Do we augment or replace our current data infrastructure? Answer – Augment Why? – combine the best of both worlds, use aggregated data in your data stores and all the detail data and lifetime in Hadoop Of course, you will different SLA’s based on the query you ask.
  24. 24. CHALLENGES Data Protection Security SLA’s – Service Level Agreements Integration w/ applications Services and support Training Performance Scaling and Administration
  25. 25. STARTUPS VS. MATURE Startups that are in data should make the consideration of going with YARN to gain the advantages of YARN Mature companies tend to be conservative and hence will look to the more established use cases of MR1 Startups and Mature companies should look at the advantages of YARN as well as applying more near real-time sql-on-hadoop
  26. 26. GETTING STARTED WITH HADOOP VS. ESTABLISHED HADOOP PRACTICES Getting started with Hadoop – Opportunity to get off the ground running YARN plus bleeding edge technologies. Established companies with a Hadoop practice tend to be conservative but that shouldn’t prevent them from coming with a migration plan to YARN
  27. 27. REAL TIME ANALYTICS  Kiji  HBase  Storm  Shark  Redshift  Impala  Stinger  Drill  Accumolo  Presto  Hawq  IBM BigSQL
  28. 28. REAL TIME STREAMING Flume Kafka Scribe HBase
  29. 29. SECURITY Kerberos with ACL’s Cloudera Sentry Project Knox Accumolo(BigTable clone) HBase w/Cell Security
  30. 30. DEVELOPERS TOOLSET Cloudera CDK renamed to Kite Java M/R Spring for Hadoop Hive Pig Scalding Impala Others
  31. 31. MANAGEMENT, GUI, MACHIN E LEARNING, MONITORING, SC HEDULING & GRAPH DB Ambari Cloudera Manager HUE Mahout Giraph Zookeeper Oozie
  32. 32. FUTURE OF HADOOP: YARN & NEAR REAL TIME SQL-ONHADOOP Multi Tenancy HA(High Availability) Tools for SQL-On-Hadoop Impala Stinger/Tez Drill Shark
  33. 33. WHAT DO YOU CHOOSE? The choices are huge The toolsets are varied First focus on the problems you are trying to solve. Don’t choose Hadoop because it is the latest buzz word. Make sure there is a real need to solve Focus on developers and administrators and ensure that whatever toolset you choose, they have the relevant skillset or training will be provided or relevant resources will be brought in from outside( whether through hiring or consulting) REMEMBER PROBLEMSET!!! i.e what you are trying to
  34. 34. CAVEATS Work still being done on bringing real time sql-onhadoop to YARN. Impala has Llama for this. Stinger for Hive Preview is currently available HBase on YARN(HOYA) is also actively being worked on. Since YARN is a low level API, some abstraction is needed which is available with tools such as Samza and Weave
  35. 35. BIG DATA = BIG IMPACT Ken Rudin, Director of Analytics, Facebook “You need to go the last mile and evangelize your insights so that people actually act on them and there is impact." “It doesn’t matter how brilliant our analyses are. If nothing changes we have made no impact”
  36. 36. GIVING BACK Hadoop is an open source project Work done on this and the ecosystem tools are by committers and contributors, some of whom do this in their own personal time, in reporting and fixing bugs as well as new functionality. Please give back either by becoming a contributor(Testing, filing bugs) or getting out your use case for Hadoop(at meetups and/or conferences such as this one) so others can make use of the issues you have faced as well see the rapid adoption of the
  37. 37. THANKS Subash D’Souza Twitter: @sawjd22 Linkedin: www.linkedin.com/in/sawjd/ Email: subashdsouza@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×