Hadoop Innovation Summit 2014

  • 745 views
Uploaded on

Slides from my presentation at Hadoop Innovation Summit 2014 …

Slides from my presentation at Hadoop Innovation Summit 2014
The Future of Hadoop: Choosing the right options

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
745
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
49
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. THE FUTURE OF HADOOP: CHOOSING THE RIGHT OPTIONS Subash D’Souza Hadoop Innovation Summit 2014
  • 2. WHO AM I?  Recognized as a Champion of Big Data by Cloudera  Co-Organizer - Los Angeles Hadoop User Group  Organizer - Los Angeles HBase User Group  Organizer – Los Angeles Big Data Users Group  Organizer - Big Data Camp LA  Speaker – Big Data Camp LA 2013  Leading a BOF Session at Hadoop Summit Europe 2014  Author – HBase Developer’s Cookbook (Out Fall 2014)  Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop
  • 3. HADOOP: OLD & NEW  Hadoop first released in 2006.  Based on the GFS and MapReduce papers released by Google  Ever since adoption has been massive and rapid  Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the Social Security Administration are adopting Hadoop  Hadoop 2.0 AKA YARN went GA in September of 2013  Is backwards compatible with Hadoop 1.0 API’s  Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager and Node Managers
  • 4. A BRIEF HISTORY Google releases GFS paper 2002 2003 Google releases MapReduce paper 2004 Nutch adds distributed file system Doug Cutting launches Nutch project MapR founded 2005 Hortonworks founded Cloudera founded 2006 2007 Hadoop spun out of Nutch project at Yahoo MapReduce implemented in Nutch Stinger/ Tez to be released Hadoop 2.0 w/HA available 2008 2009 2010 2011 Hadoop breaks Terasort world record 2012 2013 2014 YARN goes GA HBase, Zookee per, Flume and more added to CDH Impala (SQL on Hadoop) launched
  • 5. PREVIOUSLY, THE STATE OF DATA As a data analyst, previously, you were not able to ask questions you wanted to ask because you did not have the data points available Corollary, you couldn’t think of questions to ask of your data because you didn’t know you had access to those data points
  • 6. BIG DATA IMPACT
  • 7. FOCUS  No standard way to get to the data  This is a plus and minus, plus because there is variety to choose from, minus because the no. of tools to pull the data is huge and evermore expanding As a company what do you choose? What do you focus on? Question – Do you replace your current data infrastructure or do you augment it?
  • 8. HADOOP TECHNOLOGIES
  • 9. DISTRIBUTIONS OF HADOOP Apache Hortonworks Cloudera MapR Intel IBM Pivotal
  • 10. HORTONWORKS HDP 2.0 Source: hortonworks.com
  • 11. CLOUDERA ENTERPRISE DATA HUB Source: cloudera.com & techweekly.com
  • 12. MAPR M7 ENTERPRISE Source: business-software.com & wn.com
  • 13. INTEL DISTRIBUTION FOR APACHE HADOOP Source: gigaom.com
  • 14. IBM BIGINSIGHTS ENTERPRISE EDITION Source: ndm.net
  • 15. PIVOTAL HD Source: infoq.com
  • 16. CHOICES  Hortonworks – Completely Open Source – Everything on their platform is available from Apache Hadoop Distribution. Available as a free download or with paid support.  Cloudera – Offers the open source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. Available as a free download or with paid support with the additional tools  MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary MFS(MapR File System). Everything else on their stack is based on the open source Apache distribution. Offers a free M3 version along with paid M5 and M7 versions.
  • 17. ADVANTAGES OF YARN Ability to handle multi tenant clients, i.e. running multiple applications atop the same framework(multi-tenancy) Splits the work of Job tracker into Resource Manager and Application master so Job tracker does not have to allocate resources as well as manage the tasks Ability to restart Jobs from the place where they failed Scales well beyond the limitations of MR1(4000
  • 18. SQL-ON-HADOOP The different available Hive Impala Drill Stinger/Tez HAWQ Hadapt Presto Shark SQL-On-Hadoop tools currently
  • 19. SQL-ON-HADOOP BENCHMARK - SCAN Source:
  • 20. SQL-ON-HADOOP BENCHMARK - AGGREGATE Source:
  • 21. SQL-ON-HADOOP BENCHMARK - JOIN Source:
  • 22. SQL ON HADOOP VS. TRADITIONAL RDBMS Data on Hadoop is not as responsive as a RDBMS Data in Hadoop can scale much better than an RDBMS Data in Hadoop can be accessed using a variety of mechanisms such as Hive, Imapala, Drill, etc. i.e. the query engines are abstracted from the Hadoop(HDFS) storage layer. The same cannot be said of RDBMS where you would need between one system to another example, Oracle cannot pull from SQL Server and vice versa
  • 23. QUESTION? Do we augment or replace our current data infrastructure? Answer – Augment Why? – combine the best of both worlds, use aggregated data in your data stores and all the detail data and lifetime in Hadoop Of course, you will different SLA’s based on the query you ask.
  • 24. CHALLENGES Data Protection Security SLA’s – Service Level Agreements Integration w/ applications Services and support Training Performance Scaling and Administration
  • 25. STARTUPS VS. MATURE Startups that are in data should make the consideration of going with YARN to gain the advantages of YARN Mature companies tend to be conservative and hence will look to the more established use cases of MR1 Startups and Mature companies should look at the advantages of YARN as well as applying more near real-time sql-on-hadoop
  • 26. GETTING STARTED WITH HADOOP VS. ESTABLISHED HADOOP PRACTICES Getting started with Hadoop – Opportunity to get off the ground running YARN plus bleeding edge technologies. Established companies with a Hadoop practice tend to be conservative but that shouldn’t prevent them from coming with a migration plan to YARN
  • 27. REAL TIME ANALYTICS  Kiji  HBase  Storm  Shark  Redshift  Impala  Stinger  Drill  Accumolo  Presto  Hawq  IBM BigSQL
  • 28. REAL TIME STREAMING Flume Kafka Scribe HBase
  • 29. SECURITY Kerberos with ACL’s Cloudera Sentry Project Knox Accumolo(BigTable clone) HBase w/Cell Security
  • 30. DEVELOPERS TOOLSET Cloudera CDK renamed to Kite Java M/R Spring for Hadoop Hive Pig Scalding Impala Others
  • 31. MANAGEMENT, GUI, MACHIN E LEARNING, MONITORING, SC HEDULING & GRAPH DB Ambari Cloudera Manager HUE Mahout Giraph Zookeeper Oozie
  • 32. FUTURE OF HADOOP: YARN & NEAR REAL TIME SQL-ONHADOOP Multi Tenancy HA(High Availability) Tools for SQL-On-Hadoop Impala Stinger/Tez Drill Shark
  • 33. WHAT DO YOU CHOOSE? The choices are huge The toolsets are varied First focus on the problems you are trying to solve. Don’t choose Hadoop because it is the latest buzz word. Make sure there is a real need to solve Focus on developers and administrators and ensure that whatever toolset you choose, they have the relevant skillset or training will be provided or relevant resources will be brought in from outside( whether through hiring or consulting) REMEMBER PROBLEMSET!!! i.e what you are trying to
  • 34. CAVEATS Work still being done on bringing real time sql-onhadoop to YARN. Impala has Llama for this. Stinger for Hive Preview is currently available HBase on YARN(HOYA) is also actively being worked on. Since YARN is a low level API, some abstraction is needed which is available with tools such as Samza and Weave
  • 35. BIG DATA = BIG IMPACT Ken Rudin, Director of Analytics, Facebook “You need to go the last mile and evangelize your insights so that people actually act on them and there is impact." “It doesn’t matter how brilliant our analyses are. If nothing changes we have made no impact”
  • 36. GIVING BACK Hadoop is an open source project Work done on this and the ecosystem tools are by committers and contributors, some of whom do this in their own personal time, in reporting and fixing bugs as well as new functionality. Please give back either by becoming a contributor(Testing, filing bugs) or getting out your use case for Hadoop(at meetups and/or conferences such as this one) so others can make use of the issues you have faced as well see the rapid adoption of the
  • 37. THANKS Subash D’Souza Twitter: @sawjd22 Linkedin: www.linkedin.com/in/sawjd/ Email: subashdsouza@gmail.com