Elephant in the cloud


Published on

The Elephant in the Cloud: A Quest for the Next Generation

In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Elephant in the cloud

  1. 1. Elephant in the Cloud: a quest for the next generation Hadoop architecture Roman Shaposhnik Sr. Manager, Open Source Hadoop Platform @Pivotal (Twitter: @rhatr)
  2. 2. Who’s this guy? •  Sr. Manager @Pivotal building a team of OS contributors •  Apache Software Foundation guy (VP of Apache Incubator, VP of Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  3. 3. Agenda &
  4. 4. Agenda
  5. 5. Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  6. 6. In a blink of an eye: HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  7. 7. Genesis of Hadoop • Google papers on GFS and MapReduce • A subproject of Apache Nutch • A bet by Yahoo!
  8. 8. Data brings value • What features to add to the product • Data analysis must enable decisions • V3: volume, velocity, variety
  9. 9. Big Data brings big value
  10. 10. Entering: Industrial Data
  11. 11. Big Data Utility Gap 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises 3 Exabytes per day now 40 Trillion total Gigabytes in 2020 (Or 162 iPhones of storage for every human) ?
  12. 12. Hadoop’s childhood • HDFS: Hadoop Distributed Filesystem • MapReduce: computational framework
  13. 13. HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb) • Mostly immutable files (append, truncate) • Streaming data access • Block replication
  14. 14. How do I use it? $ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
  15. 15. Principle #1 HDFS is the datalake
  16. 16. Pivotal’s Focus on Data Lakes Existing EDW / Datamarts Raw “untouched” Data In-MemoryParallelIngest Data Management