Elephant in the Cloud:
a quest for the next generation
Hadoop architecture	

Roman Shaposhnik	

Sr. Manager, Open Source H...
Who’s this guy?	

•  Sr. Manager @Pivotal building a team of OS contributors	

•  Apache Software Foundation guy (VP of Ap...
Agenda	

&
Agenda
Long, long time ago…	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
In a blink of an eye:	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	
...
Genesis of Hadoop	

• Google papers on GFS and MapReduce	

• A subproject of Apache Nutch	

• A bet by Yahoo!
Data brings value	

• What features to add to the product	

• Data analysis must enable decisions	

• V3: volume, velocity...
Big Data brings big value
Entering: Industrial Data
Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being...
Hadoop’s childhood	

• HDFS: Hadoop Distributed Filesystem	

• MapReduce: computational framework
HDFS: not a POSIXfs	

• Huge blocks: 64Mb (128Mb)	

• Mostly immutable files (append, truncate)	

• Streaming data access	
...
How do I use it?	

$ hadoop fs –lsr /	

	

# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt	

$ ls /mnt	

	

# mount –t nfs –o ver...
Principle #1	

HDFS is the datalake
Pivotal’s Focus on Data Lakes
Existing EDW 	

/ Datamarts	

Raw “untouched” Data	

In-MemoryParallelIngest	

Data	

Manage...
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Elephant in the cloud
Upcoming SlideShare
Loading in...5
×

Elephant in the cloud

1,496

Published on

The Elephant in the Cloud: A Quest for the Next Generation

In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,496
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
62
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Elephant in the cloud

  1. 1. Elephant in the Cloud: a quest for the next generation Hadoop architecture Roman Shaposhnik Sr. Manager, Open Source Hadoop Platform @Pivotal (Twitter: @rhatr)
  2. 2. Who’s this guy? •  Sr. Manager @Pivotal building a team of OS contributors •  Apache Software Foundation guy (VP of Apache Incubator, VP of Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  3. 3. Agenda &
  4. 4. Agenda
  5. 5. Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  6. 6. In a blink of an eye: HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  7. 7. Genesis of Hadoop • Google papers on GFS and MapReduce • A subproject of Apache Nutch • A bet by Yahoo!
  8. 8. Data brings value • What features to add to the product • Data analysis must enable decisions • V3: volume, velocity, variety
  9. 9. Big Data brings big value
  10. 10. Entering: Industrial Data
  11. 11. Big Data Utility Gap 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises 3 Exabytes per day now 40 Trillion total Gigabytes in 2020 (Or 162 iPhones of storage for every human) ?
  12. 12. Hadoop’s childhood • HDFS: Hadoop Distributed Filesystem • MapReduce: computational framework
  13. 13. HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb) • Mostly immutable files (append, truncate) • Streaming data access • Block replication
  14. 14. How do I use it? $ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
  15. 15. Principle #1 HDFS is the datalake
  16. 16. Pivotal’s Focus on Data Lakes Existing EDW / Datamarts Raw “untouched” Data In-MemoryParallelIngest Data Management
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×