Elephant in the cloud
Upcoming SlideShare
Loading in...5
×
 

Elephant in the cloud

on

  • 866 views

The Elephant in the Cloud: A Quest for the Next Generation ...

The Elephant in the Cloud: A Quest for the Next Generation

In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.

Statistics

Views

Total Views
866
Views on SlideShare
865
Embed Views
1

Actions

Likes
2
Downloads
36
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Elephant in the cloud Elephant in the cloud Presentation Transcript

  • Elephant in the Cloud: a quest for the next generation Hadoop architecture Roman Shaposhnik Sr. Manager, Open Source Hadoop Platform @Pivotal (Twitter: @rhatr)
  • Who’s this guy? •  Sr. Manager @Pivotal building a team of OS contributors •  Apache Software Foundation guy (VP of Apache Incubator, VP of Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  • Agenda &
  • Agenda
  • Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • In a blink of an eye: HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • Genesis of Hadoop • Google papers on GFS and MapReduce • A subproject of Apache Nutch • A bet by Yahoo!
  • Data brings value • What features to add to the product • Data analysis must enable decisions • V3: volume, velocity, variety
  • Big Data brings big value
  • Entering: Industrial Data
  • Big Data Utility Gap 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises 3 Exabytes per day now 40 Trillion total Gigabytes in 2020 (Or 162 iPhones of storage for every human) ?
  • Hadoop’s childhood • HDFS: Hadoop Distributed Filesystem • MapReduce: computational framework
  • HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb) • Mostly immutable files (append, truncate) • Streaming data access • Block replication
  • How do I use it? $ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
  • Principle #1 HDFS is the datalake
  • Pivotal’s Focus on Data Lakes Existing EDW / Datamarts Raw “untouched” Data In-MemoryParallelIngest Data Management (Search Engine) Processed Data In-Memory Services BI/AnalyticalTools Data Lake ERP HR SFDC New Data Sources/Formats Machine Traditional Data Sources Finally! I now have full transparency on the data with amazing speed! All data is now accessible! I can now afford  “Big Data” Business Users ELT Processing with Hadoop HDFS MapReduce/SQL/Pig/Hive Analytical Data Marts/ Sandboxes SecurityandControl
  • HDFS enables the stack HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • Principle #2 Apps share their internal state
  • MapReduce • Batch oriented (long jobs; final results) • Brings the computation to the data • Very constrained programming model • Embarrassingly parallel programming model • Used to be the only game in town for compute
  • MapReduce Overview • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Logical Phases: Input, Map, Shuffle, Reduce, Output
  • Map • Input: (Key1, Value1) • Output: List(Key2, Value2) • Projections, Filtering, Transformation
  • Shuffle • Input: List(Key2, Value2) • Output • Sort(Partition(List(Key2, List(Value2)))) • Provided by Hadoop : Several Customizations Possible
  • Reduce • Input: List(Key2, List(Value2)) • Output: List(Key3, Value3) • Aggregations
  • Anatomy of MapReduce d a c a b c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1 HDFS mappers reducers HDFS
  • MapReduce DataFlow
  • How do I use it? public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • How do I use it? public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • How do I run it? $ hadoop jar hadoop-examples.jar wordcount input output
  • Principle #3 MapReduce is assembly language of Hadoop
  • Hadoop’s childhood • Compact (pretty much a single jar) • Challenged in scalability and SPOFs • Extremely batch oriented • Hard for non-Java programmers
  • Then, something happened
  • Hadoop 1.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • Hadoop 2.0 • HDFS 2.0 • Yet Another Resource Negotiator (YARN) • MapReduce is just an “application” now • Tez is another “application” • Pivotal’s Hamster (OpenMPI) yet another one
  • MapReduce 1.0 Job Tracker Task Tracker (HDFS) Task Tracker (HDFS) task1 task1 task1 task1 task1 task1 task1 task1 task1 taskN
  • YARN (AKA MR2.0) Resource Manager Job Tracker task1 task1 task1 task1 task1 Task Tracker
  • YARN (AKA MR2.0) Resource Manager Job Tracker task1 task1 task1 task1 task1 Task Tracker
  • YARN • Yet Another Resource Negotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker)
  • YARN: beyond MR Resource Manager MPI MPI
  • Hamster •  Hadoop and MPI on the same cluster •  OpenMPI Runtime on Hadoop YARN •  Hadoop Provides: Resource Scheduling,  Process monitoring, Distributed File System •  Open MPI Provides: Process launching,  Communication, I/O forwarding
  • Hamster Components • Hamster Application Master • Gang Scheduler, YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect
  • Hamster Architecture
  • Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • Hadoop ecosystem HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • There’s way too much stuff • Tracking dependencies • Integration testing • Optimizing the defaults • Rationalizing the behaviour
  • Wait! We’ve seen this! GNU Software Linux kernel
  • Apache Bigtop Hadoop ecosystem (Hbase, Pig, Hive) Hadoop (HDFS,YARN, MR)
  • Principle #4 Apache Bigtop is how the Hadoop distros get defined
  • The ecosystem • Apache HBase • Apache Crunch, Pig, Hive and Phoenix • Apache Giraph • Apache Oozie • Apache Mahout • Apache Sqoop and Flume
  • Apache HBase • Small mutable records vs. HDFS files • HFiles kept in HDFS • Memcached for HDFS • Built on HDFS and Zookeeper • Google’s Bigtable
  • Hbase datamodel • Driven by the original Webtable usecase: com.cnn.www <html>... content: CNN CNN.co anchor:a.com anchor:b.com
  • How do I use it? HTable table = new HTable(config, “table”); Put p = new Put(Bytes.toBytes(“row”)); p.add(Bytes.toBytes(“family”), Bytes.toBytes(“qualifier”), Bytes.toBytes(“data”)); table.put(p);
  • Dataflow model HBase HDFS Producer Consumer
  • When do I use it? • Serving up large amounts of data • Fast random access • Scan operations
  • Principle #5 HBase: when you need OLAP + OLTP
  • What if its OLTP? HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • GemFire XD HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD Hamster YARN
  • GemFire XD: a better HBase? • Close sourced but extremely mature • SQL/Objects/JSON data model • High concurrency, high update load • Mostly selective point queries (no scans) • Tiered storage architecture
  • YCSB Benchmark; Throughput is 2-12X 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) HBase 4 8 12 16 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) GemFire XD 4 8 12 16
  • YCSB Benchmark; Latency is 2X – 20X better 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) HBase 4 8 12 16 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) GemFire XD 4 8 12 16
  • Principle #6 There are always 3 implementations
  • Querying data • MapReduce: “an assembly language” • Apache Pig: a data manipulation DSL (now Turing complete!) • Apache Hive: a batch-oriented SQL on top of Hadoop
  • How do I use Pig? grunt> A = load ‘./input.txt’; grunt> B = foreach A generate  flatten(TOKENIZE((chararray)$0)) as words; grunt> C = group B by word; grunt> D = foreach C generate COUNT(B),  group;
  • How do I use Hive? CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) GROUP BY word ORDER BY word;
  • Can we short Oracle now? • No indexing • Batch oriented scheduling • Optimization for long running queries • Metadata management is still in flux
  • [Close to] real-time SQL • Impala (inspired by Google’s F1) • Hive/Tez (AKA Stinger) • Facebook’s Presto (Hive’s lineage) • Pivotal’s HAWQ
  • HAWQ • GreenPlum MPP database core • True ANSI SQL support • HDFS storage backend • Parquet support is coming
  • Principle #7 SQL on Hadoop
  • Feeding the elephant
  • Getting data in: Flume • Designed for collecting log data • Flexible deployment topology
  • Sqoop: RDBMs connection • Sqoop 1 • A MapReduce tool • Must use Oozie for workflows • Sqoop 2 • Well, 0.99.x really • A standalone service
  • Spring XD • Unified, distributed, extensible system for data ingestions, real time analytics and data exports • Apache Licensed, not ASF • A runtime service, not a library • AKA “Oozie + Flume + Sqoop + Morphlines”
  • How do I use it? # deployment: ./xd-singlenode $ ./xd-shell xd:> hadoop config fs –namenode hdfs://nn:8020 xd:> stream create –definition “time | hdfs”  –name ticktock xd:> stream destroy –name ticktock
  • Feeding the Elephant HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD SpringXD Hamster YARN
  • Spark the disruptor HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFireXD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX SpringXD YARN Hamster YARN
  • What’s wrong with MR? Source: UC Berkeley Spark project (just the image)
  • Spark innovations • Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • Parallel operators • map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • An alternative backend • Shark: a Hive on Spark • Spork: a Pig on Spark • Mlib: machine learning on Spark • GraphX: Graph processing on Spark • Also featuring its own streaming engine
  • How do I use it? val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • Principle #8 Spark is the technology of 2014
  • Where’s the cloud?
  • What’s new? • True elasticity • Resource partitioning • Security • Data marketplace • Data leaks/breaches
  • Hadoop Maturity ETL Offload Accommodate massive  data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting  top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  • Pivotal HD on Pivotal CF Ÿ Enterprise PaaS Management System Ÿ Flexible multi-language ‘buildpack’ architecture Ÿ Deployed applications enjoy built-in services Ÿ On-Premise Hadoop as a Service Ÿ Single cluster deployment of Pivotal HD Ÿ Developers instantly bind to shared Hadoop Clusters Ÿ Speeds up time-to-value
  • Pivotal Data Fabric Evolution Analytic Data Marts SQL Services Operational Intelligence In-Memory Database Run-Time Applications Data Staging Platform Data Mgmt. Services Pivotal Data Platform Stream  Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
  • Principle #9 Hadoop in the Cloud is one of many distributed frameworks
  • 2014 is the year of Hadoop HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • A NEW PLATFORM FOR A NEW ERA
  • Credits • Apache Software Foundation • Milind Bhandarkar • Konstantin Boudnik • Robert Geiger • Susheel Kaushik • Mak Gokhale
  • Questions ?