Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012
Upcoming SlideShare
Loading in...5
×
 

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

on

  • 3,942 views

Session presented at Big Data Spain 2012 Conference ...

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/coordinating-many-tools-of-big-data/alan-gates

Statistics

Views

Total Views
3,942
Views on SlideShare
2,302
Embed Views
1,640

Actions

Likes
2
Downloads
62
Comments
0

11 Embeds 1,640

http://www.hadoopsphere.com 1586
http://feeds.feedburner.com 17
http://www.newsblur.com 16
http://www.linkedin.com 6
http://www.tuicool.com 6
https://twitter.com 4
http://kristas-virtual-art.info 1
http://translate.googleusercontent.com 1
http://2868824907842590784_18d887540e821527539eb43f8f9aa97f67770712.blogspot.com 1
http://twimblr.appspot.com 1
http://www.google.fr 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is how we tend to think of Big data
  • Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around
  • Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012 Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012 Presentation Transcript

  • Coordinating the Many Tools of Big DataBig Data Spain 2012http://www.bigdataspain.org/Alan F. Gates@alanfgates Page 1
  • Big Data = Terabytes, Petabytes, …Image Credit: Gizmodo © Hortonworks 2012 Page 2
  • But It Is Also Complex Algorithms• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data: w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y) © Hortonworks 2012 Page 3
  • Pre-Cloud: One Tool per Machine• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (SAS). Data Mart Statistical Analysis Data Warehouse Cube/M OLTP OLAP © Hortonworks 2012 Page 4
  • Cloud: Many Tools One Platform • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Statistical Data Analysis Mart DataWarehouseCube/M OLT OLAP P © Hortonworks 2012 Page 5
  • Upside - Pick the Right Tool for the Job © Hortonworks 2012 Page 6
  • Downside – Tools Don’t Play Well Together• Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces © Hortonworks 2012 Page 7
  • Downside – Wasted Developer Time• Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Planner Planner Executor Executor © Hortonworks 2012 Page 8
  • Downside – Wasted Developer Time• Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2012 Page 9
  • Conclusion: We Need Services• We need to find a way to share services where we can.• Gives users the same experience across tools• Allows developers to share effort when it makes sense © Hortonworks 2012 Page 10
  • Hadoop = Distributed Data OperatingSystemService Hadoop Component Single Node AnalogueTable Management HCatalog RDBMSUser access control Hadoop /etc/passwd, file system permissions, etc.Resource management YARN Process managementNotification HCatalog Signals, semaphores, mutexesREST/Connectors HCatalog, Hive, HBase, Network layer OozieBatch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built © Hortonworks 2012 Page 11
  • Hadoop = Distributed Data OperatingSystemService Hadoop Component Single Node AnalogueTable Management HCatalog RDBMSUser access control Hadoop /etc/passwd, file system permissions, etc.Resource management YARN Process managementNotification HCatalog Signals, semaphores, mutexesREST/Connectors HCatalog, Hive, HBase, Network layer OozieBatch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built © Hortonworks 2012 Page 12
  • HCatalog – Table Management• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access © Hortonworks 2012 Page 13
  • Data Access Without HCatalogMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 14
  • Data & Metadata Access With HCatalog MapReduce Hive Pig HCatInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ REST Metastore Client OuputFormatExternalSystem HDFS Metastore © Hortonworks 2012 Page 15
  • Without HCatalogFeature MapReduce Pig HiveRecord format Key value pairs Tuple RecordData model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bagsSchema Encoded in app Declared in script Read from or read by loader metadataData location Encoded in app Declared in script Read from metadataData format Encoded in app Declared in script Read from metadata © Hortonworks 2012 Page 16
  • With HCatalogFeature MapReduce + Pig + HCatalog Hive HCatalogRecord format Record Tuple RecordData model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bagsSchema Read from Read from Read from metadata metadata metadataData location Read from Read from Read from metadata metadata metadataData format Read from Read from Read from metadata metadata metadata © Hortonworks 2012 Page 17
  • YARN – Resource Manager• Hadoop 1.0: HDFS plus MapReduce• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for developers to write parallel applications on top of the Hadoop cluster• The Resource Manager provides: – applications a way to request resources in the cluster – allocation and scheduling of machine resource to the applications• MapReduce is now an application provided inside YARN• Other systems have been ported to YARN such as Spark (cluster computing system that focuses on in memory operations) and Storm (streaming computations) © Hortonworks 2012 Page 18
  • Architectural Comparison Hadoop 1.0 Hadoop 2.0 © Hortonworks 2012 Page 19
  • Data Virtual Machine – Shared BatchProcessing• Recall our previous diagram of Pig and Hive Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2012 Page 20
  • A VM That Provides• Standard operators (equivalent of Java byte codes): – Project – Select – Join – Aggregate – Sort –…• An optimizer that could – Choose appropriate implementation of an operator based on physical data characteristics – Dynamically re-optimize the plan based on information gathered executing the plan• Shared execution layer – Can provide its own YARN application master and improve on MapReduce paradigm for batch processing• Shared User Defined Function (UDF) framework – user code works across systems © Hortonworks 2012 Page 21
  • Taking Advantage of YARN – MR* Map MapReduce Reduce HDFS Map MapReduce Reduce © Hortonworks 2012 Page 22
  • Taking Advantage of YARN – MR* Map MapReduce Reduce HDFS Why do I need these Map Map maps?Reduce Reduce © Hortonworks 2012 Page 23
  • Taking Advantage of YARN – MR* Map Map Map MapReduce Reduce Reduce Reduce Reduce Reduce HDFS • Removed an entire write/read cycle of HDFS Map Map • Still want to checkpoint sometimesReduce Reduce © Hortonworks 2012 Page 24
  • Taking Advantage of YARN – In MemoryData Transfer Map Map Reduce Reduce © Hortonworks 2012 Page 25
  • Taking Advantage of YARN – In Memory Data Transfer Map Map These are writes to disk Reduce ReduceSwitching shuffle to in memory instead of on disk• Better performance• Data must also be spilled to disk for retry-ability and to handle memory overflow• Will benefit from stronger guarantees of simultaneous execution © Hortonworks 2012 Page 26
  • On the Fly Optimization• Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR MR Job Job Hash Join © Hortonworks 2012 Page 27
  • On the Fly Optimization• Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR MR Job Job Output fits Hash in memory Join © Hortonworks 2012 Page 28
  • On the Fly Optimization• Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR MR MR MR Job Job Job Job Load into Map- Hash distributed side Join cache Join © Hortonworks 2012 Page 29
  • Thank You Big Data Spain © Hortonworks 2012 Page 30