Hadoop & Greenplum: Why Do Such a Thing?
Upcoming SlideShare
Loading in...5
×
 

Hadoop & Greenplum: Why Do Such a Thing?

on

  • 11,003 views

Greenplum is using Hadoop in several interesting ways as part of a larger big data architecture with EMC Greenplum Database (a scale-out MPP SQL database) and EMC Isilon (a scale-out network-attached ...

Greenplum is using Hadoop in several interesting ways as part of a larger big data architecture with EMC Greenplum Database (a scale-out MPP SQL database) and EMC Isilon (a scale-out network-attached storage appliance). After a quick introduction of Greenplum Database and Isilon, I list some ways Greenplum is tightly integrating with Hadoop and why we would want to do such a thing. Integration points discussed include: Greenplum Database external tables to seamlessly access data in HDFS, querying HBase tables natively from Greenplum Database, Greenplum Database having its underlying storage on HDFS, and Isilon OneFS as a seamless replacement for HDFS.

Statistics

Views

Total Views
11,003
Views on SlideShare
10,994
Embed Views
9

Actions

Likes
14
Downloads
416
Comments
1

6 Embeds 9

http://www.mefeedia.com 3
http://www.slashdocs.com 2
http://paper.li 1
http://confluence.verticacorp.com 1
https://twitter.com 1
https://si0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Great Stuff, Donald, I worked with Scott Yara before briefly ..Is ghe with EMC or left..He was CTO at Greenplum ...Thanks Kahn
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Greenplum HD HadoopSoftware

Hadoop & Greenplum: Why Do Such a Thing? Hadoop & Greenplum: Why Do Such a Thing? Presentation Transcript

  • Greenplum & Hadoop Why do such a thing? Donald Miner Solutions Architect Advanced Technologies Group Donald.Miner@emc.com© Copyright 2012 EMC Corporation. All rights reserved. 1
  • QUICK INTRODUCTION TO GREENPLUM DATABASE© Copyright 2012 EMC Corporation. All rights reserved. 2
  • GREENPLUM DATABASEGreenplum Database BasicsMassively Parallel Processing (MPP) DatabaseUses commodity hardware Master MasterData is distributed by auser-defined “distribution key”Master node delegatesqueries to segments Segment Segment Segment Segment1:1 segment and mastermirroring for redundancy© Copyright 2012 EMC Corporation. All rights reserved. 3
  • GREENPLUM DATABASEGreenplum Database FeaturesFull SQL support based on PostgreSQL 8.2Columnar or row-oriented storage with compressionMulti-level table partitioning with query time partition pruningB-tree and bitmap indexesJDBC, ODBC, OLEDB, etc. interfacesHigh speed, parallel bulk ingestParallel query optimizerExternal tables© Copyright 2012 EMC Corporation. All rights reserved. 4
  • GREENPLUM DATABASEMADlib Analytics with GreenplumScalable and in-database > SELECT householdID, variables FROM householdsMathematical, statistical, ORDER BY RANDOM() LIMIT 100000; machine learning > SELECT run_univariate_analysis ( households_training,Active open source project variables); WHERE pvalue<.01 AND r2>.01; > SELECT run_regression( univariate_results, households_training); > SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households;© Copyright 2012 EMC Corporation. All rights reserved. 5
  • GREENPLUM DATABASEMADlib In-Database Analytical Functions Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining CountMin (Cormode-Muthukrishnan) K-Means Clustering Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Naïve Bayes Classification Estimator MFV (Most Frequent Values) Sketch- Linear Regression based Estimator Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART Latent Dirichlet Allocation Topic Modeling© Copyright 2012 EMC Corporation. All rights reserved. 6
  • GREENPLUM DATABASEPostGIS Support in Greenplum DB PostGIS adds support for geographic objects in PostgreSQL Example: find all records within 25 miles of hurricane path http://postgis.refractions.net/ select customer_id, ST_AsText(lat_lon), phone_num from clients where ST_DWithin(lat_lon, ST_GeometryFromText(LINESTRING( -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, - 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, - 83.9 17.9, -85.2 18.3, -85.5 18.4), 4326), 25.0/3959.0 * 180.0/PI()) customer_id | st_astext | phone_num ------------+-----------------------------+------------- 493140 | POINT(-80.040397 26.570613) | 1231231234 192401 | POINT(-81.820933 26.242611) | 2342342345© Copyright 2012 EMC Corporation. All rights reserved. 7
  • GREENPLUM DATABASE Solr integration with GPDB Solr is an open source enterprise search engine Enable in-database text indexing and search id | score | message_textselect -----------+------------------+------------------------------------------- t.id, 71552856 | 5.43078422546387 | Hates BBs Love IPhones! q.score, 91373993 | 4.06371879577637 | Its a love hate relationship with t.message_text iPhone spellcheckfrom message t, 25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate gptext.search( relationship... twitter.public.message, 120166038 | 3.39410924911499 | Love the new iPhone 4s, hate (iphone and (hate or love)), @ATT service #Verizonhereicome author_lang:en, 100 117498183 | 3.39181470870972 | I got a love-hate relationship for )q my iPhone!!!where t.id=q.id 86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me..order by score desc; © Copyright 2012 EMC Corporation. All rights reserved. 8
  • GREENPLUM HADOOP© Copyright 2012 EMC Corporation. All rights reserved. 9
  • GREENPLUM HADOOPGreenplum “HD”• Bundled open source• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma hout© Copyright 2012 EMC Corporation. All rights reserved. 10
  • GREENPLUM HADOOPGreenplum “MR”• Bundled MapR, a commercial version of Hadoop• API compatible with traditional Hadoop• MapR improvements over Hadoop: – Improved control system – Major portions of HDFS re-implemented in C++ – HDFS is NFS mountable – Improved shuffle and sort – Distributed NameNode – Supports large number of files – Mirroring, snapshot capability© Copyright 2012 EMC Corporation. All rights reserved. 11
  • Why do such a thing? Greenplum DBMADLib Partitioning GP Solr/Lucene SQL Indexing Text objects RDBMS PostGIS GPMapReduceTables and Schemas STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 12
  • Why do such a thing?Hadoop Schema on load MapReduce Hive XML, JSON, … Flat files Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 13
  • Why do such a thing?HBase Row keys Hive Flexible schema MapReduce HBase Tables Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 14
  • Why do such a thing? Hybrid architecture with all three (or two…)MADLib Partitioning Row keys GP Solr/Lucene SQL Schema on load Indexing Text objects Flexible schema MapReduce RDBMS Hive PostGIS HBase Tables GPMapReduceTables and Schemas Pig XML, JSON, … Flat files STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 15
  • Greenplum Unified Analytics Platform© Copyright 2012 EMC Corporation. All rights reserved. 16
  • Hadoop External Tables in GPDB External tables bring external data into the database. Native support for HDFS with parallelized loading. Can write to HDFS or read from HDFS. > CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION (gphdfs://namenode:9000/user/don/docs/part-*) FORMAT text (delimiter |); > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word; > WRITE INTO hdfs_export SELECT * FROM gpdb_source;© Copyright 2012 EMC Corporation. All rights reserved. 17
  • Why do such a thing?Many of the same use cases of a HBase/Hadoop environmentUse Hadoop as a data groomerDo rollups in Hadoop and store results in GPDBUse the best tool for the job (structured vs. unstructured)Use GPDB to host data sets in a more real-time layer for ad-hocanalytics© Copyright 2012 EMC Corporation. All rights reserved. 18
  • EMC Isilon Hardware appliance for scale-out network-attached storage (NAS) Stripes data across all nodes Uses Infiniband for intra-cluster communication Up to 15.5PB total storage 3 different hardware configurations to handle different workloads Uses “OneFS”, Isilon’s operating system and file system Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more.© Copyright 2012 EMC Corporation. All rights reserved. 19
  • Isilon HDFS interface Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data. Underlying system is OneFS and does not follow the traditional HDFS scheme. Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster.© Copyright 2012 EMC Corporation. All rights reserved. 20
  • Pros & Cons Isilon is more dense Isilon can be mounted via a number of protocols – Easier ingest / egress – Raw data accessible by applications Isilon is easy to manage Free of certain HDFS limitations Isilon loses data locality (~250MB/sec throughput per node over network)© Copyright 2012 EMC Corporation. All rights reserved. 21
  • Why do such a thing? Hadoop backup or archive – More dense than HDFS, more accessible than tape, no need for compute Complete HDFS replacement – More dense, more accessible, utilize existing Isilon, slower per terabyte of storage Hot/warm storage – Use HDFS as primary, but Isilon as secondary Storage for original content – Use MapReduce to extract metadata from original content, and leave original content in place© Copyright 2012 EMC Corporation. All rights reserved. 22
  • HBase External Tables in GPDB Project in development Load data in parallel from HBase by specifying table name and column qualifiers > CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION (gphbase://docfeatures) FORMAT ‟CUSTOM (formatter=„gpdbwriteable_import‟); > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word;© Copyright 2012 EMC Corporation. All rights reserved. 23
  • HBase External Tables in GPDBPossible TODO list: Specify range of rowkeys Support writes into HBase Specify filter criteria on the external table select * from hbase_external where ROWKEY=‘abc’ Accumulo?© Copyright 2012 EMC Corporation. All rights reserved. 24
  • Why do such a thing?Have HBase store semi-structured dataExploit the strengths of eachUse HBase for really really wide tablesUse HBase as a scalable archive of raw recordsLeverage existing HBase applications© Copyright 2012 EMC Corporation. All rights reserved. 25
  • Greenplum On HDFS Get Greenplum Database to run natively off of HDFS Underlying Greenplum Database data is stored in HDFS Unifies the two platform further – no need for external tables Fully supports Greenplum’s append-only tables Early project in R&D Talk will be given by Chang Lei at Yahoo Summit© Copyright 2012 EMC Corporation. All rights reserved. 26
  • Greenplum On HDFS Master host Interconnect Segment Segment (Mirror) Segment Segment Segment Segment Segment Segment (Mirror) Segment Segment (Mirror) (Mirror) (Mirror) Segment host Segment host Segment host Segment host Segment host Meta Ops Read/Write Tables in HDFS filespace Namenode B Datanode replication Datanode Datanode Rack1 Rack2© Copyright 2012 EMC Corporation. All rights reserved. 27
  • Why do such a thing?Covers many of the same use cases as HiveRun Hadoop MapReduce over data managed by Greenplum DBInitial results show it is faster than HiveYou only have to store your data in one system© Copyright 2012 EMC Corporation. All rights reserved. 28