SlideShare a Scribd company logo
1 of 29
Greenplum & Hadoop
                                            Why do such a thing?


                                            Donald Miner
                                            Solutions Architect
                                            Advanced Technologies Group
                                            Donald.Miner@emc.com




© Copyright 2012 EMC Corporation. All rights reserved.                    1
QUICK INTRODUCTION TO


        GREENPLUM DATABASE




© Copyright 2012 EMC Corporation. All rights reserved.                 2
GREENPLUM DATABASE

Greenplum Database Basics
Massively Parallel Processing (MPP) Database

Uses commodity hardware                                                  Master             Master



Data is distributed by a
user-defined “distribution key”

Master node delegates
queries to segments                                      Segment   Segment        Segment            Segment



1:1 segment and master
mirroring for redundancy




© Copyright 2012 EMC Corporation. All rights reserved.                                                         3
GREENPLUM DATABASE

Greenplum Database Features
Full SQL support based on PostgreSQL 8.2

Columnar or row-oriented storage with compression

Multi-level table partitioning with query time partition pruning

B-tree and bitmap indexes

JDBC, ODBC, OLEDB, etc. interfaces

High speed, parallel bulk ingest

Parallel query optimizer

External tables




© Copyright 2012 EMC Corporation. All rights reserved.             4
GREENPLUM DATABASE

MADlib Analytics with Greenplum

Scalable and in-database                                 > SELECT householdID, variables
                                                            FROM households
Mathematical, statistical,                                  ORDER BY RANDOM()
                                                            LIMIT 100000;
 machine learning
                                                         > SELECT run_univariate_analysis (
                                                               'households_training',
Active open source project                                     'variables');
                                                            WHERE pvalue<.01 AND r2>.01;
                                                         > SELECT run_regression(
                                                               'univariate_results',
                                                               'households_training');
                                                         > SELECT householdID,
                                                         madlib.array_dot(
                                                               coef::REAL[],
                                                               xmatrix::REAL[])
                                                            FROM coefficients, households;




© Copyright 2012 EMC Corporation. All rights reserved.                                        5
GREENPLUM DATABASE

MADlib In-Database Analytical Functions
    Descriptive Statistics                               Modeling
    Quantile                                             Correlation Matrix
    Profile                                              Association Rule Mining

    CountMin (Cormode-Muthukrishnan)
                                                         K-Means Clustering
    Sketch-based Estimator

    FM (Flajolet-Martin) Sketch-based
                                                         Naïve Bayes Classification
    Estimator
    MFV (Most Frequent Values) Sketch-
                                                         Linear Regression
    based Estimator
    Frequency                                            Logistic Regression
    Histogram                                            Support Vector Machines
    Bar Chart                                            SVD Matrix Factorisation
    Box Plot Chart                                       Decision Trees/CART
    Latent Dirichlet Allocation Topic
    Modeling




© Copyright 2012 EMC Corporation. All rights reserved.                                6
GREENPLUM DATABASE

PostGIS Support in Greenplum DB
  PostGIS adds support for geographic objects in PostgreSQL

  Example: find all records within 25 miles of hurricane path
                                                           http://postgis.refractions.net/

 select customer_id, ST_AsText(lat_lon), phone_num
 from clients
 where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(
 -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -
 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -
 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())


 customer_id | st_astext                         | phone_num
 ------------+-----------------------------+-------------
 493140        | POINT(-80.040397 26.570613) | 1231231234
 192401        | POINT(-81.820933 26.242611) | 2342342345



© Copyright 2012 EMC Corporation. All rights reserved.                                             7
GREENPLUM DATABASE

 Solr integration with GPDB
 Solr is an open source enterprise search engine

 Enable in-database text indexing and search
                                                           id |        score        |      message_text
select                                                    -----------+------------------+-------------------------------------------
   t.id,                                                    71552856 | 5.43078422546387 | Hates BB's Love IPhones!
   q.score,                                                91373993 | 4.06371879577637 | Its a love hate relationship with
   t.message_text                                         iPhone spellcheck
from
   message t,                                              25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate
   gptext.search(                                         relationship...
    'twitter.public.message',
                                                          120166038 | 3.39410924911499 | Love the new iPhone 4s, hate
    '(iphone and (hate or love))',                        @ATT service #Verizonhereicome
    'author_lang:en',
       100                                                117498183 | 3.39181470870972 | I got a love-hate relationship for
   )q                                                     my iPhone!!!
where
   t.id=q.id                                               86416378 | 3.39180779457092 | Absolutely love the new iPhone,
                                                          but Siri seems to hate me..
order by score desc;




 © Copyright 2012 EMC Corporation. All rights reserved.                                                                                8
GREENPLUM HADOOP




© Copyright 2012 EMC Corporation. All rights reserved.   9
GREENPLUM HADOOP

Greenplum “HD”
• Bundled open source

• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma
  hout




© Copyright 2012 EMC Corporation. All rights reserved.   10
GREENPLUM HADOOP

Greenplum “MR”
• Bundled MapR, a commercial version of Hadoop
• API compatible with traditional Hadoop
• MapR improvements over Hadoop:
        – Improved control system
        – Major portions of HDFS re-implemented
           in C++
        – HDFS is NFS mountable
        – Improved shuffle and sort
        – Distributed NameNode
        – Supports large number of files
        – Mirroring, snapshot capability



© Copyright 2012 EMC Corporation. All rights reserved.   11
Why do such a thing?
 Greenplum DB
MADLib
               Partitioning                                        GP Solr/Lucene
   SQL
                Indexing                                                   Text objects
        RDBMS                                  PostGIS
                                                                     GPMapReduce
Tables and Schemas

  STRUCTURED                                              SEMISTRUCTURED            UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                            12
Why do such a thing?
Hadoop


                                                                              Schema on load
                                                                                   MapReduce
                            Hive
                                                               XML, JSON, …        Flat files
                                           Pig

 STRUCTURED                                              SEMISTRUCTURED       UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                          13
Why do such a thing?
HBase


                                          Row keys

         Hive                                             Flexible schema       MapReduce

                                                          HBase Tables
                          Pig

 STRUCTURED                                              SEMISTRUCTURED     UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                      14
Why do such a thing?
 Hybrid architecture with all three (or two…)
MADLib
        Partitioning Row keys            GP Solr/Lucene
  SQL                                                    Schema on load
        Indexing                                Text objects
                             Flexible schema                  MapReduce
     RDBMS      Hive  PostGIS
                            HBase Tables GPMapReduce
Tables and Schemas Pig              XML, JSON, …              Flat files

  STRUCTURED                                              SEMISTRUCTURED   UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                   15
Greenplum Unified Analytics Platform




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop External Tables in GPDB
  External tables bring external data into the database.

  Native support for HDFS with parallelized loading.

  Can write to HDFS or read from HDFS.

 > CREATE EXTERNAL TABLE hdfs_document_feature (
   docid integer,
   term text,
   freq integer)
  LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*')
  FORMAT 'text' (delimiter '|');

 > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE
 h.term = g.word;

 > WRITE INTO hdfs_export SELECT * FROM gpdb_source;




© Copyright 2012 EMC Corporation. All rights reserved.                17
Why do such a thing?
Many of the same use cases of a HBase/Hadoop environment

Use Hadoop as a data groomer

Do rollups in Hadoop and store results in GPDB

Use the best tool for the job (structured vs. unstructured)

Use GPDB to host data sets in a more real-time layer for ad-hoc
analytics




© Copyright 2012 EMC Corporation. All rights reserved.            18
EMC Isilon
    Hardware appliance for scale-out
    network-attached storage (NAS)
    Stripes data across all nodes
    Uses Infiniband for intra-cluster
    communication
    Up to 15.5PB total storage
    3 different hardware configurations
    to handle different workloads
    Uses “OneFS”, Isilon’s operating system and file system
    Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few
    more.



© Copyright 2012 EMC Corporation. All rights reserved.        19
Isilon HDFS interface
    Isilon is able to “pretend” to be a HDFS
    cluster: it mimics the NameNode and
    DataNode protocols to host data.
    Underlying system is OneFS and does not
    follow the traditional HDFS scheme.
    Point HDFS clients (MapReduce, command
    line, etc.) to any IP in the Isilon cluster.




© Copyright 2012 EMC Corporation. All rights reserved.   20
Pros & Cons
    Isilon is more dense
    Isilon can be mounted via a number of
    protocols
        – Easier ingest / egress
        – Raw data accessible by applications
    Isilon is easy to manage
    Free of certain HDFS limitations
    Isilon loses data locality (~250MB/sec
    throughput per node over network)

© Copyright 2012 EMC Corporation. All rights reserved.   21
Why do such a thing?
    Hadoop backup or archive
     – More dense than HDFS, more accessible than
       tape, no need for compute
    Complete HDFS replacement
     – More dense, more accessible, utilize existing
       Isilon, slower per terabyte of storage
    Hot/warm storage
     – Use HDFS as primary, but Isilon as secondary
    Storage for original content
     – Use MapReduce to extract metadata from original
       content, and leave original content in place

© Copyright 2012 EMC Corporation. All rights reserved.   22
HBase External Tables in GPDB
  Project in development

  Load data in parallel from HBase by specifying table name and
  column qualifiers


 > CREATE EXTERNAL TABLE hbase_document_feature (
   “HBASEROWKEY” text,
   “term” text,
   “freq” integer)
  LOCATION ('gphbase://docfeatures')
  FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟);

 > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE
 h.term = g.word;




© Copyright 2012 EMC Corporation. All rights reserved.                 23
HBase External Tables in GPDB
Possible TODO list:

                 Specify range of rowkeys

                 Support writes into HBase

                 Specify filter criteria on the external table

                 select * from hbase_external where ROWKEY=‘abc’

                 Accumulo?




© Copyright 2012 EMC Corporation. All rights reserved.             24
Why do such a thing?
Have HBase store semi-structured data

Exploit the strengths of each

Use HBase for really really wide tables

Use HBase as a scalable archive of raw records

Leverage existing HBase applications




© Copyright 2012 EMC Corporation. All rights reserved.   25
Greenplum On HDFS

  Get Greenplum Database to run natively off of HDFS

  Underlying Greenplum Database data is stored in HDFS

  Unifies the two platform further – no need for external tables

  Fully supports Greenplum’s append-only tables


  Early project in R&D

  Talk will be given by Chang Lei at Yahoo Summit




© Copyright 2012 EMC Corporation. All rights reserved.             26
Greenplum On HDFS
                                                             Master host


                                                                                                         Interconnect




                                                                                                             Segment
                                     Segment                                                                 (Mirror)
    Segment                                                Segment                 Segment
                                                                     Segment
                 Segment                        Segment
                                                                     (Mirror)
                                                                                             Segment                    Segment
                 (Mirror)                       (Mirror)                                     (Mirror)
   Segment host                     Segment host                Segment host      Segment host              Segment host

                                                                       Meta Ops                                             Read/Write
             Tables in HDFS filespace


                                                           Namenode
                                                                                                                        B
                                             Datanode          replication
                                                                                             Datanode             Datanode



                            Rack1                                                                       Rack2




© Copyright 2012 EMC Corporation. All rights reserved.                                                                                   27
Why do such a thing?
Covers many of the same use cases as Hive

Run Hadoop MapReduce over data managed by Greenplum DB

Initial results show it is faster than Hive

You only have to store your data in one system




© Copyright 2012 EMC Corporation. All rights reserved.   28
Hadoop & Greenplum: Why Do Such a Thing?

More Related Content

What's hot

Accelerate Your Cloud Migration Journey.pdf
Accelerate Your Cloud Migration Journey.pdfAccelerate Your Cloud Migration Journey.pdf
Accelerate Your Cloud Migration Journey.pdfAmazon Web Services
 
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...Amazon Web Services
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheAmazon Web Services
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”Ruggero Citton
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesBobby Curtis
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloudrohit_ainapure
 
MySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellMySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellFrederic Descamps
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11Kenny Gryp
 
Veeam Solutions for SMB_2022.pptx
Veeam Solutions for SMB_2022.pptxVeeam Solutions for SMB_2022.pptx
Veeam Solutions for SMB_2022.pptxPrince Joseph
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheAmazon Web Services
 
MAA for Oracle Database, Exadata and the Cloud
MAA for Oracle Database, Exadata and the CloudMAA for Oracle Database, Exadata and the Cloud
MAA for Oracle Database, Exadata and the CloudMarkus Michalewicz
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsGokhan Atil
 
Vce vxrail-customer-presentation new
Vce vxrail-customer-presentation newVce vxrail-customer-presentation new
Vce vxrail-customer-presentation newJennifer Graham
 
Virtual SAN 6.2, hyper-converged infrastructure software
Virtual SAN 6.2, hyper-converged infrastructure softwareVirtual SAN 6.2, hyper-converged infrastructure software
Virtual SAN 6.2, hyper-converged infrastructure softwareDuncan Epping
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
 
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017Amazon Web Services
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon Web Services
 
Private cloud network architecture (2018)
Private cloud network architecture (2018)Private cloud network architecture (2018)
Private cloud network architecture (2018)Gasida Seo
 

What's hot (20)

Accelerate Your Cloud Migration Journey.pdf
Accelerate Your Cloud Migration Journey.pdfAccelerate Your Cloud Migration Journey.pdf
Accelerate Your Cloud Migration Journey.pdf
 
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...
Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API Examples
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloud
 
MySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellMySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a Nutshell
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
Veeam Solutions for SMB_2022.pptx
Veeam Solutions for SMB_2022.pptxVeeam Solutions for SMB_2022.pptx
Veeam Solutions for SMB_2022.pptx
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
MAA for Oracle Database, Exadata and the Cloud
MAA for Oracle Database, Exadata and the CloudMAA for Oracle Database, Exadata and the Cloud
MAA for Oracle Database, Exadata and the Cloud
 
Oracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAsOracle Enterprise Manager Cloud Control 13c for DBAs
Oracle Enterprise Manager Cloud Control 13c for DBAs
 
Vce vxrail-customer-presentation new
Vce vxrail-customer-presentation newVce vxrail-customer-presentation new
Vce vxrail-customer-presentation new
 
Virtual SAN 6.2, hyper-converged infrastructure software
Virtual SAN 6.2, hyper-converged infrastructure softwareVirtual SAN 6.2, hyper-converged infrastructure software
Virtual SAN 6.2, hyper-converged infrastructure software
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017
Elastic Load Balancing Deep Dive and Best Practices - NET402 - re:Invent 2017
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and Migration
 
Private cloud network architecture (2018)
Private cloud network architecture (2018)Private cloud network architecture (2018)
Private cloud network architecture (2018)
 

Viewers also liked

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급hslkdfjs
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNuxeo
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasMongoDB
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기설리번 프로젝트
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery conceptphillip shambare
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure ManagementDavid Puckett
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containersbenaam
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a PlatformMicah Laaker
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Ericsson
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolSperasoft
 

Viewers also liked (20)

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the Enterprise
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
 
Hadoop Cluster Management
Hadoop Cluster ManagementHadoop Cluster Management
Hadoop Cluster Management
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery concept
 
Polymer optical fibers
Polymer optical fibersPolymer optical fibers
Polymer optical fibers
 
SAP Cloud for Service
SAP Cloud for ServiceSAP Cloud for Service
SAP Cloud for Service
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure Management
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containers
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a Platform
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
High-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for AuditoriumsHigh-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for Auditoriums
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI Tool
 

Similar to Hadoop & Greenplum: Why Do Such a Thing?

Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsThilina Gunarathne
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site OptimizationAmit Kejriwal
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAmazon Web Services
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisCaserta
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zendswiss IT bridge
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 

Similar to Hadoop & Greenplum: Why Do Such a Thing? (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site Optimization
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data Analytics
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zend
 
London data science
London data scienceLondon data science
London data science
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Hadoop & Greenplum: Why Do Such a Thing?

  • 1. Greenplum & Hadoop Why do such a thing? Donald Miner Solutions Architect Advanced Technologies Group Donald.Miner@emc.com © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. QUICK INTRODUCTION TO GREENPLUM DATABASE © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. GREENPLUM DATABASE Greenplum Database Basics Massively Parallel Processing (MPP) Database Uses commodity hardware Master Master Data is distributed by a user-defined “distribution key” Master node delegates queries to segments Segment Segment Segment Segment 1:1 segment and master mirroring for redundancy © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. GREENPLUM DATABASE Greenplum Database Features Full SQL support based on PostgreSQL 8.2 Columnar or row-oriented storage with compression Multi-level table partitioning with query time partition pruning B-tree and bitmap indexes JDBC, ODBC, OLEDB, etc. interfaces High speed, parallel bulk ingest Parallel query optimizer External tables © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. GREENPLUM DATABASE MADlib Analytics with Greenplum Scalable and in-database > SELECT householdID, variables FROM households Mathematical, statistical, ORDER BY RANDOM() LIMIT 100000; machine learning > SELECT run_univariate_analysis ( 'households_training', Active open source project 'variables'); WHERE pvalue<.01 AND r2>.01; > SELECT run_regression( 'univariate_results', 'households_training'); > SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households; © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. GREENPLUM DATABASE MADlib In-Database Analytical Functions Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining CountMin (Cormode-Muthukrishnan) K-Means Clustering Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Naïve Bayes Classification Estimator MFV (Most Frequent Values) Sketch- Linear Regression based Estimator Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART Latent Dirichlet Allocation Topic Modeling © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. GREENPLUM DATABASE PostGIS Support in Greenplum DB PostGIS adds support for geographic objects in PostgreSQL Example: find all records within 25 miles of hurricane path http://postgis.refractions.net/ select customer_id, ST_AsText(lat_lon), phone_num from clients where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING( -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, - 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, - 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI()) customer_id | st_astext | phone_num ------------+-----------------------------+------------- 493140 | POINT(-80.040397 26.570613) | 1231231234 192401 | POINT(-81.820933 26.242611) | 2342342345 © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. GREENPLUM DATABASE Solr integration with GPDB Solr is an open source enterprise search engine Enable in-database text indexing and search id | score | message_text select -----------+------------------+------------------------------------------- t.id, 71552856 | 5.43078422546387 | Hates BB's Love IPhones! q.score, 91373993 | 4.06371879577637 | Its a love hate relationship with t.message_text iPhone spellcheck from message t, 25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate gptext.search( relationship... 'twitter.public.message', 120166038 | 3.39410924911499 | Love the new iPhone 4s, hate '(iphone and (hate or love))', @ATT service #Verizonhereicome 'author_lang:en', 100 117498183 | 3.39181470870972 | I got a love-hate relationship for )q my iPhone!!! where t.id=q.id 86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me.. order by score desc; © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. GREENPLUM HADOOP © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. GREENPLUM HADOOP Greenplum “HD” • Bundled open source • HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma hout © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. GREENPLUM HADOOP Greenplum “MR” • Bundled MapR, a commercial version of Hadoop • API compatible with traditional Hadoop • MapR improvements over Hadoop: – Improved control system – Major portions of HDFS re-implemented in C++ – HDFS is NFS mountable – Improved shuffle and sort – Distributed NameNode – Supports large number of files – Mirroring, snapshot capability © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Why do such a thing? Greenplum DB MADLib Partitioning GP Solr/Lucene SQL Indexing Text objects RDBMS PostGIS GPMapReduce Tables and Schemas STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Why do such a thing? Hadoop Schema on load MapReduce Hive XML, JSON, … Flat files Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Why do such a thing? HBase Row keys Hive Flexible schema MapReduce HBase Tables Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Why do such a thing? Hybrid architecture with all three (or two…) MADLib Partitioning Row keys GP Solr/Lucene SQL Schema on load Indexing Text objects Flexible schema MapReduce RDBMS Hive PostGIS HBase Tables GPMapReduce Tables and Schemas Pig XML, JSON, … Flat files STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Greenplum Unified Analytics Platform © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop External Tables in GPDB External tables bring external data into the database. Native support for HDFS with parallelized loading. Can write to HDFS or read from HDFS. > CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|'); > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word; > WRITE INTO hdfs_export SELECT * FROM gpdb_source; © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Why do such a thing? Many of the same use cases of a HBase/Hadoop environment Use Hadoop as a data groomer Do rollups in Hadoop and store results in GPDB Use the best tool for the job (structured vs. unstructured) Use GPDB to host data sets in a more real-time layer for ad-hoc analytics © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. EMC Isilon Hardware appliance for scale-out network-attached storage (NAS) Stripes data across all nodes Uses Infiniband for intra-cluster communication Up to 15.5PB total storage 3 different hardware configurations to handle different workloads Uses “OneFS”, Isilon’s operating system and file system Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more. © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Isilon HDFS interface Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data. Underlying system is OneFS and does not follow the traditional HDFS scheme. Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster. © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Pros & Cons Isilon is more dense Isilon can be mounted via a number of protocols – Easier ingest / egress – Raw data accessible by applications Isilon is easy to manage Free of certain HDFS limitations Isilon loses data locality (~250MB/sec throughput per node over network) © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Why do such a thing? Hadoop backup or archive – More dense than HDFS, more accessible than tape, no need for compute Complete HDFS replacement – More dense, more accessible, utilize existing Isilon, slower per terabyte of storage Hot/warm storage – Use HDFS as primary, but Isilon as secondary Storage for original content – Use MapReduce to extract metadata from original content, and leave original content in place © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. HBase External Tables in GPDB Project in development Load data in parallel from HBase by specifying table name and column qualifiers > CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟); > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word; © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. HBase External Tables in GPDB Possible TODO list: Specify range of rowkeys Support writes into HBase Specify filter criteria on the external table select * from hbase_external where ROWKEY=‘abc’ Accumulo? © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Why do such a thing? Have HBase store semi-structured data Exploit the strengths of each Use HBase for really really wide tables Use HBase as a scalable archive of raw records Leverage existing HBase applications © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Greenplum On HDFS Get Greenplum Database to run natively off of HDFS Underlying Greenplum Database data is stored in HDFS Unifies the two platform further – no need for external tables Fully supports Greenplum’s append-only tables Early project in R&D Talk will be given by Chang Lei at Yahoo Summit © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Greenplum On HDFS Master host Interconnect Segment Segment (Mirror) Segment Segment Segment Segment Segment Segment (Mirror) Segment Segment (Mirror) (Mirror) (Mirror) Segment host Segment host Segment host Segment host Segment host Meta Ops Read/Write Tables in HDFS filespace Namenode B Datanode replication Datanode Datanode Rack1 Rack2 © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Why do such a thing? Covers many of the same use cases as Hive Run Hadoop MapReduce over data managed by Greenplum DB Initial results show it is faster than Hive You only have to store your data in one system © Copyright 2012 EMC Corporation. All rights reserved. 28

Editor's Notes

  1. Greenplum HD HadoopSoftware