• Like
  • Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Greenplum - Jacque Istok - Hadoop World 2010

  • 2,448 views
Uploaded on

RBDMS and Hadoop: A Powerful Combination …

RBDMS and Hadoop: A Powerful Combination

Jacque Istok

Learn more @ http://www.cloudera.com/hadoop/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,448
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. RDBMS and Hadoop
    A Powerful Combination
    Jacque Istok
  • 2. You Know Hadoop, But What Is Greenplum?
    EMC/Greenplum is an MPP data warehouse system, based off PostgreSQL, with the full capabilities of a traditional RDBMS system. In conjunction with SQL-99 compliance for structured analysis, Greenplum also offers a MapReduce implementation for non structured analysis. In short:
    Greenplum ~ Hadoop/Hive
  • 3. Data in a Typical Enterprise
    • Data is everywhere – corporate EDW, 100s of data marts, ‘shadow’ databases, spreadsheets, logs, etc
    • 4. The goal of centralizing all data in a single EDW has proven untenable
    EDW
    ~10% of data
    Data Marts and ‘Personal Databases’
    ~90% of data
  • 5. Today’s Big Data Challenges
    Sources of data and the amount of data to analyze is growing exponentially
    Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough
    Lack of performance for advanced analytics and complex queries
    The number of users and the concurrency of users is increasing rapidly
    Security and privacy around the data is both preferred and often mandated
  • 6. Architecture of HDFS/Hadoop/Hive
    Flexible framework for processing large datasets
    SQL (subset)
    Hive
    Process large datasets with support for both SQL and MapReduce
    MapReduce
    NameNode
    NameNode
    DataNode
    DataNode
    DataNode
    DataNode
    DataNode
    Hive Server accepts SQL and dynamically
    generates and executes MapReduce code
    Materialize data subsets to reduce impact of node failure

    DataNode servers process analytics close to the data in parallel
  • 7. Architecture of Greenplum
    Flexible framework for processing large datasets
    Process large datasets with support for both SQL and MapReduce
    SQL
    MapReduce
    Master
    Master
    Segment
    Segment
    Segment
    Segment
    Segment
    Master servers optimize queries
    for the most efficient query execution
    Interconnect for continuous pipelining of data processing

    Segment servers process queries close to the data in parallel
    MPP Scatter/Gather streaming for fast loading of data
  • 8. RDBMS Advantages
  • 9. Common Real World Implementation
    Lots ‘O Data
  • 10. A Cyber-Analytics Data Mart Use Case
    Commercial SIEM products struggle with the volumes of data generated in a large enterprise. Non-parallel event processing systems can’t keep up with ingest, user load, etc
    Greenplum provides the ability to cost-effectively ingest and store large volumes of sensor data.
    Greenplum provides the parallel analytics that support data mining, event correlation, etc, over datasets from TB’s to PB’s in size.
    Access and Events
    SoR
    ETL
    GPLoad
    Greenplum
    Analytics
    Data Mart
    ODS
    SQL
    MapReduce
    (Perl)
    (Python Math Lib)
    (R)
    BI
  • 11. Coexistence Approach – Use Case
    General Purpose X86 Cluster of Systems
    Provides true, complete SQL compliant analytics
    Data can be read and written from Hadoop via Greenplum
    Store your data structured, unstructured, column or row oriented, compressed, leveraging Index support where appropriate
    SQL can be executed, through Greenplum, on data residing within Greenplum as well as data residing within HDFS
    MapReduce can be executed through Greenplum in Java, C, Perl, Python or through Java in Hadoop
    • Designed for rapid analysis of data volumes from less than a terabyte scaling into the petabytes
    Compute
    Storage
    Analytics
    Network
  • 12. Big Data is Complementary to EDW
    Greenplum
    Enterprise Data Warehouse
    • Single Source of Truth
    • 13. 1 Logical Model
    • 14. Heavy data governance and quality
    • 15. Operational Reporting
    • 16. Financial Consolidation
    MapReduce Analytics Cloud
    • Source of all raw data (often 10X size of EDW)
    • 17. Self-service infrastructure to support multiple marts and sandboxes
    • 18. Rapid analytic iteration, and business owned solutions
    Commodity Hardware
    Virtual Machines
    Public Cloud