Your SlideShare is downloading. ×

Greenplum - Jacque Istok - Hadoop World 2010

2,495

Published on

RBDMS and Hadoop: A Powerful Combination …

RBDMS and Hadoop: A Powerful Combination

Jacque Istok

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,495
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
1
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1© Copyright 2010 EMC Corporation. All rights reserved. RDBMS and Hadoop A Powerful Combination Jacque Istok
  • 2. 2© Copyright 2010 EMC Corporation. All rights reserved. You Know Hadoop, But What Is Greenplum? EMC/Greenplum is an MPP data warehouse system, based off PostgreSQL, with the full capabilities of a traditional RDBMS system. In conjunction with SQL-99 compliance for structured analysis, Greenplum also offers a MapReduce implementation for non structured analysis. In short: Greenplum ~ Hadoop/Hive
  • 3. 3© Copyright 2010 EMC Corporation. All rights reserved. Data in a Typical Enterprise • Data is everywhere – corporate EDW, 100s of data marts, ‘shadow’ databases, spreadsheets, logs, etc • The goal of centralizing all data in a single EDW has proven untenable EDW ~10% of data Data Marts and ‘Personal Databases’ ~90% of data
  • 4. 4© Copyright 2010 EMC Corporation. All rights reserved. Today’s Big Data Challenges • Sources of data and the amount of data to analyze is growing exponentially • Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough • Lack of performance for advanced analytics and complex queries • The number of users and the concurrency of users is increasing rapidly • Security and privacy around the data is both preferred and often mandated
  • 5. 5© Copyright 2010 EMC Corporation. All rights reserved. Architecture of HDFS/Hadoop/Hive Hive Server accepts SQL and dynamically generates and executes MapReduce code Flexible framework for processing large datasets Materialize data subsets to reduce impact of node failure DataNode servers process analytics close to the data in parallel NameNode DataNodeDataNode DataNode DataNode DataNode … NameNode SQL (subset) Hive Process large datasets with support for both SQL and MapReduce MapReduce
  • 6. 6© Copyright 2010 EMC Corporation. All rights reserved. Architecture of Greenplum Master servers optimize queries for the most efficient query execution MPP Scatter/Gather streaming for fast loading of data Flexible framework for processing large datasets Interconnect for continuous pipelining of data processing Segment servers process queries close to the data in parallel Master SegmentSegment Segment Segment Segment … Master SQL MapReduce Process large datasets with support for both SQL and MapReduce
  • 7. 7© Copyright 2010 EMC Corporation. All rights reserved. RDBMS Advantages
  • 8. 8© Copyright 2010 EMC Corporation. All rights reserved. Common Real World Implementation Lots ‘O Data
  • 9. 9© Copyright 2010 EMC Corporation. All rights reserved. A Cyber-Analytics Data Mart Use Case • Commercial SIEM products struggle with the volumes of data generated in a large enterprise. Non-parallel event processing systems can’t keep up with ingest, user load, etc • Greenplum provides the ability to cost-effectively ingest and store large volumes of sensor data. • Greenplum provides the parallel analytics that support data mining, event correlation, etc, over datasets from TB’s to PB’s in size. Access and Events Greenplum Analytics Data Mart GPLoad SQL MapReduce (Perl) (Python Math Lib) (R) SoR ETL ODS BI
  • 10. 10© Copyright 2010 EMC Corporation. All rights reserved. Coexistence Approach – Use Case Compute Storage Analytics General Purpose X86 Cluster of Systems Network • Provides true, complete SQL compliant analytics • Data can be read and written from Hadoop via Greenplum • Store your data structured, unstructured, column or row oriented, compressed, leveraging Index support where appropriate • SQL can be executed, through Greenplum, on data residing within Greenplum as well as data residing within HDFS • MapReduce can be executed through Greenplum in Java, C, Perl, Python or through Java in Hadoop • Designed for rapid analysis of data volumes from less than a terabyte scaling into the petabytes
  • 11. 11© Copyright 2010 EMC Corporation. All rights reserved. Big Data is Complementary to EDW Commodity Hardware Virtual Machines Public Cloud Greenplum Enterprise Data Warehouse • Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational Reporting • Financial Consolidation MapReduce Analytics Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple marts and sandboxes • Rapid analytic iteration, and business owned solutions
  • 12. 12© Copyright 2010 EMC Corporation. All rights reserved.

×