Your SlideShare is downloading. ×
0
Vertica Integration with Apache Hadoop Hadoop World NYC 2009 HDFS Hadoop Compute  Cluster Map Map Map Reduce
Vertica ®  Analytic Database <ul><li>MPP columnar architecture </li></ul><ul><li>Second to sub-second queries </li></ul><u...
What do people do with Hadoop? <ul><li>Transform data </li></ul><ul><li>Archive data </li></ul><ul><li>Look for Patterns <...
Big Data comes in Three Forms <ul><li>Unstructured </li></ul><ul><ul><li>Images, sound, video </li></ul></ul><ul><li>Semi-...
Availability, Scalability and Efficiency <ul><li>… how fast can you go from data to answers? </li></ul><ul><li>Unstructure...
Hadoop / Vertica <ul><li>Distributed processing framework (MapReduce) </li></ul><ul><li>Distributed storage layer (HDFS) <...
Hadoop / Vertica Vertica serves as a structured data repository for hadoop Hadoop Compute  Cluster Map Map Map Reduce
Hadoop / Vertica <ul><li>Vertica’s input formatter takes a parameterized query </li></ul><ul><li>Relational Map operations...
Hadoop / Vertica Federate multiple Vertica database clusters with hadoop Hadoop Compute  Cluster Map Map Map Reduce Hadoop...
What is the Interface? <ul><li>Input Formatter </li></ul><ul><ul><li>Query specifies which data to read </li></ul></ul><ul...
Some Hadoop / Vertica Applications <ul><li>Elastic Map Reduce parsing and loading CloudFront Logs </li></ul><ul><li>Tickst...
Basic Example <ul><li>Elastic Map Reduce parsing and loading CloudFront Logs </li></ul><ul><li>Mapper reads from S3 CloudF...
Advanced Example <ul><li>Tickstore algorithm with map push down </li></ul><ul><li>Input formatter queries Vertica using ma...
How to get started <ul><li>Get a copy of hadoop from Apache or Cloudera </li></ul><ul><li>Get vertica from  www.vertica.co...
Future Directions and Questions <ul><li>Archiving information lifecycle (sqoop) </li></ul><ul><li>Invoking hadoop jobs fro...
Upcoming SlideShare
Loading in...5
×

Hw09 Hadoop + Vertica

4,082

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,082
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
225
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Hw09 Hadoop + Vertica"

  1. 1. Vertica Integration with Apache Hadoop Hadoop World NYC 2009 HDFS Hadoop Compute Cluster Map Map Map Reduce
  2. 2. Vertica ® Analytic Database <ul><li>MPP columnar architecture </li></ul><ul><li>Second to sub-second queries </li></ul><ul><li>300GB/node load times </li></ul><ul><li>Scales to hundreds of TBs </li></ul><ul><li>Standard ETL & Reporting Tools </li></ul>www.vertica.com
  3. 3. What do people do with Hadoop? <ul><li>Transform data </li></ul><ul><li>Archive data </li></ul><ul><li>Look for Patterns </li></ul><ul><li>Parse Logs </li></ul>
  4. 4. Big Data comes in Three Forms <ul><li>Unstructured </li></ul><ul><ul><li>Images, sound, video </li></ul></ul><ul><li>Semi-structured </li></ul><ul><ul><li>Logs, data feeds, event streams </li></ul></ul><ul><li>Fully Structured </li></ul><ul><ul><li>Relational tables </li></ul></ul>
  5. 5. Availability, Scalability and Efficiency <ul><li>… how fast can you go from data to answers? </li></ul><ul><li>Unstructured data needs to be analyzed to make sense. </li></ul><ul><li>Semi-structure data parsed based on spec (or brute force). </li></ul><ul><li>Structured data can be optimized for ad-hoc analysis. </li></ul>
  6. 6. Hadoop / Vertica <ul><li>Distributed processing framework (MapReduce) </li></ul><ul><li>Distributed storage layer (HDFS) </li></ul><ul><li>Vertica can be used as a data source and target for MapReduce </li></ul><ul><li>Data can also be moved between Vertica and HDFS (sqoop) </li></ul><ul><li>Hadoop talks to Vertica via custom Input and Output Formatters </li></ul>
  7. 7. Hadoop / Vertica Vertica serves as a structured data repository for hadoop Hadoop Compute Cluster Map Map Map Reduce
  8. 8. Hadoop / Vertica <ul><li>Vertica’s input formatter takes a parameterized query </li></ul><ul><li>Relational Map operations can be pushed down to the database </li></ul><ul><li>Vertica’s output formatter takes an existing table name or a description </li></ul><ul><li>Vertica output tables can be optimized directly from hadoop </li></ul>
  9. 9. Hadoop / Vertica Federate multiple Vertica database clusters with hadoop Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce
  10. 10. What is the Interface? <ul><li>Input Formatter </li></ul><ul><ul><li>Query specifies which data to read </li></ul></ul><ul><ul><li>Query can be parameterizes (map push down) </li></ul></ul><ul><ul><li>Each input split gets one parameter </li></ul></ul><ul><ul><li>OR, input can be spliced with order by and limit (slower) </li></ul></ul><ul><li>Output Formatter </li></ul><ul><ul><li>Job specifies format for output table </li></ul></ul><ul><ul><li>Vertica converts reduced output into trickle loads </li></ul></ul><ul><ul><li>Vertica can optimize new tables </li></ul></ul>
  11. 11. Some Hadoop / Vertica Applications <ul><li>Elastic Map Reduce parsing and loading CloudFront Logs </li></ul><ul><li>Tickstore algorithm with map push down </li></ul><ul><li>Analyze time series </li></ul><ul><li>Sessionize click streams </li></ul><ul><li>Parse and load logs </li></ul>
  12. 12. Basic Example <ul><li>Elastic Map Reduce parsing and loading CloudFront Logs </li></ul><ul><li>Mapper reads from S3 CloudFront Logs </li></ul><ul><li>Parses into records, transmits to reducer </li></ul><ul><li>Reducer loads into Vertica </li></ul><ul><li>All done with streaming API </li></ul>~ 10 lines of python Limitless SQL
  13. 13. Advanced Example <ul><li>Tickstore algorithm with map push down </li></ul><ul><li>Input formatter queries Vertica using map push down </li></ul><ul><li>Identity Mapper passes through to reducer </li></ul><ul><li>Reducer runs proprietary algorithm </li></ul><ul><ul><li>moving average, correlations, secret sauce </li></ul></ul><ul><li>Results are stored in a new table for further analysis </li></ul><ul><li>Vertica optimizes the new table </li></ul>
  14. 14. How to get started <ul><li>Get a copy of hadoop from Apache or Cloudera </li></ul><ul><li>Get vertica from www.vertica.com or via Amazon or RightScale or as a VM </li></ul><ul><li>Grab the formatter and Vertica jdbc drivers from vetica.com/MapReduce </li></ul><ul><li>Included in contrib from hadoop 0.21.0 (MR-775) </li></ul><ul><li>Put the jars in hadoop/lib </li></ul><ul><li>Run your Hadoop/Vertica job </li></ul>
  15. 15. Future Directions and Questions <ul><li>Archiving information lifecycle (sqoop) </li></ul><ul><li>Invoking hadoop jobs from Vertica </li></ul><ul><li>Joining Vertica data mid job </li></ul><ul><li>Using Vertica for (structured) transient job data </li></ul><ul><li>[email_address] </li></ul><ul><li>Vertica.com/MapReduce </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×