Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Upcoming SlideShare
Loading in...5
×
 

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

on

  • 6,923 views

See

See

Statistics

Views

Total Views
6,923
Views on SlideShare
4,939
Embed Views
1,984

Actions

Likes
7
Downloads
147
Comments
0

11 Embeds 1,984

http://www.emergingafrican.com 1204
http://stevewatt.blogspot.com 434
http://memeburn.com 315
http://lanyrd.com 24
http://twitter.com 1
http://www.stevewatt.blogspot.com 1
http://www.twylah.com 1
http://cloud.feedly.com 1
http://95.mashmetv.appspot.com 1
http://www.mashme.tv 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Presentation Transcript

  • Bridging Unstructured & Structured Data with Hadoop and Vertica
    Glenn Gebhart ggebhart@vertica.com
    Steve Watt swatt@hp.com
  • Contents
    • Our background with Big Data
    • Accelerating and monitoring Apache Hadoop deployments with HP CMU
    • I have my Apache Hadoop Cluster deployed….. Now what ?
    • Sample application scenario with Apache Hadoop and Vertica
  • 3
    HP Confidential
    Cluster Management Utility
  • Managing Scale Out with HP CMU
    • Proven cluster deployment and management tool
    • 11 Years Experience
    • Proven with clusters of 3500+ nodes
    • Deployment and Management
    • Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
    • Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
    • Command Line or GUI based cluster wide configuration
    • Manage a node individually or manage a cluster as a whole
    • Monitoring
    • Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
    • Extensible through Collectl integration
  • 5
    HP Confidential
  • 6
    HP Confidential
    Tech Bubble?
    What does the Data Say?
    Attribution: CC PascalTerjan via Flickr
  • 7
    HP Confidential
  • But what if I could turn that into this?
    8
    HP Confidential
  • And see how the amount invested this year differs from previous years?
  • 10
    HP Confidential
    Where is the money going?
  • What type of startups get the most investment funding?
  • Amount invested in Software Startups by Zip Code
  • How did you do that?
    13
    HP Confidential
    How
    did you
    Do that?
    Attribution: CC  Colin_K on Flickr
  • 14
    HP Confidential
    Apache
    Identify Optimal Seed URLs
    & Crawl to a depth of 2
    http://www.crunchbase.com/companies?c=a&q=privately_held
    Crawl data is stored in segment dirs on the HDFS
  • 15
    HP Confidential
  • 16
    HP Confidential
    Making the data STRUCTURED
    Retrieving HTML
    Prelim Filtering on URL
    Company POJO then /t Out
  • 17
    HP Confidential
    Aargh!
    My viz tool requires zipcodes to plot geospatially!
  • Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • The Story So Far
    Used Nutch to retrieve investment data from web site.
    Used Hadoop to extract and structure the data
    Used Pig to add zipcode data.
    End result is a collection of relations describing investment activity.
    We’ve got raw data, now we need to understand it.
  • Why Vertica?
    Vertica and Hadoop are complementary technologies.
    Hadoop’s strengths:
    Analysis of unstructured data (screen scraping, natural language recognition)
    Non-numeric operations (graphics preparation)
    Vertica’s strengths
    Counting, adding, grouping, sorting, …
    Rich suite of advanced analytic functions
    All at TB+ scales.
  • Built from the Ground Up: The Four C’s of Vertica
    Columnar storage and execution
    Continuous performance
    Clustering
    Compression
    Achieve best data query performance with unique Vertica column store
    Linear scaling by adding more resources on the fly
    Store more data, provide more views, use less hardware
    Query and load 24x7 with zero administration
  • Getting Data From Here To There
  • Connecting Vertica And Hadoop
    Vertica provides connectors for Hadoop 20.2 and Pig 0.7.
    Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data.
    Input retrieved from Vertica using standard SQL query.
    Output written to Vertica table.
  • Vertica As a M/R Data Source
    // Set up the configuration and job objects
    Configuration conf = getConf();
    Job job = new Job(conf);
    // Set the input format to retrieve data from Vertica
    job.setInputFormatClass(VerticaInputFormat.class);
    // Set the query to retrieve data from the Vertica DB
    VerticaInputFormat.setInput(
    job,
    “SELECT * FROM foo WHERE bar = ‘baz’
    );
  • Vertica As a M/R Data Sink
    // Set up the configuration and job objects
    Configuration conf = getConf();
    Job job = new Job(conf);
    // Set the output format to to write data to Vertica
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(VerticaRecord.class);
    job.setOutputFormatClass(VerticaOutputFormat.class);
    // Define the table which will hold the output
    VerticaOutputFormat.setOutput(
    job, <table name>, <truncate table?>,
    <col 1 def>, <col 2 def>, …, <col N def>
    );
  • Reading Data Via Pig
    # Read some tuples
    A = LOAD 'sql://< Your query here >'
    USING com.vertica.pig.VerticaLoader(
    ‘server1,server2,server3',
    ‘< DB Name>','5433',‘< user >',‘< password >’
    );
    26
  • Writing Data Via Pig
    # Write some tuples
    STORE < some var >
    INTO '{
    < table name > (< col 1 def >, < col 2 def >, … )
    }'
    USING com.vertica.pig.VerticaStorer(
    ‘< server >',‘< DB >','5433',‘< user >',‘< password >’
    );
    27
  • Reporting And Data Visualization
  • Does My Favorite Application Work With Vertica?
    Vertica is an ANSI SQL99 compliant DB.
    Comes with drivers for ODBC, JDBC, and ADO.Net.
    If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
  • We Support…
  • Traditional Reports
    Integrates smoothly with reporting frontends such as Jasper and Pentaho.
    Scriptable via the vsqlcommand line tool.
    C/C++ SDK for parallelized, in-DB computation.
    But… you have to know what questions you want to ask.
  • Graphical, Real-Time Data Exploration
  • Wrap-Up
  • Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges.
    Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data.
    Getting them to work together is easy.
    In Closing…
  • Questions?