Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart ggebhart@vertica.com Steve Watt swatt@hp.com

Accelerating and monitoring Apache Hadoop deployments with HP CMU

I have my Apache Hadoop Cluster deployed….. Now what ?

Sample application scenario with Apache Hadoop and Vertica,[object Object]

Managing Scale Out with HP CMU ,[object Object]

Proven with clusters of 3500+ nodes

Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.

Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)

Command Line or GUI based cluster wide configuration

Manage a node individually or manage a cluster as a whole

Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics

Extensible through Collectl integration,[object Object]

6 HP Confidential Tech Bubble? What does the Data Say? Attribution: CC PascalTerjan via Flickr

But what if I could turn that into this? 8 HP Confidential

And see how the amount invested this year differs from previous years?

10 HP Confidential Where is the money going?

What type of startups get the most investment funding?

Amount invested in Software Startups by Zip Code

How did you do that? 13 HP Confidential How did you Do that? Attribution: CC Colin_K on Flickr

14 HP Confidential Apache Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS

16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out

17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!

Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.

Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation) Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.

Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration

Getting Data From Here To There

Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.

Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’ );

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica (20)

More from Steve Watt

More from Steve Watt (12)

Recently uploaded

Recently uploaded (20)

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica