24. How did you do that? 13 HP Confidential How did you Do that? Attribution: CC Colin_K on Flickr
25. 14 HP Confidential Apache Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
27. 16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
28. 17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
30. The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
31. Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation) Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
32. Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
34. Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
35. Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’ );
36. Vertica As a M/R Data Sink // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the output format to to write data to Vertica job.setOutputKeyClass(Text.class); job.setOutputValueClass(VerticaRecord.class); job.setOutputFormatClass(VerticaOutputFormat.class); // Define the table which will hold the output VerticaOutputFormat.setOutput( job, <table name>, <truncate table?>, <col 1 def>, <col 2 def>, …, <col N def> );
37. Reading Data Via Pig # Read some tuples A = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ ); 26
38. Writing Data Via Pig # Write some tuples STORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … ) }' USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’ ); 27
40. Does My Favorite Application Work With Vertica? Vertica is an ANSI SQL99 compliant DB. Comes with drivers for ODBC, JDBC, and ADO.Net. If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
42. Traditional Reports Integrates smoothly with reporting frontends such as Jasper and Pentaho. Scriptable via the vsqlcommand line tool. C/C++ SDK for parallelized, in-DB computation. But… you have to know what questions you want to ask.
45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges. Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data. Getting them to work together is easy. In Closing…