Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Bridging Unstructured & Structured Data with Hadoop and Vertica<br />Glenn Gebhart 	ggebhart@vertica.com<br />Steve Watt  ...
Contents<br /><ul><li>Our background with Big Data
Accelerating and monitoring Apache Hadoop deployments with HP CMU
I have my Apache Hadoop Cluster deployed….. Now what ?
Sample application scenario with Apache Hadoop and Vertica</li></li></ul><li>3<br />HP Confidential<br />Cluster Managemen...
Managing Scale Out with HP CMU<br /><ul><li>Proven cluster deployment and management tool
11 Years Experience
Proven with clusters of 3500+ nodes
Deployment and Management
Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
Command Line or GUI based cluster wide configuration
Manage a node individually or manage a cluster as a whole
Monitoring
Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
Extensible through Collectl integration</li></li></ul><li>5<br />HP Confidential<br />
6<br />HP Confidential<br />Tech Bubble? <br />What does the Data Say?<br />Attribution: CC PascalTerjan via Flickr<br />
7<br />HP Confidential<br />
But what if I could turn that into this?<br />8<br />HP Confidential<br />
And see how the amount invested this year differs from previous years?<br />
10<br />HP Confidential<br />Where is the money going?<br />
What type of startups get the most investment funding?<br />
Amount invested in Software Startups by Zip Code<br />
How did you do that?<br />13<br />HP Confidential<br />How<br />did you <br />Do that?<br />Attribution: CC  Colin_K on Fl...
14<br />HP Confidential<br />Apache <br />Identify Optimal Seed URLs<br />& Crawl to a depth of 2<br />http://www.crunchba...
15<br />HP Confidential<br />
16<br />HP Confidential<br />Making the data STRUCTURED<br />Retrieving HTML<br />Prelim Filtering on URL<br />Company POJ...
17<br />HP Confidential<br />Aargh!<br />My viz tool requires zipcodes to plot geospatially!<br />
Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USIN...
The Story So Far<br />Used Nutch to retrieve investment data from web site.<br />Used Hadoop to extract and structure the ...
Why Vertica?<br />Vertica and Hadoop are complementary technologies.<br />Hadoop’s strengths:<br /> Analysis of unstructur...
Built from the Ground Up: The Four C’s of Vertica<br />Columnar storage and execution<br />Continuous performance<br />Clu...
Getting Data From Here To There<br />
Connecting Vertica And Hadoop<br />Vertica provides connectors for Hadoop 20.2 and Pig 0.7.<br />Acts as a passive compone...
Vertica As a M/R Data Source<br />// Set up the configuration and job objects<br />Configuration conf = getConf(); <br />J...
Upcoming SlideShare
Loading in …5
×

of

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 1 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 2 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 3 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 4 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 5 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 6 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 7 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 8 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 9 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 10 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 11 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 12 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 13 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 14 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 15 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 16 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 17 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 18 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 19 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 20 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 21 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 22 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 23 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 24 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 25 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 26 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 27 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 28 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 29 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 30 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 31 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 32 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 33 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 34 Bridging Structured and Unstructred Data with Apache Hadoop and Vertica Slide 35
Upcoming SlideShare
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distributed R and Vertica by Edward Ma of HP
Next
Download to read offline and view in fullscreen.

8 Likes

Share

Download to read offline

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Download to read offline

See

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

  1. Bridging Unstructured & Structured Data with Hadoop and Vertica<br />Glenn Gebhart ggebhart@vertica.com<br />Steve Watt swatt@hp.com<br />
  2. Contents<br /><ul><li>Our background with Big Data
  3. Accelerating and monitoring Apache Hadoop deployments with HP CMU
  4. I have my Apache Hadoop Cluster deployed….. Now what ?
  5. Sample application scenario with Apache Hadoop and Vertica</li></li></ul><li>3<br />HP Confidential<br />Cluster Management Utility<br />
  6. Managing Scale Out with HP CMU<br /><ul><li>Proven cluster deployment and management tool
  7. 11 Years Experience
  8. Proven with clusters of 3500+ nodes
  9. Deployment and Management
  10. Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
  11. Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
  12. Command Line or GUI based cluster wide configuration
  13. Manage a node individually or manage a cluster as a whole
  14. Monitoring
  15. Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
  16. Extensible through Collectl integration</li></li></ul><li>5<br />HP Confidential<br />
  17. 6<br />HP Confidential<br />Tech Bubble? <br />What does the Data Say?<br />Attribution: CC PascalTerjan via Flickr<br />
  18. 7<br />HP Confidential<br />
  19. But what if I could turn that into this?<br />8<br />HP Confidential<br />
  20. And see how the amount invested this year differs from previous years?<br />
  21. 10<br />HP Confidential<br />Where is the money going?<br />
  22. What type of startups get the most investment funding?<br />
  23. Amount invested in Software Startups by Zip Code<br />
  24. How did you do that?<br />13<br />HP Confidential<br />How<br />did you <br />Do that?<br />Attribution: CC  Colin_K on Flickr<br />
  25. 14<br />HP Confidential<br />Apache <br />Identify Optimal Seed URLs<br />& Crawl to a depth of 2<br />http://www.crunchbase.com/companies?c=a&q=privately_held<br />Crawl data is stored in segment dirs on the HDFS<br />
  26. 15<br />HP Confidential<br />
  27. 16<br />HP Confidential<br />Making the data STRUCTURED<br />Retrieving HTML<br />Prelim Filtering on URL<br />Company POJO then /t Out<br />
  28. 17<br />HP Confidential<br />Aargh!<br />My viz tool requires zipcodes to plot geospatially!<br />
  29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');<br />
  30. The Story So Far<br />Used Nutch to retrieve investment data from web site.<br />Used Hadoop to extract and structure the data<br />Used Pig to add zipcode data.<br />End result is a collection of relations describing investment activity.<br />We’ve got raw data, now we need to understand it.<br />
  31. Why Vertica?<br />Vertica and Hadoop are complementary technologies.<br />Hadoop’s strengths:<br /> Analysis of unstructured data (screen scraping, natural language recognition)<br /> Non-numeric operations (graphics preparation)<br />Vertica’s strengths<br /> Counting, adding, grouping, sorting, …<br /> Rich suite of advanced analytic functions<br /> All at TB+ scales. <br />
  32. Built from the Ground Up: The Four C’s of Vertica<br />Columnar storage and execution<br />Continuous performance<br />Clustering<br />Compression<br />Achieve best data query performance with unique Vertica column store<br />Linear scaling by adding more resources on the fly<br />Store more data, provide more views, use less hardware<br />Query and load 24x7 with zero administration<br />
  33. Getting Data From Here To There<br />
  34. Connecting Vertica And Hadoop<br />Vertica provides connectors for Hadoop 20.2 and Pig 0.7.<br />Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data.<br />Input retrieved from Vertica using standard SQL query.<br />Output written to Vertica table.<br />
  35. Vertica As a M/R Data Source<br />// Set up the configuration and job objects<br />Configuration conf = getConf(); <br />Job job = new Job(conf); <br />// Set the input format to retrieve data from Vertica<br />job.setInputFormatClass(VerticaInputFormat.class);<br />// Set the query to retrieve data from the Vertica DB <br />VerticaInputFormat.setInput(<br /> job,<br /> “SELECT * FROM foo WHERE bar = ‘baz’<br />);<br />
  36. Vertica As a M/R Data Sink<br />// Set up the configuration and job objects<br />Configuration conf = getConf(); <br />Job job = new Job(conf); <br />// Set the output format to to write data to Vertica<br />job.setOutputKeyClass(Text.class);<br />job.setOutputValueClass(VerticaRecord.class);<br />job.setOutputFormatClass(VerticaOutputFormat.class);<br />// Define the table which will hold the output<br />VerticaOutputFormat.setOutput(<br /> job, <table name>, <truncate table?>,<br /> <col 1 def>, <col 2 def>, …, <col N def><br />);<br />
  37. Reading Data Via Pig<br /># Read some tuples<br />A = LOAD 'sql://< Your query here >' <br /> USING com.vertica.pig.VerticaLoader(<br /> ‘server1,server2,server3', <br /> ‘< DB Name>','5433',‘< user >',‘< password >’<br /> ); <br />26<br />
  38. Writing Data Via Pig<br /># Write some tuples<br />STORE < some var > <br />INTO '{<br /> < table name > (< col 1 def >, < col 2 def >, … )<br />}'<br />USING com.vertica.pig.VerticaStorer(<br /> ‘< server >',‘< DB >','5433',‘< user >',‘< password >’<br />);<br />27<br />
  39. Reporting And Data Visualization<br />
  40. Does My Favorite Application Work With Vertica?<br />Vertica is an ANSI SQL99 compliant DB.<br />Comes with drivers for ODBC, JDBC, and ADO.Net.<br />If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.<br />
  41. We Support…<br />
  42. Traditional Reports<br />Integrates smoothly with reporting frontends such as Jasper and Pentaho.<br />Scriptable via the vsqlcommand line tool.<br />C/C++ SDK for parallelized, in-DB computation.<br />But… you have to know what questions you want to ask.<br />
  43. Graphical, Real-Time Data Exploration<br />
  44. Wrap-Up<br />
  45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges.<br />Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data.<br />Getting them to work together is easy.<br />In Closing…<br />
  46. Questions?<br />
  • JaredZhang3

    Jun. 25, 2018
  • hadramylftpi

    Jul. 16, 2014
  • kartiktv

    Jul. 13, 2014
  • dtran320

    Aug. 22, 2012
  • yannizhang1

    Jul. 22, 2012
  • hasbiarsyad

    May. 19, 2012
  • owldiscourse

    Mar. 20, 2012
  • bhaboghure

    Nov. 9, 2011

See

Views

Total views

9,394

On Slideshare

0

From embeds

0

Number of embeds

2,269

Actions

Downloads

193

Shares

0

Comments

0

Likes

8

×