SlideShare a Scribd company logo
1 of 35
Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart 	ggebhart@vertica.com Steve Watt         swatt@hp.com
Contents ,[object Object]
Accelerating and monitoring Apache Hadoop deployments with HP CMU
I have my Apache Hadoop Cluster deployed….. Now what ?
Sample application scenario with Apache Hadoop and Vertica,[object Object]
Managing Scale Out with HP CMU ,[object Object]
11 Years Experience
Proven with clusters of 3500+ nodes
Deployment and Management
Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
Command Line or GUI based cluster wide configuration
Manage a node individually or manage a cluster as a whole
Monitoring
Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
Extensible through Collectl integration,[object Object]
6 HP Confidential Tech Bubble?  What does the Data Say? Attribution: CC PascalTerjan via Flickr
7 HP Confidential
But what if I could turn that into this? 8 HP Confidential
And see how the amount invested this year differs from previous years?
10 HP Confidential Where is the money going?
What type of startups get the most investment funding?
Amount invested in Software Startups by Zip Code
How did you do that? 13 HP Confidential How did you  Do that? Attribution: CC  Colin_K on Flickr
14 HP Confidential Apache  Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
15 HP Confidential
16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths:  Analysis of unstructured data (screen scraping, natural language recognition)  Non-numeric operations (graphics preparation) Vertica’s strengths  Counting, adding, grouping, sorting, …  Rich suite of advanced analytic functions  All at TB+ scales.
Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
Getting Data From Here To There
Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf();  Job job = new Job(conf);   // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB  VerticaInputFormat.setInput( 	job, 	“SELECT * FROM foo WHERE bar = ‘baz’ );

More Related Content

What's hot

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 

What's hot (20)

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of Funnels
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Viewers also liked

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Bill Graham
 

Viewers also liked (20)

Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practices
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Vertica mpp columnar dbms
Vertica mpp columnar dbmsVertica mpp columnar dbms
Vertica mpp columnar dbms
 
Vertica finalist interview
Vertica finalist interviewVertica finalist interview
Vertica finalist interview
 
Optimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureOptimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management Infrastructure
 
Vertica
VerticaVertica
Vertica
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture Overview
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.
 
HP Vertica basics
HP Vertica basicsHP Vertica basics
HP Vertica basics
 
Vertica
VerticaVertica
Vertica
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
 

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
Landoop Ltd
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
Edward Capriolo
 

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica (20)

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 

More from Steve Watt

Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
Steve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
Steve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
Steve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Steve Watt
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
Steve Watt
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
Steve Watt
 

More from Steve Watt (12)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Final deck
Final deckFinal deck
Final deck
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Extractiv
ExtractivExtractiv
Extractiv
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

  • 1. Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart ggebhart@vertica.com Steve Watt swatt@hp.com
  • 2.
  • 3. Accelerating and monitoring Apache Hadoop deployments with HP CMU
  • 4. I have my Apache Hadoop Cluster deployed….. Now what ?
  • 5.
  • 6.
  • 8. Proven with clusters of 3500+ nodes
  • 10. Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
  • 11. Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
  • 12. Command Line or GUI based cluster wide configuration
  • 13. Manage a node individually or manage a cluster as a whole
  • 15. Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
  • 16.
  • 17. 6 HP Confidential Tech Bubble? What does the Data Say? Attribution: CC PascalTerjan via Flickr
  • 19. But what if I could turn that into this? 8 HP Confidential
  • 20. And see how the amount invested this year differs from previous years?
  • 21. 10 HP Confidential Where is the money going?
  • 22. What type of startups get the most investment funding?
  • 23. Amount invested in Software Startups by Zip Code
  • 24. How did you do that? 13 HP Confidential How did you Do that? Attribution: CC  Colin_K on Flickr
  • 25. 14 HP Confidential Apache Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
  • 27. 16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
  • 28. 17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
  • 29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 30. The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
  • 31. Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation) Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
  • 32. Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
  • 33. Getting Data From Here To There
  • 34. Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
  • 35. Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’ );
  • 36. Vertica As a M/R Data Sink // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the output format to to write data to Vertica job.setOutputKeyClass(Text.class); job.setOutputValueClass(VerticaRecord.class); job.setOutputFormatClass(VerticaOutputFormat.class); // Define the table which will hold the output VerticaOutputFormat.setOutput( job, <table name>, <truncate table?>, <col 1 def>, <col 2 def>, …, <col N def> );
  • 37. Reading Data Via Pig # Read some tuples A = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ ); 26
  • 38. Writing Data Via Pig # Write some tuples STORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … ) }' USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’ ); 27
  • 39. Reporting And Data Visualization
  • 40. Does My Favorite Application Work With Vertica? Vertica is an ANSI SQL99 compliant DB. Comes with drivers for ODBC, JDBC, and ADO.Net. If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
  • 42. Traditional Reports Integrates smoothly with reporting frontends such as Jasper and Pentaho. Scriptable via the vsqlcommand line tool. C/C++ SDK for parallelized, in-DB computation. But… you have to know what questions you want to ask.
  • 45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges. Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data. Getting them to work together is easy. In Closing…