SlideShare a Scribd company logo
1 of 35
Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart 	ggebhart@vertica.com Steve Watt         swatt@hp.com
Contents ,[object Object]
Accelerating and monitoring Apache Hadoop deployments with HP CMU
I have my Apache Hadoop Cluster deployed….. Now what ?
Sample application scenario with Apache Hadoop and Vertica,[object Object]
Managing Scale Out with HP CMU ,[object Object]
11 Years Experience
Proven with clusters of 3500+ nodes
Deployment and Management
Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
Command Line or GUI based cluster wide configuration
Manage a node individually or manage a cluster as a whole
Monitoring
Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
Extensible through Collectl integration,[object Object]
6 HP Confidential Tech Bubble?  What does the Data Say? Attribution: CC PascalTerjan via Flickr
7 HP Confidential
But what if I could turn that into this? 8 HP Confidential
And see how the amount invested this year differs from previous years?
10 HP Confidential Where is the money going?
What type of startups get the most investment funding?
Amount invested in Software Startups by Zip Code
How did you do that? 13 HP Confidential How did you  Do that? Attribution: CC  Colin_K on Flickr
14 HP Confidential Apache  Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
15 HP Confidential
16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths:  Analysis of unstructured data (screen scraping, natural language recognition)  Non-numeric operations (graphics preparation) Vertica’s strengths  Counting, adding, grouping, sorting, …  Rich suite of advanced analytic functions  All at TB+ scales.
Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
Getting Data From Here To There
Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf();  Job job = new Job(conf);   // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB  VerticaInputFormat.setInput( 	job, 	“SELECT * FROM foo WHERE bar = ‘baz’ );

More Related Content

What's hot

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World VerticaOmer Trajman
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of FunnelsDatabricks
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsKasper Sørensen
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 

What's hot (20)

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of Funnels
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Viewers also liked

Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practicesZvika Gutkin
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guideneinamat
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Vertica mpp columnar dbms
Vertica mpp columnar dbmsVertica mpp columnar dbms
Vertica mpp columnar dbmsZvika Gutkin
 
Vertica finalist interview
Vertica finalist interviewVertica finalist interview
Vertica finalist interviewMITX
 
Optimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureOptimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureImanis Data
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro wayZvika Gutkin
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewAndrey Karpov
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.Anil Maharjan
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to VerticaTommi Siivola
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016Analytics10
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataVoltDB
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks
 

Viewers also liked (20)

Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practices
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Vertica mpp columnar dbms
Vertica mpp columnar dbmsVertica mpp columnar dbms
Vertica mpp columnar dbms
 
Vertica finalist interview
Vertica finalist interviewVertica finalist interview
Vertica finalist interview
 
Optimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureOptimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management Infrastructure
 
Vertica
VerticaVertica
Vertica
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture Overview
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.
 
HP Vertica basics
HP Vertica basicsHP Vertica basics
HP Vertica basics
 
Vertica
VerticaVertica
Vertica
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
 

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce Amazon Web Services
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataDynamical Software, Inc.
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 

Similar to Bridging Structured and Unstructred Data with Apache Hadoop and Vertica (20)

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 

More from Steve Watt

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusionedSteve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systemsSteve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using HadoopSteve Watt
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 

More from Steve Watt (12)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Final deck
Final deckFinal deck
Final deck
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Extractiv
ExtractivExtractiv
Extractiv
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

  • 1. Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart ggebhart@vertica.com Steve Watt swatt@hp.com
  • 2.
  • 3. Accelerating and monitoring Apache Hadoop deployments with HP CMU
  • 4. I have my Apache Hadoop Cluster deployed….. Now what ?
  • 5.
  • 6.
  • 8. Proven with clusters of 3500+ nodes
  • 10. Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
  • 11. Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
  • 12. Command Line or GUI based cluster wide configuration
  • 13. Manage a node individually or manage a cluster as a whole
  • 15. Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
  • 16.
  • 17. 6 HP Confidential Tech Bubble? What does the Data Say? Attribution: CC PascalTerjan via Flickr
  • 19. But what if I could turn that into this? 8 HP Confidential
  • 20. And see how the amount invested this year differs from previous years?
  • 21. 10 HP Confidential Where is the money going?
  • 22. What type of startups get the most investment funding?
  • 23. Amount invested in Software Startups by Zip Code
  • 24. How did you do that? 13 HP Confidential How did you Do that? Attribution: CC  Colin_K on Flickr
  • 25. 14 HP Confidential Apache Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
  • 27. 16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
  • 28. 17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
  • 29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 30. The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
  • 31. Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation) Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
  • 32. Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
  • 33. Getting Data From Here To There
  • 34. Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
  • 35. Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’ );
  • 36. Vertica As a M/R Data Sink // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the output format to to write data to Vertica job.setOutputKeyClass(Text.class); job.setOutputValueClass(VerticaRecord.class); job.setOutputFormatClass(VerticaOutputFormat.class); // Define the table which will hold the output VerticaOutputFormat.setOutput( job, <table name>, <truncate table?>, <col 1 def>, <col 2 def>, …, <col N def> );
  • 37. Reading Data Via Pig # Read some tuples A = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ ); 26
  • 38. Writing Data Via Pig # Write some tuples STORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … ) }' USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’ ); 27
  • 39. Reporting And Data Visualization
  • 40. Does My Favorite Application Work With Vertica? Vertica is an ANSI SQL99 compliant DB. Comes with drivers for ODBC, JDBC, and ADO.Net. If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
  • 42. Traditional Reports Integrates smoothly with reporting frontends such as Jasper and Pentaho. Scriptable via the vsqlcommand line tool. C/C++ SDK for parallelized, in-DB computation. But… you have to know what questions you want to ask.
  • 45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges. Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data. Getting them to work together is easy. In Closing…