© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
HPVertica-SolvingFacebooksBig
DataChallenges
Moustafa Soliman/ April 1, 2015
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Hello!
Infrastructure and
Outsourcing
Applications and
Business Services
Analytics & Data
Management
HP Financial
Services
HP Enterprise
Group
HP Enterprise
Services
HP Software
HP Printing &
Personal
Systems
Hewlett-Packard Enterprise HP Inc.
HPBigDataPlatform-HAVEn
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4
HP Big Data platform
Transforming Big Data into business solutions
AudioSocial Media ImagesEmailVideo Search engineDocumentsTexts Transactional
Data
Mobile IT/OT
Catalog massive
volumes of
distributed data
Hadoop/HDFS
Process and index
all information
Autonomy IDOL
Analyze at
extreme scale
in real-time
Vertica
Collect & unify
machine data
Enterprise Security
Powering HP Software
+ your apps
nApps
HAVEn
Standard platform with connectors, applications, and engines
TheVerticaTechnology
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8
HP Vertica: An integrated Big Data ecosystem
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
HP Vertica – Core Features
Columnar
Storage
Compression MPP Scale-Out Distributed
Query
Projections
Speeds Query Time
by Reading Only
Necessary Data
Lowers costly I/O to
boost overall
performance
Provides high
scalability on
clusters with no
name node or other
single point of
failure
Any node can
initiate the queries
and use other nodes
for work.
No single point of
failure
Combine high
availability with
special
optimizations for
query performance
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
A B D C E A
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Verticaco-existingtoconvergence
Vertica - adooH P
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
1.Vertica and Hadoop Integration
1. MapReduce & Pig Connector
2. HDFS Flat files
3. HCatalog
Hadoop
HDFS
ANSI SQL
HCatalog Pig
MapReduce
HBase
Billing
Clickstream
Telemetry
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
Vertica Map Reduce Connector
Allows flexibility & interoperability
Integrate with Hadoop / MapReduce and Pig
• Vertica-aware extension to Hadoop
• Specialized adapter for distributed streaming
between Hadoop and Vertica
Developers need access to fast DBMS that
co-exists with Hadoop rather than being
embedded
• Operate on different clusters, generally by different
groups of people
• Allows customers to scale computation independent
of DBMS
Hadoop
HDFS
ANSI SQL
HCatalog Pig
MapReduce
HBase
Billing
Clickstream
Telemetry
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13
Vertica Connector for HDFS
• External Tables: Query/Access
HDFS data directly off HDFS
• Connector for HDFS - Load files
from HDFS into Vertica via Copy
command
• HDFS Storage Location - Store ROS
containers on HDFS
Hadoop
HDFS
ANSI SQL
HCatalog Pig
MapReduce
HBase
Billing
Clickstream
Telemetry
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14
Vertica HCatalog Connector
The Vertica HCatalog Connector lets you access
data stored in Apache's Hive data warehouse
software the same way you access it within a
native Vertica table.
• Always reflects the current state of data stored
in Hive.
• The HCatalog Connector uses the parallel
nature of both HP Vertica and Hadoop to
process Hive data.
• Since Vertica performs the extraction and
parsing of data, the HCatalog Connector does
not significantly increase the load on your
Hadoop cluster.
Hadoop
HDFS
ANSI SQL
HCatalog Pig
MapReduce
HBase
Billing
Clickstream
Telemetry
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15
2.Vertica SQL-on-Hadoop
• Vertica and Hadoop co-exist on
shared hardware
• Vertica uses HDFS for its data
storage
Hadoop
HDFS
HCatalog Pig
MapReduce
HBase
Billing
Clickstream
Telemetry
HPVertica-SolvingFacebooksBigData
Challenges
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17
Use-case: Facebook
What do they do Challenge Business Benefit /
Outcomes
Technical Pain
Use-Case
Business Challenge
• Lack of query performance
• Restricted data set due to lack of scalability
• Targeted marketing
• Generating new revenue streams from mobile
Leading social media
website focused on
connecting the world.
largest Database in the
world (> 400PB), driving
revenue through targeted
online marketing/revenue
from data via web and
mobile
• Jobs take 1 day in Hadoop
• X could not scale-up
• Y too expensive
To increase revenue from
information, through massive
volume and variety of queries
and profiling people with the
right advertising campaigns
The queries take < 1 minute
using HP Vertica.
The company said growth had
been fuelled by advertising
income, which leapt 66 per
cent year-on-year.
Facebook were struggling to
generate mobile advertising
revenue prior to implementing
HP Vertica.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Facebook Production Data Flow
Mobile
PC/Laptop
Web Servers
Logs
Hadoop/ HDFS
2 huge Hadoop Clusters
• 1.7 ExaBytes
• 15000 nodes
• 40000 nodes
Job
Scheduler
Vertica
Logs
15 mins
Hourly
Daily
Legacy
• Schedulers: Data Wormhole and Data Bee (Developed at FB)
• 1500B rows/day, 20TB/hour…..500TB/Day
• Keep 30-90 days. Currently 30 days
• 2 x 270-node Vertica cluster
 Distance apart 60km
 ETL Dual Load
 One (Primary), One (Secondary)
 Grow to 500-1000 nodes per cluster in one year
• 600K MR Jobs/day
• 50K Informatica Jobs/day
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thankyou!
Contact us:
moustafa.soliman@hp.com
marcus.hansfeldt@hp.com
sten.nikolaev@hp.com
carola.holmstrom@hp.com
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Backup
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22
“…for Facebook’s CIO, Tim Campos, to get on stage
in Europe and declare that, "A partner like
HP Vertica thinks like we do” and is a “key part” of
Facebook’s big data capabilities, is one the best
endorsements, err … “likes,” that any modern IT
infrastructure vendor could hope for.”
- Dana Gardner, Briefings Direct
Watch the video to see how Vertica empowers Facebook
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
The Vertica Technology
The 4 Cs
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24
HP Vertica Flex Zone
Challenge: Analyzing semi-structured data is
difficult and time consuming
SOLUTION: HP Vertica Flex Zone
Benefits:
• Store and explore semi-structured data cost effectively
• Avoid creating and maintaining time-consuming schemas
• Gain 10X+ performance with one simple step
• http://www.vertica.com/hp-vertica-products/flexzone/
Extensible Analytics
SQL
Flex Zone
Explore
Enterprise Edition
Optimize
HP Vertica Analytics Platform

Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"

  • 1.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HPVertica-SolvingFacebooksBig DataChallenges Moustafa Soliman/ April 1, 2015
  • 2.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Hello! Infrastructure and Outsourcing Applications and Business Services Analytics & Data Management HP Financial Services HP Enterprise Group HP Enterprise Services HP Software HP Printing & Personal Systems Hewlett-Packard Enterprise HP Inc.
  • 3.
  • 4.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 HP Big Data platform Transforming Big Data into business solutions AudioSocial Media ImagesEmailVideo Search engineDocumentsTexts Transactional Data Mobile IT/OT Catalog massive volumes of distributed data Hadoop/HDFS Process and index all information Autonomy IDOL Analyze at extreme scale in real-time Vertica Collect & unify machine data Enterprise Security Powering HP Software + your apps nApps HAVEn Standard platform with connectors, applications, and engines
  • 5.
  • 6.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
  • 7.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
  • 8.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 HP Vertica: An integrated Big Data ecosystem
  • 9.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 HP Vertica – Core Features Columnar Storage Compression MPP Scale-Out Distributed Query Projections Speeds Query Time by Reading Only Necessary Data Lowers costly I/O to boost overall performance Provides high scalability on clusters with no name node or other single point of failure Any node can initiate the queries and use other nodes for work. No single point of failure Combine high availability with special optimizations for query performance CPU Memory Disk CPU Memory Disk CPU Memory Disk A B D C E A
  • 10.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Verticaco-existingtoconvergence Vertica - adooH P
  • 11.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 1.Vertica and Hadoop Integration 1. MapReduce & Pig Connector 2. HDFS Flat files 3. HCatalog Hadoop HDFS ANSI SQL HCatalog Pig MapReduce HBase Billing Clickstream Telemetry
  • 12.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 Vertica Map Reduce Connector Allows flexibility & interoperability Integrate with Hadoop / MapReduce and Pig • Vertica-aware extension to Hadoop • Specialized adapter for distributed streaming between Hadoop and Vertica Developers need access to fast DBMS that co-exists with Hadoop rather than being embedded • Operate on different clusters, generally by different groups of people • Allows customers to scale computation independent of DBMS Hadoop HDFS ANSI SQL HCatalog Pig MapReduce HBase Billing Clickstream Telemetry
  • 13.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 Vertica Connector for HDFS • External Tables: Query/Access HDFS data directly off HDFS • Connector for HDFS - Load files from HDFS into Vertica via Copy command • HDFS Storage Location - Store ROS containers on HDFS Hadoop HDFS ANSI SQL HCatalog Pig MapReduce HBase Billing Clickstream Telemetry
  • 14.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 Vertica HCatalog Connector The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table. • Always reflects the current state of data stored in Hive. • The HCatalog Connector uses the parallel nature of both HP Vertica and Hadoop to process Hive data. • Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not significantly increase the load on your Hadoop cluster. Hadoop HDFS ANSI SQL HCatalog Pig MapReduce HBase Billing Clickstream Telemetry
  • 15.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 2.Vertica SQL-on-Hadoop • Vertica and Hadoop co-exist on shared hardware • Vertica uses HDFS for its data storage Hadoop HDFS HCatalog Pig MapReduce HBase Billing Clickstream Telemetry
  • 16.
  • 17.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 Use-case: Facebook What do they do Challenge Business Benefit / Outcomes Technical Pain Use-Case Business Challenge • Lack of query performance • Restricted data set due to lack of scalability • Targeted marketing • Generating new revenue streams from mobile Leading social media website focused on connecting the world. largest Database in the world (> 400PB), driving revenue through targeted online marketing/revenue from data via web and mobile • Jobs take 1 day in Hadoop • X could not scale-up • Y too expensive To increase revenue from information, through massive volume and variety of queries and profiling people with the right advertising campaigns The queries take < 1 minute using HP Vertica. The company said growth had been fuelled by advertising income, which leapt 66 per cent year-on-year. Facebook were struggling to generate mobile advertising revenue prior to implementing HP Vertica.
  • 18.
    © Copyright 2013Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Facebook Production Data Flow Mobile PC/Laptop Web Servers Logs Hadoop/ HDFS 2 huge Hadoop Clusters • 1.7 ExaBytes • 15000 nodes • 40000 nodes Job Scheduler Vertica Logs 15 mins Hourly Daily Legacy • Schedulers: Data Wormhole and Data Bee (Developed at FB) • 1500B rows/day, 20TB/hour…..500TB/Day • Keep 30-90 days. Currently 30 days • 2 x 270-node Vertica cluster  Distance apart 60km  ETL Dual Load  One (Primary), One (Secondary)  Grow to 500-1000 nodes per cluster in one year • 600K MR Jobs/day • 50K Informatica Jobs/day
  • 19.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Thankyou! Contact us: moustafa.soliman@hp.com marcus.hansfeldt@hp.com sten.nikolaev@hp.com carola.holmstrom@hp.com
  • 20.
    © Copyright 2012Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Backup
  • 21.
    © Copyright 2012Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21
  • 22.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 “…for Facebook’s CIO, Tim Campos, to get on stage in Europe and declare that, "A partner like HP Vertica thinks like we do” and is a “key part” of Facebook’s big data capabilities, is one the best endorsements, err … “likes,” that any modern IT infrastructure vendor could hope for.” - Dana Gardner, Briefings Direct Watch the video to see how Vertica empowers Facebook
  • 23.
    © Copyright 2014Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The Vertica Technology The 4 Cs
  • 24.
    © Copyright 2012Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 HP Vertica Flex Zone Challenge: Analyzing semi-structured data is difficult and time consuming SOLUTION: HP Vertica Flex Zone Benefits: • Store and explore semi-structured data cost effectively • Avoid creating and maintaining time-consuming schemas • Gain 10X+ performance with one simple step • http://www.vertica.com/hp-vertica-products/flexzone/ Extensible Analytics SQL Flex Zone Explore Enterprise Edition Optimize HP Vertica Analytics Platform