• Save
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 

Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?

on

  • 2,376 views

When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool - either Hadoop or a traditional DBMS - to do all the work.  At Vertica, ...

When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool - either Hadoop or a traditional DBMS - to do all the work.  At Vertica, we've found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together.  Join us as we walk through some of the customer use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.

Statistics

Views

Total Views
2,376
Views on SlideShare
1,573
Embed Views
803

Actions

Likes
7
Downloads
0
Comments
0

6 Embeds 803

http://www.cloudera.com 791
http://192.168.6.179 7
http://blog.cloudera.com 2
http://cloudera.matt.dev 1
https://www.cloudera.com 1
http://dschool.co 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose? Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose? Presentation Transcript

  • Hadoop vs. RDBMS for Big Data Analytics... Why Choose?Mingsheng Hong Field CTO, HP VerticaScott McClellan VP, HP Emerging Applications
  • 2 HP Confidential
  • Hadoop for Big Data Analytics• Scalable• Flexible• Low cost to try out• Strong community• But… –Batch oriented jobs –Less efficient storage –“Programmer friendly” (improving)3 HP Confidential
  • Survey of Big Data Tools Stats Programs ? Hadoop Big Data CEP Engines4 HP Confidential
  • Vertica Analytics RDBMS PlatformReal-time Big Data SPEED SCALABILITY SIMPLICITY• Relational DBMS with ACID• Real-time analytic reporting with SQL• 50–1000x faster than traditional DBs• High scalability, elasticity and full parallelism• Simple install/use with auto setup and tuning• Industry standard x86 hardware• Advanced in-database analytics• Extensible analytics framework5 HP Confidential
  • We have a Lot in Common …• Purpose-built from scratch for analytics• Commodity hardware• MPP infrastructure, scaling to 100’s nodes and multiple PBs• Robust• Diverse use cases with strong market traction6 HP Confidential
  • … And We Have Differences• Interface• Tool chain / ecosystem• Storage management• Run time optimization• Automatic performance tuning7 HP Confidential
  • Column Store – Column-Based Disk I/O• Typical FinServ price per stock for 1 day Column Store - Reads 3 columns SELECT AVG(price) AAPL NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE 143.74 143.75 NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE 5/05/09 5/06/09 FROM tickstore AAPL NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE WHERE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS BBY 37.03 5/05/09 NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE BBY 37.13 5/06/09 NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE symbol = ‘AAPL” AND date = NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS ‘5/06/09’ Row Store - Reads all columns AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/05/09 AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/06/09 BBY NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 37.03 NYSE NYSE NYSE 5/05/09 BBY 37.13 5/06/09 … NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYSE NYSE NYSE8 HP Confidential
  • Column Store – Sort and Encode for Speed Student_ID Name Gender Class Score Grade 1256678 Cappiello, Emilia F Sophomore 62 D 1254038 Dalal, Alana F Senior 92 A 1278858 Orner, Katy F Junior 76 C 1230807 Frigo, Avis M Senior 64 D 1210466 Stober, Saundra F Junior 90 A 1249290 Borba, Milagros F Freshman 96 A 1244262 Sosnowski, Hillary F Junior 68 D 1252490 Nibert, Emilia F Sophomore 59 F 1267170 Popovic, Tanisha F Freshman 95 A 1248100 Schreckengost, Max M Senior 76 C 1243483 Porcelli, Darren M Junior 67 D 1230382 Sinko, Erik M Freshman 91 A 1240224 Tarvin, Julio M Sophomore 85 B 1222781 Lessig, Elnora F Junior 63 D 1231806 Thon, Max M Sophomore 82 B 1246648 Trembley, Allyson F Junior 100 A9 HP Confidential
  • Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Sophomore D 62 Cappiello, Emilia 1256678 F Senior A 92 Dalal, Alana 1254038 F Junior C 76 Orner, Katy 1278858 M Senior D 64 Frigo, Avis 1230807 F Junior A 90 Stober, Saundra 1210466 F Freshman A 96 Borba, Milagros 1249290 F Junior D 68 Sosnowski, Hillary 1244262 F Sophomore F 59 Nibert, Emilia 1252490 F Freshman A 95 Popovic, Tanisha 1267170 M Senior C 76 Schreckengost, Max 1248100 M Junior D 67 Porcelli, Darren 1243483 M Freshman A 91 Sinko, Erik 1230382 M Sophomore B 85 Tarvin, Julio 1240224 F Junior D 63 Lessig, Elnora 1222781 M Sophomore B 82 Thon, Max 1231806 F Junior A 100 Trembley, Allyson 1246648 Columns used in predicates Correlated values “indexed” by preceding column values10 HP Confidential
  • Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Freshman A 95 Popovic, Tanisha 1267170 F Freshman A 96 Borba, Milagros 1249290 F Junior A 90 Stober, Saundra 1210466 F Junior A 100 Trembley, Allyson 1246648 F Junior C 76 Orner, Katy 1278858 F Junior D 63 Lessig, Elnora 1222781 F Junior D 68 Sosnowski, Hillary 1244262 F Senior A 92 Dalal, Alana 1254038 F Sophomore D 62 Cappiello, Emilia 1256678 F Sophomore F 59 Nibert, Emilia 1252490 M Freshman A 91 Sinko, Erik 1230382 M Junior D 67 Porcelli, Darren 1243483 M Sophomore B 82 Thon, Max 1231806 M Sophomore B 85 Tarvin, Julio 1240224 M Senior C 76 Schreckengost, Max 1248100 M Senior D 64 Frigo, Avis 1230807 Columns used in predicates Correlated values “indexed” by preceding column values11 HP Confidential
  • Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Freshman A 95 Popovic, Tanisha 1267170 F Freshman offset A offset 96 Borba, Milagros 1249290 F Junior A 90 Stober, Saundra 1210466 F Junior A 100 Trembley, Allyson 1246648 F Junior C 76 Orner, Katy 1278858 F Junior D 63 Lessig, Elnora 1222781 F Junior D 68 Sosnowski, Hillary 1244262 F Senior A 92 Dalal, Alana 1254038 F Sophomore D 62 Cappiello, Emilia 1256678 F 2 nd Sophomore 3 rd F 4 th 59 Nibert, Emilia 1252490 M Freshman A 91 Sinko, Erik 1230382 M M I/O Junior Sophomore I/O D B I/O 67 82 Porcelli, Darren Thon, Max 1243483 1231806 1st I/O M M Sophomore Senior B C 85 76 Tarvin, Julio Schreckengost, Max 1240224 1248100 Reads entire M Senior D 64 Frigo, Avis 1230807 column Example query: select avg( Score ) from example where Class = ‘Junior’ and Gender = ‘F’ and Grade =12 ‘A’ HP Confidential
  • Column Store – Column Based Compression Compression Compression Results Ratio Clickstream 10:1 Audit 10:1 Trading 5:1 SNMP 20:1 Network Logs 60:1 Marketing 20:1 Consumer 30:1 CDR 8:1 0% 20% 40% 60% 80% 100% Encoded Data Raw Data13 HP Confidential
  • Query-Driven Data Segmentation and HA Segment 1 Segment N RAID-like functionality Client Facing Network • Segment 2 within DB Cluster Network Segment 1 Segment 3 • Smart K-safety Segment 2 • Always-on loads & queries Segment N Segment N-114 HP Confidential
  • Automatic Performance Tuning • Optimal data layout (physical schema)  optimal performance • User provides –Logical schema –Sample data set –Typical queries • Database Designer generates data layout proposals which: –Optimize query performance –Optimize data loading throughput –Minimize storage footprint • Workload Analyzer15 HP Confidential
  • Database Designer Case Studies• Financial Services (vs manual design) –Queries 4x faster –Storage: 50% less –Design cost: 4 minutes vs months• Marketing & advertising –All queries fully optimized; storage 10% of raw data• Retail (vs manual design) –Queries 2x faster; storage 33% less• News media (vs manual design) –Queries comparable; storage 25% less16 HP Confidential
  • Application Integration17 HP Confidential
  • Analytics Feature Comparison• SQL • “Everything” –Graph analytics • But especially –Monte carlo simulation –HDFS for storing schema-less data –Statistical functions –Parse & transform semi-structured• Extended SQL data –Clickstream analytics (e.g. –Machine learning sessionization) –Multi-language scripts and –Time series analytics libraries –Pattern matching –Event series join•18 Extensible analytics HP Confidential
  • 19 HP Confidential
  • Combining the Strengths• Hadoop for exploratory analysis –Especially with existing MR, Pig scripts• Vertica for stylized, interactive analysis –For shared features, often faster than Hadoop with a fraction of hardware resources• Vertica’s Hadoop connector21 HP Confidential
  • Hadoop + Vertica Use Case Example Extract Transform Load Hadoop HP Vertica Flume Connector H D F S Vertica Hadoop Connector SQOOP Connector Other Sources22 HP Confidential
  • More Joint Use Cases• Parallel import /export to HDFS• MR for data transformation, Vertica for optimized storage & retrieval –Apache log parsing –Convert JSON into relational tuples –Sentiment analysis• Advanced analytics –Filter, join and aggregation in Vertica –Intermediate result fed into an MR job23 HP Confidential
  • Vertica Extensible Analyics SDK • A framework for user- defined Functions and Transformations –C++ based extensible framework –Flexible: express a wide range of analytic computation –In-process, fully parallel execution25 HP Confidential
  • Vertica Community Edition• Join the community: http://www.vertica.com/community –Fully featured, 1 TB + 3 nodes (unlimited academic use) –Open source analytic packages on github26 HP Confidential
  • HP Hadoop Reference Architecture End-to-End Scalable Information Management Solution Systems Management Connectors to move Analytics tooling to enable 4 CMU Real-time Monitoring 5 subsets of data in Structured Big Data CMU Push Button Scale Out and out of Hadoop 3 users to create and run Cloudera Enterprise analytics jobs on unstructured data BI/Tableau HP CMU Real-time Monitoring Datameer HP Vertica HP CMU Scale Out Karmasphere Cloudera Flume Cloudera SQOOP Cloudera Enterprise HP VERTICA Application Hadoop Core Execution engine Cloudera Distribution of Apache Hadoop2 and distributed file system to run massively parallel processing tasks (Map/Reduce and HDFS) RDBMS, SAP, Logs, etc. Operating System e.g. RedHat, Suse, CentOS HP ProLiant and HP Networking Scale Out Proliant x86 hardware with 1 large amounts of DAS storage to store and process data 27 HP Confidential
  • HP Hadoop Reference Architecture: Basic Concepts Starter Kit Scaled Deployment Development/POC Non-Production Environment Starter Kit  Modest Scale Production Typical scale configurations are up to two racks. Switch Switch Switch Second Switch Second Switch  Add redundant network/switch Master Node  Move management nodes to separate Management Node Job Tracker Node Management Node racks Name Node Secondary Name Node Hadoop Slave Nodes Hadoop Slave Nodes Hadoop Slave Nodes At Larger Scale – “Hundreds of nodes”Starter Kit Typical scale configurations are beyond two racks.6 nodes (2 mgmt, 4 worker), 1 switch  Upgrade switches (better congestion management)• Optimized for low cost  Add additional management nodes• Configurations generally not Scaled Deployment fully redundant (single NW/switch) for scaling >2 Racks• Same hardware as production o Separate name nodes (become very busy, need • cluster Optimized for scale and resiliency • lots of memory) Same hardware as starter kit 28 HP Confidential
  • Visualizing Cluster/Hadoop Performance: BasicConcepts Key system statistics CPU Disk Reads # Map Tasks Node1 75% 300 8 Visualized as “tubes” … Node N 65% 315 7 Displayed as gauges (2-dimensional) CPU Disk Reads Map Reduce Tasks CPU Disk Reads Map Reduce Tasks (where the z-axis is time)29 HP Confidential
  • Visualizing Cluster/Hadoop: Normal Run – No Problems100 Nodes – TeraSort processing on Hadoop Write Shuffle Results Long lived tasks Sort (Move (1 copy intermediate Many short- Processing 700 tasks only) results between lived tasks (CPU (2 per core) intensive) nodes) 7823 tasks Data Read (1 per block) 30 HP Confidential
  • Visualizing Cluster/Hadoop: With Network Problems100 Nodes – TeraSort processing on Hadoop Job stalls at 90% waiting for remaining tasks Some tasks take a long time to finish (Speculation kicks Failing switch in) caused many network retries 31 HP Confidential
  • In Closing…• Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges.• Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data.• Getting them to work together is easy.32 HP Confidential
  • Conclusion• Join the community: http://www.vertica.com/community• Join the core team: http://www.vertica.com/about/careers/33 HP Confidential