Hybrid my sql_hadoop_datawarehouse
 

Hybrid my sql_hadoop_datawarehouse

on

  • 331 views

 

Statistics

Views

Total Views
331
Views on SlideShare
331
Embed Views
0

Actions

Likes
0
Downloads
19
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hybrid my sql_hadoop_datawarehouse Hybrid my sql_hadoop_datawarehouse Presentation Transcript

  • Percona Live NYC 2012 1 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino?› Bespoke Services: we work with and like you.› Production Experienced: senior DBAs, admins, engineers.› 24x7: globally-distributed on-call staff.› One-Month Contracts: not more.
  • Percona Live NYC 2012 2 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino?› Bespoke Services: we work with and like you.› Production Experienced: senior DBAs, admins, engineers.› 24x7: globally-distributed on-call staff.› One-Month Contracts: not more.› Professional Services: › ETLs, › Cluster tooling.
  • Percona Live NYC 2012 3 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino?› Bespoke Services: we work with and like you.› Production Experienced: senior DBAs, admins, engineers.› 24x7: globally-distributed on-call staff.› One-Month Contracts: not more.› Professional Services: › ETLs, › Cluster tooling.› Configuration management (DevOps) › Chef, › Puppet, › Ansible.
  • Percona Live NYC 2012 4 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino?› Bespoke Services: we work with and like you.› Production Experienced: senior DBAs, admins, engineers.› 24x7: globally-distributed on-call staff.› One-Month Contracts: not more.› Professional Services: › ETLs, › Cluster tooling.› Configuration management (DevOps) › Chef, › Puppet, › Ansible.› Big Data Cluster Administration (OpsDev) › MySQL, PostgreSQL, › Cassandra, HBase, › MongoDB, Couchbase.
  • Percona Live NYC 2012 5 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync.
  • Percona Live NYC 2012 6 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party.
  • Percona Live NYC 2012 7 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party.Ensured successful business for: › Digg, › Friendster,
  • Percona Live NYC 2012 8 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party.Ensured successful business for: › Digg, › Friendster, › Mozilla, › StumbleUpon, › Riot Games (League of Legends).
  • What Is This Talk? 9 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?
  • What Is This Talk? 10 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.
  • What Is This Talk? 11 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.Ive administered high-volume Datawarehouses andmanaged a large ETL rollout, but havent writtenextensive ETLs or reports.
  • What Is This Talk? 12 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.Ive administered high-volume Datawarehouses andmanaged a large ETL rollout, but havent writtenextensive ETLs or reports.A high-volume Datawarehouse is of a different designthan a low-volume Datawarehouse by necessity.Typically simpler schemas, more complex queries.
  • Why OSS? 13 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking.
  • Why OSS? 14 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking....but then terascale happened one day.
  • Why OSS? 15 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking....but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?!
  • Why OSS? 16 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking....but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?! › How many sales calls before I push the release?
  • Why OSS? 17 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking....but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?! › How many sales calls before I push the release? › Ill hire an entire team and still be more efficient.
  • How to begin? 18 Take stock of the current systemEstablish a data flow: › Whos sending me data? › How much?
  • How to begin? 19 Take stock of the current systemEstablish a data flow: › Whos sending me data? › How much? › What are the bottlenecks? › Whats the current ETL process?
  • How to begin? 20 Take stock of the current systemEstablish a data flow: › Whos sending me data? › How much? › What are the bottlenecks? › Whats the current ETL process?Were looking for typical data flow characteristics: › Log data, write-mostly, free-form. › Looks tabular, “select * from table.” › Size: MB, GB or TB per hour?
  • How to begin? 21 Take stock of the current systemEstablish a data flow: › Whos sending me data? › How much? › What are the bottlenecks? › Whats the current ETL process?Were looking for typical data flow characteristics: › Log data, write-mostly, free-form. › Looks tabular, “select * from table.” › Size: MB, GB or TB per hour? › Who queries this data? How often?
  • What is Hadoop? 22 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster.
  • What is Hadoop? 23 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation.
  • What is Hadoop? 24 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter.
  • What is Hadoop? 25 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more).
  • What is Hadoop? 26 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more).Most-interesting bits: › Hive lets business users formulate SQL!
  • What is Hadoop? 27 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more).Most-interesting bits: › Hive lets business users formulate SQL! › HBase provides a distributed column store!
  • What is Hadoop? 28 The Hadoop EcosystemHadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more).Most-interesting bits: › Hive lets business users formulate SQL! › HBase provides a distributed column store! › HDFS provides massive I/O and redundancy.
  • Should You Use Hadoop? 29 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time.
  • Should You Use Hadoop? 30 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query.
  • Should You Use Hadoop? 31 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes.
  • Should You Use Hadoop? 32 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.
  • Should You Use Hadoop? 33 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.Where Hadoop/HBase falls short: › Query iteration is typically minutes.
  • Should You Use Hadoop? 34 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual.
  • Should You Use Hadoop? 35 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual. › Hadoop still immature (some say “beta”).
  • Should You Use Hadoop? 36 Hadoop Strengths and WeaknessesHadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual. › Hadoop still immature (some say “beta”). › Documentation is bad or non-existent.
  • Should You Use MySQL? 37 MySQL Strengths and WeaknessesMySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM.
  • Should You Use MySQL? 38 MySQL Strengths and WeaknessesMySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM.Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal.
  • Should You Use MySQL? 39 MySQL Strengths and WeaknessesMySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM.Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal. › You probably know better than I. Trust the DBA.
  • Should You Use MySQL? 40 MySQL Strengths and WeaknessesMySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM.Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal. › You probably know better than I. Trust the DBA. › Be honest with management. If Vertica is better...
  • MySQL/Hadoop Hybrid 41 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools.
  • MySQL/Hadoop Hybrid 42 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes.
  • MySQL/Hadoop Hybrid 43 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.
  • MySQL/Hadoop Hybrid 44 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.Youll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers,
  • MySQL/Hadoop Hybrid 45 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.Youll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users,
  • MySQL/Hadoop Hybrid 46 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.Youll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users, › Systems Administrators,
  • MySQL/Hadoop Hybrid 47 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.Youll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users, › Systems Administrators, › Management.
  • Building a Hadoop Cluster 48 The NameNodeTypical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail)
  • Building a Hadoop Cluster 49 The NameNodeTypical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail) › NameNode dies? (single point of failure)
  • Building a Hadoop Cluster 50 The NameNodeTypical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: › All critical filesystems on RAID 1+0
  • Building a Hadoop Cluster 51 The NameNodeTypical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: › All critical filesystems on RAID 1+0 › Redundant PSU
  • Building a Hadoop Cluster 52 The NameNodeTypical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: › All critical filesystems on RAID 1+0 › Redundant PSU › Redundant NICs to independent routers
  • Building a Hadoop Cluster 53 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD.
  • Building a Hadoop Cluster 54 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U.
  • Building a Hadoop Cluster 55 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill.
  • Building a Hadoop Cluster 56 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage.
  • Building a Hadoop Cluster 57 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. › 8-24GB RAM.
  • Building a Hadoop Cluster 58 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. › 8-24GB RAM. › Good/fast network cards!
  • Building a Hadoop Cluster 59 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. ←lots of this!!! › 8-24GB RAM. › Good/fast network cards!A DBA thinks “Database” == RAM. Likewise,“Hadoop Node” == disk spindles, disk storage, andnetwork. You lose 2-3x storage to data replication.
  • Building a Hadoop Cluster 60 Network and Rack LayoutNetwork within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt.
  • Building a Hadoop Cluster 61 Network and Rack LayoutNetwork within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging.
  • Building a Hadoop Cluster 62 Network and Rack LayoutNetwork within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging.Network between racks (datacentre switching): › Inter-rack switches: better than 2Gbit desireable. › Hadoop rack awareness reduces inter-rack traffic.
  • Building a Hadoop Cluster 63 Network and Rack LayoutNetwork within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging.Network between racks (datacentre switching): › Inter-rack switches: better than 2Gbit desireable. › Hadoop rack awareness reduces inter-rack traffic.Need sharp Networking employees on-board to helpbuild cluster. Network instability can cause crashes.
  • Building a Hadoop Cluster 64 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph?
  • Building a Hadoop Cluster 65 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats.
  • Building a Hadoop Cluster 66 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats.A possible solution: › Use collectds JMX plugin to collect stats.
  • Building a Hadoop Cluster 67 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats.A possible solution: › Use collectds JMX plugin to collect stats. › Put stats into Graphite. › Or Ganglia if you know how.
  • Building a Hadoop Cluster 68 Palomino Cluster ToolUse Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.
  • Building a Hadoop Cluster 69 Palomino Cluster ToolUse Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts.
  • Building a Hadoop Cluster 70 Palomino Cluster ToolUse Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring.
  • Building a Hadoop Cluster 71 Palomino Cluster ToolUse Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring. › In the future, will also set up alerting and backups.
  • Building a Hadoop Cluster 72 Palomino Cluster ToolUse Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring. › In the future, will also set up alerting and backups. › Also sets up MySQL+MHA, may be relevant?
  • Running the Hadoop Cluster 73 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload.
  • Running the Hadoop Cluster 74 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload.
  • Running the Hadoop Cluster 75 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload.
  • Running the Hadoop Cluster 76 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!
  • Running the Hadoop Cluster 77 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!Watch your storage subsystems. › 120TB is a lot of disk space.
  • Running the Hadoop Cluster 78 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data.
  • Running the Hadoop Cluster 79 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data. › 400 spindles is a lot of IOPS.
  • Running the Hadoop Cluster 80 Typical ProblemsHadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data. › 400 spindles is a lot of IOPS. › Until you query everything. Ten times.
  • Running the Hadoop Cluster 81 Administration by Scientific MethodWhat did we just learn...?
  • Running the Hadoop Cluster 82 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault.
  • Running the Hadoop Cluster 83 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes.
  • Running the Hadoop Cluster 84 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs.
  • Running the Hadoop Cluster 85 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies?
  • Running the Hadoop Cluster 86 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs).
  • Running the Hadoop Cluster 87 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations).
  • Running the Hadoop Cluster 88 Administration by Scientific MethodHadoop Clusters are Distributed Systems! › Instability on system X? Could be Ys fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations). 4. Go to 1. Youre graphing EVERYTHING, right?
  • Running the Hadoop Cluster 89 Graphing your LogsYou need to graph everything. How about graphingyour logs?
  • Running the Hadoop Cluster 90 Graphing your LogsYou need to graph everything. How about graphingyour logs? › grep ERROR | cut <date/hour part> | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652
  • Running the Hadoop Cluster 91 Graphing your LogsYou need to graph everything. How about graphingyour logs? › grep ERROR | cut <date/hour part> | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652Thats close, but what if thats hundreds of lines? Youcan put the data into LibreOffice Calc, but slows downiteration cycle.
  • Running the Hadoop Cluster 92 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool “distribution,” OSS on Github:
  • Running the Hadoop Cluster 93 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool “distribution,” OSS on Github: › grep ERROR | cut <date/hour part> | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++
  • Running the Hadoop Cluster 94 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool “distribution,” OSS on Github: › grep ERROR | cut <date/hour part> | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++On a quick iteration cycle in the terminal, this is veryuseful. For presentation to the suits later you can importthe data into another prettier tool.
  • Running the Hadoop Cluster 95 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n This file was about 2.5GB in size Just the date/hour portion Distribution sorts by key frequency by default, but well want date/hour ordering.
  • Running the Hadoop Cluster 96 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏ 120601 17|10 (7.69%) █████████▋ 120601 14|4 (3.08%) ███▉ 120602 14|2 (1.54%) ██ 120602 21|4 (3.08%) ███▉ 120610 13|2 (1.54%) ██ 120610 14|4 (3.08%) ███▉ 120611 14|2 (1.54%) ██ 120612 14|2 (1.54%) ██ 120613 14|2 (1.54%) ██ 120616 13|2 (1.54%) ██ 120630 14|5 (3.85%) ████▉Obvious: Noon on June 1st was ugly.
  • Running the Hadoop Cluster 97 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏ 120601 17|10 (7.69%) █████████▋ 120601 14|4 (3.08%) ███▉ 120602 14|2 (1.54%) ██ 120602 21|4 (3.08%) ███▉ 120610 13|2 (1.54%) ██ 120610 14|4 (3.08%) ███▉ 120611 14|2 (1.54%) ██ 120612 14|2 (1.54%) ██ 120613 14|2 (1.54%) ██ 120616 13|2 (1.54%) ██ 120630 14|5 (3.85%) ████▉Obvious: Noon on June 1st was ugly.But also: What keeps happening at 2pm?
  • Building the MySQL Datawarehouse 98 Hardware Spec and LayoutThis is a typical OLAP role. › Fast non-transactional engine: MyISAM. › Data typically time-related: partition by date. › Data write-only or read-all? Archive engine. › Index-everything schemas.
  • Building the MySQL Datawarehouse 99 Hardware Spec and LayoutThis is a typical OLAP role. › Fast non-transactional engine: MyISAM. › Data typically time-related: partition by date. › Data write-only or read-all? Archive engine. › Index-everything schemas.Typically beefier hardware is better. › Many spindles, many CPUs, much RAM. › Reasonably-fast network cards.
  • ETL Framework 100 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem.
  • ETL Framework 101 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>”
  • ETL Framework 102 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>”
  • ETL Framework 103 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS.
  • ETL Framework 104 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS. › Table loads: Sqoop (“select * into <hdfsFile>”).
  • ETL Framework 105 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS. › Table loads: Sqoop (“select * into <hdfsFile>”).HBase is not as simple, but can be worth it. › Flume→HBase. › HBase column family == columnar scans. › Beware: no secondary indexes.
  • ETL Framework 106 Notice when something is wrongDont skimp ETL alerting! Start with the obvious:
  • ETL Framework 107 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k.
  • ETL Framework 108 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB.
  • ETL Framework 109 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k.
  • ETL Framework 110 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k.
  • ETL Framework 111 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr.
  • ETL Framework 112 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr.If you have time, get a bit more sophisticated: › Yesterday TableX.ColY was int. Today varchar.
  • ETL Framework 113 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr.If you have time, get a bit more sophisticated: › Yesterday TableX.ColY was int. Today varchar. › Yesterday TableX.ColY compressed at 8x, today it compresses at 2x (or 32x?).
  • Getting Data Out 114 Hadoop Reporting ToolsThe oldschool method of retrieving data: › select f(col) from table where ... group by ...
  • Getting Data Out 115 Hadoop Reporting ToolsThe oldschool method of retrieving data: › select f(col) from table where ... group by ...The NoSQL method of retrieving data:
  • Getting Data Out 116 Hadoop Reporting ToolsThe oldschool method of retrieving data: › select f(col) from table where ... group by ...The NoSQL method of retrieving data: › select f(col) from table where ... group by …
  • Getting Data Out 117 Hadoop Reporting ToolsThe oldschool method of retrieving data: › select f(col) from table where ... group by ...The NoSQL method of retrieving data: › select f(col) from table where ... group by …Hadoop includes Hive (SQL→Map/Reduce Converter).In my experience, dedicated business users can learn touse Hive with little extra training.
  • Getting Data Out 118 Hadoop Reporting ToolsThe oldschool method of retrieving data: › select f(col) from table where ... group by ...The NoSQL method of retrieving data: › select f(col) from table where ... group by …Hadoop includes Hive (SQL→Map/Reduce Converter).In my experience, dedicated business users can learn touse Hive with little extra training.But there is extra training!
  • Getting Data Out 119 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting:
  • Getting Data Out 120 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector.
  • Getting Data Out 121 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector.Most of Hadoops power is in Map/Reduce: › Hive == SQL→Map/Reduce. › RHadoop == R→Map/Reduce.
  • Getting Data Out 122 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector.Most of Hadoops power is in Map/Reduce: › Hive == SQL→Map/Reduce. › RHadoop == R→Map/Reduce. › HadoopStreaming == Anything→Map/Reduce
  • The Hybrid Datawarehouse 123 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised.
  • The Hybrid Datawarehouse 124 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop.
  • The Hybrid Datawarehouse 125 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1.
  • The Hybrid Datawarehouse 126 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow.
  • The Hybrid Datawarehouse 127 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data.
  • The Hybrid Datawarehouse 128 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data. 5. Work through The Pain of #4. › It doesnt come free, but is worth the price.
  • The Hybrid Datawarehouse 129 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data. 5. Work through The Pain of #4. › It doesnt come free, but is worth the price. 6. Go to #1.
  • The Hybrid Datawarehouse 130 Q&AQuestions?
  • The Hybrid Datawarehouse 131 Q&AQuestions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you?
  • The Hybrid Datawarehouse 132 Q&AQuestions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Lets make this happen.
  • The Hybrid Datawarehouse 133 Q&AQuestions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Lets make this happen. › Wheres the coffee? I never thought I could be so sleepy.
  • The Hybrid Datawarehouse 134 Q&AQuestions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Lets make this happen. › Wheres the coffee? I never thought I could be so sleepy.Thank you! Email me if you desire.domain: palominodb.com – username: timePercona Live NYC 2012