Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elephant in the room: A DBA's Guide to Hadoop


Published on

Published in: Technology
  • Be the first to comment

Elephant in the room: A DBA's Guide to Hadoop

  1. 1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
  2. 2. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
  3. 3. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger:
  4. 4. About You Assume that ● mostly developers ● SQL experience ● exposure to database admin & architecture ● little to no experience with Big Data
  5. 5. “Big” Data
  6. 6. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
  7. 7. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations
  8. 8. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
  9. 9. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  10. 10. Foundations “Gentlemen, this is a football…” - Vince Lombardi
  11. 11. Hadoop Ecosystem (Hortonworks) Hortonworks
  12. 12. Hadoop Scaleable, distributed processing framework open-source Hortonworks* Cloudera proprietary components Facebook Yahoo
  13. 13. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012)
  14. 14. HDFS
  15. 15. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
  16. 16. MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
  17. 17. MapReduce
  18. 18. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - MicrosoftHortonworks
  19. 19. Hive
  20. 20. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
  21. 21. Hive
  22. 22. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
  23. 23. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
  24. 24. Pig
  25. 25. Pig
  26. 26. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
  27. 27. Hadoop Ecosystem (Hortonworks) Hortonworks
  28. 28. Hadoop SQL Server HDFS Windows Cluster Database MapReduce Query Optimizer Master Web Interface SQL Server Management Studio Hive SQL HCatalog Views Pig Powershell SSIS
  29. 29. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
  30. 30. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  34. 34. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
  35. 35. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux ($) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
  36. 36. Performance Tuning SYSTEM CODE RDBMS SYSTEM CODE HADOOP Performance Tuning Tips
  37. 37. Hadoop Ecosystem (Hortonworks) Hortonworks
  38. 38. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture
  39. 39. Performance Architecture
  40. 40. Getting Started (Massive Size) 1. Lab Environment (Virtualized) 2. Setup OS (Windows or Linux) 3. Download (MSI or RPM) 4. Deploy Prereqs (Python, Java, C++) 5. Setup Master Node(s) 6. Setup Data Node(s)
  41. 41. Windows Installation Tutorial
  42. 42. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  43. 43. Word Count Problem: count the number of times a word displays in a specific record. e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit.”...
  44. 44. Word Count SQL Server Create UDF to parse strings Hadoop Pig script to parse strings
  45. 45. Word Count - SQL Server CREATE function WordRepeatedNumTimes (@SourceString varchar(max),@TargetWord varchar(8000)) RETURNS int AS BEGIN DECLARE @NumTimesRepeated int ,@CurrentStringPosition int ,@LengthOfString int ,@PatternStartsAtPosition int ,@LengthOfTargetWord int ,@NewSourceString varchar(max)
  46. 46. Word Count - SQL Server SET @LengthOfTargetWord = len(@TargetWord) SET @LengthOfString = len(@SourceString) SET @NumTimesRepeated = 0 SET @CurrentStringPosition = 0 SET @PatternStartsAtPosition = 0 SET @NewSourceString = @SourceString WHILE len(@NewSourceString) >= @LengthOfTargetWord BEGIN SET @PatternStartsAtPosition = CHARINDEX (@TargetWord, @NewSourceString) IF @PatternStartsAtPosition <> 0 BEGIN
  47. 47. Word Count - SQL Server SET @NumTimesRepeated = @NumTimesRepeated + 1 SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition + @LengthOfTargetWord SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString) END ELSE BEGIN SET @NewSourceString = '' END END RETURN @NumTimesRepeated END
  48. 48. Word Count (Hadoop) a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE ((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';
  49. 49. Getting Started (Complex Analysis) 1. Lab Environment (Virtualized) 2. Install Hortonworks Sandbox 1. Setup Azure account 2. HDInsight
  50. 50. Theoretically, can scale to PB, but no idea what that will cost you. Note that the interface highlights Hive (with Stinger); Pig commands are run through Powershell
  51. 51. In Conclusion Lots of vocabulary HDFS, Pig, Hive, MapReduce Map to SQL Server (RDBMS) vocabulary Different Use Cases Massive Data Complex Analysis
  52. 52. Questions & Feedback
  53. 53. Contact Me Stuart R. Ainsworth Twitter: @codegumbo Email: SpeakerRate:
  54. 54. Big Data - Dangerous