Your SlideShare is downloading. ×
Elephant in the room: A DBA's Guide to Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Elephant in the room: A DBA's Guide to Hadoop

161
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
161
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
  • 2. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
  • 3. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com
  • 4. About You Assume that ● mostly developers ● SQL experience ● exposure to database admin & architecture ● little to no experience with Big Data
  • 5. “Big” Data
  • 6. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
  • 7. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ
  • 8. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
  • 9. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 10. Foundations “Gentlemen, this is a football…” - Vince Lombardi
  • 11. Hadoop Ecosystem (Hortonworks) Hortonworks
  • 12. Hadoop Scaleable, distributed processing framework open-source Hortonworks* Cloudera proprietary components Facebook Yahoo
  • 13. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN
  • 14. HDFS
  • 15. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
  • 16. MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
  • 17. MapReduce
  • 18. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - MicrosoftHortonworks
  • 19. Hive
  • 20. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
  • 21. Hive
  • 22. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
  • 23. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
  • 24. Pig
  • 25. Pig
  • 26. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
  • 27. Hadoop Ecosystem (Hortonworks) Hortonworks
  • 28. Hadoop SQL Server HDFS Windows Cluster Database MapReduce Query Optimizer Master Web Interface SQL Server Management Studio Hive SQL HCatalog Views Pig Powershell SSIS
  • 29. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
  • 30. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 31. PERFORMANCE APPLICATION GROWTH RDBMS
  • 32. PERFORMANCE APPLICATION GROWTH BIG DATA
  • 33. PERFORMANCE APPLICATION GROWTH
  • 34. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
  • 35. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux ($) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
  • 36. Performance Tuning SYSTEM CODE RDBMS SYSTEM CODE HADOOP Performance Tuning Tips
  • 37. Hadoop Ecosystem (Hortonworks) Hortonworks
  • 38. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture
  • 39. Performance Architecture
  • 40. Getting Started (Massive Size) 1. Lab Environment (Virtualized) 2. Setup OS (Windows or Linux) 3. Download (MSI or RPM) 4. Deploy Prereqs (Python, Java, C++) 5. Setup Master Node(s) 6. Setup Data Node(s)
  • 41. Windows Installation Tutorial
  • 42. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 43. Word Count Problem: count the number of times a word displays in a specific record. e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit.”...
  • 44. Word Count SQL Server Create UDF to parse strings Hadoop Pig script to parse strings
  • 45. Word Count - SQL Server CREATE function WordRepeatedNumTimes (@SourceString varchar(max),@TargetWord varchar(8000)) RETURNS int AS BEGIN DECLARE @NumTimesRepeated int ,@CurrentStringPosition int ,@LengthOfString int ,@PatternStartsAtPosition int ,@LengthOfTargetWord int ,@NewSourceString varchar(max)
  • 46. Word Count - SQL Server SET @LengthOfTargetWord = len(@TargetWord) SET @LengthOfString = len(@SourceString) SET @NumTimesRepeated = 0 SET @CurrentStringPosition = 0 SET @PatternStartsAtPosition = 0 SET @NewSourceString = @SourceString WHILE len(@NewSourceString) >= @LengthOfTargetWord BEGIN SET @PatternStartsAtPosition = CHARINDEX (@TargetWord, @NewSourceString) IF @PatternStartsAtPosition <> 0 BEGIN
  • 47. Word Count - SQL Server SET @NumTimesRepeated = @NumTimesRepeated + 1 SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition + @LengthOfTargetWord SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString) END ELSE BEGIN SET @NewSourceString = '' END END RETURN @NumTimesRepeated END
  • 48. Word Count (Hadoop) a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE ((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';
  • 49. Getting Started (Complex Analysis) 1. Lab Environment (Virtualized) 2. Install Hortonworks Sandbox 1. Setup Azure account 2. HDInsight
  • 50. Theoretically, can scale to PB, but no idea what that will cost you. Note that the interface highlights Hive (with Stinger); Pig commands are run through Powershell
  • 51. In Conclusion Lots of vocabulary HDFS, Pig, Hive, MapReduce Map to SQL Server (RDBMS) vocabulary Different Use Cases Massive Data Complex Analysis
  • 52. Questions & Feedback
  • 53. Contact Me Stuart R. Ainsworth Twitter: @codegumbo Email: stuart@codegumbo.com SpeakerRate: http://spkr8.com/t/33521
  • 54. Big Data - Dangerous http://www.thefacehawk.com/