Is hadoop for you

1,357 views

Published on

Introduction to Hadoop for Oracle Database professionals. Presented at E4 conference.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,357
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
45
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • The 1999 model no longer worked in 2007, and Exadata was a huge improvement
  • Is hadoop for you

    1. 1. 1 Is Hadoop For You? Gwen Shapira, Solutions Architect
    2. 2. About Me • Solution Architect @ Cloudera • Making our customers successful • Formerly: • Database consultant @ Pythian • Specializing in Exadata, RAC, replication • Oracle ACED, Oak Table Member • @gwenshap <- Hadoop tips in 140 characters 2
    3. 3. Agenda Answer the question: Who needs Hadoop? 3
    4. 4. In more details… 4 0% 10% 20% 30% 40% 50% Getting Started What you need to succeed When to Hadoop Basic Hadoop Architecture What's so special about Hadoop % of Session % of Session
    5. 5. 5 What’s so special about Hadoop? Technically Speaking
    6. 6. Databases in 1999 1. Buy a really big machine 2. Install an expensive DBMS on it 3. Point your workload at it 4. Hope it doesn’t fail 5. Ambitious: buy another really big machine as a backup 6
    7. 7. Problems: • Reliability • Scalability • Storage throughput • Complex Upgrades • Relational only 7
    8. 8. Exadata: State of the Art - 2007 1. Storage and compute in one rack 2. Cluster with Infiniband interconnect 3. Balanced architecture 4. Offloading 5. Parallelism 6. Compression 8
    9. 9. Hadoop • Distributed File System • Programming Framework • Many projects on top • Open Source (This means free) 9
    10. 10. Designed For: • Reliability • Parallel Processing • Scalability • Flexibility 10
    11. 11. Reminders: • Disk does a seek for each I/O operation • Seeks are expensive (~10ms) • Big I/Os mean better throughput • Network is fast inside rack • Slower between racks 11
    12. 12. The File System • Files are split into 64M blocks • 64M!!! • Distributed • Replicated • Write-Once 12
    13. 13. HDFS Architecture 13 DataNode Metadata Paths, filenames, file sizes, block locations, … NameNode DataNode DataNode DataNode
    14. 14. HDFS Architecture 14 DataNode Data Blocks, checksums NameNode DataNode DataNode DataNode
    15. 15. HDFS Write Path 15 DN 1 NameNode DN 2 DN 3 DN 4 Rack 1 Rack 2 Client create(“/tmp/myfile”) Write to [DN4,DN3,DN2] [DN3,DN2] [DN2]
    16. 16. HDFS Read Path 16 DN 1 NameNode DN 2 DN 3 DN 4 Rack 1 Rack 2 Client open(“/tmp/myfile”,“r”) Read from [DN4,DN3,DN2] readdata
    17. 17. Map-Reduce • Java Framework • Works on Key-Value pairs • Map: • Operate on every element • Filter or transform • Code runs where the data is stored • Shuffle: • Redistribution of data • Reduce: • Aggregate or Join 17
    18. 18. MapReduce Architcture 18 DN 1 JobTracker DN 2 DN 3 DN 4 Rack 1 Rack 2 NameNode TT 3 TT 4TT 2TT 1 • Gateway for users • Assigns tasks to TaskTrackers • Tracks job status
    19. 19. MapReduce Architcture 19 DN 1 JobTracker DN 2 DN 3 DN 4 Rack 1 Rack 2 NameNode TT 3 TT 4TT 2TT 1 • TaskTrackers execute Map and Reduce tasks assigned by JT
    20. 20. Word Count Example 20 The cat sat on the mat The aardvark sat on the sofa The, 1 cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 Mapper Input Mapping aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] Shuffling Reducing Final Result
    21. 21. MapReduce Architcture 21 DN 1 JobTracker DN 2 DN 3 DN 4 Rack 1 Rack 2 NameNode TT 3 TT 4TT 2TT 1 wordcount(<files>) M1 M2 M3 M4 R1 [cat, 1] [dog, 1][the, 1] [sat, 1]
    22. 22. MapReduce Architcture 22 DN 1 JobTracker DN 2 DN 3 DN 4 Rack 1 Rack 2 NameNode TT 3 TT 4TT 2TT 1 wordcount(<files>) M5 M6 M7 M8 R1 [a, 5] [cat, 2] [dog, 1] [the, 4] [mat, 1]
    23. 23. Compare to Oracle PX • Mappers -> Producers • Reducers -> Consumers • Shuffle -> Re-distribution 23
    24. 24. In Short Benefits • Reliable • Scalable • Infinite Flexibility • Cheap Challenges • New skills • Infinitely Flexible • Feature-completeness • Best practices and examples 24
    25. 25. 25 Use Cases When to Hadoop?
    26. 26. When to Hadoop? When Relational Databases Don’t Add Benefits 26
    27. 27. Non-relational Data • XML • Logs • Geo spatial data • Video 27
    28. 28. Adding to the Data Warehouse • ETL • History • Some reports • Rocket Data Science 28
    29. 29. 29 What you Need to Succeed
    30. 30. A Problem 30
    31. 31. Right Toolset 31
    32. 32. Toolset 32
    33. 33. Toolset for DBAs • Hive – Turn SQL to Map-Reduce • Streaming – Map-Reduce in any language • Pig – Write and Execute execution plans • Oozie – Coordinate workflows • Impala – real-time SQL • HBase – key-value real-time data store 33
    34. 34. Data Model • Partitions • Batch processing • Star Schema • Materialized Views • Sort and Compress • De-normalize • Tune the data • Nested data structures 34
    35. 35. Right Hardware • If possible – POC with your workload • Sizing by storage • You probably need to over-provision • Machine reliability • Big Data Appliance is a good start 35
    36. 36. Non-technical Advice • Your team will have to learn a lot • Be ready for a challenge 36
    37. 37. 37 Getting Started
    38. 38. Why get started? • Hadoop projects are more visible • 48% of Hadoop clusters are owned by DWH team • Big Data == Business pays attention to data • New skills – from coding to cluster administration • Interesting projects • No, you don’t need to learn Java 38
    39. 39. VM Cloud Cluster 39
    40. 40. Books 40
    41. 41. More Books 41
    42. 42. Beginner Projects • Install 5 node Hadoop cluster in AWS • Load data: • Complete works of Shakespeare • Movielens database • Find the 10 most common words in Shakespeare • Find the 10 most recommended movies • Run TPC-H • Cloudera Data Science Challenge • Actual use-case: XML ingestion, ETL process, DWH history 42
    43. 43. Need Help? • I can help: • @gwenshap • gshapira@cloudera.com • Hadoop Community: • http://community.cloudera.com • user@hadoop.apache.org • Google group: CDH Users 43
    44. 44. 44

    ×