© 2013 IBM Corporation1
AVNET – Hadoop Fundamentals I
Romeo Kienzler
IBM Innovation Center Zurich
© 2013 IBM Corporation2
1) Welcome
2) What is big data?
3) Introduction to Hadoop
4) BigInsights
5) Hadoop architecture
6)...
© 2013 IBM Corporation3
What is BIG data?
© 2013 IBM Corporation4
Traditional Business Intelligence / Data
Warehousing
...60 percent, were unsatisfied with their da...
© 2013 IBM Corporation5
What is BIG data?
© 2013 IBM Corporation6
What is BIG data?
© 2013 IBM Corporation7
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation8
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation9
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation1010
Why is Big Data important?
Data AVAILABLE to an
organization
data an organization can
PROCESS
M...
© 2013 IBM Corporation11
Why is Big Data important?
© 2013 IBM Corporation12
Why is Big Data important?
© 2013 IBM Corporation13
Why is Big Data important?
© 2013 IBM Corporation1414
Volume
Terabytes, petabytes, even
exabytes
Variety
All kinds of data
All kinds of analytics
Vel...
© 2013 IBM Corporation15
BigData Analytics
© 2013 IBM Corporation16
BigData Analytics – Predictive Analytics
© 2013 IBM Corporation17
BigData Analytics – Predictive Analytics
© 2013 IBM Corporation18
BigData Analytics – Correlation / Text / NLP
© 2013 IBM Corporation19
BigData Analytics – Feature Extraction
Feature extraction involves simplifying the amount of reso...
© 2013 IBM Corporation20
BigData Analytics – Predictive Analytics
Storage / DataCPU’s / Algorithm
Business Value / Insight
© 2013 IBM Corporation21
BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins...
© 2013 IBM Corporation22
Realtime / In-Memory Computing:
InfoSphere Streams / Watson
© 2013 IBM Corporation23
© 2013 IBM Corporation24
© 2013 IBM Corporation25
© 2013 IBM Corporation26
The Paris Hilton Problem
Watson Workshop: What is Watson?
© 2013 IBM Corporation27
Introduction to Hadoop
© 2013 IBM Corporation28
© 2013 IBM Corporation29
BigInsights
© 2013 IBM Corporation30
© 2013 IBM Corporation31
BigInsights Demonstration
© 2013 IBM Corporation32
Hadoop Architecture
© 2013 IBM Corporation33
© 2013 IBM Corporation34
© 2013 IBM Corporation35
HDFS – Hadoop File System
© 2013 IBM Corporation36
© 2013 IBM Corporation37
© 2013 IBM Corporation38
© 2013 IBM Corporation39
© 2013 IBM Corporation40
© 2013 IBM Corporation41
© 2013 IBM Corporation42
© 2013 IBM Corporation43
© 2013 IBM Corporation44
© 2013 IBM Corporation45
© 2013 IBM Corporation46
© 2013 IBM Corporation47
© 2013 IBM Corporation48
© 2013 IBM Corporation49
© 2013 IBM Corporation50
© 2013 IBM Corporation51
© 2013 IBM Corporation52
© 2013 IBM Corporation53
© 2013 IBM Corporation54
Lab 1 – Hadoop Architecture
1)Start from chapter 1.2
2)Replace /home/biadmin with /home/biadminX ...
© 2013 IBM Corporation55
Map-Reduce
© 2013 IBM Corporation56
© 2013 IBM Corporation57
© 2013 IBM Corporation58
© 2013 IBM Corporation59
© 2013 IBM Corporation60
© 2013 IBM Corporation61
© 2013 IBM Corporation62
© 2013 IBM Corporation63
© 2013 IBM Corporation64
© 2013 IBM Corporation65
© 2013 IBM Corporation66
© 2013 IBM Corporation67
© 2013 IBM Corporation68
© 2013 IBM Corporation69
© 2013 IBM Corporation70
© 2013 IBM Corporation71
© 2013 IBM Corporation72
© 2013 IBM Corporation73
© 2013 IBM Corporation74
© 2013 IBM Corporation75
© 2013 IBM Corporation76
© 2013 IBM Corporation77
© 2013 IBM Corporation78
© 2013 IBM Corporation79
© 2013 IBM Corporation80
© 2013 IBM Corporation81
© 2013 IBM Corporation82
© 2013 IBM Corporation83
© 2013 IBM Corporation84
© 2013 IBM Corporation85
© 2013 IBM Corporation86
© 2013 IBM Corporation87
© 2013 IBM Corporation88
© 2013 IBM Corporation89
© 2013 IBM Corporation90
© 2013 IBM Corporation91
© 2013 IBM Corporation92
© 2013 IBM Corporation93
© 2013 IBM Corporation94
© 2013 IBM Corporation95
© 2013 IBM Corporation96
© 2013 IBM Corporation97
Data Parallelism
© 2013 IBM Corporation98
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 s...
© 2013 IBM Corporation99
© 2013 IBM Corporation100
© 2013 IBM Corporation101
© 2013 IBM Corporation102
© 2013 IBM Corporation103
Lab 2 - MapReduce
1)Skip task 1.1._1, use putty to connect to biadmin@10.199.20.51 instead
2)Rep...
© 2013 IBM Corporation104
Pig, Jaql, Hive, BigSQL, SystemT/AQL
© 2013 IBM Corporation105
© 2013 IBM Corporation106
© 2013 IBM Corporation107
© 2013 IBM Corporation108
© 2013 IBM Corporation109
© 2013 IBM Corporation110
© 2013 IBM Corporation111
© 2013 IBM Corporation112
© 2013 IBM Corporation113
© 2013 IBM Corporation114
© 2013 IBM Corporation115
© 2013 IBM Corporation116
© 2013 IBM Corporation117
© 2013 IBM Corporation118
© 2013 IBM Corporation119
© 2013 IBM Corporation120
© 2013 IBM Corporation121
© 2013 IBM Corporation122
© 2013 IBM Corporation123
© 2013 IBM Corporation124
© 2013 IBM Corporation125
© 2013 IBM Corporation126
© 2013 IBM Corporation127
© 2013 IBM Corporation128
© 2013 IBM Corporation129
© 2013 IBM Corporation130
© 2013 IBM Corporation131
© 2013 IBM Corporation132
© 2013 IBM Corporation133
SQL for BigInsights
 Data warehouse augmentation is a very common use case for Hadoop
 While h...
© 2013 IBM Corporation134
Query Processing
 Big SQL consists of two query processing engines
– The SQL optimization engin...
© 2013 IBM Corporation135
Big SQL vs. Alternatives
 There are a number of SQL solutions, where does Big SQL fit in?
 Hiv...
© 2013 IBM Corporation136
Big SQL vs. Alternatives (cont.)
 Impala
– Recently open sourced
– Achieves low latency by bypa...
© 2013 IBM Corporation137
© 2013 IBM Corporation138
© 2013 IBM Corporation139
© 2013 IBM Corporation140
© 2013 IBM Corporation141
Lab 3 – Querying Data with Pig, Hive, Jaql
1)putty to biadmin@10.199.20.51
2)Skip task 1.1._2, s...
© 2013 IBM Corporation142
NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document...
© 2013 IBM Corporation143
CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously gua...
© 2013 IBM Corporation144
Certification
 Go to www.bigdatauniversity.com
 Search for “hadoop fundamentals”
 Choose “Had...
© 2013 IBM Corporation145
Questions?
Upcoming SlideShare
Loading in...5
×

Hadoop Fundamentals I

477

Published on

IBM Innovation Center DACH/Zurich, Romeo Kienzler

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
477
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
47
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop Fundamentals I

  1. 1. © 2013 IBM Corporation1 AVNET – Hadoop Fundamentals I Romeo Kienzler IBM Innovation Center Zurich
  2. 2. © 2013 IBM Corporation2 1) Welcome 2) What is big data? 3) Introduction to Hadoop 4) BigInsights 5) Hadoop architecture 6) Lab 1 – Core Hadoop 7) MapReduce 8) Lab 2 – MapReduce 9) Pig, Jaql, Hive, BigSQL, SystemT/AQL 10) Lab 3 – Pig, Hive, and Jaql 11) Certification on BigDataUniversity Agenda
  3. 3. © 2013 IBM Corporation3 What is BIG data?
  4. 4. © 2013 IBM Corporation4 Traditional Business Intelligence / Data Warehousing ...60 percent, were unsatisfied with their data warehousing system.¹ ¹http://www.information-management.com/issues/20010601/3494-1.html
  5. 5. © 2013 IBM Corporation5 What is BIG data?
  6. 6. © 2013 IBM Corporation6 What is BIG data?
  7. 7. © 2013 IBM Corporation7 What is BIG data? Big Data Hadoop
  8. 8. © 2013 IBM Corporation8 What is BIG data? Business Intelligence Data Warehouse
  9. 9. © 2013 IBM Corporation9 Map-Reduce → Hadoop → BigInsights
  10. 10. © 2013 IBM Corporation1010 Why is Big Data important? Data AVAILABLE to an organization data an organization can PROCESS Missed opportunity Enterprises are “more blind” to new opportunities. Organizations are able to process less and less of the available data. 100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net. 80 % spam and viruses. => Prefiltering is more and more important.
  11. 11. © 2013 IBM Corporation11 Why is Big Data important?
  12. 12. © 2013 IBM Corporation12 Why is Big Data important?
  13. 13. © 2013 IBM Corporation13 Why is Big Data important?
  14. 14. © 2013 IBM Corporation1414 Volume Terabytes, petabytes, even exabytes Variety All kinds of data All kinds of analytics Velocity Agility Analyze data in. . . Hours instead of days Days instead of weeks Dynamically responsive Rapid data exploration Traditional / Non-traditional data sources Store Analyze Explore What is BIG data? Volume*Variaty*Velocity=Value
  15. 15. © 2013 IBM Corporation15 BigData Analytics
  16. 16. © 2013 IBM Corporation16 BigData Analytics – Predictive Analytics
  17. 17. © 2013 IBM Corporation17 BigData Analytics – Predictive Analytics
  18. 18. © 2013 IBM Corporation18 BigData Analytics – Correlation / Text / NLP
  19. 19. © 2013 IBM Corporation19 BigData Analytics – Feature Extraction Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately¹ ¹: Wikipedia
  20. 20. © 2013 IBM Corporation20 BigData Analytics – Predictive Analytics Storage / DataCPU’s / Algorithm Business Value / Insight
  21. 21. © 2013 IBM Corporation21 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => Long Tail Distributions
  22. 22. © 2013 IBM Corporation22 Realtime / In-Memory Computing: InfoSphere Streams / Watson
  23. 23. © 2013 IBM Corporation23
  24. 24. © 2013 IBM Corporation24
  25. 25. © 2013 IBM Corporation25
  26. 26. © 2013 IBM Corporation26 The Paris Hilton Problem Watson Workshop: What is Watson?
  27. 27. © 2013 IBM Corporation27 Introduction to Hadoop
  28. 28. © 2013 IBM Corporation28
  29. 29. © 2013 IBM Corporation29 BigInsights
  30. 30. © 2013 IBM Corporation30
  31. 31. © 2013 IBM Corporation31 BigInsights Demonstration
  32. 32. © 2013 IBM Corporation32 Hadoop Architecture
  33. 33. © 2013 IBM Corporation33
  34. 34. © 2013 IBM Corporation34
  35. 35. © 2013 IBM Corporation35 HDFS – Hadoop File System
  36. 36. © 2013 IBM Corporation36
  37. 37. © 2013 IBM Corporation37
  38. 38. © 2013 IBM Corporation38
  39. 39. © 2013 IBM Corporation39
  40. 40. © 2013 IBM Corporation40
  41. 41. © 2013 IBM Corporation41
  42. 42. © 2013 IBM Corporation42
  43. 43. © 2013 IBM Corporation43
  44. 44. © 2013 IBM Corporation44
  45. 45. © 2013 IBM Corporation45
  46. 46. © 2013 IBM Corporation46
  47. 47. © 2013 IBM Corporation47
  48. 48. © 2013 IBM Corporation48
  49. 49. © 2013 IBM Corporation49
  50. 50. © 2013 IBM Corporation50
  51. 51. © 2013 IBM Corporation51
  52. 52. © 2013 IBM Corporation52
  53. 53. © 2013 IBM Corporation53
  54. 54. © 2013 IBM Corporation54 Lab 1 – Hadoop Architecture 1)Start from chapter 1.2 2)Replace /home/biadmin with /home/biadminX where X is your user ID 3)In chapter 1.3 skip task 1.3.1._1 and go to http://10.199.20.51:8080 instead 4)Skip 1.3.5 5)In chapter 1.3.6._30 use any file you like on your desktop computer
  55. 55. © 2013 IBM Corporation55 Map-Reduce
  56. 56. © 2013 IBM Corporation56
  57. 57. © 2013 IBM Corporation57
  58. 58. © 2013 IBM Corporation58
  59. 59. © 2013 IBM Corporation59
  60. 60. © 2013 IBM Corporation60
  61. 61. © 2013 IBM Corporation61
  62. 62. © 2013 IBM Corporation62
  63. 63. © 2013 IBM Corporation63
  64. 64. © 2013 IBM Corporation64
  65. 65. © 2013 IBM Corporation65
  66. 66. © 2013 IBM Corporation66
  67. 67. © 2013 IBM Corporation67
  68. 68. © 2013 IBM Corporation68
  69. 69. © 2013 IBM Corporation69
  70. 70. © 2013 IBM Corporation70
  71. 71. © 2013 IBM Corporation71
  72. 72. © 2013 IBM Corporation72
  73. 73. © 2013 IBM Corporation73
  74. 74. © 2013 IBM Corporation74
  75. 75. © 2013 IBM Corporation75
  76. 76. © 2013 IBM Corporation76
  77. 77. © 2013 IBM Corporation77
  78. 78. © 2013 IBM Corporation78
  79. 79. © 2013 IBM Corporation79
  80. 80. © 2013 IBM Corporation80
  81. 81. © 2013 IBM Corporation81
  82. 82. © 2013 IBM Corporation82
  83. 83. © 2013 IBM Corporation83
  84. 84. © 2013 IBM Corporation84
  85. 85. © 2013 IBM Corporation85
  86. 86. © 2013 IBM Corporation86
  87. 87. © 2013 IBM Corporation87
  88. 88. © 2013 IBM Corporation88
  89. 89. © 2013 IBM Corporation89
  90. 90. © 2013 IBM Corporation90
  91. 91. © 2013 IBM Corporation91
  92. 92. © 2013 IBM Corporation92
  93. 93. © 2013 IBM Corporation93
  94. 94. © 2013 IBM Corporation94
  95. 95. © 2013 IBM Corporation95
  96. 96. © 2013 IBM Corporation96
  97. 97. © 2013 IBM Corporation97 Data Parallelism
  98. 98. © 2013 IBM Corporation98 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  99. 99. © 2013 IBM Corporation99
  100. 100. © 2013 IBM Corporation100
  101. 101. © 2013 IBM Corporation101
  102. 102. © 2013 IBM Corporation102
  103. 103. © 2013 IBM Corporation103 Lab 2 - MapReduce 1)Skip task 1.1._1, use putty to connect to biadmin@10.199.20.51 instead 2)Replace /home/biadmin with /home/biadminX where X is your user ID 3)In 1.1._4 - 1.1._6 replace output with with /home/biadminX/output where X is your user ID 4)Skip chapter 1.2 5)Chapter 1.3 is optional (using your local virtual machine), maybe during lunch break :)
  104. 104. © 2013 IBM Corporation104 Pig, Jaql, Hive, BigSQL, SystemT/AQL
  105. 105. © 2013 IBM Corporation105
  106. 106. © 2013 IBM Corporation106
  107. 107. © 2013 IBM Corporation107
  108. 108. © 2013 IBM Corporation108
  109. 109. © 2013 IBM Corporation109
  110. 110. © 2013 IBM Corporation110
  111. 111. © 2013 IBM Corporation111
  112. 112. © 2013 IBM Corporation112
  113. 113. © 2013 IBM Corporation113
  114. 114. © 2013 IBM Corporation114
  115. 115. © 2013 IBM Corporation115
  116. 116. © 2013 IBM Corporation116
  117. 117. © 2013 IBM Corporation117
  118. 118. © 2013 IBM Corporation118
  119. 119. © 2013 IBM Corporation119
  120. 120. © 2013 IBM Corporation120
  121. 121. © 2013 IBM Corporation121
  122. 122. © 2013 IBM Corporation122
  123. 123. © 2013 IBM Corporation123
  124. 124. © 2013 IBM Corporation124
  125. 125. © 2013 IBM Corporation125
  126. 126. © 2013 IBM Corporation126
  127. 127. © 2013 IBM Corporation127
  128. 128. © 2013 IBM Corporation128
  129. 129. © 2013 IBM Corporation129
  130. 130. © 2013 IBM Corporation130
  131. 131. © 2013 IBM Corporation131
  132. 132. © 2013 IBM Corporation132
  133. 133. © 2013 IBM Corporation133 SQL for BigInsights  Data warehouse augmentation is a very common use case for Hadoop  While highly scalable, MapReduce is notoriously difficult to use – Java API is tedious and requires programming expertise – Unfamiliar languages (e.g. Pig) also requiring expertise – Many different file formats, storage mechanisms, configuration options, etc. – Joins, grouping, sorting tedious to orchestrate  SQL support opens the data to a much wider audience – Familiar, widely known syntax – Common catalog for identifying data and structure – Clear separation of defining the what (you want) vs. the how (to get it)
  134. 134. © 2013 IBM Corporation134 Query Processing  Big SQL consists of two query processing engines – The SQL optimization engine – Jaql as the query execution engine Client SQL Engine Jaql Jaql SQL Optimizer Runtime
  135. 135. © 2013 IBM Corporation135 Big SQL vs. Alternatives  There are a number of SQL solutions, where does Big SQL fit in?  Hive – Open source • Established Hadoop component • Active development community – Restrictive SQL syntax • No subqueries (Hive 0.11 adds non-correlated subquery support) • No windowed aggregates (Hive 0.11 adds windowed aggregate support) • Ansi join syntax only – Limited type support • No varchar(n), decimal(p,s), etc. – Poor client support • Limited JDBC and ODBC drivers – Poor low-latency query support (via local mapreduce)
  136. 136. © 2013 IBM Corporation136 Big SQL vs. Alternatives (cont.)  Impala – Recently open sourced – Achieves low latency by bypassing MapReduce infrastructure • Installs a completely separate execution infrastructure • Can lead to resource scheduling conflicts – Execution engine is C++ • Great for performance, makes extending difficult (e.g. UDF's & UDA's) • Support for limited set of file formats – Currently limited to broadcast joins • All tables must fit in memory (aggregate cluster memory) • Scalability limitation for larger clusters – Uses Hive 0.9 query syntax (more limitations than the current Hive) – Uses Hive 0.9 type system (more limitations than the current Hive)
  137. 137. © 2013 IBM Corporation137
  138. 138. © 2013 IBM Corporation138
  139. 139. © 2013 IBM Corporation139
  140. 140. © 2013 IBM Corporation140
  141. 141. © 2013 IBM Corporation141 Lab 3 – Querying Data with Pig, Hive, Jaql 1)putty to biadmin@10.199.20.51 2)Skip task 1.1._2, start jaql shell using command /opt/ibm/biginsights/jaql/bin/jaqlshell 3)In 1.1._5 replace biadmin with with biadminX where X is your user ID 4)Skip chapter 1.2 (optional using virtual machine) 5)In 1.3._2 replace biadmin with with biadminX where X is your user ID 6)Instead of task 1.3._2 type /opt/ibm/biginsights/pig/bin/pig 7)In 1.3._4 replace sampleData/NewsGroups.csv with /user/biadminX/sampleData/NewsGroups.csv 8)Skip chapter 1.4 (optional using virtual machine) 9)Skip 1.5._12 and _13 and type /opt/ibm/biginsights/hive/bin/hive instead 10)Type "use biadminX" where X is your user ID 11)continue with task _14
  142. 142. © 2013 IBM Corporation142 NoSQL Databases  Column Store – Hadoop / HBASE – Cassandra – Amazon Simple DB  JSON / Document Store – MongoDB – CouchDB  Key / Value Store – Amazon DynamoDB – Voldemort  Graph DBs – DB2 SPARQL Extension – Neo4J  MP RDBMS – DB2 DPF, DB2 pureScale, PureData for Operational Analytics – Oracle RAC – Greenplum http://nosql-database.org/ > 150
  143. 143. © 2013 IBM Corporation143 CAP Theorem / Brewers Theorem¹  impossible for a distributed computer system simultaneously guarantee all 3 properties – Consistency (all nodes see the same data at the same time) – Availability (guarantee that every request knows whether it was successful or failed) – Partition tolerance (continues to operate despite failure of part of the system)  What about ACID? – Atomicity – Consistency – Isolation – Durability  BASE, the new ACID – Basically Available – Soft state – Eventual consistency • Monotonic Read Consistency • Monotonic Write Consistency • Read Your Own Writes
  144. 144. © 2013 IBM Corporation144 Certification  Go to www.bigdatauniversity.com  Search for “hadoop fundamentals”  Choose “Hadoop Fundamentals I – Version 2”  Sign up  Login with existing account or one of the following:  Take the test:
  145. 145. © 2013 IBM Corporation145 Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×