Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Hadoop Ecosystem

926 views

Published on

The presentation contents a fundamental knowlege about Hadoop Ecosystem. It includes a popular technology as HDFS, YARN, HIVE Spark and Flink

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction to Hadoop Ecosystem

  1. 1. © Introduction to
  2. 2. © ■ ■ ● ●
  3. 3. © ■ StreamRockTM ●
  4. 4. © ■ StreamRockTM
  5. 5. © ■ ● ● ●
  6. 6. ©
  7. 7. ©
  8. 8. © ■ ■ ■
  9. 9. © ■ ■ ■ ■ ● ■ ■ ●
  10. 10. © ■ ■ ■ ■ ● ■ ■ ●
  11. 11. © ■
  12. 12. © A Definition For Your Daddy
  13. 13. © $ hdfs dfs -ls /user/tiger $ hdfs dfs -put songs.txt /user/tiger $ hdfs dfs -cat /user/tiger/songs.txt $ hdfs dfs -mkdir songs $ hdfs dfs -mv songs.txt songs $ hdfs dfs -rmr songs
  14. 14. © ■ $ hdfs dfs -put songs.txt /user/tiger Question?
  15. 15. © ■ ■ Answer!
  16. 16. © ■ ● ● ■ Image source: http://pixgood.com/slicing-bread.html
  17. 17. © ■ $ hdfs dfs -cat /user/tiger/songs.txt Question?
  18. 18. © ■ ● ● ● Answer!
  19. 19. © ■ ■
  20. 20. © ■ ● ■ ● ● ■ ■ ●
  21. 21. © ■ ● ■ ● ■ ●
  22. 22. ©
  23. 23. ©
  24. 24. © ■ ■ ■ ■
  25. 25. © 1. Offers compute resources such as CPU and RAM 2. Runs tasks of the applications submitted by users 3. Reports to the Master
  26. 26. © 1. Knows about all Slaves 2. Knows about available and occupied resources on each Slave 3. Schedules jobs submitted by clients
  27. 27. © A user can submit any type of application that is supported by YARN
  28. 28. © 1. Started and overseen by Resource Manager 2. Coordinates the execution of all tasks within an application 3. Asks for resources needed to run its tasks 4. Runs on the Node Manager
  29. 29. © ■ ● ■ Containers are dynamically created and deleted
  30. 30. © ■ ■ ● ■
  31. 31. © ■ ■ ■ ■ ■ ■ ● ●
  32. 32. © ■ Large volume of data Computation e.g. a JAR file
  33. 33. © 1. NodeManagers should be collocated with DataNodes 2. The Resource Manager tries to schedule tasks on a node which is the closest to the data 3. Large volumes of data don’t have to be sent over the network
  34. 34. © ■
  35. 35. © Their reality ■ ■ ● ■ Their conclusion ■
  36. 36. © HADOOP MR MR SOME MAGIC 1. Parses query 2. Plans execution 3. Submits jobs 4. Monitors jobs 5. Returns results Execution SELECT trackid, COUNT(*) AS cnt FROM stream GROUP BY trackid ORDER BY cnt DESC; Results
  37. 37. © HADOOP MR MR APACHE HIVE Results 1. Parses query 2. Plans execution 3. Submits jobs 4. Monitors jobs 5. Returns results Execution SELECT trackid, COUNT(*) AS cnt FROM stream GROUP BY trackid ORDER BY cnt DESC;
  38. 38. ©
  39. 39. ©
  40. 40. ©
  41. 41. ©
  42. 42. © ■ ● ● ● ●
  43. 43. © RDBMS Hive Metastore Stores Hive metadata Manages metadata about databases, tables and views
  44. 44. © Hive Shell CLI RDBMS Hive Metastore
  45. 45. © Hive Shell CLI BeesWax HUE RDBMS Hive Metastore Acts as a proxy for “ligth” clients JDBC/ODBC Hive Server 2 Beeline CLI
  46. 46. ©
  47. 47. © ■ ■
  48. 48. ©
  49. 49. © Job 1 Job 2 Possible to cache dataset in cluster’s (distributed) memory to read it faster in next jobs HDFS Read Memory Read Cache In Memory Cache In Memory Memory Read
  50. 50. © Job 1 Job 2 Great fit for iterative algorithms and interactive queries! HDFS Read Memory Read Cache In Memory Cache In Memory Possible to cache dataset in cluster’s (distributed) memory to read it faster in next jobs Memory Read
  51. 51. © Interactive queries Iterative algorithms Input Query 2 Query 1 Query 3 Input Iteration 1 Iteration 2 Distributed Memory
  52. 52. © NodeManager Client YARN Container Spark Application Master Spark Driver Resource Manager NodeManager YARN Container Spark Executor Spark Task NodeManager YARN Container Spark Executor Spark Task
  53. 53. © ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 20g --executor-cores 3 lib/spark-examples*.jar 10
  54. 54. © ■ Spark Core Spark SQL Spark Streaming (near real-time, micro-batch) MLlib (machine learning) GraphFrames (graph processing) SparkR (R on Spark)
  55. 55. © <- INGEST <- STORE <- MANAGE <- ANALYZE
  56. 56. © ■ StreamRockTM ● ■ ● ● ● ●
  57. 57. © Non - stop Each event or each minute or each user session Real-time event collection Stream processing
  58. 58. © ■ ● ■ ● ●
  59. 59. © StreamRockTM ■ ● ■ ●
  60. 60. © ■ ● ■ ● ● ■
  61. 61. ©
  62. 62. ©
  63. 63. ©

×