Vademecum Big DataAdam Kawa, Spotify, Compendium CE
About MeSpotify/Compendium, WHUG/SHUG, HakunaMapData.com, +2.5Y
And The 20-Minute Story About ...Image source:http://www.containsmoderateperil.com/wp-content/uploads/2012/09/Dev-Diary-Ep...
A Really Data-Driven Company …Image source: http://wwwimg.roku.com/hero-images/home2_1.jpg
And Some Inevitable Problems ...Image source: http://www.digitalnewsasia.com/sites/default/files/images/digital%20economy/...
And Some Inevitable Problems ...Image source: http://p.alejka.pl/i2/p_new/36/42/grosz-na-szczescie-ze-zlota-m-1z-doskonaly...
And Some Inevitable Problems ...Image source: http://25.media.tumblr.com/d1038e7831eae86f5e84d0d09a2e6fad/tumblr_mfh5srmNA...
Start!
The First Approach Works Fine ...
Until Data Gets Bigger ...
And More Diverse ...
The Data Monster Becomes A ProblemImage source: http://cloudtimes.org/wp-content/uploads/2012/05/big-data.jpg
Apache Hadoop Becomes A SolutionImage source: http://gigaom2.files.wordpress.com/2012/06/shutterstock_60414424.jpg
Orchestra Of NodesImage source: http://www.dsn.jhu.edu/images/orchestra.gif
Fault-Tolerant Orchestra Of Nodes
Untypical Orchestra Of Typical* Nodes* however having very cheap nodes is false economy
Highly Scalable Orchestra Of Nodes
Hadoop Distributed File System (HDFS)Image source: http://www.wallcoo.net/car/Trucks/images/Big_Truck_on_Road_.jpg
HDFS Blocks And Replication
HDFS Self-Healing FeaturesImage source: http://www.mwctoys.com/images/review_hydra_3.jpg
HDFS Scales And Shines With MapReduceImage source: http://www.kkkp.pl/graph/gr_kdz_char3.jpg
MapReduce Is A Change                                            DATA                                             Map And ...
Map And Reduce Functions
MapReduce Paradigm
Artist Count Example
Sending Computation To Data                                                                                               ...
MapReduce ImplementationImage source: http://i3.mirror.co.uk/incoming/article1360046.ece/ALTERNATES/s615/Male+drones+tend+...
First Success: 5-Node Hadoop ClusterImage source: http://www.smallbiztechnology.com/wp-content/uploads/2012/12/success.jpg
Apache Whirr And The Cloud===== hadoop.properties =============whirr.cluster-name=production_clusterwhirr.instance-templat...
First Sad (Non-Java Speaking) DevelopersImage source: http://www.shivayanaturals.com/wp-content/uploads/2012/01/Unhappy.jpg
Hadoop Streaming For Scripting LanguagesImage source:http://www.mightystreamradio.com/PHOTOS/STREAM%20PHOTO%202.jpg
Apache Hive Makes You Feel YoungerImage source: http://majapszczolka.blox.pl/resource/Pszczolka_Maja_Baje_Pl_6.jpg
Speak ~SQL, But Run As MapReduce
HUE - Browser-Based EnvironmentImage source: http://www.sentric.ch/wp-content/uploads/2013/01/Create-table-in-Hive.png
Hive Is Based On & Limited By Hadoop
Apache Pig Makes Them Happier!                        Image source: http://vetnolimits.files.wordpress.com/2012/02/pumba.jpg
Pig Accelerates Development        
Need To Add More Relational Data To HDFSBased on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexin...
SQL To Hadoop = SqoopImage source:http://3.bp.blogspot.com/_uuOo8x3WXWE/SuNV4y7qzeI/AAAAAAAAkYM/6RUExOMQPno/s400/pumpkin_e...
Sqoop Import/Export Data Using MRImage source: http://blog.cloudera.com/blog/2011/10/apache-sqoop-overview/
Apache Oozie For Defining WorkflowsImage source: Apache Oozie website
Apache Oozie For SchedulingImage source:http://risingtechies.files.wordpress.com/2012/05/schedule.jpg
Need To Add Even More Logs To HDFSBased on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migr...
Apache Flume For Data Collection                                     e.g. JDBC, Memory, FileImage source: Apache Flume web...
How To Manager A Larger Cluster
Apache Avro + Snappy/Deflate_6Image source: http://www.funkydiva.pl/wp-content/uploads/2012/10/lego-tapety-na-pulpit-duze-...
When Latency Is To HighImage source: http://www.pharmacyowners.com/Portals/37772/images/It-can-be-a-LONG-wait-at-the-pharm...
Cloudera Impala – Real-Time ~SQL QueriesImage source: http://static.cargurus.com/images/site/2010/07/02/12/24/1969_chevrol...
Apache HBase - Random, Real-TimeAccess To Big DataImage source: http://www.superhqwallpapers.com/wp-content/uploads/2012/0...
YARN – Hadoop Cluster More RobustImage source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-gre...
Hadoop Is Successfully DeployedImage source: http://bogdankipko.com/wp-content/uploads/2012/03/lessons-learned.jpg
Learn More About Apache Hadoop?
Use Hadoop To Solve Real-World Problems?
Oozie And YARN At WHUG, Today @18:00
Thank You! Any Questions About Them?Image source: http://xn--gryprzegldarkowe-43b.com.pl/wp-content/uploads/2012/05/me-fre...
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Upcoming SlideShare
Loading in...5
×

Apache Hadoop Ecosystem (based on an exemplary data-driven…

2,724

Published on

Introduction to Apache Hadoop Ecosystem based on some exemplary data-driven company that wants to store and process large amounts of data.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,724
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Apache Hadoop Ecosystem (based on an exemplary data-driven…

  1. 1. Vademecum Big DataAdam Kawa, Spotify, Compendium CE
  2. 2. About MeSpotify/Compendium, WHUG/SHUG, HakunaMapData.com, +2.5Y
  3. 3. And The 20-Minute Story About ...Image source:http://www.containsmoderateperil.com/wp-content/uploads/2012/09/Dev-Diary-Epic-Story.jpg
  4. 4. A Really Data-Driven Company …Image source: http://wwwimg.roku.com/hero-images/home2_1.jpg
  5. 5. And Some Inevitable Problems ...Image source: http://www.digitalnewsasia.com/sites/default/files/images/digital%20economy/data%20explosion.jpg
  6. 6. And Some Inevitable Problems ...Image source: http://p.alejka.pl/i2/p_new/36/42/grosz-na-szczescie-ze-zlota-m-1z-doskonaly-na-kazda-okazje_0_b.jpg
  7. 7. And Some Inevitable Problems ...Image source: http://25.media.tumblr.com/d1038e7831eae86f5e84d0d09a2e6fad/tumblr_mfh5srmNAR1s06a3to1_500.jpg
  8. 8. Start!
  9. 9. The First Approach Works Fine ...
  10. 10. Until Data Gets Bigger ...
  11. 11. And More Diverse ...
  12. 12. The Data Monster Becomes A ProblemImage source: http://cloudtimes.org/wp-content/uploads/2012/05/big-data.jpg
  13. 13. Apache Hadoop Becomes A SolutionImage source: http://gigaom2.files.wordpress.com/2012/06/shutterstock_60414424.jpg
  14. 14. Orchestra Of NodesImage source: http://www.dsn.jhu.edu/images/orchestra.gif
  15. 15. Fault-Tolerant Orchestra Of Nodes
  16. 16. Untypical Orchestra Of Typical* Nodes* however having very cheap nodes is false economy
  17. 17. Highly Scalable Orchestra Of Nodes
  18. 18. Hadoop Distributed File System (HDFS)Image source: http://www.wallcoo.net/car/Trucks/images/Big_Truck_on_Road_.jpg
  19. 19. HDFS Blocks And Replication
  20. 20. HDFS Self-Healing FeaturesImage source: http://www.mwctoys.com/images/review_hydra_3.jpg
  21. 21. HDFS Scales And Shines With MapReduceImage source: http://www.kkkp.pl/graph/gr_kdz_char3.jpg
  22. 22. MapReduce Is A Change DATA Map And ReduceImage source: http://2.bp.blogspot.com/-Kl1ADjd3_7I/T6a8ZQV7ITI/AAAAAAAAKfE/qVyTQdJl2Do/s1600/make-big-changes-in-small-steps.png
  23. 23. Map And Reduce Functions
  24. 24. MapReduce Paradigm
  25. 25. Artist Count Example
  26. 26. Sending Computation To Data Data Is Here!ComputationImage source: http://www.conservationmagazine.org/wp-content/uploads/2011/03/ElephantAndMouse1.jpg
  27. 27. MapReduce ImplementationImage source: http://i3.mirror.co.uk/incoming/article1360046.ece/ALTERNATES/s615/Male+drones+tend+to+honeycomb+cells+in+a+bee+colony
  28. 28. First Success: 5-Node Hadoop ClusterImage source: http://www.smallbiztechnology.com/wp-content/uploads/2012/12/success.jpg
  29. 29. Apache Whirr And The Cloud===== hadoop.properties =============whirr.cluster-name=production_clusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,4 hadoop-datanode+hadoop-tasktrackerwhirr.provider=aws-ec2 # or Rackspace cloudservers-us...=====================================$ whirr launch-cluster --config hadoop.properties$ whirr destroy-cluster --config hadoop.properties
  30. 30. First Sad (Non-Java Speaking) DevelopersImage source: http://www.shivayanaturals.com/wp-content/uploads/2012/01/Unhappy.jpg
  31. 31. Hadoop Streaming For Scripting LanguagesImage source:http://www.mightystreamradio.com/PHOTOS/STREAM%20PHOTO%202.jpg
  32. 32. Apache Hive Makes You Feel YoungerImage source: http://majapszczolka.blox.pl/resource/Pszczolka_Maja_Baje_Pl_6.jpg
  33. 33. Speak ~SQL, But Run As MapReduce
  34. 34. HUE - Browser-Based EnvironmentImage source: http://www.sentric.ch/wp-content/uploads/2013/01/Create-table-in-Hive.png
  35. 35. Hive Is Based On & Limited By Hadoop
  36. 36. Apache Pig Makes Them Happier!  Image source: http://vetnolimits.files.wordpress.com/2012/02/pumba.jpg
  37. 37. Pig Accelerates Development  
  38. 38. Need To Add More Relational Data To HDFSBased on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
  39. 39. SQL To Hadoop = SqoopImage source:http://3.bp.blogspot.com/_uuOo8x3WXWE/SuNV4y7qzeI/AAAAAAAAkYM/6RUExOMQPno/s400/pumpkin_eating_elephant.jpg
  40. 40. Sqoop Import/Export Data Using MRImage source: http://blog.cloudera.com/blog/2011/10/apache-sqoop-overview/
  41. 41. Apache Oozie For Defining WorkflowsImage source: Apache Oozie website
  42. 42. Apache Oozie For SchedulingImage source:http://risingtechies.files.wordpress.com/2012/05/schedule.jpg
  43. 43. Need To Add Even More Logs To HDFSBased on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
  44. 44. Apache Flume For Data Collection e.g. JDBC, Memory, FileImage source: Apache Flume website
  45. 45. How To Manager A Larger Cluster
  46. 46. Apache Avro + Snappy/Deflate_6Image source: http://www.funkydiva.pl/wp-content/uploads/2012/10/lego-tapety-na-pulpit-duze-zdjecia-16.jpg
  47. 47. When Latency Is To HighImage source: http://www.pharmacyowners.com/Portals/37772/images/It-can-be-a-LONG-wait-at-the-pharmacy-resized-600.jpg
  48. 48. Cloudera Impala – Real-Time ~SQL QueriesImage source: http://static.cargurus.com/images/site/2010/07/02/12/24/1969_chevrolet_impala-pic-2868587530424686499.jpeg
  49. 49. Apache HBase - Random, Real-TimeAccess To Big DataImage source: http://www.superhqwallpapers.com/wp-content/uploads/2012/01/Super-Ferrari.jpg
  50. 50. YARN – Hadoop Cluster More RobustImage source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-green-hd-leaf-nature-wet.jpg
  51. 51. Hadoop Is Successfully DeployedImage source: http://bogdankipko.com/wp-content/uploads/2012/03/lessons-learned.jpg
  52. 52. Learn More About Apache Hadoop?
  53. 53. Use Hadoop To Solve Real-World Problems?
  54. 54. Oozie And YARN At WHUG, Today @18:00
  55. 55. Thank You! Any Questions About Them?Image source: http://xn--gryprzegldarkowe-43b.com.pl/wp-content/uploads/2012/05/me-free-zoo1.jpg

×