Hadoop Successes and Failures to Drive Deployment Evolution

3,618 views

Published on

Hadoop Successes and Failures to Drive Deployment Evolution

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,618
On SlideShare
0
From Embeds
0
Number of Embeds
2,027
Actions
Shares
0
Downloads
48
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop Successes and Failures to Drive Deployment Evolution

  1. 1. Hadoop Hands OnSuccesses and failures to driveevolutionBenoit PERROUDSoftware Engineer @Verisign & Apache CommitterGITI BigData, EPFL, November 6. 2012
  2. 2. Disclaimer • I apologize for speaking “Frenglish” • The views and statements expressed in this talk do not necessarily reflect the views of VeriSign, Inc and any other person involved in the company do not warrant the accuracy, reliability, currency or completeness of those views or statements and do not accept any legal liability whatsoever arising from any reliance on the views, statements and subject matter of the talk. • Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. • Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its affiliates • Python and the Python logo are either registered trademarks or trademarks of the Python Software Foundation • MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc. • All other marks are the property of their respective owners.Verisign Public 2
  3. 3. Let’s talk about Hadoop!Verisign Public 3
  4. 4. Hadoop 10k Feet View 1. MapReduce Processing Framework • Map  Combine  Shuffle  Reduce 2. Distributed File System (HDFS)Verisign Public Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 4
  5. 5. Your first Hadoop Deployment • Pseudo-distributed mode on a single nodeVerisign Public 5
  6. 6. Going Distributed • TaskTracker (TT) and DataNode (DN) is moved to a dedicated boxVerisign Public 6
  7. 7. NameNode Single Point of Failure • NameNode crashes. Configuring PNN and SNN NFS HA setup is not detailed here.Verisign Public 7
  8. 8. Bringing Data into the Cluster • Data could be internal to the company, but also external. Data Retrieval and Stream Ingestion are over simplified.Verisign Public 8
  9. 9. Dealing with API Changes • Integration/Validation Cluster setup Validation Cluster will be omitted in further slides for more clarityVerisign Public 9
  10. 10. Cluster Is GrowingVerisign Public 10
  11. 11. Add MonitoringVerisign Public 11
  12. 12. Turn On Rack AwarenessVerisign Public 12
  13. 13. Split the Cluster to Production and ResearchVerisign Public 13
  14. 14. Data Retrieval through REST End PointVerisign Public 14
  15. 15. Data Retrieval with Search FeaturesVerisign Public 15
  16. 16. Data Retrieval add CacheVerisign Public 16
  17. 17. Data Visualization ToolsVerisign Public 17
  18. 18. Upstream Updates ChannelVerisign Public 18
  19. 19. Realtime UpdatesVerisign Public 19
  20. 20. Future Evolutions • Hadoop Next Gen • YARN (2.0) • Graph processing • Neo4J • Google Pregel / Apache Hama • Incremental Updates • Real time ad hoc queries • Cloudera Impala / Google DremelVerisign Public 20
  21. 21. Conclusion • Hadoop has gained huge momentum • Technologies (around Hadoop) are evolving really fast • There is no “One size fits all” solution • Design hardly driven by customer needs • Data quality is a hidden requirementVerisign Public 21
  22. 22. Conclusion #2 • Data Scientists cost a lot • Running on commodity hardware still costs a lot • No one has the full understanding of the full data flow • And you need several FTE just to track the architecture • You have a high risk of misuse of these softwares • Hiring engineers with deep knowledge (meaning: hands on experience) in some of these softwares is already a challengeVerisign Public 22
  23. 23. Recommended Reading Hadoop In Practice by Alex Holmes Senior Software Engineer @VerisignVerisign Public 23
  24. 24. Q&A Benoit PERROUD bperroud@verisign.comVerisign Public 24
  25. 25. Thank You© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, anddesigns are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the UnitedStates and in foreign countries. All other trademarks are property of their respective owners.

×