• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop Successes and Failures to Drive Deployment Evolution
 

Hadoop Successes and Failures to Drive Deployment Evolution

on

  • 3,261 views

Hadoop Successes and Failures to Drive Deployment Evolution

Hadoop Successes and Failures to Drive Deployment Evolution

Statistics

Views

Total Views
3,261
Views on SlideShare
1,558
Embed Views
1,703

Actions

Likes
2
Downloads
38
Comments
0

1 Embed 1,703

http://www.scoop.it 1703

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Successes and Failures to Drive Deployment Evolution Hadoop Successes and Failures to Drive Deployment Evolution Presentation Transcript

    • Hadoop Hands OnSuccesses and failures to driveevolutionBenoit PERROUDSoftware Engineer @Verisign & Apache CommitterGITI BigData, EPFL, November 6. 2012
    • Disclaimer • I apologize for speaking “Frenglish” • The views and statements expressed in this talk do not necessarily reflect the views of VeriSign, Inc and any other person involved in the company do not warrant the accuracy, reliability, currency or completeness of those views or statements and do not accept any legal liability whatsoever arising from any reliance on the views, statements and subject matter of the talk. • Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. • Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its affiliates • Python and the Python logo are either registered trademarks or trademarks of the Python Software Foundation • MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc. • All other marks are the property of their respective owners.Verisign Public 2
    • Let’s talk about Hadoop!Verisign Public 3
    • Hadoop 10k Feet View 1. MapReduce Processing Framework • Map  Combine  Shuffle  Reduce 2. Distributed File System (HDFS)Verisign Public Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 4
    • Your first Hadoop Deployment • Pseudo-distributed mode on a single nodeVerisign Public 5
    • Going Distributed • TaskTracker (TT) and DataNode (DN) is moved to a dedicated boxVerisign Public 6
    • NameNode Single Point of Failure • NameNode crashes. Configuring PNN and SNN NFS HA setup is not detailed here.Verisign Public 7
    • Bringing Data into the Cluster • Data could be internal to the company, but also external. Data Retrieval and Stream Ingestion are over simplified.Verisign Public 8
    • Dealing with API Changes • Integration/Validation Cluster setup Validation Cluster will be omitted in further slides for more clarityVerisign Public 9
    • Cluster Is GrowingVerisign Public 10
    • Add MonitoringVerisign Public 11
    • Turn On Rack AwarenessVerisign Public 12
    • Split the Cluster to Production and ResearchVerisign Public 13
    • Data Retrieval through REST End PointVerisign Public 14
    • Data Retrieval with Search FeaturesVerisign Public 15
    • Data Retrieval add CacheVerisign Public 16
    • Data Visualization ToolsVerisign Public 17
    • Upstream Updates ChannelVerisign Public 18
    • Realtime UpdatesVerisign Public 19
    • Future Evolutions • Hadoop Next Gen • YARN (2.0) • Graph processing • Neo4J • Google Pregel / Apache Hama • Incremental Updates • Real time ad hoc queries • Cloudera Impala / Google DremelVerisign Public 20
    • Conclusion • Hadoop has gained huge momentum • Technologies (around Hadoop) are evolving really fast • There is no “One size fits all” solution • Design hardly driven by customer needs • Data quality is a hidden requirementVerisign Public 21
    • Conclusion #2 • Data Scientists cost a lot • Running on commodity hardware still costs a lot • No one has the full understanding of the full data flow • And you need several FTE just to track the architecture • You have a high risk of misuse of these softwares • Hiring engineers with deep knowledge (meaning: hands on experience) in some of these softwares is already a challengeVerisign Public 22
    • Recommended Reading Hadoop In Practice by Alex Holmes Senior Software Engineer @VerisignVerisign Public 23
    • Q&A Benoit PERROUD bperroud@verisign.comVerisign Public 24
    • Thank You© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, anddesigns are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the UnitedStates and in foreign countries. All other trademarks are property of their respective owners.