14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)

Uploaded on


More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Jean-Pierre König, MeMo News AG OPENING THE TOOL BOX DEVELOPMENT, TESTING AND DEPLOYMENT IN THE HADOOP ECOSYSTEM 14.05.12http://www.flickr.com/photos/theaucitron/5810163712/sizes/l/in/photostream/
  • 2. Development THE APPLICATIONhttp://www.flickr.com/photos/oskay/2523189273/sizes/l/in/photostream/
  • 3. DevelopmentThe Applicationisa ... • Distributed newsagent • GUI-less Java Application • Spring-based 2-layer architecture • Services and data access objects • Client of Hadoop • Dependencies to Zookeeper and HBase 14.05.12
  • 4. Development(2)We use Maven 3 for • Project structure -Corporate POM & Modules • Dependency Management • Build the artifact Corporate POM global newsagent tools mapred Loader (Client) Infrastructure Model Utils Services Data Access Objects 14.05.12
  • 5. Development MAPREDUCEJOBShttp://www.flickr.com/photos/elasticsoul/61062372/sizes/l/in/photostream/
  • 6. MapReduce6 • Java MR jobs for business processes • Input and output paths either HDFS or HBase • MR job chaining by Azkaban • PIG, HIVE for ad-hoc queries 14.05.12
  • 7. Development HBASEhttp://www.flickr.com/photos/isherwoodchris/6902155937/sizes/l/in/photostream/
  • 8. HBase• HBase Schema Manager • github.com/jkoenig/hbase-schema-manager• Utilities to copy/move/rename column-families and copy complete tables with its data • github.com/memonews/hbase-utils• Stargate REST API without compression • github.com/memonews/hbase-stargate 14.05.12
  • 9. Hadoop, HBase, Zookeeper TESTINGhttp://www.flickr.com/photos/42106306@N00/4380803535/sizes/m/in/photostream/
  • 10. HBase• We use the Apache HBaseTestingUtility• It’s in-memory  complete hadoop instance with dfs, zk and hbase• It‘s very slow – conciderlongrunning ITpublicclassConfigurableHBaseClient {protectedstaticHBaseTestingUtility TEST_UTIL;static{ final Configurationconf = HBaseConfiguration.create();conf.addResource("hbase-default-test.xml");try{TEST_UTIL = HBaseTestingUtilityFactory.getMiniCluster(1, conf); } catch (final Exception e) {fail("Couldnot start hadoop mini cluster."); } }} 14.05.12
  • 11. MapReduce• Since business logic involved, we use hadoop- mrunit for testing Map/Reduce Jobs• It’s in-memory testing • Parameterized Mapper/Reducer with a driver@TestpublicvoidreduceShouldWriteExactlyOneLinePerMap() throwsIOException {final List<DoubleWritable>values = newArrayList<DoubleWritable>();values.add(new DoubleWritable(399287729));this.driver.withInput(newText("de.t-online/nachrichten"), values);this.driver.run(); assertEquals(1, this.driver.getCounters().findCounter(MeMoCounters.SIGNALS_WRITTEN).getValue());} 14.05.12
  • 12. Zookeeper• We use the Apache Zookeeper ClientBase• It‘s not in-memory but against the staging cluster • Prefix paths e.g.: /test/memo/subscribers@TestpublicvoidgetNumberOfSubscribersShouldSetWatchFlag()throwsKeeperException,InterruptedException{ final SubscriberDaoImplsubscriberDao =newSubscriberDaoImpl(zookeeperDao, DIR, null);subscriberDao.getNumberOfSubscribers(listener);verify(this.zookeeper, times(1)).getChildren(eq(DIR), eq(subscriberDao));} 14.05.12
  • 13. Deployment THE APPLICATIONhttp://www.flickr.com/photos/navalsurfaceforces/5553412190/sizes/l/in/photostream/
  • 14. The Application• Automated build and restart via capistrano• Build on every machine • There is a .m2 repository everywhereset :deploy_to, "/usr/share/memo-newsagent“set:keep_releases, 1after "deploy:setup" dorun "mkdir -p /var/run/memo #{shared_path}/logs /var/log/memo/" ...endafter "deploy:update_code" dorun "cd #{current_release} &&mvninstall-Pfast> #{shared_path}/logs/build.log"endafter "deploy", "rowlog:stop", "newsagent:restart", "rowlog:start" 14.05.12
  • 15. Deployment MAPREDUCE JOBShttp://www.flickr.com/photos/navalsurfaceforces/6257239933/sizes/l/in/photostream/
  • 16. Map Reduce Jobs• We use a Maven HadoopPluginhadoop:pack a la mvn:packagehadoop:deploy HDFS and target folder• All dependencies packed-in  Careful: Huge JARs without dependency managementsee github.com/memonews/maven-hadoop 14.05.12
  • 17. DevOps OTHER TOOLS IN USEhttp://www.flickr.com/photos/damongman/4979871047/sizes/l/in/photostream/
  • 18. Other Tools• Staging environment in-house, 1 to 1 copy from production (virtualized)• Azkaban for MR job scheduling• Jenkins for (Integration-) Tests and Metrics• GIT• Icinga for Monitoring & Alerting• Ganglia / Graphite for Hadoop Metrics• Fliwi for automated cluster provisioning 14.05.12
  • 19. jean-pierre.koenig@menonews.comTHANKS!