• Save
14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)
Upcoming SlideShare
Loading in...5

14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG) 14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG) Presentation Transcript

  • Jean-Pierre König, MeMo News AG OPENING THE TOOL BOX DEVELOPMENT, TESTING AND DEPLOYMENT IN THE HADOOP ECOSYSTEM 14.05.12http://www.flickr.com/photos/theaucitron/5810163712/sizes/l/in/photostream/
  • Development THE APPLICATIONhttp://www.flickr.com/photos/oskay/2523189273/sizes/l/in/photostream/
  • DevelopmentThe Applicationisa ... • Distributed newsagent • GUI-less Java Application • Spring-based 2-layer architecture • Services and data access objects • Client of Hadoop • Dependencies to Zookeeper and HBase 14.05.12 View slide
  • Development(2)We use Maven 3 for • Project structure -Corporate POM & Modules • Dependency Management • Build the artifact Corporate POM global newsagent tools mapred Loader (Client) Infrastructure Model Utils Services Data Access Objects 14.05.12 View slide
  • Development MAPREDUCEJOBShttp://www.flickr.com/photos/elasticsoul/61062372/sizes/l/in/photostream/
  • MapReduce6 • Java MR jobs for business processes • Input and output paths either HDFS or HBase • MR job chaining by Azkaban • PIG, HIVE for ad-hoc queries 14.05.12
  • Development HBASEhttp://www.flickr.com/photos/isherwoodchris/6902155937/sizes/l/in/photostream/
  • HBase• HBase Schema Manager • github.com/jkoenig/hbase-schema-manager• Utilities to copy/move/rename column-families and copy complete tables with its data • github.com/memonews/hbase-utils• Stargate REST API without compression • github.com/memonews/hbase-stargate 14.05.12
  • Hadoop, HBase, Zookeeper TESTINGhttp://www.flickr.com/photos/42106306@N00/4380803535/sizes/m/in/photostream/
  • HBase• We use the Apache HBaseTestingUtility• It’s in-memory  complete hadoop instance with dfs, zk and hbase• It‘s very slow – conciderlongrunning ITpublicclassConfigurableHBaseClient {protectedstaticHBaseTestingUtility TEST_UTIL;static{ final Configurationconf = HBaseConfiguration.create();conf.addResource("hbase-default-test.xml");try{TEST_UTIL = HBaseTestingUtilityFactory.getMiniCluster(1, conf); } catch (final Exception e) {fail("Couldnot start hadoop mini cluster."); } }} 14.05.12
  • MapReduce• Since business logic involved, we use hadoop- mrunit for testing Map/Reduce Jobs• It’s in-memory testing • Parameterized Mapper/Reducer with a driver@TestpublicvoidreduceShouldWriteExactlyOneLinePerMap() throwsIOException {final List<DoubleWritable>values = newArrayList<DoubleWritable>();values.add(new DoubleWritable(399287729));this.driver.withInput(newText("de.t-online/nachrichten"), values);this.driver.run(); assertEquals(1, this.driver.getCounters().findCounter(MeMoCounters.SIGNALS_WRITTEN).getValue());} 14.05.12
  • Zookeeper• We use the Apache Zookeeper ClientBase• It‘s not in-memory but against the staging cluster • Prefix paths e.g.: /test/memo/subscribers@TestpublicvoidgetNumberOfSubscribersShouldSetWatchFlag()throwsKeeperException,InterruptedException{ final SubscriberDaoImplsubscriberDao =newSubscriberDaoImpl(zookeeperDao, DIR, null);subscriberDao.getNumberOfSubscribers(listener);verify(this.zookeeper, times(1)).getChildren(eq(DIR), eq(subscriberDao));} 14.05.12
  • Deployment THE APPLICATIONhttp://www.flickr.com/photos/navalsurfaceforces/5553412190/sizes/l/in/photostream/
  • The Application• Automated build and restart via capistrano• Build on every machine • There is a .m2 repository everywhereset :deploy_to, "/usr/share/memo-newsagent“set:keep_releases, 1after "deploy:setup" dorun "mkdir -p /var/run/memo #{shared_path}/logs /var/log/memo/" ...endafter "deploy:update_code" dorun "cd #{current_release} &&mvninstall-Pfast> #{shared_path}/logs/build.log"endafter "deploy", "rowlog:stop", "newsagent:restart", "rowlog:start" 14.05.12
  • Deployment MAPREDUCE JOBShttp://www.flickr.com/photos/navalsurfaceforces/6257239933/sizes/l/in/photostream/
  • Map Reduce Jobs• We use a Maven HadoopPluginhadoop:pack a la mvn:packagehadoop:deploy HDFS and target folder• All dependencies packed-in  Careful: Huge JARs without dependency managementsee github.com/memonews/maven-hadoop 14.05.12
  • DevOps OTHER TOOLS IN USEhttp://www.flickr.com/photos/damongman/4979871047/sizes/l/in/photostream/
  • Other Tools• Staging environment in-house, 1 to 1 copy from production (virtualized)• Azkaban for MR job scheduling• Jenkins for (Integration-) Tests and Metrics• GIT• Icinga for Monitoring & Alerting• Ganglia / Graphite for Hadoop Metrics• Fliwi for automated cluster provisioning 14.05.12
  • jean-pierre.koenig@menonews.comTHANKS!