14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)

Jean-Pierre König, MeMo News AG

OPENING THE TOOL BOX
DEVELOPMENT, TESTING AND DEPLOYMENT IN THE HADOOP
ECOSYSTEM

14.05.12

http://www.flickr.com/photos/theaucitron/5810163712/sizes/l/in/photostream/

Development

THE APPLICATION

http://www.flickr.com/photos/oskay/2523189273/sizes/l/in/photostream/

Development

The Applicationisa ...
• Distributed newsagent
• GUI-less Java Application
• Spring-based 2-layer architecture
• Services and data access objects
• Client of Hadoop
• Dependencies to Zookeeper and HBase

14.05.12

Development(2)

We use Maven 3 for
• Project structure -Corporate POM & Modules
• Dependency Management
• Build the artifact Corporate
POM

global newsagent tools mapred

Loader (Client)
Infrastructure
Model

Utils

Services

Data Access
Objects

14.05.12

Development

MAPREDUCEJOBS

http://www.flickr.com/photos/elasticsoul/61062372/sizes/l/in/photostream/

MapReduce
6

• Java MR jobs for business processes
• Input and output paths either HDFS or HBase
• MR job chaining by Azkaban
• PIG, HIVE for ad-hoc queries

14.05.12

Development

HBASE

http://www.flickr.com/photos/isherwoodchris/6902155937/sizes/l/in/photostream/

HBase

• HBase Schema Manager
• github.com/jkoenig/hbase-schema-manager
• Utilities to copy/move/rename column-families
and copy complete tables with it's data
• github.com/memonews/hbase-utils
• Stargate REST API without compression
• github.com/memonews/hbase-stargate

14.05.12

Hadoop, HBase, Zookeeper

TESTING

http://www.flickr.com/photos/42106306@N00/4380803535/sizes/m/in/photostream/

HBase

• We use the Apache HBaseTestingUtility
• It’s in-memory  complete hadoop instance
with dfs, zk and hbase
• It‘s very slow – conciderlongrunning IT
publicclassConfigurableHBaseClient {
protectedstaticHBaseTestingUtility TEST_UTIL;
static{
final Configurationconf = HBaseConfiguration.create();
conf.addResource("hbase-default-test.xml");
try{
TEST_UTIL = HBaseTestingUtilityFactory.getMiniCluster(1, conf);
} catch (final Exception e) {
fail("Couldnot start hadoop mini cluster.");
}
}
}

14.05.12

MapReduce

• Since business logic involved, we use hadoop-
mrunit for testing Map/Reduce Jobs
• It’s in-memory testing
• Parameterized Mapper/Reducer with a driver

@Test
publicvoidreduceShouldWriteExactlyOneLinePerMap() throwsIOException {
final List<DoubleWritable>values = newArrayList<DoubleWritable>();
values.add(new DoubleWritable(399287729));
this.driver.withInput(newText("de.t-online/nachrichten"), values);
this.driver.run();
assertEquals(1, this.driver.getCounters().findCounter(
MeMoCounters.SIGNALS_WRITTEN).getValue());
}

14.05.12

Zookeeper

• We use the Apache Zookeeper ClientBase
• It‘s not in-memory but against the staging
cluster
• Prefix paths e.g.: /test/memo/subscribers

@Test
publicvoidgetNumberOfSubscribersShouldSetWatchFlag()
throwsKeeperException,InterruptedException{
final SubscriberDaoImplsubscriberDao =
newSubscriberDaoImpl(zookeeperDao, DIR, null);
subscriberDao.getNumberOfSubscribers(listener);
verify(this.zookeeper, times(1)).getChildren(eq(DIR), eq(subscriberDao));
}

14.05.12

Deployment

THE APPLICATION

http://www.flickr.com/photos/navalsurfaceforces/5553412190/sizes/l/in/photostream/

The Application

• Automated build and restart via capistrano
• Build on every machine
• There is a .m2 repository everywhere

set :deploy_to, "/usr/share/memo-newsagent“
set:keep_releases, 1

after "deploy:setup" do
run "mkdir -p /var/run/memo #{shared_path}/logs /var/log/memo/"
...
end

after "deploy:update_code" do
run "cd #{current_release} &&mvninstall-Pfast> #{shared_path}/logs/build.log"
end

after "deploy", "rowlog:stop", "newsagent:restart", "rowlog:start"

14.05.12

Deployment

MAPREDUCE JOBS

http://www.flickr.com/photos/navalsurfaceforces/6257239933/sizes/l/in/photostream/

Map Reduce Jobs

• We use a Maven HadoopPlugin
hadoop:pack a la mvn:package
hadoop:deploy HDFS and target folder
• All dependencies packed-in  Careful: Huge
JARs without dependency management

see github.com/memonews/maven-hadoop

14.05.12

DevOps

OTHER TOOLS IN USE

http://www.flickr.com/photos/damongman/4979871047/sizes/l/in/photostream/

Other Tools

• Staging environment in-house, 1 to 1 copy
from production (virtualized)
• Azkaban for MR job scheduling
• Jenkins for (Integration-) Tests and Metrics
• GIT
• Icinga for Monitoring & Alerting
• Ganglia / Graphite for Hadoop Metrics
• Fliwi for automated cluster provisioning

14.05.12

jean-pierre.koenig@menonews.com

THANKS!

14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)

14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)

More Related Content

Similar to 14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)

More from Swiss Big Data User Group

Recently uploaded

14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)