Test Driven Elephants:
ETL Validation with Cascading on
Elastic MapReduce
The problem
Difficulty with TDD on Hadoop:
· Slow feedback cycle
· End-to-end smoke tests only
· Requires Hadoop install in ...
Cascading’s solution
Organize map/reduce jobs into flows, connecting:
· taps: source and sink - where your data is coming f...
And this buys us?
Lots of stuff:
· Topological scheduling
· Reusable/parameterizable components
· Debug facilities (traps, ...
An example application
Verify* extract data from a variety of production sources (MySQL, MongoDB, etc)
*meaning data creat...
Cascading Flow
Initial source standardization:
Pipe source = new Rename(
new Retain(
new Pipe(sourceName + "/continuity"),...
Cascading Flow
Custom Pipe Assemblies:
@VisibleForTesting
Pipe timestampContinuityAggregate(Pipe source) {
return new Ever...
Pipe assembly: timestampContinuityAggregate
Simple unit tests
@Test
public void testContinuityAggregate() throws Exception {
Pipe continuityAggregate = controller.tim...
Simple unit tests
@Test
public void testContinuityAggregate() throws Exception {
Pipe continuityAggregate = controller.tim...
Simple unit tests
@Test
public void testContinuityAggregate() throws Exception {
Pipe continuityAggregate = controller.tim...
Integration testing
package org.change.ml.extract.verification;
import ...
@PlatformRunner.Platform({LocalPlatform.class})...
Integration testing
package org.change.ml.extract.verification;
import ...
@PlatformRunner.Platform({LocalPlatform.class})...
Integration testing
Flow verifyContinuity = controller.verifyTimestampContinuity(
"signatures",
new Fields("created_at"),
...
Integration testing
package org.change.ml.extract.verification;
import ...
@PlatformRunner.Platform({LocalPlatform.class})...
Running tests locally
is awesome!
Check out https://github.com/Cascading/cascading.samples for examples using Gradle to drive Cascading
Now what?
· Continuous Deployment
· Gradle on Circle CI build to S3
· Elastic MapReduce scaleable
· DataPipeline provision...
We’re hiring!
Thanks to my colleagues at Change.org for all their
help with this presentation.
If you’re interested in bui...
Upcoming SlideShare
Loading in …5
×

Test Driven Elephants

2,972 views

Published on

Test Driven Elephants: ETL Validation with Cascading on Elastic MapReduce

Presentation from Data Week 2013, Vijay Ramesh, Change.org

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,972
On SlideShare
0
From Embeds
0
Number of Embeds
2,219
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Test Driven Elephants

  1. 1. Test Driven Elephants: ETL Validation with Cascading on Elastic MapReduce
  2. 2. The problem Difficulty with TDD on Hadoop: · Slow feedback cycle · End-to-end smoke tests only · Requires Hadoop install in your CI environment
  3. 3. Cascading’s solution Organize map/reduce jobs into flows, connecting: · taps: source and sink - where your data is coming from, where it is going to (e.g., on HDFS or S3) · pipe assemblies: what types of transformations/operations to run against it, specified independently of the data sources they process from http://docs.cascading.org/impatient/
  4. 4. And this buys us? Lots of stuff: · Topological scheduling · Reusable/parameterizable components · Debug facilities (traps, intermediate sinks, logging, etc) · LocalFlowConnector platform to run in-memory flows · and... Testable Components!
  5. 5. An example application Verify* extract data from a variety of production sources (MySQL, MongoDB, etc) *meaning data creation/update patterns fall within acceptable (experimentally determined) thresholds “Medium” data: · 60M+ users, 200M+ signatures, 1.7B+ emails · 100M+ rows a week for more active sources · Scale with number of sources & traffic patterns
  6. 6. Cascading Flow Initial source standardization: Pipe source = new Rename( new Retain( new Pipe(sourceName + "/continuity"), sourceField ), sourceField, new Fields("timestamp") );
  7. 7. Cascading Flow Custom Pipe Assemblies: @VisibleForTesting Pipe timestampContinuityAggregate(Pipe source) { return new Every( new GroupBy( source, new Fields("time_bucket"), new Fields("timestamp") ), new Fields("timestamp", "time_bucket"), new TimestampContinuityAggregator(), new Fields("time_bucket", "average_delta", "max_delta") ); }
  8. 8. Pipe assembly: timestampContinuityAggregate
  9. 9. Simple unit tests @Test public void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") ); Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate ); flow.complete(); validateLength(flow, 2, null); List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); } }
  10. 10. Simple unit tests @Test public void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") ); Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate ); flow.complete(); validateLength(flow, 2, null); List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); } } @Test public void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") ); Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate );
  11. 11. Simple unit tests @Test public void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") ); Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate ); flow.complete(); validateLength(flow, 2, null); List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); } } flow.complete(); validateLength(flow, 2, null); List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); } }
  12. 12. Integration testing package org.change.ml.extract.verification; import ... @PlatformRunner.Platform({LocalPlatform.class}) public class ControllerIntegrationTest extends VerificationTestCase { ... @Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...) Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink); Assert.assertTrue(continuityTuples.get(0).toString("t").equals("2013-02-25t7.166666666666667t37")); Assert.assertTrue(continuityTuples.get(1).toString("t").equals("2013-02-26t720.6t3599")); Assert.assertTrue(verifiedTuples.get(0).toString("t").equals( "2013-02-25,2013-02-26t2013-02-26t2013-02-25,2013-02-26t2013-02-26") ); } ... }
  13. 13. Integration testing package org.change.ml.extract.verification; import ... @PlatformRunner.Platform({LocalPlatform.class}) public class ControllerIntegrationTest extends VerificationTestCase { ... @Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...) Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink); Assert.assertTrue(continuityTuples.get(0).toString("t").equals("2013-02-25t7.166666666666667t37")); Assert.assertTrue(continuityTuples.get(1).toString("t").equals("2013-02-26t720.6t3599")); Assert.assertTrue(verifiedTuples.get(0).toString("t").equals( "2013-02-25,2013-02-26t2013-02-26t2013-02-25,2013-02-26t2013-02-26") ); } ... } package org.change.ml.extract.verification; import ... @PlatformRunner.Platform({LocalPlatform.class}) public class ControllerIntegrationTest extends VerificationTestCase { ... @Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)
  14. 14. Integration testing Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); package org.change.ml.extract.verification; import ... @PlatformRunner.Platform({LocalPlatform.class}) public class ControllerIntegrationTest extends VerificationTestCase { ... @Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...) Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink); Assert.assertTrue(continuityTuples.get(0).toString("t").equals("2013-02-25t7t37")); Assert.assertTrue(continuityTuples.get(1).toString("t").equals("2013-02-26t720.6t3599")); Assert.assertTrue(verifiedTuples.get(0).toString("t").equals( "2013-02-25,2013-02-26t2013-02-26t2013-02-25,2013-02-26t2013-02-26") ); } ... }
  15. 15. Integration testing package org.change.ml.extract.verification; import ... @PlatformRunner.Platform({LocalPlatform.class}) public class ControllerIntegrationTest extends VerificationTestCase { ... @Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...) Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink); Assert.assertTrue(continuityTuples.get(0).toString("t").equals("2013-02-25t7t37")); Assert.assertTrue(continuityTuples.get(1).toString("t").equals("2013-02-26t720.6t3599")); Assert.assertTrue(verifiedTuples.get(0).toString("t").equals( "2013-02-25,2013-02-26t2013-02-26t2013-02-25,2013-02-26t2013-02-26") ); } ... } verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink); Assert.assertTrue(continuityTuples.get(0).toString("t").equals("2013-02-25t7t37")); Assert.assertTrue(continuityTuples.get(1).toString("t").equals("2013-02-26t720.6t3599")); Assert.assertTrue(verifiedTuples.get(0).toString("t").equals( "2013-02-25,2013-02-26t2013-02-26t2013-02-25,2013-02-26t2013-02-26") );
  16. 16. Running tests locally
  17. 17. is awesome! Check out https://github.com/Cascading/cascading.samples for examples using Gradle to drive Cascading
  18. 18. Now what? · Continuous Deployment · Gradle on Circle CI build to S3 · Elastic MapReduce scaleable · DataPipeline provisioned clusters · Production Monitoring · SNS to Pager Duty integration on failures
  19. 19. We’re hiring! Thanks to my colleagues at Change.org for all their help with this presentation. If you’re interested in building large-scale distributed systems for data processing and recommendations (along with lots of other cool stuff that helps empower people everywhere to create the change they want to see), we’re hiring! Drop me an email (vijay@change.org) or visit our website (http://www.change.org/hiring)

×