0
How AdMobius uses Cascading in
AdTech Stack
Jyotirmoy Sundi
Sr Data Engineer in Lotame
(Acquired by LOTAME on March, 2014)
What does AdMobius do

AdMobius is a Mobile Audience Management
Platform (MAMP). It helps advertiser identify
mobile audi...
Target effectively across all platforms in multiple devices
Laptop
Mobile
Ipod
Ipad
Wearables
Topics

Device graph building and scoring device links

Cascading Taps for Hive, MySQL, HBase

Modularized Testing

Op...
AdMobius Stack
Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph
Hadoop | (Experimental Spark)Hadoop | (E...

Why Cascading
− Easy custom aggregators.
• In the existing MR framework it was very difficult
to write a series of compl...
Workflow for audience profile scoring
Driven
https://driven.cascading.io/index.html#/apps/D818DD
Audience Profiling

Cascading is used to do
− complex aggregations
− create the device multi-dimensional vectors
− device...
Example: Parallel aggregation of values across multiple fields.
Aggregations

No need to know group modes like in UDAF

Buffer

use for more complex grouping
operations

output multi...
public class MinGraphScoring extends BaseOperation implements Buffer{
@Override
public void operate(FlowProcess flowProces...
public class PotentialMatchAggregator extends
BaseOperation<PotentialMatchAggregator.IDList> implements
Aggregator<Potenti...
Joins

CoGroup:

two pipes cant fit into memory

HashJoin

when one of the pipes fit into memory
Pipe jointermsPipe = ...
Custom Src/Sink Taps

Cascading has good support to read/write to/from different form of
data sources. Slight tuning or c...
Hive Src TapsExampleWorkflow.java
Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBa...
Hive Sink Taps
ExampleWorkflow.java
Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (...
Hive table
CREATE TABLE CASCADING_HIVE_INTER
(
admo_id string,
segments string
)
PARTITIONED BY ( batch_id STRING )
ROW FO...
Good Practices

Use Checkpointing optimally

Use subassemblies instead of rewriting logic.
For further control pass addi...
Some Properties for Optimal Performance
Problems with improper configuration
1. Set compression parameters : Jobs would run slow and
may take sometime double the ...
Running in Yarn

Yarn deployment is smooth with cascading 2.5
− Make sure the config properties are set as per
YARN as th...
Cascading DSLs in other languages
Scalding (Scala)
PyCascading (Python)
cascading.jruby (Jruby)
Cascalog (Closure)

Thank you for your time

Q & A
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Upcoming SlideShare
Loading in...5
×

Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)

581

Published on

Cascading in Adtech Stack of in AdMobius(acquired by LOTAME, 2014)

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
581
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)"

  1. 1. How AdMobius uses Cascading in AdTech Stack Jyotirmoy Sundi Sr Data Engineer in Lotame (Acquired by LOTAME on March, 2014)
  2. 2. What does AdMobius do  AdMobius is a Mobile Audience Management Platform (MAMP). It helps advertiser identify mobile audiences by demographics and interest through standard, custom, private segments and reach them at scale.
  3. 3. Target effectively across all platforms in multiple devices Laptop Mobile Ipod Ipad Wearables
  4. 4. Topics  Device graph building and scoring device links  Cascading Taps for Hive, MySQL, HBase  Modularized Testing  Optimal Config Setups  Running in YARN  Conclusion
  5. 5. AdMobius Stack Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph Hadoop | (Experimental Spark)Hadoop | (Experimental Spark) RackspaceRackspace YARN | MR1YARN | MR1 Custom WorkflowsCustom Workflows
  6. 6.  Why Cascading − Easy custom aggregators. • In the existing MR framework it was very difficult to write a series of complex aggregated logic and run them in scale before making sure of its correctness. You can do that in hive by UDFs or UDAFs but we found it much easier in Cascading. − Easy for Java Developers to understand • visualize and write complicated workflows though the concept of pipes, taps, tuples.
  7. 7. Workflow for audience profile scoring
  8. 8. Driven https://driven.cascading.io/index.html#/apps/D818DD
  9. 9. Audience Profiling  Cascading is used to do − complex aggregations − create the device multi-dimensional vectors − device pair scoring based on the vectors − rule engine based filters  Size − Total number of mobile devices ~ 2.7B − ~500M devices in Giraph computation.
  10. 10. Example: Parallel aggregation of values across multiple fields.
  11. 11. Aggregations  No need to know group modes like in UDAF  Buffer  use for more complex grouping operations  output multiple tuples per group  Aggregator (simple aggregations, prebuilt aggregators like SumBy, CountBy)
  12. 12. public class MinGraphScoring extends BaseOperation implements Buffer{ @Override public void operate(FlowProcess flowProcess, BufferCall bufferCall) { Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator(); Graph g = new Graph(); while( arguments.hasNext() ) { TupleEntry tpe = arguments.next(); ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro serialization g.put(b) } Node[] nodes = g.nodes; //For each pair of nodes : i,j { double minmaxscore = scoring(g,i,j) Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore); bufferCall.getOutputCollector().add(t1); } }
  13. 13. public class PotentialMatchAggregator extends BaseOperation<PotentialMatchAggregator.IDList> implements Aggregator<PotentialMatchAggregator.IDList> { start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = new IDList(); aggregatorCall.setContext(idList); } aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { TupleEntry arguments = aggregatorCall.getArguments(); IDList idList = aggregatorCall.getContext(); idList.updateDev(amid, match); } complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = aggregatorCall.getContext(); …... }
  14. 14. Joins  CoGroup:  two pipes cant fit into memory  HashJoin  when one of the pipes fit into memory Pipe jointermsPipe = new HashJoin(termsPipe, new Fields("term_token"),dictionary, new Fields("word"), new Fields("app","term_token","score","d_count","index","word"), new InnerJoin());  CustomJoins and BloomJoin
  15. 15. Custom Src/Sink Taps  Cascading has good support to read/write to/from different form of data sources. Slight tuning or change might be required but most of code already exists. − Hive (with different file formats), HBase, MySQL − http://www.cascading.org/extensions/ − Set proper Config parameters while reading from source tap, example while reading from Hbase Tap, String tableName = "device_ids"; String[] familyNames = new String[] { "id:type1", "id:type2", “id:type3”,...”id:typen” }; Scan scan = new Scan(); scan.setCacheBlocks(false); scan.setCaching(10000); scan.setBatch(10000);
  16. 16. Hive Src TapsExampleWorkflow.java Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. }
  17. 17. Hive Sink Taps ExampleWorkflow.java Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>) Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. } conf.setOutputFormat( SequenceFileOutputFormat.class ); valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
  18. 18. Hive table CREATE TABLE CASCADING_HIVE_INTER ( admo_id string, segments string ) PARTITIONED BY ( batch_id STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS SEQUENCEFILE
  19. 19. Good Practices  Use Checkpointing optimally  Use subassemblies instead of rewriting logic. For further control pass additional parameters to subassemblies.  Use Compression and SequenceFile() in sink taps to chain multiple cascading workflows.  Use Failure Traps to filter faulty records.  Avoid creating too small or too long workflows. Chain them in Oozie or similar workflow management engines − Example: workflows with 10-20 MR jobs are good
  20. 20. Some Properties for Optimal Performance
  21. 21. Problems with improper configuration 1. Set compression parameters : Jobs would run slow and may take sometime double the time. Set the correct compression Type based on cluster configs 2. mapred.reduce.tasks : Its required to be set manually depending on the size of your job. Keeping it too low would slow down reducer jobs. 3. small file issue : The input split files read by mappers would be too small eventually bringing up more mappers then required. 4. Any custom configuration parameters : You should set it here and use getProperty to access them anywhere in the data workflow properties.setProperty("min_cutoff_score", "0.7"); FlowConnector flowConnector = new HadoopFlowConnector(properties);
  22. 22. Running in Yarn  Yarn deployment is smooth with cascading 2.5 − Make sure the config properties are set as per YARN as they are different from MR1. − While running in in workflow engines like oozie , make sure properties are set for • mapred.job.classpath.files and mapred.cache.file are set with all dependency files in colon separated formatted
  23. 23. Cascading DSLs in other languages Scalding (Scala) PyCascading (Python) cascading.jruby (Jruby) Cascalog (Closure)
  24. 24.  Thank you for your time  Q & A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×