Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ming Yuan / Alyssa Romeo
Capital One
May 24th, 2016
Simplifying Apache Cascading
2
Apache Cascading
• Open source framework implementing the “chain of responsibility” design pattern
• Abstraction over Ma...
3
PDS Framework on Cascading
A light-weight layer on top of Apache Cascading to
– Manage metadata for inputs and outputs i...
4
Case Studies
Source code Directly use Cascading After rewritten on the framework
TranOptimizerTrxnDtl.java 473 134
TrxnD...
5
Root configuration
Data Processing Step
Sources SinkData-Processing Rules
Schema file Schema fileProcessing rules
6
Managing Multiple Steps on the Framework
1
2 3
Processing rules
4
5
Root configuration Schema files
6
Processing rules
T...
7
Root Configuration
Root file entries configure application level components, including
– Hadoop configurations
– Global ...
8
ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt
ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtn...
9
ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt
ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtn...
10
Data Processing Rules
• Processing rules are documented as properties
• Out-of-box macros define the transformation log...
11
Data Processing Rules -- Macros
Macro Names Syntax Functionality
obj TARGET obj(SOURCE) result.set(outputFields.getPos(...
12
Exception Handling
“Whenever an operation fails and throws an exception, if there is an
associated trap, the offending ...
13
How to Adopt the Framework
• Create a root configuration file
• Create a schema file for each input and output (or reus...
14
How to Adopt the Framework
• Subclass the PDSBaseDriver class and implement the “transform” method
• Create a “main” cl...
15
Conclusion
• Benefits
– Reduce the total effort of developing and testing Cascading applications
• Provide a re-usable ...
16
For questions, please reach out to Ming.Yuan@capitalone.com
17
Appendix: PDSBaseDriver Class
Method Functionality Override
process(String path)
This method takes the path to the root...
18
Appendix: PDSBaseFunction Class
Method Functionality Override
prepare(FlowProcess f,
OperationCall<Tuple> call)
This me...
19
Appendix: Class Diagram
* Yellow color indicates components from Cascading package
Upcoming SlideShare
Loading in …5
×

Simplifying Apache Cascading

107 views

Published on

At Capital One, I built a small framework on top of Apache Cascading. We have found the framework can significantly reduce the effort in developing and enhance the maintainability of Cascading applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Simplifying Apache Cascading

  1. 1. Ming Yuan / Alyssa Romeo Capital One May 24th, 2016 Simplifying Apache Cascading
  2. 2. 2 Apache Cascading • Open source framework implementing the “chain of responsibility” design pattern • Abstraction over MapReduce, Tez, or Flink processing engine when transforming big data sets on Hadoop • APIs for constructing and executing data-processing flows
  3. 3. 3 PDS Framework on Cascading A light-weight layer on top of Apache Cascading to – Manage metadata for inputs and outputs in properties files – Define data processing rules in properties files – Support development in a parallel manner – Make testing easier and more flexible PDS Framework
  4. 4. 4 Case Studies Source code Directly use Cascading After rewritten on the framework TranOptimizerTrxnDtl.java 473 134 TrxnDtlTransformation.java 278 81 PlanTypeCdeCalculation.java 152 144 MyMain.java 12 Total 903 371 Source code Directly use Cascading After rewritten on the framework PmsmJoin.java 210 87 JoinFunc.java 257 38 MyMain.java 12 Total 467 137 Cascading application 1 – 60% code reduction Cascading application 2 – 70% code reduction
  5. 5. 5 Root configuration Data Processing Step Sources SinkData-Processing Rules Schema file Schema fileProcessing rules
  6. 6. 6 Managing Multiple Steps on the Framework 1 2 3 Processing rules 4 5 Root configuration Schema files 6 Processing rules Transformation step Transformation step Application ControllerApplication Initiator 5
  7. 7. 7 Root Configuration Root file entries configure application level components, including – Hadoop configurations – Global configuration entries for the application – Definitions for File Taps (location and schema) – Definitions for Hive Taps ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATPT_retain_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Root configuration
  8. 8. 8 ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Schema Configuration – FileTap atgt_org|decimal|FALSE|1|NA atgt_acct|string|FALSE|1|NA atgt_rec_nbr|decimal|FALSE|1|NA atgt_logo|decimal|FALSE|1|NA atgt_type|string|FALSE|1|NA atgt_mt_eff_date|decimal|FALSE|1|NA atgt_org| atgt_acct| atgt_rec_nbr| atgt_logo| atgt_type| atgt_mt_eff_date| Schema file Tap pmsmTap = new Hfs( getTextDelimitedFromConfig("ATPT_SCHEME_PATH", null, false, “ ”), getFromConfigure("ATPT_DATA_PATH") );
  9. 9. 9 ATPT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atpt_mntry_dq_schema.txt ATPT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atpt_mntry_vldtd_hive_extract_us ATGT_SCHEME_PATH=/devl/rwa/prtnrshp/prtntshp_whirl/whirl_atgt_mntry_dq_schema.txt ATGT_RETAIN_FIELDS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/ATGT_retain_schema.txt ATGT_DATA_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/whirl_atgt_mntry_vldtd_hive_extract_us HADOOP_PROPS_PATH=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hadoop.properties FIRST_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hiveone.properties SECOND_HIVE_TAP=/devl/rwa/prtnrshp/prtnrshp_whirl_trxn_optmzn/hivetwo.properties Schema Configuration – HiveTap Schema file DATA_BASE=dhdp_coaf APP_COLUMN_NAMES=app_id, created_dt,… APP_COLUMN_TYPES=Bigint, String, … TABLE=MyTable PARTITION_KEYS=odate SER_LIB=org.apache.hadoop… (optional, by default it is ParquetHiveSerDe) APP_PATH=hdfs://…. HiveTap hiveTap = getHiveTapFromConfig(“SECOND_HIVE_TAP”, sinkMode, booleanValue);
  10. 10. 10 Data Processing Rules • Processing rules are documented as properties • Out-of-box macros define the transformation logic • Framework translates the processing rules to Cascading API calls on the fly ARRMT_ID_CHAIN obj(atpt_chain) TRXN_SEQ_NUM atpt_mt_hi_tran_trk_id POST_DT str(atpt_mt_posting_date) TRXN_CD int(atpt_mt_txn_code) AGT_ID substr(atpt_mt_hi_rep_id, 2, 4) result.set(outputFields.getPos("ARRMT_ID_CHAIN"), argument.getObject(new Fields ("atpt_chain"))); result.set(outputFields.getPos("TRXN_SEQ_NUM"), argument.getObject(new Fields ("atpt_mt_hi_tran_trk_id"))); result.set(outputFields.getPos("POST_DT"), argument.getString(new Fields ("atpt_mt_posting_date"))); result.set(outputFields.getPos("TRXN_CD"), argument.getInteger(new Fields ("atpt_mt_txn_code"))); result.set(outputFields.getPos("AGT_ID"), argument.getString(new Fields ("atpt_mt_hi_rep_id")).substring(2,4)); Processing rules
  11. 11. 11 Data Processing Rules -- Macros Macro Names Syntax Functionality obj TARGET obj(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields("SOURCE"))); default TARGET SOURCE result.set(outputFields.getPos("TARGET"), argument.getObject(new Fields ("SOURCE"))); as-is TARGET asis(default) result.set(outputFields.getPos("TARGET"), default)); string TARGET str(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getString(new Fields ("SOURCE"))); int TARGET int(SOURCE) result.set(outputFields.getPos("TARGET"), argument.getInteger(new Fields("SOURCE"))); sub-string TARGET substr(SOURCE, 2, 4) result.set(outputFields.getPos("TARGET"), argument.getString(new Fields ("SOURCE")).subString(2,4); replace TARGET replace(SOURCE, A, B, C, D, default) String rawValue = argument.getString(new Fields ("SOURCE")); if (A equalto rawValue) then result.set(outputFields.getPos("TARGET"), B); if (C equalto rawValue) then result.set(outputFields.getPos("TARGET"), D); result.set(outputFields.getPos("TARGET"), “default”); replace null TARGET repnull(SOURCE, default) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then result.set(outputFields.getPos("TARGET"), “default”); else result.set(outputFields.getPos("TARGET"), rawValue); replace null with whitespace TARGET repnullws(SOURCE) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then result.set(outputFields.getPos("TARGET"), " "); else result.set(outputFields.getPos("TARGET"), rawValue); not null TARGET notnull(SOURCE) String rawValue = argument.getString(new Fields ("SOURCE")); if (null equalto rawValue) then throw RuntimeException; else result.set(outputFields.getPos("TARGET"), rawValue); convert date TARGET dateconv(SOURCE, yyyymmdd, dd-mm-yyyy) String rawValue = argument.getString(new Fields ("SOURCE")); targetValue = rawValue from yyyymmdd to dd-mm-yyyy; result.set(outputFields.getPos("TARGET"), targetValue); move decimal TARGET movedeci(SOURCE,-2) String rawValue = argument.getDouble(new Fields ("SOURCE")); result.set(outputFields.getPos("TARGET"), rawValue / (10 ^ -2));
  12. 12. 12 Exception Handling “Whenever an operation fails and throws an exception, if there is an associated trap, the offending Tuple is saved to the resource specified by the trap Tap.” -- Cascading documentation FlowDef flowDef = FlowDef.flowDef().addSource(ipAmcpPipe, ipAmcpInTap) .addSource(ipAtptPipe, ipAtptInTap) .addTailSink(transformPipe, outTap) .addTrap(ipAtptPipe, badRecordsTap); }
  13. 13. 13 How to Adopt the Framework • Create a root configuration file • Create a schema file for each input and output (or reuse DQ schema files) • Define processing rules • Add all of the files to HDFS • Subclass the PDSBaseFuntion per processing step @Override protected void operate(FlowProcess flowProcess, FunctionCall<Tuple> functionCall) { this.populateTupleSet(functionCall); TupleEntry argument = functionCall.getArguments(); Tuple result = functionCall.getContext(); Fields outputFields = functionCall.getDeclaredFields(); result.set(outputFields.getPos("CHK_NUM"), check_number_calculation(argument)); functionCall.getOutputCollector().add(result); } @Override protected String getConfigPath() { return “/path/to/rulesfile”; }
  14. 14. 14 How to Adopt the Framework • Subclass the PDSBaseDriver class and implement the “transform” method • Create a “main” class • Run tests public class TestHarness { public static void main(String[] args) { new MyDriverImp().process("/path/to/rootconfig"); } } @Override protected FlowDef transform() { Fields pmamfields = getFieldsFromConfigEntry("PMAM_SCHEME_PATH"); String apparrFilePath = this.getFromConfigure("OUTPUT_DATA_PATH"); Tap pmsmTap = new Hfs( this.getTextDelimitedFromConfig("PMSM_SCHEME_PATH", null, false, fieldDelimiter), apparrFilePath); FlowDef flowDef = FlowDef.flowDef() .addSource(ipAmcpPipe, ipAmcpInTap) .addTailSink(transformPipe, outTap) .addTrap(ipAtptPipe, badRecordsTap); return flowDef; } Key Words in the Root config file
  15. 15. 15 Conclusion • Benefits – Reduce the total effort of developing and testing Cascading applications • Provide a re-usable layer to reduce the amount of “plumbing” code • Make Cascading modules configurable – Improve the code quality • Modularize Cascading applications and support best practices in Java coding • Support additional features (such as logging and exception handling) – Build an open architecture for future extension and integration • Technical specification – Compatible with JDK 1.5 and above; Jar file was compiled with JDK 1.7 – Tested with Cascading 2.5
  16. 16. 16 For questions, please reach out to Ming.Yuan@capitalone.com
  17. 17. 17 Appendix: PDSBaseDriver Class Method Functionality Override process(String path) This method takes the path to the root configuration file, initializes all required configurations, invokes "transform()" in its subclass, and executes Cascading flows. No init(String path) This method takes the path to the root configuration file, parses the file, and stores configuration entries accordingly. No getFromConfig(String key) This method takes a String-typed key, and returns a string-typed value is the key has been used in the root configuration file. It returns null, otherwise. No getFieldsFromConfigEntry( String key) This method takes a String-typed key. In the root configuration file, if the key has been assigned to a path to a schema file, the method returns a Fields object based on all column names in the schema file. This Fields will be automatically cached. No getFieldsFromConfigEntry( String key, String[] appendences) This method takes a String-typed key in the root configuration file. If the key has been assigned to a path to a schema file, it returns a Fields object with all column names in the schema file and all names in the input string array. This Fields will NOT be cached. No getTextDelimitedFromConfig( String key, String[] appendences, boolean hasHeader, String delimiter) This methods creates and returns a TextDelimited object from a configuration key in the root configuration files. You can use the second parameter to append any column names programmatically. The third and forth parameters are for input/output files. No transform() Subclass should build a Flowdef object with application processing flow in this method. Yes
  18. 18. 18 Appendix: PDSBaseFunction Class Method Functionality Override prepare(FlowProcess f, OperationCall<Tuple> call) This method overrides the same function in Cascading BaseOperation class. No cleanup(FlowProcess f, OperationCall<Tuple> call) This method overrides the same function in Cascading BaseOperation class. No init(String key, String filePath) This method parses a mapping rules file, and initializes the PDSBaseFunction object. No populateTupleSet( FunctionCall<Tuple> call) This method populates values in its output tuple based on input values and pre- defined processing rules. No getConfigPath() This method returns the path pointing to processing rules file in HDFS. Yes Operate( FlowProcess flowProcess, FunctionCall<Tuple> functionCall) This methods should invoke the "populateTupleSet()" method in order to execute pre-defined transformation rules, and it should invoke any additional custom transformation methods for complex logic. Yes
  19. 19. 19 Appendix: Class Diagram * Yellow color indicates components from Cascading package

×