Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

0

Share

Download to read offline

Configuration Driven Reporting On Large Dataset Using Apache Spark

Download to read offline

In financial world, petabytes of transactional data need to be stored, processed, distributed across global customers and partners in a secured, compliant and accurate way with high availability, resiliency and observability. In American Express, we need to generate hundreds of different kinds of reports and distribute to thousands of partners in different schedules based on billions of daily transactions. Our next generation reporting framework is a highly configurable enterprise framework that caters to different reporting needs with zero development. This reusable framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different dataframes using inbuilt as well as user-defined spark functions leveraging spark’s in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding template engines like FreeMarker and Mustache into the framework.

  • Be the first to like this

Configuration Driven Reporting On Large Dataset Using Apache Spark

  1. 1. Configuration Driven Reporting On a Large Dataset Using Apache Spark Arvind Das, Senior Engineer, American Express can connect with me @ http://linkedin.com/in/arvind-das-a8720b49 Zheng Gu, Engineer, American Express can connect with me @ http://linkedin.com/in/zheng-gu-895bb4157
  2. 2. Agenda • Introduction • Need For Dynamic Configuration based Reporting • Overall Design • Components Involved • Transformation at Scale • Templating at Scale
  3. 3. Introduction- What is Reporting Framework? Reporting framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different DataFrames using inbuilt as well as user-defined functions leveraging Spark's in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding FreeMarker as template engine into the framework
  4. 4. STATISTICS AND GENERAL NEED
  5. 5. PATTERN: Need for Configuration based Reporting Different reports & feeds generation involve a common pattern: • How the input dataset is read • Optional enhancement of the Dataset with a referential data lookup • Sequence of transformation rules • Application of a template on final data { Control Various Parameters Of reporting, as dynamic configuration, external To the Actual Framework} Common Reporting Framework Different reports & feeds involve different: • Partner/stakeholder configurations • Frequency of generation • Input Dataset and schema definition • Aggregation rules • Templates
  6. 6. Technical Components • Configurations driving the reporting is maintained in config management system outside of the framework • Core reporting framework in a sequence of activities which runs as a Spark job • K8s based scheduler app manages the job scheduling and frequencies based on partner/downstream contracts • FreeMarker template engine embedded into the framework, reads externally provided template file • Framework publishes the final report to s3 object store
  7. 7. A Sample Configuration File { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": ["report.ftlh"], "sample-schema": [ ] } } Report Meta-data Schema to a report Step transform rules Schema elements Report template Lookup Dataset
  8. 8. Deep-Dive
  9. 9. Apply Schema Stage • Create a Spark SQL Query from the Schema provided • Filter out columns which are not needed for the report using Spark SQL • Reduce the size of DataFrame ID Name Sex Birth 1 name1 male 01/01/19 70 2 name2 female 01/01/19 70 Name Sex name1 male name2 female DF1 DF2 select name, sex from DF1 { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": [""], "sample-schema": [ “name”, “sex”] } }
  10. 10. Data Lookup Stage • Reports might have need for data that is not part of the input; • For example, Static Data • Join the data with the input Dataset based on a common key ID Name Sex 1 name1 male 2 name2 female DF1 ID Phone 1 123-456-7890 2 098-765-4321 sample-lookup ID Name Sex Phone 1 name1 male 123-456-7890 2 name2 female 098-765-4321 DF after lookup { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“sample-lookup”], "transformation-rule": { "step1": , "step2": }, "report-template": [""], "sample-schema": [ “name”, “sex”] } }
  11. 11. Apply Transformation Rules Stage • Report needs aggregations at different levels. Several transformation rules are needed, and each transformation returns a DataFrame txn_ID txn_typ e cat_ID amoun t 1 Credit 100 50 2 Debit 102 30 3 Credit 100 20 4 Credit 102 10 5 Credit 105 100 6 Debit 102 20 7 Credit 102 30 8 Credit 105 50 9 Credit 105 60 10 Debit 100 10 11 Debit 104 5 original DF step1 DF step2 DF step3 DF txn_ID txn_typ e cat_ID amoun t 1 Credit 100 50 2 Debit 102 30 3 Credit 100 20 4 Credit 102 10 5 Credit 105 100 6 Debit 102 20 7 Credit 102 30 8 Credit 105 50 9 Credit 105 60 10 Debit 100 10 cat_id amount count 100 80 3 102 90 4 105 210 3 txn_type amount count Credit 320 7 Debit 60 3 { "report-name": { "title": "", ……………. "transformation-rule": { "step1": select * from original_DF where txn_id <= 10, "step2": select cat_id, sum(amount) as amount, count(amount) as count from step1 group by cat_id, "step3": select txn_type, sum(amount) as amount, count(amount) as count from step 1 group by txn_type }, "sample-schema": [ “name”, “sex”] } }
  12. 12. Apply Transformation Rules (continued…) Along with Spark SQL functions, UDFs (User Defined Functions) provide customized querying abilities • Some sample UDFs: • Decimalize: • Calculate the actual transaction amount • Parameters: transaction amount, decimalization factor select decimalize(amount, decimal_factor) from DF; • Signage: • Apply rules to change transaction amount to positive or negative value • Be used for further aggregation select signage(amount, xx, xx…) from DF;
  13. 13. Apply Template • Choose a Template Engine • Need to transform the DataFrame to the format that can be referred by the template • Each DataFrame will be transferred into List<T> DataFrame Dataset<T> List<T> <#list step2 as step2> State: ${step2.state} Count: ${step2.count} </#list> List<T> step2 Template Engine State: CA Count: 4 State: MA Count: 3 State: AZ Count: 3 template: data: output: { "report-name": { "title": "", "type": "", "id": "", "schema": "sample-schema", "look-up-dataset": [“”, “”], "transformation-rule": { "step1": , "step2": }, "report-template": [”sample-report.ftlh"], "sample-schema": [ “name”, “sex”] } } Template Options: Velocity, Thymeleaf, StringTemplate, FreeMarker We selected FreeMarker, because: • Template engine for generic template • Supported by the Apache Software Foundation (ASF) • Is widely used across Apache projects • New frequent release. Latest release was in Feb 2021 • Good documentations available
  14. 14. Success Metrics • Time to build a new report reduced from a month to a week • Customers of the reports doesn’t have to know the internal nuances of report build • We do not need to have a highly skilled technical team for a report build • Performance Of Processing up to 10 million records per report achieved
  15. 15. Thank you
  16. 16. Find your place in technology on #TeamAmex https://jobs.americanexpress.com/tech

In financial world, petabytes of transactional data need to be stored, processed, distributed across global customers and partners in a secured, compliant and accurate way with high availability, resiliency and observability. In American Express, we need to generate hundreds of different kinds of reports and distribute to thousands of partners in different schedules based on billions of daily transactions. Our next generation reporting framework is a highly configurable enterprise framework that caters to different reporting needs with zero development. This reusable framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different dataframes using inbuilt as well as user-defined spark functions leveraging spark’s in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding template engines like FreeMarker and Mustache into the framework.

Views

Total views

85

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×