Hadoop Live Project
Payment Gateway Data Analytics
Project Overview
• Domain: Payment Gateway , Finance (Visa , MasterCard, Amex).
• Clients: 2000+ (Banks and Credit unions).
• Duration: Phase1 4 modules ( 2 Years project).
• Cost/Revenue: 50 Million USD/Year.( 30% growth yearly)
• Data: 50-200 GB/day. 5 Tb /Month.
• Prod Cluster: 50-70 Nodes running on Dell/HP Servers.
Project Execution Details
• Agile project scope details – User stories , Scrum cycles.
• 9 use cases covered in Phase 1.
• Technology Stack details for each modules.
• Implemented on Linux VM based Apache Hadoop cluster.
• Recorded sessions shared via google drive.
• Participants will receive Source code, DDL (Database scripts),
Execution scripts, Design docs for each modules.
Phase 1: Data Transformation /Staging
• Analyze the payment data xmls and json form.(from FTP, MQ jobs).
• Parse xml data using choice of technology(DOM , JAXB etc).
• Load data in RDBMS tables in incremental mode. (Oracle / MYSQL
RAC cluster).
• Schedule the preprocessing job to run for every 30 min run ( Java
scheduler Quartz- source 1 every 15 min, Crontab - source 2 : every 1
hour).
• Add multithreading / parallel process model. ( To handle large
volumes ).
Phase 2: Data Migration
• Build data migration flow from RDBMS into Hadoop/ Hive using
Apache Sqoop Map Reduce jobs.
• Create Import tables in Hive using Apache Sqoop features.
• Create Sqoop - Hive data import scripts with optimal tuning
parameters.
• Audit data migration into HDFS for archival.
Phase 3: Data Analytics System
• Design/Execute Apache Hive / Impala /Pig analytic queries and
store output data in result table.
• Execute Hive joins for complex queries involving multiple data
sets.
• Write UDF for data normalization.
• Use Apache Sqoop scripts to export data from Hive to RDBMS.
Phase 4: Data Visualization
• Visualize output data in RDBMS table using open source( Jfree
Chart/GoogleCharts)/commercial tools like Tableau/ Qlikview.
• Create report using Bar graph to show trends for payment
gateway issues across different sources.
• Create report using Pie chart for payment gateway issues
distribution across multiple RCAs( issue types).
• Use Hiveserver2 to connect and generate live analytic results.
Project Hardware and Deployment Details
• DEV->TEST->PROD life cycle in Hadoop Projects. ( code movement,
deployment strategy , etc.).
• PROD Environment details.( Cluster size, CPUs, RAM , Storage,
Network details, Server details etc.).
• Best Practices and Lessons Leant in Hadoop Cluster Deployment.
• Key Issues faced and associated resolution approach.
• Project Support Work after Prod Launch.

Bigdata Hadoop project payment gateway domain

  • 1.
    Hadoop Live Project PaymentGateway Data Analytics
  • 2.
    Project Overview • Domain:Payment Gateway , Finance (Visa , MasterCard, Amex). • Clients: 2000+ (Banks and Credit unions). • Duration: Phase1 4 modules ( 2 Years project). • Cost/Revenue: 50 Million USD/Year.( 30% growth yearly) • Data: 50-200 GB/day. 5 Tb /Month. • Prod Cluster: 50-70 Nodes running on Dell/HP Servers.
  • 3.
    Project Execution Details •Agile project scope details – User stories , Scrum cycles. • 9 use cases covered in Phase 1. • Technology Stack details for each modules. • Implemented on Linux VM based Apache Hadoop cluster. • Recorded sessions shared via google drive. • Participants will receive Source code, DDL (Database scripts), Execution scripts, Design docs for each modules.
  • 4.
    Phase 1: DataTransformation /Staging • Analyze the payment data xmls and json form.(from FTP, MQ jobs). • Parse xml data using choice of technology(DOM , JAXB etc). • Load data in RDBMS tables in incremental mode. (Oracle / MYSQL RAC cluster). • Schedule the preprocessing job to run for every 30 min run ( Java scheduler Quartz- source 1 every 15 min, Crontab - source 2 : every 1 hour). • Add multithreading / parallel process model. ( To handle large volumes ).
  • 5.
    Phase 2: DataMigration • Build data migration flow from RDBMS into Hadoop/ Hive using Apache Sqoop Map Reduce jobs. • Create Import tables in Hive using Apache Sqoop features. • Create Sqoop - Hive data import scripts with optimal tuning parameters. • Audit data migration into HDFS for archival.
  • 6.
    Phase 3: DataAnalytics System • Design/Execute Apache Hive / Impala /Pig analytic queries and store output data in result table. • Execute Hive joins for complex queries involving multiple data sets. • Write UDF for data normalization. • Use Apache Sqoop scripts to export data from Hive to RDBMS.
  • 7.
    Phase 4: DataVisualization • Visualize output data in RDBMS table using open source( Jfree Chart/GoogleCharts)/commercial tools like Tableau/ Qlikview. • Create report using Bar graph to show trends for payment gateway issues across different sources. • Create report using Pie chart for payment gateway issues distribution across multiple RCAs( issue types). • Use Hiveserver2 to connect and generate live analytic results.
  • 8.
    Project Hardware andDeployment Details • DEV->TEST->PROD life cycle in Hadoop Projects. ( code movement, deployment strategy , etc.). • PROD Environment details.( Cluster size, CPUs, RAM , Storage, Network details, Server details etc.). • Best Practices and Lessons Leant in Hadoop Cluster Deployment. • Key Issues faced and associated resolution approach. • Project Support Work after Prod Launch.