Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

382 views

Published on

As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.

As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.

Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.

Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal

Published in: Technology
  • Be the first to comment

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

  1. 1. PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
  2. 2. 2 Who we are? Deepika Khera Kasi Natarajan • Big Data Technologist for over a decade • Focused on building scalable platforms with Hadoop ecosystem – Map Reduce, HBase, Spark, Elasticsearch, Druid • Senior Engineering Manager - Merchant Analytics at PayPal • Contributed to Druid for the Spark Streaming integration • 15+ years of industry experience • Spark Engineer @PayPal Merchant Analytics • Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML. • Passionate about providing Analytics at scale from Big Data platforms
  3. 3. 3 Agenda PayPal Data & Scale Merchant Use Case Review Data Pipeline Learnings - Spark & HBase Tools & Utilities Behavioral Driven Development Data Quality Tool using Spark BI with Druid & Tableau
  4. 4. PayPal Data & Scale 4
  5. 5. PayPal is more than a button Loyalty Faster Conversion Reduction in Cart Abandonment Credit Customer Acquisition APV Lift Invoicing Offers CBT Mobile In-Store Online 5
  6. 6. PayPal Datasets 6 Social Media Demo- graphics Marketing Activity Email Applicatio n Logs Invoice Credit Reversals Disputes CBT Risk Consumer Merchants Partners Location Payment Products Transactio n Spending
  7. 7. 7 PayPal operates one of the largest PRIVATE CLOUDSin the world* petabytes of data* 42markets active customer accounts** 237M payments in 2017** 7.6 BILLION merchants 19Mpayments/ second at peak* ~60 0 our platform Dedicated to with a customer focused, strong performance, highly scalable, continuously available PLATFORM. PayPal has one of the top five Kafka deployments in the world, handling over 200 billion messages per day 200 + PayPal operates one of the largest Hadoop deployments in the world. A 1600 Node Hadoop Cluster with 230TB of Memory, 78PB of Storage Running 50,000 Jobs Per day The power of
  8. 8. Merchant Use Case Review 8
  9. 9. 9 Use Case Overview INSIGHTS MARKETING SOLUTIONS • Help Merchants engage their customers by personalized shopping experience • Offers & Campaigns • Shoppers Insights • Revenue & transaction trends • Cross-Border Insights • Customer Shopping Segments • Products performance • Checkout Funnel • Behavior analysis • Measuring effectiveness PAYPAL ANALYTICS.com Merchant Data Platform 1. Fast Processing platform crunching multi-terabytes of data 2. Scalable, Highly available, Low latency Serving platform
  10. 10. 10 Technologies Processing Serving Movement
  11. 11. 11 Merchant Analytics Merchant Data Platform Pre-aggregated Cubes Denormalized Schema Analytics
  12. 12. Data Pipeline 12
  13. 13. 13 Data Ingestion PayPal Replication Data Lake Data Processing Data Serving SQL Data PipelineData Sources Visualization Custom UI Web Servers Data Pipeline Architecture
  14. 14. Learnings – Spark & HBase 14
  15. 15. Design Considerations for Spark Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact  Tune kyroserializer buffer to hold large objects Garbage Collection Memory Management Parallelism Action-Transformation Spark Best Practices Checklist Caching & Persisting  Clean up cached/persisted collections when they are no longer needed  Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC  Avoided using executors with too much memory  Optimize number of cores & partitions*  Minimize shuffles on join() by broadcasting the smaller collection  Optimize wider transformations as much as possible*  Used MEMORY_AND_DISK storage level for caching large  Repartition data before persisting to HDFS for better performance in downstream jobs *Specific examples later © 2018 PayPal Inc. Confidential and proprietary. 15
  16. 16. Learnings • Executor spends long time on shuffle reads. Then times out , terminates and results in job failure • Resource constraints on executor nodes causing delay in executor node Observations Resolution To address memory constraints, tuned 1. config from 200 executor * 4 cores to 400 executor * 2 cores 2. executor memory allocation (reduced) 16© 2018 PayPal Inc. Confidential and proprietary. Spark job failures with Fetch Exceptions Long shuffle read times
  17. 17. Learnings • Series of left joins on large datasets cause shuffle exceptions Observations Resolution 1. Split into Small jobs and run them in parallel 2. Faster reprocessing and fail fast jobs 7day 30day 60day 90day 180da y Union Hive 7day 30day 60day 90day 180da y Job Job1 Job2 Job3 Job4 Job5 (Multiple Partitions) Time Series data source Other Data sources to Join with 17© 2018 PayPal Inc. Confidential and proprietary. Hive Parallelism for long running jobs
  18. 18. Learnings • Spark Driver was left with too many heartbeat requests to process even after the job was complete • Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats • The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue • Allocate more memory to Driver to handle overheads other than typical Driver processes Resolution Tuning between Spark driver and executors Driver Executor Executor Heartbeats Observations 18© 2018 PayPal Inc. Confidential and proprietary. Executor Executorheartbeats Yarn RM Waiting on Driver
  19. 19. With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance overhead Observation Resolution Reduce the spark.sql.shuffle.partitions settings to a lower threshold 19© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize joins for efficient use of cluster resources (Memory, CPU etc..,) Read Table 1 Read Table 2 Start Join Process
  20. 20. T2T1 20© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize wide transformations Left Outer Join Left Outer Join with OR Operators Resolution • Convert expensive left joins to combination of light weight join and except/union etc.., Results of the Sub-Joins are being sent back to Driver causing poor performance left join T2 is NULL 7 billion rows1 billion rows T2 is NOT NULL T2T1 join T2 is NOT NULL except T2 is NULL rewritten as T2T1 left join OR T3 25 million rows25 million rows T2T1 left join rewritten as On T1.C1 = T2.C1 On T1.C2=T2.C2 On T1.C1 = T2.C1 T2T1 left join On T1.C2=T2.C2 union T3
  21. 21. • Batch puts and gets slow due to HBase overloaded connections • Since our HBase row was wide, HBase operations for partitions containing larger groups were slow Observations Resolution • Implemented sliding window for HBase Operations to reduce HBase connection overload 21© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize throughput for HBase Spark Connection Val rePartitionedRDD: RDD[Event] = filledRDD.repartition(2000) ….. ….. groupedEventRDD.mapPartitions( p => p.sliding(2000,2000)..foreach( Create Hbase Connection Batch Hbase Read or Write Close Hbase Connection ) ) Repartition RDD For each RDD partition Perform HBase batch operation For each Sliding Window Example
  22. 22. Tools & Utilities 22
  23. 23. Behavioral Driven Development Feature : Identify the activity related to an event Scenario: Should perform an iteration on events and join to activity table and identify the activity name Given I have a set of events |cookie_id:String |page_id:String|last_actvty:String| |263FHFBCBBCBV|login_provide |review_next_page| |HFFDJFLUFBFNJL|home_page |provide_credent| And I have a Activity table |last_activity_id:String|activity_id:String|activity_name:String| |review_next_page | 1494300886856 |Reviewing Next Page | |provide_credent | 2323232323232 |Provide Credentials | When I implement Event Activity joins Then the final result is |cookie_id:String |activity_id:String|activity_name:String| |last_activity_id:String|activity_id:String|activity_name:String| |263FHFBCBBCBV | 1494300886856 |Reviewing Next Page | |HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials | • While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code • Writing “Specifications” in pseudo-English. • Enables testing at external touch-points of your application © 2018 PayPal Inc. Confidential and proprietary. 23 import cucumber.api.scala.{EN, ScalaDsl} import cucumber.api.DataTable import org.scalatest.Matchers Given("""^I have a set of events$""") { (data:DataTable) => eventdataDF = dataTableToDataFrame(data) } Given("""^I have a Activity table$""") { (data:DataTable) => activityDataDF = dataTableToDataFrame(data) } When("""^I implement Event Activity joins$"""){ () => eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } } Then("""^the final result is $"""){ (expectedData:DataTable) => val expectedDf = dataTableToDataFrame(expectedData) val resultDF = eventActivityDF resultDF.except(expectedDF).count pseudo code
  24. 24. Data Quality Tool 1. Define Source ,Target Query, Test Operation in Config file Source Tables Output Table 2. Spark Job that takes the config and runs the test cases Config in SQL format Operation Source Query Target Query Key Column Count Select c1 … Select c1,c2,c3 …. C1 Values Select c1….… Select c1,c2,c3 from t1 C1 …….. Reports Alerts Quality operations Count Aggreagtion DuplicateRows MissingInLookup Values © 2018 PayPal Inc. Confidential and proprietary. 24 • Config driven automated tool written in Spark for Quality Control • Used extensively during functional testing of the application and once live, used as quality check for our data pipeline • Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively Quality Tool Flow
  25. 25. Batch Ingestion Druid Integration with BI • Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence queries on event data* • Traditional Databases did not scale and perform with Tableau dashboards (for many use cases) • Enable Tableau dashboards with Druid as the serving platform • Live connection from tableau to druid avoids getting limited by storage at any layer. *.from http://Druid.io Our Datasets Druid Cluster Historicals serving Data Segments Druid Broker Deep Storage (HDFS) Visualization Druid SQL SQL Client Custom App © 2018 PayPal Inc. Confidential and proprietary. 25 Hadoop HDFS Visualization at scale
  26. 26. ConclusionConclusion © 2018 PayPal Inc. Confidential and proprietary. 26  Spark Applications on Yarn (Hortonworks distribution).  Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)  Spark-HBase optimization improved performance  Pre-aggregated datasets to Elasticsearch  Denormalized datasets to Druid  Pushed lowest-granularity denormalized datasets to Druid  Behavior Driven Development a great add-on for Product-backed applications
  27. 27. QUESTIONS?

×