Building a Hadoop Powered Commerce Data Pipeline

1,689 views

Published on

Published in: Technology

Building a Hadoop Powered Commerce Data Pipeline

  1. 1. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Building A Hadoop Powered Commerce Data Pipeline Jay Tang June/2014
  2. 2. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. • Introduction • Hadoop Data Platform • Data Repository • Data Processing • Data Service • Future Work 2
  3. 3. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 3 About Me • Director of Big Data Platform & Analytics, PayPal - Hadoop, Graph mining, Real-time Analytics, ML - Hadoop-based data pipeline • Member of original Hadoop team @Yahoo • Built MPP data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2
  4. 4. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 4 PayPal • Enable Online, Offline, and Mobile payment • 148M customers in 193 markets • $52B payment volume Q1 2014 • 834M transactions Q1 2014 • 1 in every 6 e-commerce dollar spent via PayPal Petabyte Data & Growing
  5. 5. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 5 Big Data Powers PayPal Analytics • Detect and prevent fraud • Assess credit risk • Relevant offer to customers • Improve user experience • Provide better merchant insights
  6. 6. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Hadoop Data Platform 6
  7. 7. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 7 New Center of Gravity • Hadoop powers next generation data architecture • Data Lake • Data Hub • New Center of Gravity for data work • Data refining/ETL • Analytics • Data Science/Machine Learning • Enterprise Features – security, HA, management, integration
  8. 8. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 8 Site Internal App Data As A Service Data Repository & Processing Data Service Modern Data Architecture
  9. 9. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Data Repository 9
  10. 10. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 10 Data Central A Central Data Store on Hadoop delivered as a Service to ease data analytics • All data under one roof on Hadoop • Single data store for both real-time and historical data • Organize the data for easy discovery and analytics • Harness the power of ALL DATA and reduce data fragmentation and data innovation cost • Reduce latency from raw data to actionable biz insights • Offload heavy lifting from other data platforms
  11. 11. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 11 Data Central • Land RAW DATA on Hadoop • Structured and unstructured data • Reliability – fault tolerance/HA, SLA adherence • Quality – guaranteed data quality/integrity • Low latency – minutes data pipeline • Analytics Ready – consumption-friendly format via standard tools
  12. 12. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 12 R&D Production Staging DB Site DB Collection Tier Ingestion Tier SSL SSL Flume Flume Flume Flume Production Data Center Analytical Data Center Data Ingestion Architecture
  13. 13. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 13 Site To Hadoop Pipeline • Transaction tables in Oracle forklifted to Hadoop Hive tables • Continuous streaming via Flume • Latency measured in minutes • HA built into the architecture • Oracle grid • Flume pipeline • Dual load into two Hadoop clusters • System monitoring via Ganglia • Moving large amount of data across data center • Separate network topology for user traffic and data transfer traffic • Ensure there is sufficient WAN bandwidth
  14. 14. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 14 Landing Data On Hadoop • Data quality check • Create Hive partitions from minutes data stream • Compact small interval partitions into daily partition • HCatalog integration • Pig/MR data access via HCatalog API • Plan to move to ORC file format • Developers familiar with Oracle schema can innovate quickly
  15. 15. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Data Processing 15
  16. 16. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 16 Beyond Raw Data • Raw data on Hadoop is NOT optimized for analytics • Every downstream applications must implement raw data decoding logic • Single upstream data change – all dependent applications must be updated simultaneously Unlikely Scenario In A Big Enterprise
  17. 17. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Create Foundational ETL pipeline • Single abstraction to decode raw data • Raw Data  Analytics Ready Data Making Raw Data More Digestible 17 Data Central Site Log 3rd Party Foundation ETL Analytics Ready Data
  18. 18. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 18 Analytics Ready Data -- Hive • Analytics Ready Data exposed as Hive tables • Leverage existing data transformation logic • Pre-join and pre-aggregate frequently accessed metrics • New time-based partition added periodically • HCatalog integration for Pig/MR • Hive table model fits nicely with insert only data for batch access • Source tables with updates? • Quick access for a slice of data?
  19. 19. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 19 HDFS RCFile HCatalog Merchant Data Store Consumer Data Store Analytics Ready Data -- HBase
  20. 20. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 20 Merchant & Consumer Data Store • Data store for all information related to merchant and consumer • Point-In-Time data representation • Data continuously added from Hive via MR jobs • Access time < 100 ms • Data latency < 1 hour • Commonly used metrics computed and stored in a time series • Sales by Time, Geo, Mobile/Web • Pluggable architecture -- new metrics can be quickly generated
  21. 21. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 21 • PayPal data form a graph • Infer new insights through mining entity relationship User1 User2 Merchant BUY BUY P2P Money Transfer Connected Data
  22. 22. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Titan Graph Database 22 • Distributed, Scalable, Highly Available Open Source Graph Database • Storing and Querying graph data • Billions of Nodes & Edges • Sit on top of HBase
  23. 23. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 23 HDFS RCFile HCatalog BatchGraphGiraph Input Titan PayPal Graph Database
  24. 24. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. Data Service 24
  25. 25. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 25 Insights to Users • How to deliver insights to users & applications outside of Hadoop • Batch loading of insight data to application data store • Exposing Data As A Service • BI tools with direct Hadoop integration
  26. 26. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 26 Batch Loading ETL Tool
  27. 27. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 27 HADOOP THRIFT RESTFUL WEB SERVICE Data As A Service
  28. 28. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 28 BIG DATA PROCESSING DISTRIBUTED IN-MEMORY ACCELERATION INTERACTIVE VISUAL ANALYTICS RESTful Web Service Tool With Hadoop Integration -- Platfora
  29. 29. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 29 Data Central Graph Services Search Service Merchant Data Service Consumer Data Service MerchantInsights ConsumerCredit RiskMitigation PP Site Log 3rd Party Foundation ETL Hadoop Data Platform Architecture
  30. 30. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 30 Moving from Hadoop 1.0 to 2.0 • Two Hadoop 1.0 clusters – R&D and Production • Create a 3rd Production cluster running Hadoop 2.0 • Users certify applications against Hadoop 2.0 • Recompile code with Hadoop 2.0 • Some minor backward compatibility issues • Migrate jobs and input/output pipeline from 1.0 to 2.0 cluster over a period of time • Very hard to get everyone to migrate at the same time • Create data abstraction via HCatalog
  31. 31. © 2014 PayPal Inc. All rights reserved. Confidential and proprietary. 31 Future Work • In-memory computing • Spark/Shark • HDFS In-memory cache • Streaming processing • Kafka • Storm

×