SlideShare a Scribd company logo
1 of 27
PayPal Merchant ecosystem using Spark,
Hive, Druid, HBase & Elasticsearch
2
Who we are?
Deepika Khera
Kasi Natarajan
• Big Data Technologist for over a decade
• Focused on building scalable platforms with Hadoop ecosystem – Map Reduce,
HBase, Spark, Elasticsearch, Druid
• Senior Engineering Manager - Merchant Analytics at PayPal
• Contributed to Druid for the Spark Streaming integration
• 15+ years of industry experience
• Spark Engineer @PayPal Merchant Analytics
• Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML.
• Passionate about providing Analytics at scale from Big Data platforms
3
Agenda
PayPal Data & Scale
Merchant Use Case Review
Data Pipeline
Learnings - Spark & HBase
Tools & Utilities
Behavioral Driven Development
Data Quality Tool using Spark
BI with Druid & Tableau
PayPal Data & Scale
4
PayPal is more than a button
Loyalty
Faster
Conversion
Reduction
in Cart
Abandonment
Credit
Customer
Acquisition
APV Lift
Invoicing
Offers
CBT Mobile In-Store Online
5
PayPal Datasets
6
Social
Media
Demo-
graphics
Marketing
Activity
Email Applicatio
n Logs
Invoice
Credit
Reversals
Disputes
CBT
Risk
Consumer
Merchants
Partners
Location
Payment
Products
Transactio
n
Spending
7
PayPal operates one of
the largest
PRIVATE
CLOUDSin the world*
petabytes
of data*
42markets active customer
accounts**
237M
payments in
2017**
7.6
BILLION
merchants
19Mpayments/
second at peak*
~60
0
our platform
Dedicated to with a
customer focused,
strong performance,
highly scalable,
continuously available
PLATFORM.
PayPal has one of the top five Kafka
deployments in the world, handling over
200 billion messages per day
200
+
PayPal operates one of the largest Hadoop
deployments in the world.
A 1600 Node Hadoop Cluster
with 230TB of Memory, 78PB of Storage
Running 50,000 Jobs Per day
The power of
Merchant Use Case Review
8
9
Use Case Overview
INSIGHTS MARKETING SOLUTIONS
• Help Merchants engage their
customers by personalized
shopping experience
• Offers & Campaigns
• Shoppers Insights
• Revenue & transaction trends
• Cross-Border Insights
• Customer Shopping Segments
• Products performance
• Checkout Funnel
• Behavior analysis
• Measuring effectiveness
PAYPAL ANALYTICS.com
Merchant Data Platform
1. Fast Processing platform crunching multi-terabytes of data
2. Scalable, Highly available, Low latency Serving platform
10
Technologies
Processing
Serving
Movement
11
Merchant Analytics
Merchant
Data
Platform
Pre-aggregated
Cubes
Denormalized
Schema
Analytics
Data Pipeline
12
13
Data Ingestion
PayPal
Replication
Data Lake
Data Processing
Data Serving
SQL
Data PipelineData Sources Visualization
Custom UI
Web
Servers
Data Pipeline Architecture
Learnings – Spark & HBase
14
Design Considerations for Spark
Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact
 Tune kyroserializer buffer to hold large objects
Garbage Collection
Memory Management
Parallelism
Action-Transformation
Spark Best Practices Checklist
Caching & Persisting
 Clean up cached/persisted collections when they are no longer needed
 Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC
 Avoided using executors with too much memory
 Optimize number of cores & partitions*
 Minimize shuffles on join() by broadcasting the smaller collection
 Optimize wider transformations as much as possible*
 Used MEMORY_AND_DISK storage level for caching large
 Repartition data before persisting to HDFS for better performance in downstream jobs
*Specific examples later
© 2018 PayPal Inc. Confidential and proprietary. 15
Learnings
• Executor spends long time on shuffle reads. Then times out , terminates and results in job failure
• Resource constraints on executor nodes causing delay in executor node
Observations
Resolution
To address memory constraints, tuned
1. config from 200 executor * 4 cores to 400 executor * 2 cores
2. executor memory allocation (reduced)
16© 2018 PayPal Inc. Confidential and proprietary.
Spark job failures with Fetch Exceptions
Long shuffle read times
Learnings
• Series of left joins on large datasets cause shuffle exceptions
Observations
Resolution
1. Split into Small jobs and run them in parallel
2. Faster reprocessing and fail fast jobs
7day
30day
60day
90day
180da
y
Union Hive
7day
30day
60day
90day
180da
y
Job
Job1
Job2
Job3
Job4
Job5
(Multiple Partitions)
Time Series data source
Other Data sources to Join with
17© 2018 PayPal Inc. Confidential and proprietary.
Hive
Parallelism for long running jobs
Learnings
• Spark Driver was left with too many heartbeat requests to process even after the job was complete
• Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats
• The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue
• Allocate more memory to Driver to handle overheads other than typical Driver processes
Resolution
Tuning between Spark driver and executors
Driver
Executor
Executor Heartbeats
Observations
18© 2018 PayPal Inc. Confidential and proprietary.
Executor
Executorheartbeats
Yarn RM
Waiting on Driver
With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance
overhead
Observation
Resolution
Reduce the spark.sql.shuffle.partitions settings to a lower threshold
19© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize joins for efficient use of cluster resources (Memory, CPU etc..,)
Read
Table
1
Read
Table 2
Start
Join
Process
T2T1
20© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize wide transformations
Left Outer Join Left Outer Join with OR Operators
Resolution
• Convert expensive left joins to combination of light weight join and except/union etc..,
Results of the Sub-Joins are
being sent back to Driver causing
poor performance
left
join
T2 is NULL
7 billion rows1 billion rows
T2 is NOT NULL
T2T1
join
T2 is NOT NULL
except
T2 is NULL
rewritten as
T2T1
left join
OR
T3
25 million rows25 million
rows
T2T1
left
join
rewritten as
On T1.C1 = T2.C1 On T1.C2=T2.C2
On T1.C1 =
T2.C1
T2T1
left
join On T1.C2=T2.C2
union
T3
• Batch puts and gets slow due to HBase overloaded connections
• Since our HBase row was wide, HBase operations for partitions containing larger groups were slow
Observations
Resolution
• Implemented sliding window for HBase Operations to reduce HBase connection overload
21© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize throughput for HBase Spark Connection
Val rePartitionedRDD: RDD[Event] =
filledRDD.repartition(2000)
…..
…..
groupedEventRDD.mapPartitions( p =>
p.sliding(2000,2000)..foreach(
Create Hbase Connection
Batch Hbase Read or Write
Close Hbase Connection
)
)
Repartition
RDD
For each
RDD partition
Perform HBase
batch operation
For each Sliding
Window
Example
Tools & Utilities
22
Behavioral Driven Development
Feature : Identify the activity related to an event
Scenario: Should perform an iteration on events and join to activity table and identify
the activity name
Given I have a set of events
|cookie_id:String |page_id:String|last_actvty:String|
|263FHFBCBBCBV|login_provide |review_next_page|
|HFFDJFLUFBFNJL|home_page |provide_credent|
And I have a Activity table
|last_activity_id:String|activity_id:String|activity_name:String|
|review_next_page | 1494300886856 |Reviewing Next Page |
|provide_credent | 2323232323232 |Provide Credentials |
When I implement Event Activity joins
Then the final result is
|cookie_id:String |activity_id:String|activity_name:String|
|last_activity_id:String|activity_id:String|activity_name:String|
|263FHFBCBBCBV | 1494300886856 |Reviewing Next Page |
|HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials |
• While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code
• Writing “Specifications” in pseudo-English.
• Enables testing at external touch-points of your application
© 2018 PayPal Inc. Confidential and proprietary. 23
import cucumber.api.scala.{EN, ScalaDsl}
import cucumber.api.DataTable
import org.scalatest.Matchers
Given("""^I have a set of events$""") { (data:DataTable) =>
eventdataDF = dataTableToDataFrame(data)
}
Given("""^I have a Activity table$""") { (data:DataTable) =>
activityDataDF = dataTableToDataFrame(data)
}
When("""^I implement Event Activity joins$"""){ () =>
eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } }
Then("""^the final result is $"""){ (expectedData:DataTable) =>
val expectedDf = dataTableToDataFrame(expectedData)
val resultDF = eventActivityDF
resultDF.except(expectedDF).count
pseudo code
Data Quality Tool
1. Define Source ,Target Query, Test Operation in Config file
Source Tables Output Table
2. Spark Job that takes the config and runs the test cases
Config in SQL
format
Operation Source
Query
Target
Query
Key Column
Count Select c1
…
Select c1,c2,c3
….
C1
Values Select
c1….…
Select c1,c2,c3
from t1
C1
……..
Reports
Alerts
Quality operations
Count Aggreagtion
DuplicateRows MissingInLookup
Values
© 2018 PayPal Inc. Confidential and proprietary.
24
• Config driven automated tool written in Spark for Quality Control
• Used extensively during functional testing of the application and once live, used as quality check for our data pipeline
• Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively
Quality Tool Flow
Batch
Ingestion
Druid Integration with BI
• Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is
primarily used for business intelligence queries on event data*
• Traditional Databases did not scale and perform with Tableau dashboards (for many use cases)
• Enable Tableau dashboards with Druid as the serving platform
• Live connection from tableau to druid avoids getting limited by storage at any layer.
*.from http://Druid.io
Our Datasets
Druid Cluster
Historicals serving Data
Segments
Druid
Broker
Deep Storage
(HDFS)
Visualization
Druid SQL
SQL Client
Custom App
© 2018 PayPal Inc. Confidential and proprietary.
25
Hadoop
HDFS
Visualization at scale
ConclusionConclusion
© 2018 PayPal Inc. Confidential and proprietary. 26
 Spark Applications on Yarn (Hortonworks distribution).
 Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)
 Spark-HBase optimization improved performance
 Pre-aggregated datasets to Elasticsearch
 Denormalized datasets to Druid
 Pushed lowest-granularity denormalized datasets to Druid
 Behavior Driven Development a great add-on for Product-backed applications
QUESTIONS?

More Related Content

What's hot

Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersDataWorks Summit
 
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019VMware Tanzu
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka confluent
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...confluent
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview Hortonworks
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Kai Wähner
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 

What's hot (20)

Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
 
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 

Similar to PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 

Similar to PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase (20)

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

  • 1. PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
  • 2. 2 Who we are? Deepika Khera Kasi Natarajan • Big Data Technologist for over a decade • Focused on building scalable platforms with Hadoop ecosystem – Map Reduce, HBase, Spark, Elasticsearch, Druid • Senior Engineering Manager - Merchant Analytics at PayPal • Contributed to Druid for the Spark Streaming integration • 15+ years of industry experience • Spark Engineer @PayPal Merchant Analytics • Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML. • Passionate about providing Analytics at scale from Big Data platforms
  • 3. 3 Agenda PayPal Data & Scale Merchant Use Case Review Data Pipeline Learnings - Spark & HBase Tools & Utilities Behavioral Driven Development Data Quality Tool using Spark BI with Druid & Tableau
  • 4. PayPal Data & Scale 4
  • 5. PayPal is more than a button Loyalty Faster Conversion Reduction in Cart Abandonment Credit Customer Acquisition APV Lift Invoicing Offers CBT Mobile In-Store Online 5
  • 6. PayPal Datasets 6 Social Media Demo- graphics Marketing Activity Email Applicatio n Logs Invoice Credit Reversals Disputes CBT Risk Consumer Merchants Partners Location Payment Products Transactio n Spending
  • 7. 7 PayPal operates one of the largest PRIVATE CLOUDSin the world* petabytes of data* 42markets active customer accounts** 237M payments in 2017** 7.6 BILLION merchants 19Mpayments/ second at peak* ~60 0 our platform Dedicated to with a customer focused, strong performance, highly scalable, continuously available PLATFORM. PayPal has one of the top five Kafka deployments in the world, handling over 200 billion messages per day 200 + PayPal operates one of the largest Hadoop deployments in the world. A 1600 Node Hadoop Cluster with 230TB of Memory, 78PB of Storage Running 50,000 Jobs Per day The power of
  • 8. Merchant Use Case Review 8
  • 9. 9 Use Case Overview INSIGHTS MARKETING SOLUTIONS • Help Merchants engage their customers by personalized shopping experience • Offers & Campaigns • Shoppers Insights • Revenue & transaction trends • Cross-Border Insights • Customer Shopping Segments • Products performance • Checkout Funnel • Behavior analysis • Measuring effectiveness PAYPAL ANALYTICS.com Merchant Data Platform 1. Fast Processing platform crunching multi-terabytes of data 2. Scalable, Highly available, Low latency Serving platform
  • 13. 13 Data Ingestion PayPal Replication Data Lake Data Processing Data Serving SQL Data PipelineData Sources Visualization Custom UI Web Servers Data Pipeline Architecture
  • 14. Learnings – Spark & HBase 14
  • 15. Design Considerations for Spark Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact  Tune kyroserializer buffer to hold large objects Garbage Collection Memory Management Parallelism Action-Transformation Spark Best Practices Checklist Caching & Persisting  Clean up cached/persisted collections when they are no longer needed  Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC  Avoided using executors with too much memory  Optimize number of cores & partitions*  Minimize shuffles on join() by broadcasting the smaller collection  Optimize wider transformations as much as possible*  Used MEMORY_AND_DISK storage level for caching large  Repartition data before persisting to HDFS for better performance in downstream jobs *Specific examples later © 2018 PayPal Inc. Confidential and proprietary. 15
  • 16. Learnings • Executor spends long time on shuffle reads. Then times out , terminates and results in job failure • Resource constraints on executor nodes causing delay in executor node Observations Resolution To address memory constraints, tuned 1. config from 200 executor * 4 cores to 400 executor * 2 cores 2. executor memory allocation (reduced) 16© 2018 PayPal Inc. Confidential and proprietary. Spark job failures with Fetch Exceptions Long shuffle read times
  • 17. Learnings • Series of left joins on large datasets cause shuffle exceptions Observations Resolution 1. Split into Small jobs and run them in parallel 2. Faster reprocessing and fail fast jobs 7day 30day 60day 90day 180da y Union Hive 7day 30day 60day 90day 180da y Job Job1 Job2 Job3 Job4 Job5 (Multiple Partitions) Time Series data source Other Data sources to Join with 17© 2018 PayPal Inc. Confidential and proprietary. Hive Parallelism for long running jobs
  • 18. Learnings • Spark Driver was left with too many heartbeat requests to process even after the job was complete • Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats • The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue • Allocate more memory to Driver to handle overheads other than typical Driver processes Resolution Tuning between Spark driver and executors Driver Executor Executor Heartbeats Observations 18© 2018 PayPal Inc. Confidential and proprietary. Executor Executorheartbeats Yarn RM Waiting on Driver
  • 19. With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance overhead Observation Resolution Reduce the spark.sql.shuffle.partitions settings to a lower threshold 19© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize joins for efficient use of cluster resources (Memory, CPU etc..,) Read Table 1 Read Table 2 Start Join Process
  • 20. T2T1 20© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize wide transformations Left Outer Join Left Outer Join with OR Operators Resolution • Convert expensive left joins to combination of light weight join and except/union etc.., Results of the Sub-Joins are being sent back to Driver causing poor performance left join T2 is NULL 7 billion rows1 billion rows T2 is NOT NULL T2T1 join T2 is NOT NULL except T2 is NULL rewritten as T2T1 left join OR T3 25 million rows25 million rows T2T1 left join rewritten as On T1.C1 = T2.C1 On T1.C2=T2.C2 On T1.C1 = T2.C1 T2T1 left join On T1.C2=T2.C2 union T3
  • 21. • Batch puts and gets slow due to HBase overloaded connections • Since our HBase row was wide, HBase operations for partitions containing larger groups were slow Observations Resolution • Implemented sliding window for HBase Operations to reduce HBase connection overload 21© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize throughput for HBase Spark Connection Val rePartitionedRDD: RDD[Event] = filledRDD.repartition(2000) ….. ….. groupedEventRDD.mapPartitions( p => p.sliding(2000,2000)..foreach( Create Hbase Connection Batch Hbase Read or Write Close Hbase Connection ) ) Repartition RDD For each RDD partition Perform HBase batch operation For each Sliding Window Example
  • 23. Behavioral Driven Development Feature : Identify the activity related to an event Scenario: Should perform an iteration on events and join to activity table and identify the activity name Given I have a set of events |cookie_id:String |page_id:String|last_actvty:String| |263FHFBCBBCBV|login_provide |review_next_page| |HFFDJFLUFBFNJL|home_page |provide_credent| And I have a Activity table |last_activity_id:String|activity_id:String|activity_name:String| |review_next_page | 1494300886856 |Reviewing Next Page | |provide_credent | 2323232323232 |Provide Credentials | When I implement Event Activity joins Then the final result is |cookie_id:String |activity_id:String|activity_name:String| |last_activity_id:String|activity_id:String|activity_name:String| |263FHFBCBBCBV | 1494300886856 |Reviewing Next Page | |HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials | • While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code • Writing “Specifications” in pseudo-English. • Enables testing at external touch-points of your application © 2018 PayPal Inc. Confidential and proprietary. 23 import cucumber.api.scala.{EN, ScalaDsl} import cucumber.api.DataTable import org.scalatest.Matchers Given("""^I have a set of events$""") { (data:DataTable) => eventdataDF = dataTableToDataFrame(data) } Given("""^I have a Activity table$""") { (data:DataTable) => activityDataDF = dataTableToDataFrame(data) } When("""^I implement Event Activity joins$"""){ () => eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } } Then("""^the final result is $"""){ (expectedData:DataTable) => val expectedDf = dataTableToDataFrame(expectedData) val resultDF = eventActivityDF resultDF.except(expectedDF).count pseudo code
  • 24. Data Quality Tool 1. Define Source ,Target Query, Test Operation in Config file Source Tables Output Table 2. Spark Job that takes the config and runs the test cases Config in SQL format Operation Source Query Target Query Key Column Count Select c1 … Select c1,c2,c3 …. C1 Values Select c1….… Select c1,c2,c3 from t1 C1 …….. Reports Alerts Quality operations Count Aggreagtion DuplicateRows MissingInLookup Values © 2018 PayPal Inc. Confidential and proprietary. 24 • Config driven automated tool written in Spark for Quality Control • Used extensively during functional testing of the application and once live, used as quality check for our data pipeline • Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively Quality Tool Flow
  • 25. Batch Ingestion Druid Integration with BI • Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence queries on event data* • Traditional Databases did not scale and perform with Tableau dashboards (for many use cases) • Enable Tableau dashboards with Druid as the serving platform • Live connection from tableau to druid avoids getting limited by storage at any layer. *.from http://Druid.io Our Datasets Druid Cluster Historicals serving Data Segments Druid Broker Deep Storage (HDFS) Visualization Druid SQL SQL Client Custom App © 2018 PayPal Inc. Confidential and proprietary. 25 Hadoop HDFS Visualization at scale
  • 26. ConclusionConclusion © 2018 PayPal Inc. Confidential and proprietary. 26  Spark Applications on Yarn (Hortonworks distribution).  Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)  Spark-HBase optimization improved performance  Pre-aggregated datasets to Elasticsearch  Denormalized datasets to Druid  Pushed lowest-granularity denormalized datasets to Druid  Behavior Driven Development a great add-on for Product-backed applications

Editor's Notes

  1. Today PayPal is much more than a button on a website. We have an extensive portfolio of products & services. Enabling CBT, easy Mobile & Web access, Credit Options to customers, Marketing solutions for merchants and many more help merchants grow their business and enable customers to safe digital commerce.
  2. All of these also translate into a rich set of data that PayPal has to inform strategic and operational decisions
  3. Concurrent Mark Sweep – If it doesn’t finish garbage collection, it starts stop the world GC. Tuned it from 10s ec to 30seconds CMSMaxAbortablePrecleanTime
  4. https://community.hortonworks.com/questions/44950/spark-memory-issue.html org.apache.spark.shuffle.MetadataFetchFailedException Running this job with 4 cores and 200 executors. Although there could be multiple reasons for delay like skewness in data . For us it turned out that the datanode that the executor was running on was busy , a lot of times this happened with nodes with limited capacity having more number of tasks in per executor theoretically puts more pressure on the executor where if there are memory constraints the chances of having an executor failure increases metafetchfailed happens usually due to executor failure or due to executor termination
  5. https://engineering.paypalcorp.com/confluence/display/EDS/Muse+Visitor+Count+Job+Split+Design
  6. Combinations
  7. Points The tool was completely customizable for each project The tool was build to be schema agnostic of the table and scalable to run on datasets of large size Report was generated on Match/Mismatch count by Key Columns like Product and Geography as needed