agenda
✓ datasapiensintroduction
✓ our infrastructurejourney
✓ our observations
✓ areproducible example usingTPC-DS
✓ general implications forcloud data lake
architectures
datasapiensintroduction
3 products
15 countries
over 100m weekly transactions
DATASCIENCE
BUSINESS
EXPERIENCE
TECHNOLOGY
aboutdatasapiens
our tools help to turn datainto profit
BUSINESSINSIGHTS
Drive your businessthrough data.
OurBIwith an embedded insights
framework helpsyou to become a data-
driven company.
BRANDSINSIGHTS
Commercialize your dataassets.
We provide an industry-leading BI
platform to drive cooperation with your
suppliers.
internalusers suppliers
PERSONALISEDOFFERS
Increase customerengagement, salesand
margin.
We introduce a simple plan-do-review
processto continuouslymeasure and
improve the loyalty lifecycle of your
customers.
customers
someusersof ourproducts
our infrastructurejourney
data platform: 1stgeneration
✓ simple tosetup and use
✓ storage-compute coupledarchitecture
✓ long query compilationtimes
✓ vendor-locking
data platform: 2ndgeneration version1
✓ storage-compute separationbenefits
✓ faster performance due to reduced compilation times
✓ costdecreasebelow expectationsdue to S3APIcostsincrease
✓ co-located Alluxio-Prestoclusters
✓ fasterand more consistentperformance
✓ S3APIcost decreased to negligiblelevel
data platform: 2ndgeneration version2
ourobservations
Performancetestsona Presto cluster
✓ Docker Swarmcluster asaDocker
stackdeployment
✓ various levels of usageconcurrency
✓ clustersetup
▪ master node (1xm5.2xlarge)
▪ worker nodes (10x c5.4xlarge)
▪ Docker stackservices
• PrestoSQLv317
• Apache Hivev2.3.5
Drasticincreaseof S3APIcosts
✓ S3APIcosts during the performancetests
Performancetests– per day stats
✓ query costs
Date Total querycount Total S3costs($) Total S3requestcount Avg.costsperquery
($)
Avg.S3requestsper
query
2019-08-28 15,086 172.73 197,549,860 0.01145 13,095
2019-08-29 29,997 124.87 138,210,969 0.00416 4,607
2019-08-30 6,394 26.27 22,638,580 0.00411 3,541
2019-08-31 36,815 80.47 66,406,798 0.00219 1,804
2019-09-01 13,939 62.54 56,195,976 0.00449 4,032
2019-09-03 752 4.72 3,331,228 0.00628 4,430
2019-09-06 6,806 28.59 18,005,207 0.00420 2,645
2019-09-09 75,687 231.74 175,443,269 0.00306 2,318
Total 185,476 731.93 677,781,887 0.00395 3,654
Performancetests– per day stats
✓ infrastructurecosts
Date Total S3costs($) Total EC2costs($) S3APIcosts- %of total
costs
2019-08-28 172.73 45.25 79.24%
2019-08-29 124.87 107.21 53.80%
2019-08-30 26.27 27.18 49.15%
2019-08-31 80.47 29.42 73.23%
2019-09-01 62.54 23.69 72.53%
2019-09-03 4.72 11.47 29.15%
2019-09-06 28.59 73.05 28.13%
2019-09-09 231.74 102.97 69.24%
Total 731.94 420.24 63.53%
Alluxioasa remedy
✓ co-located Alluxio-Prestocluster
✓ fasterand more consistentperformance
✓ S3APIcost dropped tobelow $1/day
a reproducible exampleusing TPC-DS
Clustersetup
✓ Alluxio-Prestocluster
▪ Hardwaresetup
• master node (1xm5.2xlarge)
• worker nodes (6xc5.4xlarge)
▪ Software deploymentsetup
• Alluxio2.2.0
• PrestoSQL337
✓ EMR-Prestocluster
▪ Hardwaresetup:
• master nodes (1xm5.2xlarge)
• worker nodes (6xc5.4xlarge)
▪ Software deploymentsetup
• PrestoDB0.227
Testsetup
✓ dataset
▪ TPC-DSdataset with scale factor100
▪ Stored in aS3bucket
✓ query execution
▪ set of queries:
• TPC-DSsuite excluding Queryno.72
▪ query executions:
• number of repeats:10
• concurrency level:1
▪ S3:
✓ measurement
▪ Alluxio:
• logicalOperations:‘File Infos Got’
• RPCInvocations:‘GetFileInfo
Operations’
• total request countsper request
type
• total request costsper request type
Resultsfrom the Alluxio-Prestocluster
✓ 10 most APIrequests-expensivequeries
Queryname FileInfosGot- avg GetFileInfo Operations -avg
q14_1 159,200.1 127,576.9
q09 137,031.0 109,669.0
q14_2 110,933.8 88,732.6
q75 101,468.4 81,166.3
q64 75,148.3 60,099.4
q88 73,224.0 58,584.0
q23_1 61,313.6 49,054.3
q23_2 60,566.2 48,457.6
q95 56,518.0 45,212.0
q28 54,810.0 43,866.0
Resultsfrom the Alluxio-Prestocluster
✓ cumulative requestcounts
✓ S3APIcostsfor cachingthe dataset into Alluxio
Operation type Cumulativecount
File Infos Got 24,089,740
GetFileInfo Operations 19,287,627
Requesttype Cumulativecount Cumulative cost($)
ListBucket 28,324 0.14
GetObject 24,033 0.01
HeadObject 44,581 0.02
Total 96,938 0.17
Resultsfrom the EMR-Prestocluster
✓ cumulative requestcounts
Requesttype Cumulativecount Cumulative cost($)
ListBucket 5,771,219 28.86
GetObject 29,254,280 11.70
HeadObject 133,888 0.05
Total 35,159,387 40.61
Per-query costestimations
✓ 10 most APIrequests-expensivequeries
Queryname S3APIcost($)
q14_1 0.2684
q09 0.2310
q14_2 0.1870
q75 0.1711
q64 0.1267
q88 0.1234
q23_1 0.1034
q23_2 0.1021
q95 0.0953
q28 0.0924
Costscomparison
✓ 10 most APIrequests-expensivequeries
✓ S3APIcostsform 0.58%of total costswhen usingAlluxio
✓ S3APIcostsform 48.83%of total costswhen not usingAlluxio
Cluster Infrastructure costs($) S3APIcosts($)
Alluxio+Presto cluster 29.02 0.17
EMR+Prestocluster 42.55 40.61
generalimplications for clouddata lakearchitectures
Useanintermediate storage layer
✓ pricing for storage acrosscloud providersis
similar
✓ common analytical workloads are far more
storage/computeintensethan in our example
✓ no intermediate data storagelayer will lead to
higher costs
Q&A

Reducing large S3 API costs using Alluxio at Datasapiens

  • 1.
    agenda ✓ datasapiensintroduction ✓ ourinfrastructurejourney ✓ our observations ✓ areproducible example usingTPC-DS ✓ general implications forcloud data lake architectures
  • 2.
  • 3.
    3 products 15 countries over100m weekly transactions DATASCIENCE BUSINESS EXPERIENCE TECHNOLOGY aboutdatasapiens
  • 4.
    our tools helpto turn datainto profit BUSINESSINSIGHTS Drive your businessthrough data. OurBIwith an embedded insights framework helpsyou to become a data- driven company. BRANDSINSIGHTS Commercialize your dataassets. We provide an industry-leading BI platform to drive cooperation with your suppliers. internalusers suppliers PERSONALISEDOFFERS Increase customerengagement, salesand margin. We introduce a simple plan-do-review processto continuouslymeasure and improve the loyalty lifecycle of your customers. customers
  • 5.
  • 6.
  • 7.
    data platform: 1stgeneration ✓simple tosetup and use ✓ storage-compute coupledarchitecture ✓ long query compilationtimes ✓ vendor-locking
  • 8.
    data platform: 2ndgenerationversion1 ✓ storage-compute separationbenefits ✓ faster performance due to reduced compilation times ✓ costdecreasebelow expectationsdue to S3APIcostsincrease
  • 9.
    ✓ co-located Alluxio-Prestoclusters ✓fasterand more consistentperformance ✓ S3APIcost decreased to negligiblelevel data platform: 2ndgeneration version2
  • 10.
  • 11.
    Performancetestsona Presto cluster ✓Docker Swarmcluster asaDocker stackdeployment ✓ various levels of usageconcurrency ✓ clustersetup ▪ master node (1xm5.2xlarge) ▪ worker nodes (10x c5.4xlarge) ▪ Docker stackservices • PrestoSQLv317 • Apache Hivev2.3.5
  • 12.
  • 13.
    Performancetests– per daystats ✓ query costs Date Total querycount Total S3costs($) Total S3requestcount Avg.costsperquery ($) Avg.S3requestsper query 2019-08-28 15,086 172.73 197,549,860 0.01145 13,095 2019-08-29 29,997 124.87 138,210,969 0.00416 4,607 2019-08-30 6,394 26.27 22,638,580 0.00411 3,541 2019-08-31 36,815 80.47 66,406,798 0.00219 1,804 2019-09-01 13,939 62.54 56,195,976 0.00449 4,032 2019-09-03 752 4.72 3,331,228 0.00628 4,430 2019-09-06 6,806 28.59 18,005,207 0.00420 2,645 2019-09-09 75,687 231.74 175,443,269 0.00306 2,318 Total 185,476 731.93 677,781,887 0.00395 3,654
  • 14.
    Performancetests– per daystats ✓ infrastructurecosts Date Total S3costs($) Total EC2costs($) S3APIcosts- %of total costs 2019-08-28 172.73 45.25 79.24% 2019-08-29 124.87 107.21 53.80% 2019-08-30 26.27 27.18 49.15% 2019-08-31 80.47 29.42 73.23% 2019-09-01 62.54 23.69 72.53% 2019-09-03 4.72 11.47 29.15% 2019-09-06 28.59 73.05 28.13% 2019-09-09 231.74 102.97 69.24% Total 731.94 420.24 63.53%
  • 15.
    Alluxioasa remedy ✓ co-locatedAlluxio-Prestocluster ✓ fasterand more consistentperformance ✓ S3APIcost dropped tobelow $1/day
  • 16.
  • 17.
    Clustersetup ✓ Alluxio-Prestocluster ▪ Hardwaresetup •master node (1xm5.2xlarge) • worker nodes (6xc5.4xlarge) ▪ Software deploymentsetup • Alluxio2.2.0 • PrestoSQL337 ✓ EMR-Prestocluster ▪ Hardwaresetup: • master nodes (1xm5.2xlarge) • worker nodes (6xc5.4xlarge) ▪ Software deploymentsetup • PrestoDB0.227
  • 18.
    Testsetup ✓ dataset ▪ TPC-DSdatasetwith scale factor100 ▪ Stored in aS3bucket ✓ query execution ▪ set of queries: • TPC-DSsuite excluding Queryno.72 ▪ query executions: • number of repeats:10 • concurrency level:1 ▪ S3: ✓ measurement ▪ Alluxio: • logicalOperations:‘File Infos Got’ • RPCInvocations:‘GetFileInfo Operations’ • total request countsper request type • total request costsper request type
  • 19.
    Resultsfrom the Alluxio-Prestocluster ✓10 most APIrequests-expensivequeries Queryname FileInfosGot- avg GetFileInfo Operations -avg q14_1 159,200.1 127,576.9 q09 137,031.0 109,669.0 q14_2 110,933.8 88,732.6 q75 101,468.4 81,166.3 q64 75,148.3 60,099.4 q88 73,224.0 58,584.0 q23_1 61,313.6 49,054.3 q23_2 60,566.2 48,457.6 q95 56,518.0 45,212.0 q28 54,810.0 43,866.0
  • 20.
    Resultsfrom the Alluxio-Prestocluster ✓cumulative requestcounts ✓ S3APIcostsfor cachingthe dataset into Alluxio Operation type Cumulativecount File Infos Got 24,089,740 GetFileInfo Operations 19,287,627 Requesttype Cumulativecount Cumulative cost($) ListBucket 28,324 0.14 GetObject 24,033 0.01 HeadObject 44,581 0.02 Total 96,938 0.17
  • 21.
    Resultsfrom the EMR-Prestocluster ✓cumulative requestcounts Requesttype Cumulativecount Cumulative cost($) ListBucket 5,771,219 28.86 GetObject 29,254,280 11.70 HeadObject 133,888 0.05 Total 35,159,387 40.61
  • 22.
    Per-query costestimations ✓ 10most APIrequests-expensivequeries Queryname S3APIcost($) q14_1 0.2684 q09 0.2310 q14_2 0.1870 q75 0.1711 q64 0.1267 q88 0.1234 q23_1 0.1034 q23_2 0.1021 q95 0.0953 q28 0.0924
  • 23.
    Costscomparison ✓ 10 mostAPIrequests-expensivequeries ✓ S3APIcostsform 0.58%of total costswhen usingAlluxio ✓ S3APIcostsform 48.83%of total costswhen not usingAlluxio Cluster Infrastructure costs($) S3APIcosts($) Alluxio+Presto cluster 29.02 0.17 EMR+Prestocluster 42.55 40.61
  • 24.
  • 25.
    Useanintermediate storage layer ✓pricing for storage acrosscloud providersis similar ✓ common analytical workloads are far more storage/computeintensethan in our example ✓ no intermediate data storagelayer will lead to higher costs
  • 26.