Scaling up data science applications
How switching to Spark improved performance, realizability and reduced cost
Kexin Xie
Director, Data Science
kexin.xie@salesforce.com
@realstraw
Yacov Salomon
VP, Data Science
ysalomon@salesforce.com
Marketers want to find more customers like their
loyal customer
Lookalikes
Naive Bayes FrameworkModel
Naive Bayes FrameworkModel
Linear Discriminant AnalysisFeature Selection
Naive Bayes FrameworkModel
Linear Discriminant AnalysisFeature Selection
Correct for autocorrelation in feature space (paper pending)Science / Art
Prepare Train Classify
Prepare Train Classify
109 105 105
O(n)
1014
1014 105
O(n2)
105
1014 105
O(nm)
109
# jobs
# jobs # failures
# jobs # failures cost
# jobs # failures cost
Map Reduce
Disk
Complexity
Number of Features Total Population
Segment Populations Segment Population Overlap
userSegments
.flatMap(_.segments)
.distinct
.count
userSegments.count
userSegments
.flatMap(r => r.segments.map(_ -> 1L))
.reduceByKey(_ + _)
val userSegmentPairs = userSegments
.flatMap(r => r.segments.map(r.userId -> _))
userSegmentPairs
.join(userSegmentPairs)
.map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L }
.reduceByKey(_ + _)
Reality: Data in many S3 prefixes/folders
val inputData = Seq(
"s3://my-bucket/some-path/prefix1/",
"s3://my-bucket/some-path/prefix2/",
"s3://my-bucket/some-path/prefix2/",
...
"s3://my-bucket/some-path/prefix2000/",
)
How about this?
val myRdd = inputData
.map(sc.textFile)
.reduceLeft(_ ++ _)
Or this?
val myRdd = sc.union(inputData.map(sc.textFile))
Solution
// get the s3 objects
val s3Objects = new AmazonS3Client()
.listObjects("my-bucket", "some-path")
.getObjectSummaries()
.map(_.getKey())
.filter(hasPrefix1to2000)
// send them to slave nodes and retrieve content
val myRdd = sc
.parallelize(Random.shuffle(s3Objects.toSeq), parallelismFactor)
.flatMap( key =>
Source
.fromInputStream(
new AmazonS3Client().getObjectForKey("my-bucket", key)
.getObjectContent
)
.getLines
)
Reality: Large Scale Overlap
val userSegmentPairs = userSegments
.flatMap(r => r.segments.map(r.userId -> _))
userSegmentPairs
.join(userSegmentPairs)
.map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L }
.reduceByKey(_ + _)
user1 a, b, c
user2 a, b, c
user3 a, b, c
user4 a, c
user5 a, c
user1 a
user1 b
user1 c
user2 a
user2 b
user2 c
user3 a
user3 b
user3 c
user4 a
user4 c
user5 a
user5 c
user1 a b
user1 a c
user1 b c
user2 a b
user2 a c
user2 b c
user3 a b
user3 a c
user3 b c
user4 a c
user5 a c
a b 3
a c 5
b c 3
1 a b
1 a c
1 b c
1 a b
1 a c
1 b c
1 a b
1 a c
1 b c
1 a c
1 a c
user1 a, b, c
user2 a, b, c
user3 a, b, c
user4 a, c
user5 a, c
hash1 a 3
hash1 b 3
hash1 c 3
hash2 a 2
hash2 c 2
hash1 a b 3
hash1 a c 3
hash1 b c 3
hash2 a c 2
a b 3
a c 5
b c 3
hash1 a, b, c 3
hash2 a, c 2
Solution
// Reduce the user space
val aggrUserSegmentPairs = userSegmentPairs
.map(r => r.segments -> 1L)
.reduceByKey(_ + _)
.flatMap { case (segments, count) =>
segments.map(s => (hash(segments), (segment, count))
}
aggrUserSegmentPairs
.join(aggrUserSegmentPairs)
.map { case (_, (seg1, count), (seg2, _)) =>
(seg1, seg2) -> count
}
.reduceByKey(_ + _)
Reality: Perform Join on Skewed Data
user1 a
user2 b
user3 c
user4 d
user5 e
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
user3 seven
user3 eight
user4 nine
user5 ten
X
data1.join(data2)
Executor 1
Executor 2
Executor 3
user1 a
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
user3 c
user4 d
user5 e
user2 b
user3 seven
user3 eight
user4 nine
user5 ten
Executor 1
Executor 2
Executor 3
user1 salt1 a
user1 salt1 one
user1 salt1 two
user1 salt2 three
user1 salt2 four
user1 salt3 five
user1 salt3 six
user1 salt2 a
user1 salt3 a
user1 a
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
X
user3 seven
user3 eight
user4 nine
user5 ten
user2 b
user3 c
user4 d
user5 e
Solution
val topKeys = data2
.mapValues(x => 1L)
.reduceByKey(_ + _)
.takeOrdered(10)(Ordering[(String, Long)].on(_._2).reverse)
.toMap
.keys
val topData1 = sc.broadcast(
data1.filter(r => topKeys.contains(r._1)).collect.toMap
)
val bottomData1 = data1.filter(r => !topKeys.contains(r._1))
val topJoin = data2.flatMap { case (k, v2) =>
topData1.value.get(k).map(v1 => k -> (v1, v2))
}
topJoin ++ bottomData1.join(data2)
Smarter retrieval of data from S3
Condensed overlap algorithm
Hybrid join algorithm
Clients with more than 2000 S3 prefixes/folders
Before: 5 hours
After: 20 minutes
100x faster and 10x less data for segment
overlap
Able to process joins for highly skewed data
Hadoop to Spark Maintainable codebase
Failure Rate
Performance
Cost
Before
After
Scaling up data science applications

Scaling up data science applications

  • 1.
    Scaling up datascience applications How switching to Spark improved performance, realizability and reduced cost Kexin Xie Director, Data Science kexin.xie@salesforce.com @realstraw Yacov Salomon VP, Data Science ysalomon@salesforce.com
  • 9.
    Marketers want tofind more customers like their loyal customer Lookalikes
  • 11.
  • 12.
    Naive Bayes FrameworkModel LinearDiscriminant AnalysisFeature Selection
  • 13.
    Naive Bayes FrameworkModel LinearDiscriminant AnalysisFeature Selection Correct for autocorrelation in feature space (paper pending)Science / Art
  • 14.
  • 15.
    Prepare Train Classify 109105 105 O(n) 1014 1014 105 O(n2) 105 1014 105 O(nm) 109
  • 17.
  • 18.
    # jobs #failures
  • 19.
    # jobs #failures cost
  • 20.
    # jobs #failures cost Map Reduce Disk Complexity
  • 21.
    Number of FeaturesTotal Population Segment Populations Segment Population Overlap
  • 23.
    userSegments .flatMap(_.segments) .distinct .count userSegments.count userSegments .flatMap(r => r.segments.map(_-> 1L)) .reduceByKey(_ + _) val userSegmentPairs = userSegments .flatMap(r => r.segments.map(r.userId -> _)) userSegmentPairs .join(userSegmentPairs) .map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L } .reduceByKey(_ + _)
  • 24.
    Reality: Data inmany S3 prefixes/folders val inputData = Seq( "s3://my-bucket/some-path/prefix1/", "s3://my-bucket/some-path/prefix2/", "s3://my-bucket/some-path/prefix2/", ... "s3://my-bucket/some-path/prefix2000/", )
  • 25.
    How about this? valmyRdd = inputData .map(sc.textFile) .reduceLeft(_ ++ _)
  • 26.
    Or this? val myRdd= sc.union(inputData.map(sc.textFile))
  • 27.
    Solution // get thes3 objects val s3Objects = new AmazonS3Client() .listObjects("my-bucket", "some-path") .getObjectSummaries() .map(_.getKey()) .filter(hasPrefix1to2000) // send them to slave nodes and retrieve content val myRdd = sc .parallelize(Random.shuffle(s3Objects.toSeq), parallelismFactor) .flatMap( key => Source .fromInputStream( new AmazonS3Client().getObjectForKey("my-bucket", key) .getObjectContent ) .getLines )
  • 28.
    Reality: Large ScaleOverlap val userSegmentPairs = userSegments .flatMap(r => r.segments.map(r.userId -> _)) userSegmentPairs .join(userSegmentPairs) .map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L } .reduceByKey(_ + _)
  • 29.
    user1 a, b,c user2 a, b, c user3 a, b, c user4 a, c user5 a, c user1 a user1 b user1 c user2 a user2 b user2 c user3 a user3 b user3 c user4 a user4 c user5 a user5 c user1 a b user1 a c user1 b c user2 a b user2 a c user2 b c user3 a b user3 a c user3 b c user4 a c user5 a c a b 3 a c 5 b c 3 1 a b 1 a c 1 b c 1 a b 1 a c 1 b c 1 a b 1 a c 1 b c 1 a c 1 a c
  • 30.
    user1 a, b,c user2 a, b, c user3 a, b, c user4 a, c user5 a, c hash1 a 3 hash1 b 3 hash1 c 3 hash2 a 2 hash2 c 2 hash1 a b 3 hash1 a c 3 hash1 b c 3 hash2 a c 2 a b 3 a c 5 b c 3 hash1 a, b, c 3 hash2 a, c 2
  • 31.
    Solution // Reduce theuser space val aggrUserSegmentPairs = userSegmentPairs .map(r => r.segments -> 1L) .reduceByKey(_ + _) .flatMap { case (segments, count) => segments.map(s => (hash(segments), (segment, count)) } aggrUserSegmentPairs .join(aggrUserSegmentPairs) .map { case (_, (seg1, count), (seg2, _)) => (seg1, seg2) -> count } .reduceByKey(_ + _)
  • 32.
    Reality: Perform Joinon Skewed Data user1 a user2 b user3 c user4 d user5 e user1 one user1 two user1 three user1 four user1 five user1 six user3 seven user3 eight user4 nine user5 ten X data1.join(data2)
  • 33.
    Executor 1 Executor 2 Executor3 user1 a user1 one user1 two user1 three user1 four user1 five user1 six user3 c user4 d user5 e user2 b user3 seven user3 eight user4 nine user5 ten
  • 34.
    Executor 1 Executor 2 Executor3 user1 salt1 a user1 salt1 one user1 salt1 two user1 salt2 three user1 salt2 four user1 salt3 five user1 salt3 six user1 salt2 a user1 salt3 a
  • 35.
    user1 a user1 one user1two user1 three user1 four user1 five user1 six X user3 seven user3 eight user4 nine user5 ten user2 b user3 c user4 d user5 e
  • 36.
    Solution val topKeys =data2 .mapValues(x => 1L) .reduceByKey(_ + _) .takeOrdered(10)(Ordering[(String, Long)].on(_._2).reverse) .toMap .keys val topData1 = sc.broadcast( data1.filter(r => topKeys.contains(r._1)).collect.toMap ) val bottomData1 = data1.filter(r => !topKeys.contains(r._1)) val topJoin = data2.flatMap { case (k, v2) => topData1.value.get(k).map(v1 => k -> (v1, v2)) } topJoin ++ bottomData1.join(data2)
  • 37.
    Smarter retrieval ofdata from S3 Condensed overlap algorithm Hybrid join algorithm Clients with more than 2000 S3 prefixes/folders Before: 5 hours After: 20 minutes 100x faster and 10x less data for segment overlap Able to process joins for highly skewed data Hadoop to Spark Maintainable codebase
  • 38.
  • 39.
  • 40.

Editor's Notes

  • #2 Good morning everyone, I am Kexin, and with me my college Yacov. We are here today to share our experience with using Spark as the production engine for large scale data science applications. We joined Salesforce at the end of last year through an acquisition of a startup called Krux.
  • #3 I'm sure everyone here heard about Salesforce, the CRM enterprise software company that shuts down San Fran once a year. But here are some facts you are probably less familiar with. Salesforce is the fastest growing enterprise software company in the world, projected to cross 10B in revenue this year, and it does not look like its going to slow down anytime soon.
  • #4 This growth is due to the fact that Salesforce nowadays is not just a B2B CRM system. We have one of the largest e-commerce platforms, the leading service platform, IoT, and of course marketing. And with the introduction of Einstein, an Artificial Intelligence platform deeply integrated into all the products, Salesforce is the smartest CRM system out there.
  • #5 We both work on the Salesforce DMP, formally known as Krux. DMP is short for data management platform. For our client we securely collect, store, unify, analyze and activate people data. Think users visiting a brand's site, exposed to a marketing campaigns and ads, interact with the brand on social media or the app, purchase products and so on.
  • #6 Our clients are some of the largest publishers and marketers in the world. And thanks to these top clients, we process and analyze big portions of the Internet traffic. Lets put some numbers to this statement.
  • #7 Here is what's happening within a minute on the Internet, 500 thousands tweets, 900 thousand Facebook logins, and 3.5 million search queries on Google. Fascinating, big numbers, right?
  • #8 Here is how much data we are processing, 4.2 million user match queries, and nearly 5 million data capture events. And see more than 4 billion unique users each month. We currently process couple of 10's petabytes of data on behalf of our clients, more than about 60 libraries of congress worth of data.
  • #9 In order to process data at this scale, we built very large scale systems that handles the data collection events, server to server integrations, that gets ingested into different part of the data processing systems throw distributed queues like Kafka. We use AWS data pipeline to handle the orchestration, and have developed an open source library called Hyperion, to help define scheduling, task dependencies, fault tolerance and resource management easily for developers. At any given moment, we have more than three thousand instances running with hundreds of EMR clusters. These jobs ranges from very simple cleaning, normalization, and aggregation workflows, to complex data science, Machine Learning and AI products.
  • #10 To motivate the problem consider the following use case. Marketers want to find more customers that look like their loyal customers. That is, given a segment of valuable users, and their associated features, train a model that captures their characteristics. Then using the model, for every other user in the system, answer the question, how similar are they to the original segment, given their unique features.
  • #11 Let's see it in a quick demo. I'm in the Northen Trail Demo account. I can select a user segment, for example this one, and I can see a chart telling me how many users I can reach given a certain similarity threshold, obviously, if you want to reach more users, you have to select a low similarity.
  • #12 At a high level here is some of the data science machinery we are using. For a model framework we use the trusted and loved Naive Bayes. Now, as we all know, more important than the model is the data we feed in to it. Signal to noise is very important. At the scale we operate, as you can imagine, our feature space is very large so we must run feature selection. We use standard linear discriminant analysis here. Finally, for some spacial sauce, and to overcome the shortcomings of Naive Bayes, we have developed an algorrithm to correct the autocorrelation in the feature space.
  • #14 This talk is not about the details here, but about Spark. If you are interested to learn more, come and see us later. So let's get into some implementation details.
  • #15 Here are the basic tasks that everyone does for machine learning applications. Prepare data, Train model, and leverage those to classify. Link, normalize, and clean the data in the prepare task; The data then gets feed into the trainer task to produce models, and then the classifier takes the models generated by the trainer and produce results.
  • #16 Let's do some basic back of the envelop calculation to assess the scale. Prepare job need to get all the users, their segments and features, we process billions of unique users, we have hundreds of thousands of segments and features in our system. Prepare task have a running complexity of order n, and will produce results in the order of ten to the fourteenth. Trainer takes the produced results and compute a model for each of the segment, it has an O n squared running complexity. The classifier takes all the models and perform classifications for each of the users, so about the same running complexity as the trainer. Clearly with the amount of data that we need to process, None of the off-the-shelf software or library like scikit learn is not going to work, we had to write a custom algorithm and deploy the jobs on large distributed systems.
  • #17 So we did just that, we implemented the jobs and deployed them ...
  • #20 And ... it did not work so great. The challenge was scaling up the data science application.
  • #21 We had lots of failures... Stack overflow exceptions, nodes dying in succession eventually killing the cluster. And cost went through the roof... slow execution, poor parallelism and utilization of task nodes, task taking long time to complete, failures and retries. The first thing we examined is the framework we were using. You see at the time we were using map-reduce. Our algorithm however was cyclical, the same very large data was used to produce multiple parameters in the model, and with the map reduce framework, this resulted in high overhead of data serialization formats and managing storage on clusters, not to mention the complexity of the code base.
  • #22 Let me show you what we mean. Take the trainer for example, these are the four basic numbers required for a simple Naive Bayes model with no feature selection or autocorrelation correction mentioned before. Total number of features, total population, segment populations, segment population overlaps.
  • #23 It turns out you need this to get just those four numbers. Not to mention the code to put those four numbers together, serialize the intermediate results, and chain the map reduce tasks. Remember, this is just the basic parameters.
  • #24 We then introduced Spark, you just need to write these for those numbers, plain and simple Scala code that you would expect to write for working a with typical collection object, [insert some explanation of the code here depend on time] without worrying too much about the underlying distribution, or data serialization between small things. It handles distributed memory management, task coordination, data serialization, the works.
  • #25 Even with Spark, we were still experiencing issues. First of all, we use AWS S3 heavily, to prepare data we need to be able to load data that's spread out in a very big number of S3 directories.
  • #26 Well, you can just load them all as RDDs, and then union them one at a time. That would not work, the way UnionedRDD works and how tasks are serialized, you'll get a StackOverflowException if you have a lot of RDDs in the union.
  • #27 Ah, I know, Spark have special method to union large number of RDDs together. This surely works. It does work, in the sense that the code runs, and does not trigger the StackOverflowException anymore. However, it's painfully slow. The problem now is not the union operation, but to retrieve the data required for the RDD. You see, Spark uses HDFS API to interact with compatible storage systems like S3. To plan for the work, it uses ListStatus calls to all those S3 directories, in order to gather the files. , however, strictly speaking, S3 is not a file system, but a key value store, also, the ListStatus is a network call, and is executed on the master node in serial fashion, each of these calls takes at least 2 seconds to return a result, so thousands of these calls would take hours to complete, while all the slave nodes stays in idle doing nothing.
  • #28 Since the key is to minimise the number of network calls. We can actually use S3 APIs directly on the master node, filter to get all the files we want, distribute the file URIs, and let the executor use S3 APIs to fetch and parse the content. Just remember if your file size varies a lot, shuffling before parallelize will avoid small number of executors getting all the big files. Also make sure you have a reasonable parallelism factor [may need to explain parallelism factor]. We found that 3 times the total number of executors or CPU cores perform the best. OK, that's it.
  • #29 Remember one of the key numbers is to find segment population overlaps? We could easily do it with a self-join like this.
  • #30 given data like this, we first normalize it to user and segment pairs, perform self-join on user id, replace user ID with 1s, and aggregate to get the overlaps. Well, like all data base problems, join is the most time consuming operation. So, it takes a long time for this to complete, which in turn cost a lot of money.
  • #31 If you examine this closely, you see what matters here are the segment combinations, if we have 2 users here belongs to both segment a and c, we can already aggregate them before the join. Remember, naive join operation without index has an O(n^2) complexity, even with sort-merge join it's not cheap, you will get massive performance boost if you can push joins to the last, and reduce data as much as possible before the join operation, this is the most basic data base query optimization techniques. I'm sure a lot of you are aware of it. So here is what we did. First of all we group and aggregate the number of users that belongs to exactly the same segments. Generate a hash for the segments, and then perform a join on the hash instead of user id. As a result, you can see that the data shrinks, and the join will be much faster.
  • #32 Here is how the code looks like, condense the data and then join.
  • #33 Another very common operation is joining on a foreign key. Meaning joins between a data set that have unique keys to another one with the foreign key which is not unique. It typically is not a problem, however, in some scenarios, the foreign key on data2 is highly skewed. Meaning a very large portion of data belongs to a relatively small number of keys. In this example most of the rows in data2 have the key user1.
  • #34 When spark perform such joins, the data that have the same key will end up on the same executor. This causes majority of the data being shuffled to executor 1, where it may not have sufficient memory to deal with it. Causes failure, when the task are shifted to other executors, it causes them to fail one by one as well, and then you'll experience this slow and painful death of a cluster.
  • #35 One way to deal with this is to salt the keys, adding an additional component to breakup the skewness. The downside of this is that you'll need to inflate the number of rows in data1. So it's not ideal.
  • #36 Alternatively, if data1 is small enough, you could simply copy it to all executors, and perform a match on each row. This is called broadcast hash join in SparkSql. So for each row in data2, you perform a lookup on the local copy of data1, and saves the result. If it's not small enough, we could use a hybrid approach by broadcasting only the part of data1 with the skewed keys, do matching, and the perform normal joins on the rest of the data.
  • #37 To make data available on all executors, you can use the broadcast call. And you pay an overhead cost of getting the skewed keys. However this approach made joining on such data possible.
  • #38 To recap, moving from Hadoop to Spark makes a much better codebase, and we could add more complex features easier and faster. The smarter S3 retrieval greatly improves the performance for getting data from a large number of S3 directories. The condensed overlap algorithm improves the join performance up to 100x and the hybrid join makes joining on skewed data possible.
  • #39 As a result, we have a much lower failure rate,
  • #40 with up to 65% performance improvement,
  • #41 and about a third of the cost.
  • #42 This is just one small portion of the problem related to Spark. We are tackling a lot of data science problems on a massive scale. Did we mention we need to implement most of the algorithms, tweak and improve classic data mining and machine learning models heavily for this kind of data? We are also working a lot of other interesting Einstein products like journey insights and machine discovered segments. And this concludes our presentation, thank you, any questions? [Answer questions] Talk to us offline if you are interested or have other questions.