Spark1

Using Spark to Ignite Data Analytics
by eBay Global Data Infrastructure Analytics Team on 05/28/2014 [http://www.ebaytechblog.com
/2014/05/28/using-spark-to-ignite-data-analytics/]
in Data Infrastructure and Services, Machine Learning, Open Source
At eBay we want our customers to have the best experience possible. We use data analytics to
improve user experiences, provide relevant o�ers, optimize performance, and create many, many
other kinds of value. One way eBay supports this value creation is by utilizing data processing
frameworks that enable, accelerate, or simplify data analytics. One such framework is Apache Spark.
This post describes how Apache Spark �ts into eBay’s Analytic Data Infrastructure.
What is Apache Spark?
The Apache Spark web site describes Spark as “a fast and general engine for large-scale data
processing.” Spark is a framework that enables parallel, distributed data processing. It o�ers a simple
programming abstraction that provides powerful cache and persistence capabilities. The Spark
framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager. Developers can use the Spark framework via several programming languages including
Java, Scala, and Python. Spark also serves as a foundation for additional data processing frameworks
such as Shark, which provides SQL functionality for Hadoop.
Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this
type of processing is through its Resilient Distributed Dataset (RDD). In the paper titled Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, RDDs are
described as “…fault-tolerant, parallel data structures that let users explicitly persist intermediate
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
1 of 6 08/18/2015 05:03 PM

results in memory, control their partitioning to optimize data placement, and manipulate them using
a rich set of operators.” By using RDDs, programmers can pin their large data sets to memory,
thereby supporting high-performance, iterative processing. Compared to reading a large data set
from disk for every processing iteration, the in-memory solution is obviously much faster.
The diagram below shows a simple example of using Spark to read input data from HDFS, perform a
series of iterative operations against that data using RDDs, and write the subsequent output back to
HDFS.
In the case of the �rst map operation into RDD(1), not all of the data could �t within the memory
space allowed for RDDs. In such a case, the programmer is able to specify what should happen to the
data that doesn’t �t. The options include spilling the computed data to disk and recreating it upon
read. We can see in this example how each processing iteration is able to leverage memory for the
reading and writing of its data. This method of leveraging memory is likely to be 100X faster than
other methods that rely purely on disk storage for intermittent results.
Apache Spark at eBay
Today Spark is most commonly leveraged at eBay through Hadoop via Yarn. Yarn manages the
Hadoop cluster’s resources and allows Hadoop to extend beyond traditional map and reduce jobs by
employing Yarn containers to run generic tasks. Through the Hadoop Yarn framework, eBay’s Spark
users are able to leverage clusters approaching the range of 2000 nodes, 100TB of RAM, and 20,000
cores.
2 of 6 08/18/2015 05:03 PM

The following example illustrates Spark on Hadoop via Yarn.
The user submits the Spark job to Hadoop. The Spark application master starts within a single Yarn
container, then begins working with the Yarn resource manager to spawn Spark executors – as many
as the user requested. These Spark executors will run the Spark application using the speci�ed
amount of memory and number of CPU cores. In this case, the Spark application is able to read and
write to the cluster’s data residing in HDFS. This model of running Spark on Hadoop illustrates
Hadoop’s growing ability to provide a singular, foundational platform for data processing over
shared data.
The eBay analyst community includes a strong contingent of Scala users. Accordingly, many of eBay’s
Spark users are writing their jobs in Scala. These jobs are supporting discovery through interrogation
of complex data, data modelling, and data scoring, among other use cases. Below is a code snippet
from a Spark Scala application. This application uses Spark’s machine learning library, MLlib, to
cluster eBay’s sellers via KMeans. The seller attribute data is stored in HDFS.
/**
* read input files and turn into usable records
*/
var table = new SellerMetric()
3 of 6 08/18/2015 05:03 PM

val model_data = sc.sequenceFile[Text,Text](
input_path
,classOf[Text]
,classOf[Text]
,num_tasks.toInt
).map(
v => parseRecord(v._2,table)
).filter(
v => v != null
).cache
....
/**
* build training data set from sample and summary data
*/
val train_data = sample_data.map( v =>
Array.tabulate[Double](field_cnt)(
i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
)
).cache
/**
* train the model
*/
val model = KMeans.train(train_data,CLUSTERS,ITERATIONS)
/**
* score the data
*/
val results = grouped_model_data.map(
v => (
v._1
,model.predict(
Array.tabulate[Double](field_cnt)(
i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
)
)
)
)
results.saveAsTextFile(output_path)
4 of 6 08/18/2015 05:03 PM

In addition to Spark Scala users, several folks at eBay have begun using Spark with Shark to
accelerate their Hadoop SQL performance. Many of these Shark queries are easily running 5X faster
than their Hive counterparts. While Spark at eBay is still in its early stages, usage is in the midst of
expanding from experimental to everyday as the number of Spark users at eBay continues to
accelerate.
The Future of Spark at eBay
Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. Our
Hadoop platform team has started gearing up to formally support Spark on Hadoop. Additionally,
we’re keeping our eyes on how Hadoop continues to evolve in its support for frameworks like Spark,
how the community is able to use Spark to create value from data, and how companies like
Hortonworks and Cloudera are incorporating Spark into their portfolios. Some groups within eBay
are looking at spinning up their own Spark clusters outside of Hadoop. These clusters would either
leverage more specialized hardware or be application-speci�c. Other folks are working on
incorporating eBay’s already strong data platform language extensions into the Spark model to make
it even easier to leverage eBay’s data within Spark. In the meantime, we will continue to see adoption
of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product
announcements, industry chatter, and Spark’s own strengths and capabilities.
5 Replies
5 thoughts on “Using Spark to Ignite Data Analytics”
Pingback: Data Science and Big Data | A Computer Box
Pingback: Spark and Scala and Big Data in General | A Computer Box
45
  162
  1
  
Rajnish
11/29/2014 at 3:41AM
5 of 6 08/18/2015 05:03 PM

Pingback: Quick Start Spark with Scala API | datasciencedreams
Thanks for sharing this nice article that is helping us to know the technology stack used in big
companies. What is your opinion on using Cassandra as data storage for Spark instead of hadoop?
I haven’t used Cassandra with Spark so I don’t have much to o�er.
I look at Spark as an additional tool around leveraging an existing big data platform investment
hence leveraging the storage solution already in place. Hadoop o�ers a fair amount of general
purpose platform capabilities which is nice.
To create a platform for Spark exclusively or to use in Spark in a way that doesn’t need to piggy-back
on existing platform then I would look for a storage solution superior to HDFS.
John
12/01/2014 at 10:51AM
6 of 6 08/18/2015 05:03 PM

Spark1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark1

Similar to Spark1 (20)

Recently uploaded

Recently uploaded (20)

Spark1