SlideShare a Scribd company logo
1 of 6
Download to read offline
Using Spark to Ignite Data Analytics
by eBay Global Data Infrastructure Analytics Team on 05/28/2014 [http://www.ebaytechblog.com
/2014/05/28/using-spark-to-ignite-data-analytics/]
in Data Infrastructure and Services, Machine Learning, Open Source
At eBay we want our customers to have the best experience possible. We use data analytics to
improve user experiences, provide relevant o�ers, optimize performance, and create many, many
other kinds of value. One way eBay supports this value creation is by utilizing data processing
frameworks that enable, accelerate, or simplify data analytics. One such framework is Apache Spark.
This post describes how Apache Spark �ts into eBay’s Analytic Data Infrastructure.
What is Apache Spark?
The Apache Spark web site describes Spark as “a fast and general engine for large-scale data
processing.” Spark is a framework that enables parallel, distributed data processing. It o�ers a simple
programming abstraction that provides powerful cache and persistence capabilities. The Spark
framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager. Developers can use the Spark framework via several programming languages including
Java, Scala, and Python. Spark also serves as a foundation for additional data processing frameworks
such as Shark, which provides SQL functionality for Hadoop.
Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this
type of processing is through its Resilient Distributed Dataset (RDD). In the paper titled Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, RDDs are
described as “…fault-tolerant, parallel data structures that let users explicitly persist intermediate
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
1 of 6 08/18/2015 05:03 PM
results in memory, control their partitioning to optimize data placement, and manipulate them using
a rich set of operators.” By using RDDs, programmers can pin their large data sets to memory,
thereby supporting high-performance, iterative processing. Compared to reading a large data set
from disk for every processing iteration, the in-memory solution is obviously much faster.
The diagram below shows a simple example of using Spark to read input data from HDFS, perform a
series of iterative operations against that data using RDDs, and write the subsequent output back to
HDFS.
In the case of the �rst map operation into RDD(1), not all of the data could �t within the memory
space allowed for RDDs. In such a case, the programmer is able to specify what should happen to the
data that doesn’t �t. The options include spilling the computed data to disk and recreating it upon
read. We can see in this example how each processing iteration is able to leverage memory for the
reading and writing of its data. This method of leveraging memory is likely to be 100X faster than
other methods that rely purely on disk storage for intermittent results.
Apache Spark at eBay
Today Spark is most commonly leveraged at eBay through Hadoop via Yarn. Yarn manages the
Hadoop cluster’s resources and allows Hadoop to extend beyond traditional map and reduce jobs by
employing Yarn containers to run generic tasks. Through the Hadoop Yarn framework, eBay’s Spark
users are able to leverage clusters approaching the range of 2000 nodes, 100TB of RAM, and 20,000
cores.
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
2 of 6 08/18/2015 05:03 PM
The following example illustrates Spark on Hadoop via Yarn.
The user submits the Spark job to Hadoop. The Spark application master starts within a single Yarn
container, then begins working with the Yarn resource manager to spawn Spark executors – as many
as the user requested. These Spark executors will run the Spark application using the speci�ed
amount of memory and number of CPU cores. In this case, the Spark application is able to read and
write to the cluster’s data residing in HDFS. This model of running Spark on Hadoop illustrates
Hadoop’s growing ability to provide a singular, foundational platform for data processing over
shared data.
The eBay analyst community includes a strong contingent of Scala users. Accordingly, many of eBay’s
Spark users are writing their jobs in Scala. These jobs are supporting discovery through interrogation
of complex data, data modelling, and data scoring, among other use cases. Below is a code snippet
from a Spark Scala application. This application uses Spark’s machine learning library, MLlib, to
cluster eBay’s sellers via KMeans. The seller attribute data is stored in HDFS.
/**
* read input files and turn into usable records
*/
var table = new SellerMetric()
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
3 of 6 08/18/2015 05:03 PM
val model_data = sc.sequenceFile[Text,Text](
input_path
,classOf[Text]
,classOf[Text]
,num_tasks.toInt
).map(
v => parseRecord(v._2,table)
).filter(
v => v != null
).cache
....
/**
* build training data set from sample and summary data
*/
val train_data = sample_data.map( v =>
Array.tabulate[Double](field_cnt)(
i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
)
).cache
/**
* train the model
*/
val model = KMeans.train(train_data,CLUSTERS,ITERATIONS)
/**
* score the data
*/
val results = grouped_model_data.map(
v => (
v._1
,model.predict(
Array.tabulate[Double](field_cnt)(
i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
)
)
)
)
results.saveAsTextFile(output_path)
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
4 of 6 08/18/2015 05:03 PM
In addition to Spark Scala users, several folks at eBay have begun using Spark with Shark to
accelerate their Hadoop SQL performance. Many of these Shark queries are easily running 5X faster
than their Hive counterparts. While Spark at eBay is still in its early stages, usage is in the midst of
expanding from experimental to everyday as the number of Spark users at eBay continues to
accelerate.
The Future of Spark at eBay
Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. Our
Hadoop platform team has started gearing up to formally support Spark on Hadoop. Additionally,
we’re keeping our eyes on how Hadoop continues to evolve in its support for frameworks like Spark,
how the community is able to use Spark to create value from data, and how companies like
Hortonworks and Cloudera are incorporating Spark into their portfolios. Some groups within eBay
are looking at spinning up their own Spark clusters outside of Hadoop. These clusters would either
leverage more specialized hardware or be application-speci�c. Other folks are working on
incorporating eBay’s already strong data platform language extensions into the Spark model to make
it even easier to leverage eBay’s data within Spark. In the meantime, we will continue to see adoption
of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product
announcements, industry chatter, and Spark’s own strengths and capabilities.
5 Replies
5 thoughts on “Using Spark to Ignite Data Analytics”
Pingback: Data Science and Big Data | A Computer Box
Pingback: Spark and Scala and Big Data in General | A Computer Box
45
  162
  1
  
Rajnish
11/29/2014 at 3:41AM
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
5 of 6 08/18/2015 05:03 PM
Pingback: Quick Start Spark with Scala API | datasciencedreams
Thanks for sharing this nice article that is helping us to know the technology stack used in big
companies. What is your opinion on using Cassandra as data storage for Spark instead of hadoop?
I haven’t used Cassandra with Spark so I don’t have much to o�er.
I look at Spark as an additional tool around leveraging an existing big data platform investment
hence leveraging the storage solution already in place. Hadoop o�ers a fair amount of general
purpose platform capabilities which is nice.
To create a platform for Spark exclusively or to use in Spark in a way that doesn’t need to piggy-back
on existing platform then I would look for a storage solution superior to HDFS.
John
12/01/2014 at 10:51AM
Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s...
6 of 6 08/18/2015 05:03 PM

More Related Content

What's hot

Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Edureka!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationEdureka!
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 

What's hot (20)

Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview Preparation
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Hadoop
HadoopHadoop
Hadoop
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 

Similar to Spark1

Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 

Similar to Spark1 (20)

Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

Spark1

  • 1. Using Spark to Ignite Data Analytics by eBay Global Data Infrastructure Analytics Team on 05/28/2014 [http://www.ebaytechblog.com /2014/05/28/using-spark-to-ignite-data-analytics/] in Data Infrastructure and Services, Machine Learning, Open Source At eBay we want our customers to have the best experience possible. We use data analytics to improve user experiences, provide relevant o�ers, optimize performance, and create many, many other kinds of value. One way eBay supports this value creation is by utilizing data processing frameworks that enable, accelerate, or simplify data analytics. One such framework is Apache Spark. This post describes how Apache Spark �ts into eBay’s Analytic Data Infrastructure. What is Apache Spark? The Apache Spark web site describes Spark as “a fast and general engine for large-scale data processing.” Spark is a framework that enables parallel, distributed data processing. It o�ers a simple programming abstraction that provides powerful cache and persistence capabilities. The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager. Developers can use the Spark framework via several programming languages including Java, Scala, and Python. Spark also serves as a foundation for additional data processing frameworks such as Shark, which provides SQL functionality for Hadoop. Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this type of processing is through its Resilient Distributed Dataset (RDD). In the paper titled Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, RDDs are described as “…fault-tolerant, parallel data structures that let users explicitly persist intermediate Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 1 of 6 08/18/2015 05:03 PM
  • 2. results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.” By using RDDs, programmers can pin their large data sets to memory, thereby supporting high-performance, iterative processing. Compared to reading a large data set from disk for every processing iteration, the in-memory solution is obviously much faster. The diagram below shows a simple example of using Spark to read input data from HDFS, perform a series of iterative operations against that data using RDDs, and write the subsequent output back to HDFS. In the case of the �rst map operation into RDD(1), not all of the data could �t within the memory space allowed for RDDs. In such a case, the programmer is able to specify what should happen to the data that doesn’t �t. The options include spilling the computed data to disk and recreating it upon read. We can see in this example how each processing iteration is able to leverage memory for the reading and writing of its data. This method of leveraging memory is likely to be 100X faster than other methods that rely purely on disk storage for intermittent results. Apache Spark at eBay Today Spark is most commonly leveraged at eBay through Hadoop via Yarn. Yarn manages the Hadoop cluster’s resources and allows Hadoop to extend beyond traditional map and reduce jobs by employing Yarn containers to run generic tasks. Through the Hadoop Yarn framework, eBay’s Spark users are able to leverage clusters approaching the range of 2000 nodes, 100TB of RAM, and 20,000 cores. Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 2 of 6 08/18/2015 05:03 PM
  • 3. The following example illustrates Spark on Hadoop via Yarn. The user submits the Spark job to Hadoop. The Spark application master starts within a single Yarn container, then begins working with the Yarn resource manager to spawn Spark executors – as many as the user requested. These Spark executors will run the Spark application using the speci�ed amount of memory and number of CPU cores. In this case, the Spark application is able to read and write to the cluster’s data residing in HDFS. This model of running Spark on Hadoop illustrates Hadoop’s growing ability to provide a singular, foundational platform for data processing over shared data. The eBay analyst community includes a strong contingent of Scala users. Accordingly, many of eBay’s Spark users are writing their jobs in Scala. These jobs are supporting discovery through interrogation of complex data, data modelling, and data scoring, among other use cases. Below is a code snippet from a Spark Scala application. This application uses Spark’s machine learning library, MLlib, to cluster eBay’s sellers via KMeans. The seller attribute data is stored in HDFS. /** * read input files and turn into usable records */ var table = new SellerMetric() Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 3 of 6 08/18/2015 05:03 PM
  • 4. val model_data = sc.sequenceFile[Text,Text]( input_path ,classOf[Text] ,classOf[Text] ,num_tasks.toInt ).map( v => parseRecord(v._2,table) ).filter( v => v != null ).cache .... /** * build training data set from sample and summary data */ val train_data = sample_data.map( v => Array.tabulate[Double](field_cnt)( i => zscore(v._2(i),sample_mean(i),sample_stddev(i)) ) ).cache /** * train the model */ val model = KMeans.train(train_data,CLUSTERS,ITERATIONS) /** * score the data */ val results = grouped_model_data.map( v => ( v._1 ,model.predict( Array.tabulate[Double](field_cnt)( i => zscore(v._2(i),sample_mean(i),sample_stddev(i)) ) ) ) ) results.saveAsTextFile(output_path) Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 4 of 6 08/18/2015 05:03 PM
  • 5. In addition to Spark Scala users, several folks at eBay have begun using Spark with Shark to accelerate their Hadoop SQL performance. Many of these Shark queries are easily running 5X faster than their Hive counterparts. While Spark at eBay is still in its early stages, usage is in the midst of expanding from experimental to everyday as the number of Spark users at eBay continues to accelerate. The Future of Spark at eBay Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. Our Hadoop platform team has started gearing up to formally support Spark on Hadoop. Additionally, we’re keeping our eyes on how Hadoop continues to evolve in its support for frameworks like Spark, how the community is able to use Spark to create value from data, and how companies like Hortonworks and Cloudera are incorporating Spark into their portfolios. Some groups within eBay are looking at spinning up their own Spark clusters outside of Hadoop. These clusters would either leverage more specialized hardware or be application-speci�c. Other folks are working on incorporating eBay’s already strong data platform language extensions into the Spark model to make it even easier to leverage eBay’s data within Spark. In the meantime, we will continue to see adoption of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product announcements, industry chatter, and Spark’s own strengths and capabilities. 5 Replies 5 thoughts on “Using Spark to Ignite Data Analytics” Pingback: Data Science and Big Data | A Computer Box Pingback: Spark and Scala and Big Data in General | A Computer Box 45   162   1    Rajnish 11/29/2014 at 3:41AM Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 5 of 6 08/18/2015 05:03 PM
  • 6. Pingback: Quick Start Spark with Scala API | datasciencedreams Thanks for sharing this nice article that is helping us to know the technology stack used in big companies. What is your opinion on using Cassandra as data storage for Spark instead of hadoop? I haven’t used Cassandra with Spark so I don’t have much to o�er. I look at Spark as an additional tool around leveraging an existing big data platform investment hence leveraging the storage solution already in place. Hadoop o�ers a fair amount of general purpose platform capabilities which is nice. To create a platform for Spark exclusively or to use in Spark in a way that doesn’t need to piggy-back on existing platform then I would look for a storage solution superior to HDFS. John 12/01/2014 at 10:51AM Using Spark to Ignite Data Analytics | eBay Tech Blog http://www.ebaytechblog.com/2014/05/28/using-s... 6 of 6 08/18/2015 05:03 PM