Productionalizing a spark application

Productionalizing a Spark
application
Productionalizing an application on a frequently
evolving framework like Spark

● Shashank L
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com

Agenda
● Financial analytics
● Requirements
● Architecture
● Initial solution
● RDD to Dataframe API
● Code quality and testing
● Architectural changes
● Future improvements
● Lookback

Financial Analytics
Financial analytics is used to predict the stock
prices for a specific company using its historical
price information

Architecture
Stocks data
(Daily basis)
Sql Server
ETL - Pipeline HDFS
Data
preprocessing
Data Analytics NoSQL
Frontend
(Dashboard)

Our team
● Data scientists
○ Coming up with the new magic
● Data engineers
○ Productionalizing the magic on large datasets
● Front end developer
○ Consumes results to make it presentable to
clients.

Requirements
● Across geography developers
● Variety of developers in team
● Better code quality
● Better testing mechanisms
● Easier team expansion
● Lesser infrastructure maintenance overhead
● Use latest libraries available

Iteration 1
● Data scientists
○ They were well versed with Python or SQL
○ They did analysis using Python Panda dataframe code
○ Analysis were tested on only small set of data
● Data engineers
○ Using Spark - Spark 0.9
○ They used to port Python to Scala RDD API to be able to
scale the analysis to big data
○ Custom Framework with ability to write into and read from
multiple sources (File, Hive Table, S3, JDBC)

Data engineers
ArchitectureStocks data
(Daily basis)
Sql Server
ETL - Pipeline
HDFS
Data
preprocessing
Frontend
(Dashboard)
Analysis
(Python)
Data scientists

Challenges
● Framework challenges
○ Porting code from one language to another would lead
to a lot of inaccuracies
○ Differences in the language constructs and API lead to
change in code design
● Architectural challenges
○ Clusters used by the team were manually created and
maintained
○ Intermediate data was saved in a text based csv
format.

Iteration 2
RDD API to Dataframe API

Iteration 2
● Upgrade to Spark 1.3
● Data scientists
○ Dataframe API was introduced which was a better known
interface for Data scientists
○ SQL API was easier for the Data scientist to perform simple
operations
○ Zeppelin for Data scientists to prototype the analytical
algorithms
● Data engineers
○ CSV based intermediate format to Parquet
○ Amazon EMR based Hadoop cluster with Spark on it

Data science cluster
Data engineer Architecture
Stocks
data
ETL HDFS
Zeppelin
Dashboard
Data Analytics
(PySpark)
Data engineering cluster
Data
preprocessing

Challenges
● Quality challenges
○ Productionalizing multiple analysis required
expansion of Data engineering team
○ Team expansion induced code quality issues and
bugs in the code
○ Unit tests for the each functionalities were not
present
○ Review process for the changes in the code were
not present

Iteration 3
Code quality and testing

Iteration 3
● Creation of unit test cases for all the analysis
● More readable test case suite for the code using
ScalaTest (http://www.scalatest.org/)
● Test cases for unit testing small functionalities and
flow testing to test the full ETL flow on sampled data
● Review process for the changes in the code through
Github PR
● Daily build in Jenkins to test the flow and
functionalities on a daily basis

ScalaTest
class ExampleSpec extends FlatSpec with Matchers {
"A Stack" should "pop values in last-in-first-out order" in {
val stack = new Stack[Int]
stack.push(1)
stack.push(2)
stack.pop() should be (2)
stack.pop() should be (1)
}
it should "throw NoSuchElementException if an empty stack is popped" in {
val emptyStack = new Stack[Int]
a [NoSuchElementException] should be thrownBy {
emptyStack.pop()
}
}
}

Challenges
● Architectural challenges
○ Cluster resources was a bottleneck for the teams
○ Amazon EMR clusters were not throw away
clusters as data was stored in HDFS.
○ Upgrading the Spark version on the cluster was
difficult
○ Infrastructure to run scheduled jobs was missing
as Jenkins was not the best way to schedule jobs
○ Stability issues with Zeppelin

Iteration 4
Architectural changes

Iteration 4
● Moved the data storage from HDFS to s3
● Moved to Databricks cloud environment (https:
//databricks.com/product/databricks)
● Databricks cloud provides notebook based interface
for writing Spark code in Scala, Java, Python and R
● Encourage data scientists to use Scala API
● Travis for deployment and testing

Databrick cloud
● Cluster config
○ Launch, configure, scale and terminate

Databrick cloud
● Jobs
○ Schedule complex workflows

Databrick cloud
● Notebooks
○ Explore, Visualize and Share

Improvements
● Data engineers
○ Cluster bottleneck was solved with creating multiple
throw away clusters when needed.
○ Need not stick to a cluster for a long time as primary
data storage was s3
○ Terminating cluster when not being used would be
cost efficient
○ Multiple clusters with different versions of Spark
enables the user to try out the latest feature in Spark
○ Cluster maintenance and tuning overhead

Improvements
● Data engineers
○ Lesser turnaround time in understanding bottlenecks in
the workflows
○ Databricks cloud Jobs can be used for scheduling
workflows and daily runs
○ Travis enabled strict and immediate code testing
● Data scientists
○ Data Scientists can easily share the notebooks and
results of the analysis with the team
○ Ability to write in multiple languages

DATABRICKS CLOUD
Jobs
Architecture
Dashboard
NoSQL
S3
ETL
Stocks
data
Datascience
cluster
Notebook
(R/Python)
DataEngg
cluster1
Notebook
(Scala)
DataEngg
cluster2
Notebook
(Scala)

Challenges
● Framework challenges
○ Schema is static and doesn’t change frequently
○ Dataframe doesn’t have static schema check
○ Pipeline fails in the middle of the processing if there
is any change in the data
○ Current window analysis uses Scala constructs to
load specific set of data to memory and run ML on
top of it
○ Domain object based functions are called from
inside udf currently

Iteration 5 (Future iteration)
● Data engineers
○ Port analysis from Dataframe API into Dataset API
(in Spark 2.0)
○ With Dataset API, we get static schema check
○ Using existing Domain object based functions
● Data scientists
○ Move from Scala window based analysis to
SparkSQL window analytics

Lookback
● Spark version
○ 0.9 -> 1.6.0
● API
○ RDD -> Dataframe -> Dataset
● Deployment
○ EC2 -> EMR -> DB cloud
● Scheduling
○ Jenkins -> DB cloud Jobs
● Language
○ Scala

Lookback
● Data format
○ Text -> Parquet
● Storage
○ HDFS -> s3
● Deployment
○ Jenkins -> Travis

References
● http://go.databricks.com/databricks-community-
edition-beta-waitlist
● https://databricks.com/blog/2014/07/14/databricks-
cloud-making-big-data-easy.html
● http://shashankgowda.com/2016/02/20/introduction-
to-dataset-api-in-spark.html

Productionalizing a spark application

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Productionalizing a spark application

Similar to Productionalizing a spark application (20)

More from datamantra

More from datamantra (10)

Recently uploaded

Recently uploaded (20)

Productionalizing a spark application