How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

Goldman Sachs
Engineering
GS.com/Engineering

Vice President - Technology
Vincent Saulys

Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
4
Early Big Data Tooling
 Multiple data sources
 Various APIs to get data
 Processing data with Java map-reduce and PIG scripts
RDBMS
Apache
Hive
Apache
HBASE
OpenTSDB
Hadoop HDFS
PIG
Scripts
Java
Map-Reduce
REST

5
Challenges
 Lots of Java code
 Debugging PIG when things go wrong
 Code-compile-deploy-debug (rinse and repeat)
Source: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html Source: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/

6
Spark’s Attraction
Benefits
 Language support: Scala, Java, Python, R
 In memory: Faster than other solutions
 SQL, Stream Processing, Machine Learning, Graphs
Scala
Python
Java
R

 Strata+Hadoop World 2014
• Many sessions on Apache Spark
 Internal blog posts on Spark
7
Spark arrives on the scene
Blog Post
Images

8
Viral Adoption
Added to Data Science Toolkit
 Works with GS Hadoop clusters out of box
 Supporting YARN, Hive, and sparkr (with v1.4.0)
 Integrated with proprietary tools
Community Supported
 On-line forums, meet-ups, references, examples, sharing best practices

 Data Science Toolkit
• Java, Scala, Python and R
• Open source analytic packages (ggplot2, pandas, scikit-learn, theano, etc.)
• IDEs (and notebooks) for turnkey developer setup
 Mirrors Spark
• Same language support
9
Data Science Toolkit with Spark Support
Scala
R
Python
Hadoop
HDFSJava
Scikit-Learn
Theano
Pandas
RStudio Desktop
IPython Notebook
ggplot2
Markdown
Shiny
H2O

 Languages: Scala and Java
 Data Processing (Batch and Stream)
• Bring process to the data
• Massage data for report generation
• Micro-batch approach works for streaming
• Easy to process kafka streams
Simpler and Faster Code
 Spark Job-Server
• REST job server for sharing RDDs across jobs
• https://github.com/spark-jobserver/spark-jobserver (open-source)
10
Today

Windows driver / Linux executors
• When spark-shell –master yarn does not work….
JVM based deployments
• Scala or Java based, no Python or R (yet)
Secure data clusters
• Supported by YARN
Compute and HDFS on separate machines
• One Data Lake (shared HDFS)
• YARN for compute/processing
• Multiple YARN clusters for different workloads using same HDFS
11
Challenges Encountered, Lessons Learned

Beyond Scala and Java
• R: manipulate and reduce data for local analysis, plotting, and reporting
• Python: run simulations (written in Python), collect results
‘Burst out’ Cluster Compute Support
• On demand YARN clusters built with needed libraries (with version support)
Machine Learning
• Need: Full development lifecycle integration
• What’s different from traditional SDLC: model training, deployment, monitoring and
maintenance
Data Reproducibility
• Data Provenance: A framework to keep code and the data it produced together
12
On the Horizon

Continued Data Frame Enhancements
• Easier preparation for Machine Learning
> Easy one hot encoding
> Pivot (a.k.a. cast/melt in R)
• Functional parity with R and Pandas (just bigger data)
Formulas!
• Ex: “response ~ feature_1 + feature_2 + …”
• Less data preparation code for machine learning (ex: on-hot for categorical data columns)
• Introduced in 1.5 but only for R and Python
IDE for Spark with Plots and Publish
• Beyond notebooks
• Simpler than a Java/Scala IDE
• Integrated with enterprise software development lifecycle
13
Wish List

Learn more at GS.com/Engineering

How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

Similar to How Spark is Making an Impact at Goldman Sachs by Vincent Saulys (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

Editor's Notes