SlideShare a Scribd company logo
Goldman Sachs
Engineering
GS.com/Engineering
Vice President - Technology
Vincent Saulys
How Spark is Making an Impact
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
4
Early Big Data Tooling
 Multiple data sources
 Various APIs to get data
 Processing data with Java map-reduce and PIG scripts
RDBMS
Apache
Hive
Apache
HBASE
OpenTSDB
Hadoop HDFS
PIG
Scripts
Java
Map-Reduce
REST
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
5
Challenges
 Lots of Java code
 Debugging PIG when things go wrong
 Code-compile-deploy-debug (rinse and repeat)
Source: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html Source: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
6
Spark’s Attraction
Benefits
 Language support: Scala, Java, Python, R
 In memory: Faster than other solutions
 SQL, Stream Processing, Machine Learning, Graphs
Scala
Python
Java
R
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
 Strata+Hadoop World 2014
• Many sessions on Apache Spark
 Internal blog posts on Spark
7
Spark arrives on the scene
Blog Post
Images
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
8
Viral Adoption
Added to Data Science Toolkit
 Works with GS Hadoop clusters out of box
 Supporting YARN, Hive, and sparkr (with v1.4.0)
 Integrated with proprietary tools
Community Supported
 On-line forums, meet-ups, references, examples, sharing best practices
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
 Data Science Toolkit
• Java, Scala, Python and R
• Open source analytic packages (ggplot2, pandas, scikit-learn, theano, etc.)
• IDEs (and notebooks) for turnkey developer setup
 Mirrors Spark
• Same language support
9
Data Science Toolkit with Spark Support
Scala
R
Python
Hadoop
HDFSJava
Scikit-Learn
Theano
Pandas
RStudio Desktop
IPython Notebook
ggplot2
Markdown
Shiny
H2O
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
 Languages: Scala and Java
 Data Processing (Batch and Stream)
• Bring process to the data
• Massage data for report generation
• Micro-batch approach works for streaming
• Easy to process kafka streams
Simpler and Faster Code
 Spark Job-Server
• REST job server for sharing RDDs across jobs
• https://github.com/spark-jobserver/spark-jobserver (open-source)
10
Today
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
Windows driver / Linux executors
• When spark-shell –master yarn does not work….
JVM based deployments
• Scala or Java based, no Python or R (yet)
Secure data clusters
• Supported by YARN
Compute and HDFS on separate machines
• One Data Lake (shared HDFS)
• YARN for compute/processing
• Multiple YARN clusters for different workloads using same HDFS
11
Challenges Encountered, Lessons Learned
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
Beyond Scala and Java
• R: manipulate and reduce data for local analysis, plotting, and reporting
• Python: run simulations (written in Python), collect results
‘Burst out’ Cluster Compute Support
• On demand YARN clusters built with needed libraries (with version support)
Machine Learning
• Need: Full development lifecycle integration
• What’s different from traditional SDLC: model training, deployment, monitoring and
maintenance
Data Reproducibility
• Data Provenance: A framework to keep code and the data it produced together
12
On the Horizon
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
Continued Data Frame Enhancements
• Easier preparation for Machine Learning
> Easy one hot encoding
> Pivot (a.k.a. cast/melt in R)
• Functional parity with R and Pandas (just bigger data)
Formulas!
• Ex: “response ~ feature_1 + feature_2 + …”
• Less data preparation code for machine learning (ex: on-hot for categorical data columns)
• Introduced in 1.5 but only for R and Python
IDE for Spark with Plots and Publish
• Beyond notebooks
• Simpler than a Java/Scala IDE
• Integrated with enterprise software development lifecycle
13
Wish List
Goldman Sachs Engineering
The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not
be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman
Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.
Learn more at GS.com/Engineering

More Related Content

What's hot

Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
yesheeka
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
confluent
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Monitoring dynamique : Grafana et Microsoft
Monitoring dynamique : Grafana et MicrosoftMonitoring dynamique : Grafana et Microsoft
Monitoring dynamique : Grafana et Microsoft
Christophe Villeneuve
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
Amazon Web Services
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
Amazon Web Services
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
Edureka!
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming Systems
Yingjun Wu
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
Michelle Ufford
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 

What's hot (20)

Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Monitoring dynamique : Grafana et Microsoft
Monitoring dynamique : Grafana et MicrosoftMonitoring dynamique : Grafana et Microsoft
Monitoring dynamique : Grafana et Microsoft
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Rethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming SystemsRethinking State Management in Cloud-Native Streaming Systems
Rethinking State Management in Cloud-Native Streaming Systems
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 

Viewers also liked

M&A Pitch Book- Costco and Target Proposed Merger
M&A Pitch Book- Costco and Target Proposed MergerM&A Pitch Book- Costco and Target Proposed Merger
M&A Pitch Book- Costco and Target Proposed Merger
Mohammad Al Sabeeh
 
GOLDMAN SACHS
GOLDMAN SACHS GOLDMAN SACHS
GOLDMAN SACHS
Asgar Hussain Inamdar
 
Vito Gamberale - Presentazione f2i
Vito Gamberale - Presentazione f2iVito Gamberale - Presentazione f2i
Vito Gamberale - Presentazione f2i
Vito Gamberale
 
Inter-service communication
Inter-service communicationInter-service communication
Inter-service communication
Steve Upton
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
Shashi Shekar Madappa
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
 
University of Westminster Skills Academy CV 2016
University of Westminster Skills Academy CV 2016University of Westminster Skills Academy CV 2016
University of Westminster Skills Academy CV 2016
Louise Bamford
 
Managing your workload
Managing your workloadManaging your workload
Managing your workload
Simon Ahern
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
Dongmin Yu
 
Sql Stream Intro
Sql Stream IntroSql Stream Intro
Sql Stream Intro
Chris Clabaugh
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Zaloni
 
Zeromq anatomy & jeromq
Zeromq anatomy & jeromqZeromq anatomy & jeromq
Zeromq anatomy & jeromq
Dongmin Yu
 
Fantastic DSL in Python
Fantastic DSL in PythonFantastic DSL in Python
Fantastic DSL in Python
kwatch
 
Business Track: Building a Private Cloud to Empower the Business at Goldman ...
Business Track: Building a Private Cloud  to Empower the Business at Goldman ...Business Track: Building a Private Cloud  to Empower the Business at Goldman ...
Business Track: Building a Private Cloud to Empower the Business at Goldman ...
MongoDB
 
Jp morgan chase and bear stearns
Jp morgan chase and bear stearnsJp morgan chase and bear stearns
Jp morgan chase and bear stearns
Elvia Quiñonez
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
2014 Biopharmaceutical Partnering Survey
2014 Biopharmaceutical Partnering Survey2014 Biopharmaceutical Partnering Survey
2014 Biopharmaceutical Partnering Survey
Boston Consulting Group
 
The 2nd Space Revolution
The 2nd Space RevolutionThe 2nd Space Revolution
The 2nd Space Revolution
Fabernovel
 
Death to the DevOps team - Agile Cambridge 2014
Death to the DevOps team - Agile Cambridge 2014Death to the DevOps team - Agile Cambridge 2014
Death to the DevOps team - Agile Cambridge 2014
Matthew Skelton
 

Viewers also liked (20)

M&A Pitch Book- Costco and Target Proposed Merger
M&A Pitch Book- Costco and Target Proposed MergerM&A Pitch Book- Costco and Target Proposed Merger
M&A Pitch Book- Costco and Target Proposed Merger
 
TIP - Investor Meeting 2012
TIP - Investor Meeting 2012TIP - Investor Meeting 2012
TIP - Investor Meeting 2012
 
GOLDMAN SACHS
GOLDMAN SACHS GOLDMAN SACHS
GOLDMAN SACHS
 
Vito Gamberale - Presentazione f2i
Vito Gamberale - Presentazione f2iVito Gamberale - Presentazione f2i
Vito Gamberale - Presentazione f2i
 
Inter-service communication
Inter-service communicationInter-service communication
Inter-service communication
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
University of Westminster Skills Academy CV 2016
University of Westminster Skills Academy CV 2016University of Westminster Skills Academy CV 2016
University of Westminster Skills Academy CV 2016
 
Managing your workload
Managing your workloadManaging your workload
Managing your workload
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Sql Stream Intro
Sql Stream IntroSql Stream Intro
Sql Stream Intro
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
 
Zeromq anatomy & jeromq
Zeromq anatomy & jeromqZeromq anatomy & jeromq
Zeromq anatomy & jeromq
 
Fantastic DSL in Python
Fantastic DSL in PythonFantastic DSL in Python
Fantastic DSL in Python
 
Business Track: Building a Private Cloud to Empower the Business at Goldman ...
Business Track: Building a Private Cloud  to Empower the Business at Goldman ...Business Track: Building a Private Cloud  to Empower the Business at Goldman ...
Business Track: Building a Private Cloud to Empower the Business at Goldman ...
 
Jp morgan chase and bear stearns
Jp morgan chase and bear stearnsJp morgan chase and bear stearns
Jp morgan chase and bear stearns
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
2014 Biopharmaceutical Partnering Survey
2014 Biopharmaceutical Partnering Survey2014 Biopharmaceutical Partnering Survey
2014 Biopharmaceutical Partnering Survey
 
The 2nd Space Revolution
The 2nd Space RevolutionThe 2nd Space Revolution
The 2nd Space Revolution
 
Death to the DevOps team - Agile Cambridge 2014
Death to the DevOps team - Agile Cambridge 2014Death to the DevOps team - Agile Cambridge 2014
Death to the DevOps team - Agile Cambridge 2014
 

Similar to How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for Hadoop
Georg Knon
 
Hadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the FutureHadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the Future
DataWorks Summit
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
SAP HANA Cloud Platform - The big picture
SAP HANA Cloud Platform - The big picture SAP HANA Cloud Platform - The big picture
SAP HANA Cloud Platform - The big picture
Matthias Steiner
 
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
Rafal Los
 
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
Sandesh Rao
 
Building an Analytics Enables SOC
Building an Analytics Enables SOCBuilding an Analytics Enables SOC
Building an Analytics Enables SOC
Splunk
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
Inside Analysis
 
SaaSfocus_Profile_s
SaaSfocus_Profile_sSaaSfocus_Profile_s
SaaSfocus_Profile_s
Shawn Murray
 
Managing Successful Data Projects: Technology Selection and Team Building
Managing Successful Data Projects: Technology Selection and Team BuildingManaging Successful Data Projects: Technology Selection and Team Building
Managing Successful Data Projects: Technology Selection and Team Building
Cloudera, Inc.
 
SAP HANA SPS09 - SAP River
SAP HANA SPS09 - SAP RiverSAP HANA SPS09 - SAP River
SAP HANA SPS09 - SAP River
SAP Technology
 
Ryder SAP
Ryder SAPRyder SAP
Ryder SAP
CALSTART
 
Hadoop Summit Keynote 2014
Hadoop Summit Keynote 2014Hadoop Summit Keynote 2014
Hadoop Summit Keynote 2014
Merv Adrian
 
RAC Troubleshooting and Diagnosability Sangam2016
RAC Troubleshooting and Diagnosability Sangam2016RAC Troubleshooting and Diagnosability Sangam2016
RAC Troubleshooting and Diagnosability Sangam2016
Sandesh Rao
 
Spark forspringdevs springone_final
Spark forspringdevs springone_finalSpark forspringdevs springone_final
Spark forspringdevs springone_final
sdeeg
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe?
Zaloni
 
Advanced blockchain
Advanced blockchain Advanced blockchain
Advanced blockchain
TechMeetups
 
import data from Oracle Database into Python Pandas Dataframe
import data from Oracle Database into Python Pandas Dataframeimport data from Oracle Database into Python Pandas Dataframe
import data from Oracle Database into Python Pandas Dataframe
Johan Louwers
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
Splunk
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
Harald Erb
 

Similar to How Spark is Making an Impact at Goldman Sachs by Vincent Saulys (20)

Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for Hadoop
 
Hadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the FutureHadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the Future
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
SAP HANA Cloud Platform - The big picture
SAP HANA Cloud Platform - The big picture SAP HANA Cloud Platform - The big picture
SAP HANA Cloud Platform - The big picture
 
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
Operationalizing security intelligence for the mid market - Rafal Los - RSA C...
 
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
 
Building an Analytics Enables SOC
Building an Analytics Enables SOCBuilding an Analytics Enables SOC
Building an Analytics Enables SOC
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
SaaSfocus_Profile_s
SaaSfocus_Profile_sSaaSfocus_Profile_s
SaaSfocus_Profile_s
 
Managing Successful Data Projects: Technology Selection and Team Building
Managing Successful Data Projects: Technology Selection and Team BuildingManaging Successful Data Projects: Technology Selection and Team Building
Managing Successful Data Projects: Technology Selection and Team Building
 
SAP HANA SPS09 - SAP River
SAP HANA SPS09 - SAP RiverSAP HANA SPS09 - SAP River
SAP HANA SPS09 - SAP River
 
Ryder SAP
Ryder SAPRyder SAP
Ryder SAP
 
Hadoop Summit Keynote 2014
Hadoop Summit Keynote 2014Hadoop Summit Keynote 2014
Hadoop Summit Keynote 2014
 
RAC Troubleshooting and Diagnosability Sangam2016
RAC Troubleshooting and Diagnosability Sangam2016RAC Troubleshooting and Diagnosability Sangam2016
RAC Troubleshooting and Diagnosability Sangam2016
 
Spark forspringdevs springone_final
Spark forspringdevs springone_finalSpark forspringdevs springone_final
Spark forspringdevs springone_final
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe?
 
Advanced blockchain
Advanced blockchain Advanced blockchain
Advanced blockchain
 
import data from Oracle Database into Python Pandas Dataframe
import data from Oracle Database into Python Pandas Dataframeimport data from Oracle Database into Python Pandas Dataframe
import data from Oracle Database into Python Pandas Dataframe
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 

Recently uploaded (20)

[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 

How Spark is Making an Impact at Goldman Sachs by Vincent Saulys

  • 2. Vice President - Technology Vincent Saulys
  • 3. How Spark is Making an Impact
  • 4. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. 4 Early Big Data Tooling  Multiple data sources  Various APIs to get data  Processing data with Java map-reduce and PIG scripts RDBMS Apache Hive Apache HBASE OpenTSDB Hadoop HDFS PIG Scripts Java Map-Reduce REST
  • 5. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. 5 Challenges  Lots of Java code  Debugging PIG when things go wrong  Code-compile-deploy-debug (rinse and repeat) Source: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html Source: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/
  • 6. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. 6 Spark’s Attraction Benefits  Language support: Scala, Java, Python, R  In memory: Faster than other solutions  SQL, Stream Processing, Machine Learning, Graphs Scala Python Java R
  • 7. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.  Strata+Hadoop World 2014 • Many sessions on Apache Spark  Internal blog posts on Spark 7 Spark arrives on the scene Blog Post Images
  • 8. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. 8 Viral Adoption Added to Data Science Toolkit  Works with GS Hadoop clusters out of box  Supporting YARN, Hive, and sparkr (with v1.4.0)  Integrated with proprietary tools Community Supported  On-line forums, meet-ups, references, examples, sharing best practices
  • 9. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.  Data Science Toolkit • Java, Scala, Python and R • Open source analytic packages (ggplot2, pandas, scikit-learn, theano, etc.) • IDEs (and notebooks) for turnkey developer setup  Mirrors Spark • Same language support 9 Data Science Toolkit with Spark Support Scala R Python Hadoop HDFSJava Scikit-Learn Theano Pandas RStudio Desktop IPython Notebook ggplot2 Markdown Shiny H2O
  • 10. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved.  Languages: Scala and Java  Data Processing (Batch and Stream) • Bring process to the data • Massage data for report generation • Micro-batch approach works for streaming • Easy to process kafka streams Simpler and Faster Code  Spark Job-Server • REST job server for sharing RDDs across jobs • https://github.com/spark-jobserver/spark-jobserver (open-source) 10 Today
  • 11. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. Windows driver / Linux executors • When spark-shell –master yarn does not work…. JVM based deployments • Scala or Java based, no Python or R (yet) Secure data clusters • Supported by YARN Compute and HDFS on separate machines • One Data Lake (shared HDFS) • YARN for compute/processing • Multiple YARN clusters for different workloads using same HDFS 11 Challenges Encountered, Lessons Learned
  • 12. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. Beyond Scala and Java • R: manipulate and reduce data for local analysis, plotting, and reporting • Python: run simulations (written in Python), collect results ‘Burst out’ Cluster Compute Support • On demand YARN clusters built with needed libraries (with version support) Machine Learning • Need: Full development lifecycle integration • What’s different from traditional SDLC: model training, deployment, monitoring and maintenance Data Reproducibility • Data Provenance: A framework to keep code and the data it produced together 12 On the Horizon
  • 13. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. Continued Data Frame Enhancements • Easier preparation for Machine Learning > Easy one hot encoding > Pivot (a.k.a. cast/melt in R) • Functional parity with R and Pandas (just bigger data) Formulas! • Ex: “response ~ feature_1 + feature_2 + …” • Less data preparation code for machine learning (ex: on-hot for categorical data columns) • Introduced in 1.5 but only for R and Python IDE for Spark with Plots and Publish • Beyond notebooks • Simpler than a Java/Scala IDE • Integrated with enterprise software development lifecycle 13 Wish List
  • 14. Goldman Sachs Engineering The term ‘engineer’ referenced in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2015 The Goldman Sachs Group, Inc. All rights reserved. Learn more at GS.com/Engineering

Editor's Notes

  1. Introductions Member of firmwide machine learning working group Data Science Community Organizer Organize the internal data science community GS is a large organization with DS across the organization
  2. Thought hard on the topic for the keynote Wanted to be useful to others getting started with adoption So… How spark came into Goldman What we use it for today Future uses of spark (and work still needed)
  3. Multiple Sources of data Databases -> Right into HDFS Took tables from RDBMS and copied to Hive/Hbase (tables again) NoSQL for query Levering tech atop HBASE (like opentsdb) Multiple ways to get the data SQL query, REST API Still need data in relational stores Usually finished with some Java Hadoop 1.x and 2.x, YARN Data into hadoop, data out for reports
  4. Lots of java code Java map-Reduce was not so simple Macros to generate skeleton, but still code to write (so went to PIG) PIG scripts hard to debug when things go wrong Another language to learn, another to support in the tech stack (SDLC, etc.) Still in code-compile-deploy-debug loop Common use case: parse, query (group by), summarize
  5. Technical Attraction Java! Portable skillset, many java programmers at GS, spark is another library to use Scala: still JVM based for easy deployment to clusters (and functional for code) Python: growing in acceptance and popularity R: used for analysis, can use Big Data for analytics In memory processing (FAST!) Easiest path: Re-tool developers (Java) then move into Scala Hobby projects existed through-out the organization No firmwide supported Compiling spark themselves and then experimenting with it Early use was ETL
  6. Many sessions using Spark at Strata - WOW, who knew map-reduce could be so simple and fast on Hadoop! Many could see how it can be applied to their problem Did not solve just one problem, but seen as solving every big data problem Our technology organization is over 9000 people Many programmers with opinions, but resonated with many of our coders CIO asking for configuration help was the impetus to formally support Spark in the organization Configuring Spark for Windows to Linux YARN was the first ‘bug’ we discovered - Interoperability between windows drivers and Linux executors (YARN) did not work until Spark 1.3 Post: “I’ve seen the future and its Apache Spark” Big Data was suddenly easy
  7. Keys to going viral Made it easy to try it out with their own data and problems Community support to help with questions No team “owns” spark (too new to have a center of excellence) Anyone can help, and does! People helping people Tools help for reference and immediate help Confluence, GSConnect, Symphony
  8. Latest version of the toolkit supports the same languages as spark does Its like we planned it, but we see it as validating choices Open source based Supported on Linux and Windows
  9. Scala and Java No need to ‘install spark on cluster’ when its JVM based Jars submitted to YARN Java has inertia, already know the language; Scala is growing Scala has REPL and the code is cleaner to see where work is being done (driver v. executor) Recommended and used widely for ETL Transforming data in batch jobs Stream Processing: micro batch of kafka sources Spark preferred to PIG, slowly replacing existing PIG scripts Spark is faster than PIG Why change code if it works? (not replacing what already works and does not need changing) Spark job server Opensource project! Shared RDDs across jobs Why persist to disk only to re-load multiple times for other jobs?
  10. Recurring hickups when windows is the driver and Linux is the executor Be careful with file path separators Issue occurs in development, production is all linux Lesson: Test with multiple environments JVM based deployments best Data clusters are shared Python and R (with libraries) not on all clusters Will solve with “YARN on Demand” feature for internal cloud All data is secured (kerberized) Only YARN cluster managers We do not deploy stand-alone spark clusters Moving to separate compute from data Different workloads HDFS does not have enough compute, so creating compute only clusters Preference for teams to own their machines Teams can manage the software deployed to the nodes
  11. Python and R workloads look to be different More analysis driven Running of simulations by non-programmers (e.g. not IT but from office) YARN on Demand For bursty work Why hold compute resources idle Resources are temporarily dedicated (but pulled from a virualized pool of compute) Machine Learning Data AND code produce the model, need to version control BOTH Development lifecycle challenges
  12. As a Data Scientist…. Data Frames Less code more data frame operations (UDAFs are a start) Seamless operation from Python and R (familiar operations and capabilities) Formulas It just works from a data scientists viewpoint Lowers entry if not a programmer by trade IDE Separate code, console and plotting You need to have plots if doing data science work
  13. Our online Engineering Hub (gs.com/engineering) is regularly updated with profiles of our latest projects, our engineers and our activities in the community.