SlideShare a Scribd company logo
1 of 29
Download to read offline
Vadim Solovey
vadim@doit-intl.com
Google Cloud Dataproc
Spark and Hadoop with superfast start-up,
easy management and billed by the minute.
Copyright 2015 Google Inc
<vadim@doit-intl.com>
Google Developer Expert & Trainer
CTO of DoIT International
Agenda
01
02
03
04
05
06
Google Dataproc Overview
Features
Demo
Roadmap
Q&A
Try Google Dataproc
Google Cloud Dataproc is a fast, easy
to use, low cost and fully-managed
service that lets you run Spark and
Hadoop on Google Cloud Platform.
Cloud Dataproc
Confidential & ProprietaryGoogle Cloud Platform 5
Management
Mobile
Services
Compute
Big Data
Storage
Developer Tools
Confidential & ProprietaryGoogle Cloud Platform 6
Dataproc 101
Low Cost IntegratedEasy to Use
Easily create and scale
clusters to run native:
• Spark
• PySpark
• Spark SQL
• MapReduce
• Hive
• Pig
• More with IA’s
Integration with Cloud
Platform provides immense
scalability, ease-of use, and
multiple channels for
cluster interaction and
management.
Low-cost data processing with:
• Low and fixed price
• Minute-by-minute billing
• Fast cluster provisioning,
execution, and removal
• Ability to manually scale
clusters based on needs
• Preemptible instances
Confidential & ProprietaryGoogle Cloud Platform 7
Product Characteristics
Cloud
Dataproc
Amazon
EMR
Customer Impact
Cluster start time
Elapsed time from cluster
creation until it is ready.
< 90 seconds ~360 seconds
Faster data processing workflows because less
time is spent waiting for clusters to provision and
start executing applications.
Billing unit of measure
Increment used for billing
service when active.
Minute Hourly
Reduced costs for running Spark and Hadoop
because you pay for what you actually use, not a
cost which has been rounded up.
Preemptible VMs
Clusters can utilize
preemptible VMs.
Yes Kind of :-)
Lower total operating costs for Spark and
Hadoop processing by leveraging the cost
benefits of preemptibles.
Job output & cancellation
Job output easy to find and
are cancelable without SSH
Yes No
Higher productivity because job output does not
necessitate reviewing log files and canceling jobs
does not require SSH.
Competitive Highlights
02 Features
Confidential & ProprietaryGoogle Cloud Platform 9
● Spark 1.5.2 w/ Py-Spark & Spark-SQL
● Hadoop 2.7.1
● Pig 0.15
● Hive 1.2.1
● YARN Resource Manager
● Debian 8 based O/S
● Google Connectors for Cloud Storage, BigQuery & BigTable etc.
Packaging & Versioning
Confidential & ProprietaryGoogle Cloud Platform 10
Features
Integrated with Cloud
Storage, Cloud Logging,
BigQuery, and more.
Integrated
Manually scale clusters up
or down based on need,
even when jobs are running.
Anytime Scaling
UI, API & CLI for rapid
development including
Initialization Actions & Job
Output Driver
Tools
Available in every Google
Cloud zone in the United
States, Europe, and Asia
Global Availability
Confidential & ProprietaryGoogle Cloud Platform 11
# Only run on the master node
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requests
curl https://bootstrap.pypa.io/get-pip.py | python
mkdir IPythonNB
pip install "ipython[notebook]"
ipython profile create default
echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.py
echo "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py
# Setup script for iPython Notebook so it uses the cluster's Spark
cat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'
import os
import sys
spark_home = '/usr/lib/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
_EOF
nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &
fi
Initialization Action Example
Confidential & ProprietaryGoogle Cloud Platform 12
Off-the-Shelf Initialization Actions
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Pull Requests are Welcome!
JupyterFacebook Presto Zeppelin Kafka Zookeeper
Confidential & ProprietaryGoogle Cloud Platform 13
BigQuery BigTable CloudSQL Datastore
Available Datastores
Cloud Storage Nearline
Confidential & ProprietaryGoogle Cloud Platform 14
GCS Connector Performance (I)
Recommendation Engine Use-Case (1 file, 500GB)
Confidential & ProprietaryGoogle Cloud Platform 15
GCS Connector Performance (II)
Sessionization Use-Case (14,800 files, 1GB each)
Confidential & ProprietaryGoogle Cloud Platform 16
GCS Connector Performance (III)
Document Clustering Use-Case (31,000 files, 250MB each)
Confidential & ProprietaryGoogle Cloud Platform 17
Additional Integrations
Cloud Logging Cloud Monitoring
Confidential & ProprietaryGoogle Cloud Platform 18
Spark & BigQuery Integration Example
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
03 Demo
Confidential & ProprietaryGoogle Cloud Platform 20
Pricing Example
35-minutes Spark job running on
14x 16-cores workers (224 cores)
[ Crunching 3TB TeraSort ]
Confidential & ProprietaryGoogle Cloud Platform 21
Pricing
Pricing Example
Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price
Master Node n1-standard-4 1 4 $0.2 $0.04
Worker Nodes n1-highmem-16 4 64 $4.032 $0.64
Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6
Cluster Total n/a 15 224 $4.88
Pricing Details
Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)
35% to 300% less than AWS EMR
(c3.2xlarge | m2.4xlarge)
04 Roadmap
Confidential & ProprietaryGoogle Cloud Platform 23
Roadmap (Q1 2015)
More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)
Mahout, Hue, Cloudera, MapR and others
Performance
Further improve performance on jobs running directly on Google
Cloud Storage. The ultimate goal is to make GCS the default storage
for Dataproc and provide 2x performance of local HDFS (when not
using LocalSSD)
More Native Datastores
Spanner, Google ML
06 Try Google Dataproc in 2015
Confidential & ProprietaryGoogle Cloud Platform 25
AWS EMR Customer?
Get $1,000
To test Google Dataproc
Confidential & ProprietaryGoogle Cloud Platform 26
Not a AWS EMR Customer?
Get $1,000*
To test Google Dataproc
Confidential & ProprietaryGoogle Cloud Platform 27
* Agree to 1-hour meeting
@ Google Tel-Aviv
to discuss your Big Data needs
Confidential & ProprietaryGoogle Cloud Platform 28
goo.gl/mFwCYa
promo code is “1K-Dataproc”
05 Q?A
goo.gl/mFwCYa

More Related Content

What's hot

Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaGoogle Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaEdureka!
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloudTu Pham
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Chetan Sharma
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a GlanceCloud Analogy
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesSlim Baltagi
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source Nitesh Jadhav
 
Continuous Integration with Jenkins and Java EE
Continuous Integration with Jenkins and Java EEContinuous Integration with Jenkins and Java EE
Continuous Integration with Jenkins and Java EEFrancesco Marchitelli
 
#PCMVision: VMware NSX - Transforming Security
#PCMVision: VMware NSX - Transforming Security#PCMVision: VMware NSX - Transforming Security
#PCMVision: VMware NSX - Transforming SecurityPCM
 
Microsoft azure infrastructure essentials course manual
Microsoft azure infrastructure essentials   course manualMicrosoft azure infrastructure essentials   course manual
Microsoft azure infrastructure essentials course manualmichaeldejene4
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionTravis Wright
 
Mandy Waite, Warszawa marzec 2013
Mandy Waite, Warszawa marzec 2013Mandy Waite, Warszawa marzec 2013
Mandy Waite, Warszawa marzec 2013GeekGirlsCarrots
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyLeonid Nekhymchuk
 
Best Practices for Building Successful Cloud Projects
Best Practices for Building Successful Cloud ProjectsBest Practices for Building Successful Cloud Projects
Best Practices for Building Successful Cloud ProjectsNati Shalom
 
Introduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicIntroduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicPetchpaitoon Krungwong
 
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data ScienceOpenshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data ScienceJohn Archer
 
Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Pavan Dikondkar
 

What's hot (20)

Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaGoogle Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloud
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a Glance
 
Google Cloud Platform Data Storage
Google Cloud Platform Data StorageGoogle Cloud Platform Data Storage
Google Cloud Platform Data Storage
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetes
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Continuous Integration with Jenkins and Java EE
Continuous Integration with Jenkins and Java EEContinuous Integration with Jenkins and Java EE
Continuous Integration with Jenkins and Java EE
 
#PCMVision: VMware NSX - Transforming Security
#PCMVision: VMware NSX - Transforming Security#PCMVision: VMware NSX - Transforming Security
#PCMVision: VMware NSX - Transforming Security
 
Microsoft azure infrastructure essentials course manual
Microsoft azure infrastructure essentials   course manualMicrosoft azure infrastructure essentials   course manual
Microsoft azure infrastructure essentials course manual
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
 
Mandy Waite, Warszawa marzec 2013
Mandy Waite, Warszawa marzec 2013Mandy Waite, Warszawa marzec 2013
Mandy Waite, Warszawa marzec 2013
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
 
Best Practices for Building Successful Cloud Projects
Best Practices for Building Successful Cloud ProjectsBest Practices for Building Successful Cloud Projects
Best Practices for Building Successful Cloud Projects
 
Introduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicIntroduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config public
 
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data ScienceOpenshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
 
Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018
 

Similar to Spark on Dataproc - Israel Spark Meetup at taboola

Google Cloud Platform 2014Q1 - Starter Guide
Google Cloud Platform   2014Q1 - Starter GuideGoogle Cloud Platform   2014Q1 - Starter Guide
Google Cloud Platform 2014Q1 - Starter GuideSimon Su
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingSimon Su
 
Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHackswesley chun
 
Infrastructure Management in GCP
Infrastructure Management in GCPInfrastructure Management in GCP
Infrastructure Management in GCPDana Hoffman
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Google developers consoles
Google developers consolesGoogle developers consoles
Google developers consolesVineet Gupta
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Ido Green
 
Exploring Google (Cloud) APIs & Cloud Computing overview
Exploring Google (Cloud) APIs & Cloud Computing overviewExploring Google (Cloud) APIs & Cloud Computing overview
Exploring Google (Cloud) APIs & Cloud Computing overviewwesley chun
 
Introduction to Cloud Computing with Google Cloud
Introduction to Cloud Computing with Google CloudIntroduction to Cloud Computing with Google Cloud
Introduction to Cloud Computing with Google Cloudwesley chun
 
Powerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hackPowerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hackwesley chun
 
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...Codecamp Romania
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Accessing Google Cloud APIs
Accessing Google Cloud APIsAccessing Google Cloud APIs
Accessing Google Cloud APIswesley chun
 
Serverless Computing with Google Cloud
Serverless Computing with Google CloudServerless Computing with Google Cloud
Serverless Computing with Google Cloudwesley chun
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptxDSCIITPatna
 

Similar to Spark on Dataproc - Israel Spark Meetup at taboola (20)

Google Cloud Platform 2014Q1 - Starter Guide
Google Cloud Platform   2014Q1 - Starter GuideGoogle Cloud Platform   2014Q1 - Starter Guide
Google Cloud Platform 2014Q1 - Starter Guide
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
 
Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacks
 
Infrastructure Management in GCP
Infrastructure Management in GCPInfrastructure Management in GCP
Infrastructure Management in GCP
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
TIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google CloudTIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google Cloud
 
Google developers consoles
Google developers consolesGoogle developers consoles
Google developers consoles
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
Exploring Google (Cloud) APIs & Cloud Computing overview
Exploring Google (Cloud) APIs & Cloud Computing overviewExploring Google (Cloud) APIs & Cloud Computing overview
Exploring Google (Cloud) APIs & Cloud Computing overview
 
Introduction to Cloud Computing with Google Cloud
Introduction to Cloud Computing with Google CloudIntroduction to Cloud Computing with Google Cloud
Introduction to Cloud Computing with Google Cloud
 
Powerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hackPowerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hack
 
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...Bogdan botea, dmitry nefedkin   no fiddle, efficient development on the googl...
Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...
 
Google Cloud Platform
Google Cloud Platform Google Cloud Platform
Google Cloud Platform
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Accessing Google Cloud APIs
Accessing Google Cloud APIsAccessing Google Cloud APIs
Accessing Google Cloud APIs
 
Serverless Computing with Google Cloud
Serverless Computing with Google CloudServerless Computing with Google Cloud
Serverless Computing with Google Cloud
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptx
 

More from tsliwowicz

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboolatsliwowicz
 
Using apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolaUsing apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolatsliwowicz
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2tsliwowicz
 
Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) tsliwowicz
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 

More from tsliwowicz (7)

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboola
 
Using apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolaUsing apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboola
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2
 
Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola)
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

Spark on Dataproc - Israel Spark Meetup at taboola

  • 1. Vadim Solovey vadim@doit-intl.com Google Cloud Dataproc Spark and Hadoop with superfast start-up, easy management and billed by the minute.
  • 2. Copyright 2015 Google Inc <vadim@doit-intl.com> Google Developer Expert & Trainer CTO of DoIT International
  • 4. Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform. Cloud Dataproc
  • 5. Confidential & ProprietaryGoogle Cloud Platform 5 Management Mobile Services Compute Big Data Storage Developer Tools
  • 6. Confidential & ProprietaryGoogle Cloud Platform 6 Dataproc 101 Low Cost IntegratedEasy to Use Easily create and scale clusters to run native: • Spark • PySpark • Spark SQL • MapReduce • Hive • Pig • More with IA’s Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management. Low-cost data processing with: • Low and fixed price • Minute-by-minute billing • Fast cluster provisioning, execution, and removal • Ability to manually scale clusters based on needs • Preemptible instances
  • 7. Confidential & ProprietaryGoogle Cloud Platform 7 Product Characteristics Cloud Dataproc Amazon EMR Customer Impact Cluster start time Elapsed time from cluster creation until it is ready. < 90 seconds ~360 seconds Faster data processing workflows because less time is spent waiting for clusters to provision and start executing applications. Billing unit of measure Increment used for billing service when active. Minute Hourly Reduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up. Preemptible VMs Clusters can utilize preemptible VMs. Yes Kind of :-) Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles. Job output & cancellation Job output easy to find and are cancelable without SSH Yes No Higher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH. Competitive Highlights
  • 9. Confidential & ProprietaryGoogle Cloud Platform 9 ● Spark 1.5.2 w/ Py-Spark & Spark-SQL ● Hadoop 2.7.1 ● Pig 0.15 ● Hive 1.2.1 ● YARN Resource Manager ● Debian 8 based O/S ● Google Connectors for Cloud Storage, BigQuery & BigTable etc. Packaging & Versioning
  • 10. Confidential & ProprietaryGoogle Cloud Platform 10 Features Integrated with Cloud Storage, Cloud Logging, BigQuery, and more. Integrated Manually scale clusters up or down based on need, even when jobs are running. Anytime Scaling UI, API & CLI for rapid development including Initialization Actions & Job Output Driver Tools Available in every Google Cloud zone in the United States, Europe, and Asia Global Availability
  • 11. Confidential & ProprietaryGoogle Cloud Platform 11 # Only run on the master node ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requests curl https://bootstrap.pypa.io/get-pip.py | python mkdir IPythonNB pip install "ipython[notebook]" ipython profile create default echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.py echo "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py # Setup script for iPython Notebook so it uses the cluster's Spark cat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF' import os import sys spark_home = '/usr/lib/spark/' os.environ["SPARK_HOME"] = spark_home sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py')) _EOF nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log & fi Initialization Action Example
  • 12. Confidential & ProprietaryGoogle Cloud Platform 12 Off-the-Shelf Initialization Actions https://github.com/GoogleCloudPlatform/dataproc-initialization-actions Pull Requests are Welcome! JupyterFacebook Presto Zeppelin Kafka Zookeeper
  • 13. Confidential & ProprietaryGoogle Cloud Platform 13 BigQuery BigTable CloudSQL Datastore Available Datastores Cloud Storage Nearline
  • 14. Confidential & ProprietaryGoogle Cloud Platform 14 GCS Connector Performance (I) Recommendation Engine Use-Case (1 file, 500GB)
  • 15. Confidential & ProprietaryGoogle Cloud Platform 15 GCS Connector Performance (II) Sessionization Use-Case (14,800 files, 1GB each)
  • 16. Confidential & ProprietaryGoogle Cloud Platform 16 GCS Connector Performance (III) Document Clustering Use-Case (31,000 files, 250MB each)
  • 17. Confidential & ProprietaryGoogle Cloud Platform 17 Additional Integrations Cloud Logging Cloud Monitoring
  • 18. Confidential & ProprietaryGoogle Cloud Platform 18 Spark & BigQuery Integration Example val fullyQualifiedInputTableId = "publicdata:samples.shakespeare" val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>" val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" val jobName = "wordcount" // Set the job-level projectId. conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId) // Use the systemBucket for temporary BigQuery export data used by the InputFormat. val systemBucket = conf.get("fs.gs.system.bucket") conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket) // Configure input and output for BigQuery access. BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId) BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema) val fieldName = "word" val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject]) tableData.cache() tableData.count() tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
  • 20. Confidential & ProprietaryGoogle Cloud Platform 20 Pricing Example 35-minutes Spark job running on 14x 16-cores workers (224 cores) [ Crunching 3TB TeraSort ]
  • 21. Confidential & ProprietaryGoogle Cloud Platform 21 Pricing Pricing Example Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price Master Node n1-standard-4 1 4 $0.2 $0.04 Worker Nodes n1-highmem-16 4 64 $4.032 $0.64 Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6 Cluster Total n/a 15 224 $4.88 Pricing Details Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD) 35% to 300% less than AWS EMR (c3.2xlarge | m2.4xlarge)
  • 23. Confidential & ProprietaryGoogle Cloud Platform 23 Roadmap (Q1 2015) More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts) Mahout, Hue, Cloudera, MapR and others Performance Further improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD) More Native Datastores Spanner, Google ML
  • 24. 06 Try Google Dataproc in 2015
  • 25. Confidential & ProprietaryGoogle Cloud Platform 25 AWS EMR Customer? Get $1,000 To test Google Dataproc
  • 26. Confidential & ProprietaryGoogle Cloud Platform 26 Not a AWS EMR Customer? Get $1,000* To test Google Dataproc
  • 27. Confidential & ProprietaryGoogle Cloud Platform 27 * Agree to 1-hour meeting @ Google Tel-Aviv to discuss your Big Data needs
  • 28. Confidential & ProprietaryGoogle Cloud Platform 28 goo.gl/mFwCYa promo code is “1K-Dataproc”