Spark on Dataproc - Israel Spark Meetup at taboola

Vadim Solovey
vadim@doit-intl.com
Google Cloud Dataproc
Spark and Hadoop with superfast start-up,
easy management and billed by the minute.

Copyright 2015 Google Inc
<vadim@doit-intl.com>
Google Developer Expert & Trainer
CTO of DoIT International

Agenda
01
02
03
04
05
06
Google Dataproc Overview
Features
Demo
Roadmap
Q&A
Try Google Dataproc

Google Cloud Dataproc is a fast, easy
to use, low cost and fully-managed
service that lets you run Spark and
Hadoop on Google Cloud Platform.
Cloud Dataproc

Confidential & ProprietaryGoogle Cloud Platform 5
Management
Mobile
Services
Compute
Big Data
Storage
Developer Tools

Dataproc 101
Low Cost IntegratedEasy to Use
Easily create and scale
clusters to run native:
• Spark
• PySpark
• Spark SQL
• MapReduce
• Hive
• Pig
• More with IA’s
Integration with Cloud
Platform provides immense
scalability, ease-of use, and
multiple channels for
cluster interaction and
management.
Low-cost data processing with:
• Low and fixed price
• Minute-by-minute billing
• Fast cluster provisioning,
execution, and removal
• Ability to manually scale
clusters based on needs
• Preemptible instances

Product Characteristics
Cloud
Dataproc
Amazon
EMR
Customer Impact
Cluster start time
Elapsed time from cluster
creation until it is ready.
< 90 seconds ~360 seconds
Faster data processing workflows because less
time is spent waiting for clusters to provision and
start executing applications.
Billing unit of measure
Increment used for billing
service when active.
Minute Hourly
Reduced costs for running Spark and Hadoop
because you pay for what you actually use, not a
cost which has been rounded up.
Preemptible VMs
Clusters can utilize
preemptible VMs.
Yes Kind of :-)
Lower total operating costs for Spark and
Hadoop processing by leveraging the cost
benefits of preemptibles.
Job output & cancellation
Job output easy to find and
are cancelable without SSH
Yes No
Higher productivity because job output does not
necessitate reviewing log files and canceling jobs
does not require SSH.
Competitive Highlights

● Spark 1.5.2 w/ Py-Spark & Spark-SQL
● Hadoop 2.7.1
● Pig 0.15
● Hive 1.2.1
● YARN Resource Manager
● Debian 8 based O/S
● Google Connectors for Cloud Storage, BigQuery & BigTable etc.
Packaging & Versioning

Features
Integrated with Cloud
Storage, Cloud Logging,
BigQuery, and more.
Integrated
Manually scale clusters up
or down based on need,
even when jobs are running.
Anytime Scaling
UI, API & CLI for rapid
development including
Initialization Actions & Job
Output Driver
Tools
Available in every Google
Cloud zone in the United
States, Europe, and Asia
Global Availability

# Only run on the master node
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requests
curl https://bootstrap.pypa.io/get-pip.py | python
mkdir IPythonNB
pip install "ipython[notebook]"
ipython profile create default
echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.py
echo "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py
# Setup script for iPython Notebook so it uses the cluster's Spark
cat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'
import os
import sys
spark_home = '/usr/lib/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
_EOF
nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &
fi
Initialization Action Example

Off-the-Shelf Initialization Actions
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Pull Requests are Welcome!
JupyterFacebook Presto Zeppelin Kafka Zookeeper

BigQuery BigTable CloudSQL Datastore
Available Datastores
Cloud Storage Nearline

GCS Connector Performance (I)
Recommendation Engine Use-Case (1 file, 500GB)

GCS Connector Performance (II)
Sessionization Use-Case (14,800 files, 1GB each)

GCS Connector Performance (III)
Document Clustering Use-Case (31,000 files, 250MB each)

Additional Integrations
Cloud Logging Cloud Monitoring

Spark & BigQuery Integration Example
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

Pricing Example
35-minutes Spark job running on
14x 16-cores workers (224 cores)
[ Crunching 3TB TeraSort ]

Pricing
Pricing Example
Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price
Master Node n1-standard-4 1 4 $0.2 $0.04
Worker Nodes n1-highmem-16 4 64 $4.032 $0.64
Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6
Cluster Total n/a 15 224 $4.88
Pricing Details
Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)
35% to 300% less than AWS EMR
(c3.2xlarge | m2.4xlarge)

Roadmap (Q1 2015)
More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)
Mahout, Hue, Cloudera, MapR and others
Performance
Further improve performance on jobs running directly on Google
Cloud Storage. The ultimate goal is to make GCS the default storage
for Dataproc and provide 2x performance of local HDFS (when not
using LocalSSD)
More Native Datastores
Spanner, Google ML

06 Try Google Dataproc in 2015

AWS EMR Customer?
Get $1,000
To test Google Dataproc

Not a AWS EMR Customer?
Get $1,000*
To test Google Dataproc

* Agree to 1-hour meeting
@ Google Tel-Aviv
to discuss your Big Data needs

goo.gl/mFwCYa
promo code is “1K-Dataproc”

Spark on Dataproc - Israel Spark Meetup at taboola

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark on Dataproc - Israel Spark Meetup at taboola

Similar to Spark on Dataproc - Israel Spark Meetup at taboola (20)

More from tsliwowicz

More from tsliwowicz (7)

Recently uploaded

Recently uploaded (20)

Spark on Dataproc - Israel Spark Meetup at taboola