4. Google Cloud Dataproc is a fast, easy
to use, low cost and fully-managed
service that lets you run Spark and
Hadoop on Google Cloud Platform.
Cloud Dataproc
6. Confidential & ProprietaryGoogle Cloud Platform 6
Dataproc 101
Low Cost IntegratedEasy to Use
Easily create and scale
clusters to run native:
• Spark
• PySpark
• Spark SQL
• MapReduce
• Hive
• Pig
• More with IA’s
Integration with Cloud
Platform provides immense
scalability, ease-of use, and
multiple channels for
cluster interaction and
management.
Low-cost data processing with:
• Low and fixed price
• Minute-by-minute billing
• Fast cluster provisioning,
execution, and removal
• Ability to manually scale
clusters based on needs
• Preemptible instances
7. Confidential & ProprietaryGoogle Cloud Platform 7
Product Characteristics
Cloud
Dataproc
Amazon
EMR
Customer Impact
Cluster start time
Elapsed time from cluster
creation until it is ready.
< 90 seconds ~360 seconds
Faster data processing workflows because less
time is spent waiting for clusters to provision and
start executing applications.
Billing unit of measure
Increment used for billing
service when active.
Minute Hourly
Reduced costs for running Spark and Hadoop
because you pay for what you actually use, not a
cost which has been rounded up.
Preemptible VMs
Clusters can utilize
preemptible VMs.
Yes Kind of :-)
Lower total operating costs for Spark and
Hadoop processing by leveraging the cost
benefits of preemptibles.
Job output & cancellation
Job output easy to find and
are cancelable without SSH
Yes No
Higher productivity because job output does not
necessitate reviewing log files and canceling jobs
does not require SSH.
Competitive Highlights
10. Confidential & ProprietaryGoogle Cloud Platform 10
Features
Integrated with Cloud
Storage, Cloud Logging,
BigQuery, and more.
Integrated
Manually scale clusters up
or down based on need,
even when jobs are running.
Anytime Scaling
UI, API & CLI for rapid
development including
Initialization Actions & Job
Output Driver
Tools
Available in every Google
Cloud zone in the United
States, Europe, and Asia
Global Availability
11. Confidential & ProprietaryGoogle Cloud Platform 11
# Only run on the master node
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requests
curl https://bootstrap.pypa.io/get-pip.py | python
mkdir IPythonNB
pip install "ipython[notebook]"
ipython profile create default
echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.py
echo "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py
# Setup script for iPython Notebook so it uses the cluster's Spark
cat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'
import os
import sys
spark_home = '/usr/lib/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
_EOF
nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &
fi
Initialization Action Example
18. Confidential & ProprietaryGoogle Cloud Platform 18
Spark & BigQuery Integration Example
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
23. Confidential & ProprietaryGoogle Cloud Platform 23
Roadmap (Q1 2015)
More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)
Mahout, Hue, Cloudera, MapR and others
Performance
Further improve performance on jobs running directly on Google
Cloud Storage. The ultimate goal is to make GCS the default storage
for Dataproc and provide 2x performance of local HDFS (when not
using LocalSSD)
More Native Datastores
Spanner, Google ML