Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
James Malone
Product Manager
More data. Zero headaches.
Making the Spark and Hadoop ecosystem fast, easy, and cost-effecti...
Google Cloud Platform 2
Cloud Dataproc features and benefits
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not y...
Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integrati...
6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part ...
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Log...
8
Most time can be spent with data, not tooling
More time can be
dedicated to examining
data for actionable insights
Less ...
9
Lift and shift workloads to Cloud Dataproc
Copy data to GCS
Copy your data to Google
Cloud Storage (GCS) by
installing t...
Google Cloud Platform 10
How does Google Cloud Dataproc help me?
Traditional Spark and Hadoop clusters
Google Cloud Dataproc
Cloud example - slow vs. fast
Things take
seconds to minutes,
not hours or weeks
capacityneeded
t
Time needed to obtain ne...
Cloud example - hard vs. easy
Be an expert with
your data, not your
data infrastructure
Need experts to
optimize utilizati...
Cloud example - costly vs. cost-effective
Pay for exactly what
you use
You (probably) pay
for more capacity
than actually ...
Google Cloud Dataproc - under the hood
Google Cloud Services
Dataproc Cluster
Cloud Dataproc uses GCP - Compute Engine,
Cl...
Google Cloud Dataproc - under the hood
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters...
Google Cloud Dataproc - under the hood
Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS
components execute on th...
Google Cloud Dataproc - under the hood
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Spark & Hadoop OSS
Cloud Dataproc Agent
...
Google Cloud Dataproc - under the hood
Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapR...
Google Cloud Platform 21
How can I use Cloud Dataproc?
Google Cloud Platform 22
Google Developers Console
https://console.developers.google.com/
Google Cloud Platform 23
Google Cloud SDK
https://cloud.google.com/sdk/
Google Cloud Platform 24
Cloud Dataproc REST API
https://cloud.google.com/dataproc/reference/rest/
Google Cloud Platform 25
Let’s see an example - Cloud Dataproc demo
Confidential & ProprietaryGoogle Cloud Platform 26
Google Cloud Dataproc - demo overview
In this demo we are going to do a...
Google Cloud Platform 27
YARN Cores
1,600
What just happened?
YARN RAM
4.7 TB
Spark & Hadoop
100%
Click
1
Google Cloud Platform 2828
The New York City Taxi & Limousine
Commission and Uber released a
dataset of trips from 2009-20...
Google Cloud Platform 29
CREATE EXTERNAL TABLE trips (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_d...
Google Cloud Platform 30
SELECT cab_type, count(*)
FROM trips
GROUP BY cab_type;
SELECT passenger_count, avg(total_amount)...
Google Cloud Platform 31
Dataset
270 GB
Demo recap
Trips
1.2 B
Queries
4
Apache ecosystem
100%
Google Cloud Platform 32
$12.85
(vs $77.58, $41.54)
Google Cloud Platform 33
If you’re processing data, you may also want to consider...
Google Cloud Dataflow & Apache Beam
The Cloud Dataflow SDK, based
on Apache Beam, is a collection
of SDKs for building str...
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
Joining several threads into Beam
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
...
Google BigQuery
Virtually unlimited resources, but you only pay for what you use
Fully-managed
Analytics Data Warehouse
Hi...
Google Cloud Bigtable
Google Cloud Bigtable offers companies a fast, fully managed, infinitely
scalable NoSQL database ser...
Google Cloud Platform 39
Wrapping things up
Cloud Dataproc - get started today
Create a Google Cloud project
Visit Dataproc section
1
2
3
4
Open Developers Console
Cr...
If you only remember 3 things...
Cloud Dataproc
is easy
Cloud Dataproc offers a
number of tools to easily
interact with cl...
Google Cloud Platform 42
Thank You
Upcoming SlideShare
Loading in …5
×

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

2,717 views

Published on

At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.

Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."

Published in: Technology

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

  1. 1. James Malone Product Manager More data. Zero headaches. Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.
  2. 2. Google Cloud Platform 2 Cloud Dataproc features and benefits
  3. 3. Google Cloud Platform 3 Apache Spark and Apache Hadoop should be fast, easy, and cost-effective.
  4. 4. Easy, fast, cost-effective Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use
  5. 5. Running Hadoop on Google Cloud bdutil Free OSS Toolkit Dataproc Managed Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Custom Code Monitoring/Health Dev Integration Manual Scaling Job Submission GCP Connectivity Deployment Creation On Premise Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation Google Managed Google Cloud Platform Customer Managed Vendor Hadoop Custom Code Monitoring/Health Dev Integration Scaling Job Submission GCP Connectivity Deployment Creation
  6. 6. 6 Cloud Dataproc - integrated 6 Cloud Dataproc is natively integrated with several Google Cloud Platform products as part of an integrated data platform. Storage Operations Data
  7. 7. 7 Where Cloud Dataproc fits into GCP 7 Google Bigtable (HBase) Google BigQuery (Analytics, Data warehouse) Stackdriver Logging (Logging Ops.) Google Cloud Dataflow (Batch/Stream Processing) Google Cloud Storage (HCFS/HDFS) Stackdriver Monitoring (Monitoring)
  8. 8. 8 Most time can be spent with data, not tooling More time can be dedicated to examining data for actionable insights Less time is spent with clusters since creating, resizing, and destroying clusters is easily done Hands-on with data Cloud Dataproc setup and customization
  9. 9. 9 Lift and shift workloads to Cloud Dataproc Copy data to GCS Copy your data to Google Cloud Storage (GCS) by installing the connector or by copying manually. Update file prefix Update the file location prefix in your scripts from hdfs:// to gcs:// to access your data in GCS. Use Cloud Dataproc Create a Cloud Dataproc cluster and run your job on the cluster against the data you copied to GCS. Done. 1 32
  10. 10. Google Cloud Platform 10 How does Google Cloud Dataproc help me?
  11. 11. Traditional Spark and Hadoop clusters
  12. 12. Google Cloud Dataproc
  13. 13. Cloud example - slow vs. fast Things take seconds to minutes, not hours or weeks capacityneeded t Time needed to obtain new capacity capacityused t Scaling can take hours, days, or weeks to perform Traditional clusters Cloud Dataproc
  14. 14. Cloud example - hard vs. easy Be an expert with your data, not your data infrastructure Need experts to optimize utilization and deployment Traditional clusters Cloud Dataproc clusterutilization Cluster Inactive t clusterutilization t cluster 1 cluster 2
  15. 15. Cloud example - costly vs. cost-effective Pay for exactly what you use You (probably) pay for more capacity than actually used Traditional clusters Cloud Dataproc Time Cost Time Cost
  16. 16. Google Cloud Dataproc - under the hood Google Cloud Services Dataproc Cluster Cloud Dataproc uses GCP - Compute Engine, Cloud Storage, and Stackdriver tools
  17. 17. Google Cloud Dataproc - under the hood Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Cloud Dataproc clusters have an agent to manage the Cloud Dataproc cluster Dataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
  18. 18. Google Cloud Dataproc - under the hood Spark & Hadoop OSS Spark, Hadoop, Hive, Pig, and other OSS components execute on the cluster Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Cloud Dataproc clusters have an agent to manage the Cloud Dataproc cluster Dataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
  19. 19. Google Cloud Dataproc - under the hood Spark PySpark Spark SQL MapReduce Pig Hive Spark & Hadoop OSS Cloud Dataproc Agent Google Cloud Services Dataproc Cluster Dataproc Jobs
  20. 20. Google Cloud Dataproc - under the hood Applications on the cluster Dataproc Jobs GCP Products Spark PySpark Spark SQL MapReduce Pig Hive Dataproc Cluster Spark & Hadoop OSS Cloud Dataproc Agent Google Cloud Services Dataproc Jobs FeaturesData Outputs
  21. 21. Google Cloud Platform 21 How can I use Cloud Dataproc?
  22. 22. Google Cloud Platform 22 Google Developers Console https://console.developers.google.com/
  23. 23. Google Cloud Platform 23 Google Cloud SDK https://cloud.google.com/sdk/
  24. 24. Google Cloud Platform 24 Cloud Dataproc REST API https://cloud.google.com/dataproc/reference/rest/
  25. 25. Google Cloud Platform 25 Let’s see an example - Cloud Dataproc demo
  26. 26. Confidential & ProprietaryGoogle Cloud Platform 26 Google Cloud Dataproc - demo overview In this demo we are going to do a few things: Create a cluster Query a large set of data stored in Google Cloud Storage Review the output of the queries Delete the cluster
  27. 27. Google Cloud Platform 27 YARN Cores 1,600 What just happened? YARN RAM 4.7 TB Spark & Hadoop 100% Click 1
  28. 28. Google Cloud Platform 2828 The New York City Taxi & Limousine Commission and Uber released a dataset of trips from 2009-2015 Original dataset is in CSV format and contains over 20 columns of data and about 1.2 billion trips The dataset is about ~270 gigabytes NYC taxi data 28
  29. 29. Google Cloud Platform 29 CREATE EXTERNAL TABLE trips ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, ...(44 other columns)..., dropoff_puma STRING) STORED AS orc LOCATION 'gs://taxi-nyc-demo/trips/' TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.stripe.size"="536870912", "orc.row.index.stride"="50000");
  30. 30. Google Cloud Platform 30 SELECT cab_type, count(*) FROM trips GROUP BY cab_type; SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; SELECT passenger_count, year(pickup_datetime), count(*) FROM trips GROUP BY passenger_count, year(pickup_datetime); SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips DESC;
  31. 31. Google Cloud Platform 31 Dataset 270 GB Demo recap Trips 1.2 B Queries 4 Apache ecosystem 100%
  32. 32. Google Cloud Platform 32 $12.85 (vs $77.58, $41.54)
  33. 33. Google Cloud Platform 33 If you’re processing data, you may also want to consider...
  34. 34. Google Cloud Dataflow & Apache Beam The Cloud Dataflow SDK, based on Apache Beam, is a collection of SDKs for building streaming data processing pipelines. Cloud Dataflow is a fully managed (no-ops) and integrated service for executing optimized parallelized data processing pipelines.
  35. 35. MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam
  36. 36. Joining several threads into Beam MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam
  37. 37. Google BigQuery Virtually unlimited resources, but you only pay for what you use Fully-managed Analytics Data Warehouse Highly Available, Encrypted, Durable
  38. 38. Google Cloud Bigtable Google Cloud Bigtable offers companies a fast, fully managed, infinitely scalable NoSQL database service with a HBase-compliant API included. Unlike comparable market offerings, Bigtable is the only fully-managed database where organizations don’t have to sacrifice speed, scale or cost- efficiency when they build applications. Google Cloud Bigtable has been battle-tested at Google for 10 years as the database driving all major applications including Google Analytics, Gmail and YouTube.
  39. 39. Google Cloud Platform 39 Wrapping things up
  40. 40. Cloud Dataproc - get started today Create a Google Cloud project Visit Dataproc section 1 2 3 4 Open Developers Console Create cluster in 1 click, 90 sec.
  41. 41. If you only remember 3 things... Cloud Dataproc is easy Cloud Dataproc offers a number of tools to easily interact with clusters and jobs so you can be hands- on with your data. Cloud Dataproc is fast Cloud Dataproc clusters start in under 90 seconds on average so you spend less time and money waiting for your clusters. Cloud Dataproc is cost effective Cloud Dataproc is easy on the pocketbook with a low pricing of just 1c per vCPU per hour and minute by minute billing
  42. 42. Google Cloud Platform 42 Thank You

×