Big data in the cloud

•Download as PPTX, PDF•

1 like•608 views

Ben Sullins

Comparison of Google Big Query and Amazon Redshift

Technology Business

2TB
XL Node
High Storage Extra Large (XL) DW
Node:
CPU: 2 virtual cores - Intel Xeon E5
Memory: 15 GiB
Storage: 3 HDD with 2TB of local
attached storage
Network: Moderate
Disk I/O: Moderate
API: dw.hs1.xlarge

16TB
8XL Node
High Storage Eight Extra Large (8XL) DW Node:

CPU: 16 virtual cores - Intel Xeon E5
Memory: 120 GiB
Storage: 24 HDD with 16TB of local attached
storage
Network: 10 Gigabit Ethernet with support for
cluster placement groups
Disk I/O: Very High
API: dw.hs1.8xlarge

On-Demand Pricing

DW Node Class (On-Demand)

Hourly

XL Node - 2TB storage (Per Node)

$0.850 per Hour

8XL Node - 16TB storage (Per
Node)

$6.800 per Hour

Reserved Instance 1yr (41% savings)
DW Node Class (Reserved)

Up front

Hourly

XL Node - 2TB storage (Per Node)

$2,500

$0.215 per Hour

8XL Node - 16TB storage (Per Node)

$20,000

$1.720 per Hour

Reserved Instance 3yr (73% savings)
DW Node Class (Reserved)

Up front

Hourly

XL Node - 2TB storage (Per Node)

$3,000

$0.114 per Hour

8XL Node - 16TB storage (Per Node)

$24,000

$0.912 per Hour

Web Interface
Fully Managed

Automated Backups
Fault Tolerant

AES-256 bit Encryption

Amazon VPC Firewall

“Dremel can

Scan 35 Billion
Rows
without an Index in

Tens of Seconds”
– Solutions Architect, Google Cloud Solutions
Team

On-Demand Pricing
Resource

Pricing

Storage

$80 (per TB/month)

Interactive Queries

$35 (per TB processed)

Batch Queries

$20 (per TB processed)

Packaged Pricing
Data

100 TB

$3,300 per month ($33 per TB)

400 TB

$12,000 per month ($30 per TB)

1,500 TB

$40,500 per month ($27 per TB)

4,000 TB
•
•

Cost

$100,000 per month ($25 per TB)

Packages are billed in full at the end of each month, whether the package is used or not.
If you use more data than the amount in your chosen package, on-demand rates apply for any
additional data.

Cloud Big Data Sources Comparison
Amazon Redshift

Google BigQuery

Columnar + MPP

Columnar + Tree

Petabytes in Scale

Infinite Scalability

Easy management interface

No Management Required

Straight forward billing
($1K/TB/Yr)

Confusing Pricing Model

Great connectivity w/ BI Tools

Fair Connectivity w/ BI Tools

What's hot

Mongodb labBas van Oudenaarde

Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...Amazon Web Services

ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья СвиридовGeeksLab Odessa

刘诚忠：Running cloudera impala on postgre sqlhdhappy001

Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa

Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents

Amazon Web Services lection 4 Binary Studio

Barcelona MUG MongoDB + Hadoop PresentationNorberto Leite

Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysCAPSiDE

Bucket your partitions wisely - Cassandra summit 2016Markus Höfer

Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoAltinity Ltd

Graph databasesPathum Wijethunge

Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraAnant Corporation

Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd

Big data solution capacity planningRiyaz Shaikh

Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudDatabricks

MongoDB @ fliptopRobbie Cheng

Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva

What's hot (20)

Mongodb lab

Migration to ClickHouse. Practical guide, by Alexander Zaitsev

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...

ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов

刘诚忠：Running cloudera impala on postgre sql

Tweaking perfomance on high-load projects_Думанский Дмитрий

Распределенные системы хранения данных, особенности реализации DHT в проекте ...

Amazon Web Services lection 4

Barcelona MUG MongoDB + Hadoop Presentation

Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays

Bucket your partitions wisely - Cassandra summit 2016

Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko

Graph databases

Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra

Building maps for apps in the cloud - a Softlayer Use Case

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev

Big data solution capacity planning

Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud

MongoDB @ fliptop

Андрей Козлов (Altoros): Оптимизация производительности Cassandra

Viewers also liked

Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering Yingcai Wu

Red wine13-04-00

BekirBekir Savcı

Google drive presentaciónsantiago2005

Making pretty charts that actually mean somethingBen Sullins

Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering Yingcai Wu

Research GenrePewwis

Big Data Analytics PreviewBen Sullins

Visual Analysis of Topic Competition on Social Media Yingcai Wu

StoryFlow - Visually Tracking Evolution of StoriesYingcai Wu

Viewers also liked (10)

Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering

Red wine

Bekir

Google drive presentación

Making pretty charts that actually mean something

Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering

Research Genre

Big Data Analytics Preview

Visual Analysis of Topic Competition on Social Media

StoryFlow - Visually Tracking Evolution of Stories

Similar to Big data in the cloud

MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin

E Science As A Lens On The World Lazowskaguest43b4df3

E Science As A Lens On The World LazowskaWCET

AWS Presentation at JasperWorld APACAmazon Web Services

AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAmazon Web Services

Oracle Exadata Version 2Jarod Wang

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

(STG403) Amazon EBS: Designing for PerformanceAmazon Web Services

Sanger HPC infrastructure Report (2007)Guy Coates

Masterclass - RedshiftAmazon Web Services

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil

Amazon Web Services - An Overviewchregu

Windows Azure Storage – Architecture ViewChaowlert Chaisrichalermpol

Exadata x2 extyangjx

High Performance Cloud ComputingAmazon Web Services

Processing and AnalyticsAmazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services

Soluzioni integrate per il design e la comunicazione digitale: BuffaloPico Srl

Similar to Big data in the cloud (20)

MySQL NDB Cluster 8.0 SQL faster than NoSQL

E Science As A Lens On The World Lazowska

AWS Presentation at JasperWorld APAC

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Oracle Exadata Version 2

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

(STG403) Amazon EBS: Designing for Performance

Sanger HPC infrastructure Report (2007)

Masterclass - Redshift

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...

Amazon Web Services - An Overview

Windows Azure Storage – Architecture View

Exadata x2 ext

High Performance Cloud Computing

Processing and Analytics

Getting Started with Amazon Redshift

Leveraging Amazon Redshift for Your Data Warehouse

Soluzioni integrate per il design e la comunicazione digitale: Buffalo

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Artificial intelligence in the post-deep learning eraDeakin University

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Install Stable Diffusion in windows machinePadma Pradeep

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

costume and set research powerpoint presentationphoebematthew05

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Key Features Of Token Development (1).pptxLBM Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Advanced Test Driven-Development @ php[tek] 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Pigging Solutions in Pet Food Manufacturing

Human Factors of XR: Using Human Factors to Design XR Systems

Artificial intelligence in the post-deep learning era

Connect Wave/ connectwave Pitch Deck Presentation

Install Stable Diffusion in windows machine

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

My INSURER PTE LTD - Insurtech Innovation Award 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Unblocking The Main Thread Solving ANRs and Frozen Frames

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

costume and set research powerpoint presentation

DMCC Future of Trade Web3 - Special Edition

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Key Features Of Token Development (1).pptx

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Big data in the cloud

1. Big Data in the Cloud @bensullins

3. Columnar DB MPP Architecture Speed!

4. 2TB XL Node High Storage Extra Large (XL) DW Node: CPU: 2 virtual cores - Intel Xeon E5 Memory: 15 GiB Storage: 3 HDD with 2TB of local attached storage Network: Moderate Disk I/O: Moderate API: dw.hs1.xlarge 16TB 8XL Node High Storage Eight Extra Large (8XL) DW Node: CPU: 16 virtual cores - Intel Xeon E5 Memory: 120 GiB Storage: 24 HDD with 16TB of local attached storage Network: 10 Gigabit Ethernet with support for cluster placement groups Disk I/O: Very High API: dw.hs1.8xlarge

5. On-Demand Pricing DW Node Class (On-Demand) Hourly XL Node - 2TB storage (Per Node) $0.850 per Hour 8XL Node - 16TB storage (Per Node) $6.800 per Hour Reserved Instance 1yr (41% savings) DW Node Class (Reserved) Up front Hourly XL Node - 2TB storage (Per Node) $2,500 $0.215 per Hour 8XL Node - 16TB storage (Per Node) $20,000 $1.720 per Hour Reserved Instance 3yr (73% savings) DW Node Class (Reserved) Up front Hourly XL Node - 2TB storage (Per Node) $3,000 $0.114 per Hour 8XL Node - 16TB storage (Per Node) $24,000 $0.912 per Hour

6. Web Interface Fully Managed Automated Backups Fault Tolerant

7. AES-256 bit Encryption Amazon VPC Firewall

9. BigQuery

10.

11. Columnar DB Tree Architecture Speed!

12. “Dremel can Scan 35 Billion Rows without an Index in Tens of Seconds” – Solutions Architect, Google Cloud Solutions Team

13.

14. On-Demand Pricing Resource Pricing Storage $80 (per TB/month) Interactive Queries $35 (per TB processed) Batch Queries $20 (per TB processed) Packaged Pricing Data 100 TB $3,300 per month ($33 per TB) 400 TB $12,000 per month ($30 per TB) 1,500 TB $40,500 per month ($27 per TB) 4,000 TB • • Cost $100,000 per month ($25 per TB) Packages are billed in full at the end of each month, whether the package is used or not. If you use more data than the amount in your chosen package, on-demand rates apply for any additional data.

15.

16. Cloud Big Data Sources Comparison Amazon Redshift Google BigQuery Columnar + MPP Columnar + Tree Petabytes in Scale Infinite Scalability Easy management interface No Management Required Straight forward billing ($1K/TB/Yr) Confusing Pricing Model Great connectivity w/ BI Tools Fair Connectivity w/ BI Tools

Editor's Notes

Optimized for Data Warehousing – Amazon Redshift uses a variety of innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Amazon Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the Intel Xeon E5 processor and drives, and a 10GigE mesh network to maximize throughput between nodes.
Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
No Up-Front Costs – You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing. On-Demand pricing starts at just $0.85 per hour for a single node 2TB data warehouse, scaling linearly with cluster size. With Reserved Instance pricing, you can lower your effective price to $0.228 per hour for a single 2TB node, or under $1,000 per TB per year. To see more details, visit the Amazon Redshift Pricing page.
Get Started in Minutes – With a few clicks in the AWS Management Console or simple API calls, you can create a cluster, specifying its size, underlying node type, and security profile. Amazon Redshift will provision your nodes, configure the connections between them, and secure the cluster. Your data warehouse should be up and running in minutes.Fully Managed – Amazon Redshift handles all the work needed to manage, monitor, and scale your data warehouse, from monitoring cluster health and taking backups to applying patches and upgrades. You can easily add or remove nodes from your cluster as your performance and capacity needs change. By handling all these time-consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business.Fault Tolerant – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary.Automated Backups – Amazon Redshift’s automated snapshot feature continuously backs up new data on the cluster to Amazon S3. Snapshots, are automated, incremental, and continous. Amazon Redshift stores your snapshots for a user-defined period, which can be from one to thirty-five days. You can also take your own snapshots at any time, which leverage all existing system snapshots and are retained until you explicitly delete them. Once you delete a cluster, your system snapshots are removed but your user snapshots are available until you explicitly delete them.Easy Restores - You can use any system or user snapshot to restore your cluster using the AWS Management Console or the Amazon Redshift APIs. Your cluster is available as soon as the system metadata has been restored and you can start running queries while user data is spooled down in the background.
Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-acccelerated AES-256 encryption for data at rest. If you choose to enable encryption of data at rest, all data written to disk will be encrypted as well as any backups.Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. You can also run Amazon Redshift inside Amazon Virtual Private Cloud (Amazon VPC) to isolate your data warehouse cluster in your own virtual network and connect it to your existing IT infrastructure using industry-standard encrypted IPsec VPN.
SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers. Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you do today. See the Amazon Redshift partner page for details.Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. AWS Data Pipeline enables easy, programmatic integration between Amazon Redshift, Amazon Elastic MapReduce (Amazon EMR), and Amazon Relational Database Service (Amazon RDS).
BigQuery is Google’s Cloud Big Data solution based on the Dremel platform. Dremel has been in development for over 6 years and powers much of Googles Cloud Platform. It’s worth mentioning that for this course I’m going to cover BigQuery at a high-level and then later we’ll connect Tableau up to it to see how functionally to use it. If you’d like to dive deeper into BigQuery Lynn Langit has a course on here which goes into much greater detail that is definitely worth checking out.Let’s start by taking a look at their homepage.
Looking at their interface, on their homepage they proclaim, analyze terabytes of data w/ just a click of a button. Sounds promising, if it weren’t for Amazon Redshift offering petabytes in scale.You’ll also notice a query editor and result pane previewed, this is encouraging however for non-sql developers this can be a scary sight.
Similar to Amazon’s Redshift Google Bigquery stores data in a columnar database format which is great for data compression and query speeds.Google Bigquery differs from Amazon redshift however in that it uses this Tree structure which is similar to a MPP database however it spreads the data extremely wide and for queries creates execution “trees” which can scan tens of thousands of servers or leaf nodes containing the data and return results in miliseconds.Like Redshift, this all adds up to speed! Google is trying to differentiate from MPP solutions with BigQuery by providing what they call full-scan results. This is essentially by creating a query tree of every possible combination of query you can run. In their whitepaper from Kazunori Sato titled “An Inside Look at Google BigQuery” he states that “BigQuery solves the parallel disk I/O problem by utilizing the cloud platform’s economy of scale. You would need to run 10,000 disk drives and 5,000 processors simultaneously to execute the full scan of 1TB of data within one second. “ Impressive.The quote from this whitepaper tat
Dremel is the platform which Google is Based on.
Scalabitlity with Google BigQuery is a bit if a mystery to be honest. Since they handle all of the administration and data distribution for you the scalability really is only limited by based on how much you can afford. Once you upload your data to BigQuery, it handles the rest, you only need to worry about how much data is going to be processed in your queries, this brings us to their pricing model.
Big Data analysis engine without operating a data center Managed service means no additional capital costs Ability to terminate service and remove your data at any timeTransparency in pricing and usage Simplicity: only 2 pricing components (query processing, storage) Flexibility: choice to pay-by-the-month for what you useFull Visibility and Control Monthly billing: Monitor and throttle what you use Tools to optimize usage/costs: best practices, tooling, samplesSince you’re charged by amount of data processed this can be very expensive if using a “chatty” query tool like Tableau. Google recommends to shard data into separate tables using a time stamp and setting your queries to filter just to a specific date range to minimize query costs.In my view this is the only issue with BigQuery. Let’s say you have a query which pulls back something like sales for the west region by month for the past year. This will return 24 data points. That’s 12 integers for sales, and 12 date values corresponding to the month of sales. To get to these 24 data points your query may have to scan millions or billions of rows, imagine Amazon’s detailed sales transactions, aggregate the data, then return your results. Since you’re paying for all the data scanned, a single query could really rack up the bills. Now, if you were building a focused application and not doing visual analytics using a tool like Tableau you can probably handle this quite well however in this case, it can be cost prohibitive to store your data here. I have a friend who was testing this and one of his analysts actually ran a single query that cost them $400!

Big data in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Big data in the cloud

Similar to Big data in the cloud (20)

Recently uploaded

Recently uploaded (20)

Big data in the cloud

Editor's Notes