SlideShare a Scribd company logo
1 of 44
Download to read offline
Data Management in the Cloud
Introduction
(Lecture 1)
1
LOGISTICS AND ORGANIZATION
Data Management in the Cloud
2
Personnel
• Kristin Tufte
– FAB 115-09
– Email: tufte@pdx.edu
– Office hours: right after class
• David Maier
– FAB 115-14
– Email: maier@cs.pdx.edu
– Office hours: TBA
3
Thanks to Michael Grossniklaus
4
Course Resources
• Web site
– Piazza: https://piazza.com/pdx/spring2015/cs410510cloud/home
• Lecture note slides
– online by lecture time
• Literature list
– Readings associated with most lectures – links on the course web site
• Book - Required
– NoSQL Distilled: A Brief Guide to the Emerging world of Polyglot
Persistence
5
Course Goals
• Understand the basic concepts of cloud computing and cloud
data management
• Learn how to design data models and algorithms for managing
data in the cloud
• Experiment with cloud data management systems
• Work with cloud computing platforms
• Compare and discuss results
• Hopefully, have a good time doing so!
6
Planned Course Schedule
I. Introduction and Basics
– motivation, challenges, concepts, …
– storage, distributed file systems, and map/reduce, …
II. Data Models and Systems
– key/value, document, column families, graph, array, …
III. Data Processing Paradigms
– SCOPE, Pig Latin, Hive, …
IV. Scalable SQL
– Microsoft SQL Azure, VoltDB, …
V. Advanced and Research Topics
– SQLShare, benchmarking, …
7
Course Prerequisites
• Programming skills
– Java, C#, C/C++
– algorithms and data structures
– some distributed systems, e.g. client/server
• Database management systems
– physical storage
– query processing
– optimization
8
Assignments & Project
• Assignments (48% of Grade)
– Up to 6 assignments
– Individual work
– Some question/answer on readings or class discussions, some
implementation
• Course Project (48% of Grade)
– Part 1: Data Modeling (Written) (12%)
– Part 2: System Profile (Written) (12%)
– Part 3: Application Design (Presentation) (12%)
– Part 4: Application Implementation (Coding & Presentation) (12%)
– Part 1 is done in pairs; parts 2, 3 and 4 are done in groups of 4-5 students
• Class Participation (4% of Grade)
• Application Implementation presentations will be done during the
Finals time slot (no Final Exam)
• Assigned readings – on course web site 9
INTRODUCTION
Data Management in the Cloud
10
Outline
• Motivation
– what is cloud computing?
– what is cloud data management?
• Challenges, opportunities and limitations
– what makes data management in the cloud difficult?
• New solutions
– key/value, document, column family, graph, array, and object databases
– scalable SQL databases
• Application
– graph data and algorithms
– usage scenarios
11
What is Cloud Computing?
• Different definitions for “Cloud Computing” exist
– http://tech.slashdot.org/article.pl?sid=08/07/17/2117221
• Common ground of many definitions
– processing power, storage and software are commodities that are
readily available from large infrastructure
– service-based view: “everything as a service (*aaS)”, where only
“Software as a Service (SaaS)” has a precise and agreed-upon definition
– utility computing: pay-as-you-go model
12
Service-Based View on Computing
Server Hardware
Client Software
Infrastructure
(IaaS)
Platform
(PaaS)
Software
(SaaS)
Machine Interface
User Interface
Services
Components
Network
Computation Storage
Application
Developer
System
Administrator
End User
Source: Wikipedia (http://www.wikipedia.org)
13
Terminology
• Term cloud computing usually refers to both
– SaaS: applications delivered over the Internet as services
– The Cloud: data center hardware and systems software
• Public clouds
– available in a pay-as-you-go manner to the public
– service being sold is utility computing
– Amazon Web Service, Microsoft Azure, Google AppEngine
• Private clouds
– internal data centers of businesses or organizations
– normally not included under cloud computing
Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
14
Utility Computing
• Illusion of infinite computing resources
– available on demand
– no need for users to plan ahead for provisioning
• No up-front cost or commitment by users
– companies can start small (demand unknown in advance)
– increase resources only when there is an increase in need (demand
varies with time)
• Pay for use on short-term basis as needed
– processors by the hour and storage by the day
– release them as needed, reward conservation
• “Cost associativity”
– 1000 EC2 machines for 1 hour = 1 EC2 machine for 1000 hours
Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
15
Cloud Computing Users and Providers
Picture credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
16
Virtualization
• Virtual resources abstract from physical resources
– hardware platform, software, memory, storage, network
– fine-granular, lightweight, flexible and dynamic
• Relevance to cloud computing
– centralize and ease administrative tasks
– improve scalability and work loads
– increase stability and fault-tolerance
– provide standardized, homogenous computing platform through
hardware virtualization, i.e. virtual machines
17
Spectrum of Virtualization
• Computation virtualization
– Instruction set VM (Amazon EC2, 3Tera)
– Byte-code VM (Microsoft Azure)
– Framework VM (Google AppEngine, Force.com)
• Storage virtualization
• Network virtualization
EC2 Azure AppEngine Force.com
Lower-level,
Less management
Higher-level,
More management
Slide Credit: RAD Lab, UC Berkeley
18
Table credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
19
Unused resources
Economics of Cloud Users
• Pay by use instead of provisioning for peak
Static data center Data center in the cloud
Demand
Capacity
Time
Resources
Demand
Capacity
Time
Resources
Slide Credit: RAD Lab, UC Berkeley
20
Unused resources
Economics of Cloud Users
• Risk of over-provisioning: underutilization
Static data center
Demand
Capacity
Time
Resources
Slide Credit: RAD Lab, UC Berkeley
21
Economics of Cloud Users
• Heavy penalty for under-provisioning
Lost revenue Lost users
Resources
Demand
Capacity
Time (days)
1 2 3
Resources
Demand
Capacity
Time (days)
1 2 3
Resources
Demand
Capacity
Time (days)
1 2 3
Slide Credit: RAD Lab, UC Berkeley
22
Economics of Cloud Providers
• Cloud computing is 5-7x cheaper than traditional in-house
computing
• Power/cooling costs: approx double cost of storage, CPU, network
• Added benefits (to cloud providers)
– utilize off-peak capacity (Amazon)
– sell .NET tools (Microsoft)
– reuse existing infrastructure (Google)
Resource
Cost in Medium
Data Center
Cost in Very Large
Data Center
Ratio
Network $95/Mbps/month $13/Mbps/month 7.1x
Storage $2.20/GB/month $0.40/GB/month 5.7x
Administration ≈140 servers/admin >1000 servers/admin 7.1x
Slide Credit: RAD Lab, UC Berkeley
Source: James Hamilton (http://perspectives.mvdirona.com)
23
What is Cloud Data Management?
• Data management applications are potential candidates for
deployment in the cloud
– industry: enterprise database system have significant up-front cost that
includes both hardware and software costs
– academia: manage, process and share mass-produced data in the cloud
• Many “Cloud Killer Apps” are in fact data-intensive
– Batch Processing as with map/reduce
– Online Transaction Processing (OLTP) as in automated business
applications
– Online Analytical Processing (OLAP) as in data mining or machine
learning
24
Scientific Data Management Applications
• Old model
– “Query the world”
– data acquisition coupled to a specific hypothesis
• New model
– “Download the world”
– data acquired en masse, in support of many hypotheses
• E-science examples
– astronomy: high-resolution, high-frequency sky surveys, …
– oceanography: high-resolution models, cheap sensors, satellites, …
– biology: lab automation, high-throughput sequencing, ...
Slide Credit: Bill Howe, U Washington
25
Scaling Databases
• Flavors of database scalability
– lots of (small) transactions
– lots of copies of the data
– lots of processors running on a single query (compute intensive tasks)
– extremely large data set for one query (data intensive tasks)
• Data replication
– move data to where it is needed
– managed replication for availability and reliability
26
Revisit Cloud Characteristics
• Compute power is elastic, but only if workload is parallelizable
– transactional database management systems do not typically use a
shared-nothing architecture
– shared-nothing is a good match for analytical data management
– some things parallelize well (i.e. sum), some do not (i.e. median)
– Think about: Google gmail, Amazon web site – easy? Difficult?
– Google App Engine – API forces ability to run in shared nothing
• Scalability
– in the past: out-of-core, works even if data does not fit in main memory
– in the present: exploits thousands of (cheap) nodes in parallel
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
27
Parallel Database Architectures
interconnect
…
…
interconnect
interconnect
…
Shared nothing Shared disc Shared memory
processor memory disk
Source: D. DeWitt and J. Gray: “Parallel Database Systems: The Future of
High Performance Database Processing”, CACM 36(6), pp. 85-98, 1992. 28
Revisit Cloud Characteristics
• Data is stored at an untrusted host
– there are risks with respect to privacy and security in storing
transactional data on an untrusted host
– particularly sensitive data can be left out of analysis or anonymized
– sharing and enabling access is often precisely the goal of using the
cloud for scientific data sets
– where exactly is your data? and what are that country’s laws?
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
29
Revisit Cloud Characteristics
• Data is replicated, often across large geographic distances
– it is hard to maintain ACID guarantees in the presence of large-scale
replication
– full ACID guarantees are typically not required in analytical applications
• Virtualizing large data collections is challenging
– data loading takes more time than starting a VM
– storage cost vs. bandwidth cost
– online vs. offline replication
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
30
Challenges
• Trade-off between functionality and operational cost
– restricted interface, minimalist query language, limited consistency
guarantees
– pushes more programming burden on developers
– enables predictable services and service level agreements
• Manageability
– limited human intervention, high-variance workloads, and a variety of
shared infrastructures
– need for self-managing and adaptive database techniques
Based on: “The Claremont Report on Database Research”, 2008
31
Challenges
• Scalability
– today’s SQL databases cannot scale to the thousands of nodes deployed
in the cloud context
– hard to support multiple, distributed updaters to the same data set
– hard to replicate huge data sets for availability, due to capacity (storage,
network bandwidth, …)
– storage: different transactional implementation techniques, different
storage semantics, or both
– query processing and optimization: limitations on either the plan space
or the search will be required
– programmability: express programs in the cloud
Based on: “The Claremont Report on Database Research”, 2008
32
Challenges
• Data privacy and security
– protect from other users and cloud providers
– specifically target usage scenarios in the cloud with practical incentives
for providers and customers
• New applications: “mash up” interesting data sets
– expect services pre-loaded with large data sets, stock prices, web
crawls, scientific data
– data sets from private or public domain
– might give rise to federated cloud architectures
Based on: “The Claremont Report on Database Research”, 2008
33
Transactional Data Management – Cloud or not?
• Transactional Data Management
– Banking, airline reservation, e-commerce, etc…
– Require ACID, write-intensive
• Features
– Do not typically use shared-nothing architectures (changing a bit)
– Hard to maintain ACID guarantees in the face of data replication over
large geographic distances
– There are risks in storing transactional data on an untrusted host
• Conclusion: not appropriate for the cloud
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
34
Analytical Data Management – Cloud or not?
• Analytical Data Management
– Query data from a data store for planning, problem solving, decision
support
– Large scale
– Read-mostly
• Features
– Shared-nothing architecture is a good match for analytical data
management (Teradata, Greenplum, Vertica…)
– ACID guarantees typically not needed
– Particularly sensitive data left out of analysis
• Conclusion: appropriate for the cloud
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
35
Cloud DBMS Wish List
• Efficiency
• Fault tolerance (query restart not required, commodity hw)
• Heterogeneous environment (performance of compute nodes
not consistent)
• Operate on encrypted data
• Interface with business intelligence products
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
36
Option 1: MapReduce-like software
• Fault tolerance (yes, commodity hw)
• Heterogeneous environment (yes, by design)
• Operate on encrypted data (no)
• Interface with business intelligence products (no, not SQL-
compliant, no standard)
• Efficiency (up for debate)
– Questionable results in the MapReduce paper
– Absence of a loading phase (no indices, materialized views)
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
37
Option 2: Shared-Nothing Parallel Database
• Interface with business intelligence products (yes, by design)
• Efficiency (yes)
• Fault tolerance (no - query restart required)
• Heterogeneous environment (no)
• Operate on encrypted data (no)
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
38
Option 3: A Hybrid Solution
• HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html)
– A hybrid of DBMS and MapReduce technologies that targets analytical
workloads
– Designed to run on a shared-nothing cluster of commodity machines, or
in the cloud
– An attempt to fill the gap in the market for a free and open source
parallel DBMS
– Much more scalable than currently available parallel database systems
and DBMS/MapReduce hybrid systems.
– As scalable as Hadoop, while achieving superior performance on
structured data analysis workload
• Commercialized as Hadapt (hadapt.com)
39
Why NoSQL?
• Value of relational databases
– Persistent data
– Concurrency/transactions
– Integration
– (Mostly) Standard Model
• Impedance Mismatch
• Application vs. Integration databases
Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
40
Attack of the Clusters
• Data growth (links, social networks, logs, users)
• Need to scale to accommodate growth
• Traditional RDBMS (Oracle / Microsoft SQL Server) – shared
disk – don’t scale well
• “technical issues are exacerbated by licensing costs”
– Google, Amazon influential
• “The interesting thing about Cloud Computing is that we’ve
redefined Cloud Computing to include everything that we
already do… I don’t’ understand what we would do differently
in the light of Cloud Computing other than change the wording
of some of our ads.” – Larry Ellison
Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
41
Emergence of NoSQL
• No strong definition, but…
– Do not use SQL
– Typically open-source
– Typically oriented towards clusters (but not all)
– Operate without a schema
• Various types (in order of complexity)
– Key-value stores
– Document Stores
– Extensible Record Stores
Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
42
What were we talking about?
• Cloud Computing
– Utility Computing
– Virtualization
– Economics (pay as you go)
• Data management in the cloud
– Cloud characteristics (elasticity if parallelizable, untrusted host, large
distances)
– Transactional vs. Analytical
– Wish List
– Map Reduce vs. Shared-Nothing -> Hybrid
• DB vs. NoSQL in two lines…
– Database: complex / concurrent
– NoSQL: simple / scalable
43
References
• M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A.
Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, M. Zaharia:
Above the Clouds: A Berkeley View of Cloud Computing. Tech. Rep.
No. UCB/EECS-2009-28, 2009.
• D. J. Abadi: Data Management in the Cloud: Limitations and
Opportunities. IEEE Data Eng. Bull. 32(1), pp. 3—12, 2009.
• R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey, S.
Chaudhuri, A. Doan, D. Florescu, M. J. Franklin, H. Garcia‐ Molina, J.
Gehrke, L. Gruenwald, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E.
Ioannidis, H. F. Korth, D. Kossmann, S. Madden, R. Magoulas, B. Chin
Ooi, T. O’Reilly, R. Ramakrishnan, S. Sarawagi, M. Stonebraker, A. S.
Szalay, G. Weikum: The Claremont Report on Database Research.
2008.
• P. Sadalage, M. Fowler. NoSQL Distilled: A Brief Guide to the
Emerging World of Polyglot Persistence. 2013
44

More Related Content

What's hot

3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
How to Set Up ApsaraDB for RDS on Alibaba Cloud
How to Set Up ApsaraDB for RDS on Alibaba CloudHow to Set Up ApsaraDB for RDS on Alibaba Cloud
How to Set Up ApsaraDB for RDS on Alibaba CloudAlibaba Cloud
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAlibaba Cloud
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
Choosing the right Cloud Database
Choosing the right Cloud DatabaseChoosing the right Cloud Database
Choosing the right Cloud DatabaseJanakiram MSV
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Big Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationBig Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session IVMware Tanzu
 
Cloud Design Pattern part1
Cloud Design Pattern part1Cloud Design Pattern part1
Cloud Design Pattern part1Masashi Narumoto
 
VTU 6th Sem Elective CSE - Module 5 cloud computing
VTU 6th Sem Elective CSE - Module 5 cloud computingVTU 6th Sem Elective CSE - Module 5 cloud computing
VTU 6th Sem Elective CSE - Module 5 cloud computingSachin Gowda
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
 
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring ReportsCloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring ReportsBlazeclan Technologies Private Limited
 
Presentation on Databases in the Cloud
Presentation on Databases in the CloudPresentation on Databases in the Cloud
Presentation on Databases in the Cloudmoshfiq
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningCloudLightning
 
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to Ask
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to AskWebinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to Ask
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to AskStorage Switzerland
 
How to power microservices with MariaDB
How to power microservices with MariaDBHow to power microservices with MariaDB
How to power microservices with MariaDBMariaDB plc
 

What's hot (20)

3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
How to Set Up ApsaraDB for RDS on Alibaba Cloud
How to Set Up ApsaraDB for RDS on Alibaba CloudHow to Set Up ApsaraDB for RDS on Alibaba Cloud
How to Set Up ApsaraDB for RDS on Alibaba Cloud
 
Working and Features of HTML5 and PhoneGap - An Overview
Working and Features of HTML5 and PhoneGap - An OverviewWorking and Features of HTML5 and PhoneGap - An Overview
Working and Features of HTML5 and PhoneGap - An Overview
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core Features
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
Choosing the right Cloud Database
Choosing the right Cloud DatabaseChoosing the right Cloud Database
Choosing the right Cloud Database
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Big Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationBig Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data Integration
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session I
 
Cloud Design Pattern part1
Cloud Design Pattern part1Cloud Design Pattern part1
Cloud Design Pattern part1
 
VTU 6th Sem Elective CSE - Module 5 cloud computing
VTU 6th Sem Elective CSE - Module 5 cloud computingVTU 6th Sem Elective CSE - Module 5 cloud computing
VTU 6th Sem Elective CSE - Module 5 cloud computing
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring ReportsCloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports
Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports
 
Presentation on Databases in the Cloud
Presentation on Databases in the CloudPresentation on Databases in the Cloud
Presentation on Databases in the Cloud
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightning
 
Aneka platform
Aneka platformAneka platform
Aneka platform
 
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to Ask
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to AskWebinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to Ask
Webinar: Considering Hyperconverged for Your Enterprise? 3 Key Questions to Ask
 
How to power microservices with MariaDB
How to power microservices with MariaDBHow to power microservices with MariaDB
How to power microservices with MariaDB
 

Similar to Cloud

advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxTeddyIswahyudi1
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfAtaulAzizIkram
 
Clould Computing and its application in Libraries
Clould Computing and its application in LibrariesClould Computing and its application in Libraries
Clould Computing and its application in LibrariesAmit Shaw
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Scientific Computing in the Cloud
Scientific Computing in the CloudScientific Computing in the Cloud
Scientific Computing in the CloudAdianto Wibisono
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overviewiasaglobal
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingiasaglobal
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduceGareth Rogers
 
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdf
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdfPPT_CLOUD COMPUTING_UNIT 1.pptx.pdf
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdfVineet446350
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 

Similar to Cloud (20)

advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdf
 
Clould Computing and its application in Libraries
Clould Computing and its application in LibrariesClould Computing and its application in Libraries
Clould Computing and its application in Libraries
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Scientific Computing in the Cloud
Scientific Computing in the CloudScientific Computing in the Cloud
Scientific Computing in the Cloud
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overview
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdf
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdfPPT_CLOUD COMPUTING_UNIT 1.pptx.pdf
PPT_CLOUD COMPUTING_UNIT 1.pptx.pdf
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
ppt2.pdf
ppt2.pdfppt2.pdf
ppt2.pdf
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Cloud

  • 1. Data Management in the Cloud Introduction (Lecture 1) 1
  • 2. LOGISTICS AND ORGANIZATION Data Management in the Cloud 2
  • 3. Personnel • Kristin Tufte – FAB 115-09 – Email: tufte@pdx.edu – Office hours: right after class • David Maier – FAB 115-14 – Email: maier@cs.pdx.edu – Office hours: TBA 3
  • 4. Thanks to Michael Grossniklaus 4
  • 5. Course Resources • Web site – Piazza: https://piazza.com/pdx/spring2015/cs410510cloud/home • Lecture note slides – online by lecture time • Literature list – Readings associated with most lectures – links on the course web site • Book - Required – NoSQL Distilled: A Brief Guide to the Emerging world of Polyglot Persistence 5
  • 6. Course Goals • Understand the basic concepts of cloud computing and cloud data management • Learn how to design data models and algorithms for managing data in the cloud • Experiment with cloud data management systems • Work with cloud computing platforms • Compare and discuss results • Hopefully, have a good time doing so! 6
  • 7. Planned Course Schedule I. Introduction and Basics – motivation, challenges, concepts, … – storage, distributed file systems, and map/reduce, … II. Data Models and Systems – key/value, document, column families, graph, array, … III. Data Processing Paradigms – SCOPE, Pig Latin, Hive, … IV. Scalable SQL – Microsoft SQL Azure, VoltDB, … V. Advanced and Research Topics – SQLShare, benchmarking, … 7
  • 8. Course Prerequisites • Programming skills – Java, C#, C/C++ – algorithms and data structures – some distributed systems, e.g. client/server • Database management systems – physical storage – query processing – optimization 8
  • 9. Assignments & Project • Assignments (48% of Grade) – Up to 6 assignments – Individual work – Some question/answer on readings or class discussions, some implementation • Course Project (48% of Grade) – Part 1: Data Modeling (Written) (12%) – Part 2: System Profile (Written) (12%) – Part 3: Application Design (Presentation) (12%) – Part 4: Application Implementation (Coding & Presentation) (12%) – Part 1 is done in pairs; parts 2, 3 and 4 are done in groups of 4-5 students • Class Participation (4% of Grade) • Application Implementation presentations will be done during the Finals time slot (no Final Exam) • Assigned readings – on course web site 9
  • 11. Outline • Motivation – what is cloud computing? – what is cloud data management? • Challenges, opportunities and limitations – what makes data management in the cloud difficult? • New solutions – key/value, document, column family, graph, array, and object databases – scalable SQL databases • Application – graph data and algorithms – usage scenarios 11
  • 12. What is Cloud Computing? • Different definitions for “Cloud Computing” exist – http://tech.slashdot.org/article.pl?sid=08/07/17/2117221 • Common ground of many definitions – processing power, storage and software are commodities that are readily available from large infrastructure – service-based view: “everything as a service (*aaS)”, where only “Software as a Service (SaaS)” has a precise and agreed-upon definition – utility computing: pay-as-you-go model 12
  • 13. Service-Based View on Computing Server Hardware Client Software Infrastructure (IaaS) Platform (PaaS) Software (SaaS) Machine Interface User Interface Services Components Network Computation Storage Application Developer System Administrator End User Source: Wikipedia (http://www.wikipedia.org) 13
  • 14. Terminology • Term cloud computing usually refers to both – SaaS: applications delivered over the Internet as services – The Cloud: data center hardware and systems software • Public clouds – available in a pay-as-you-go manner to the public – service being sold is utility computing – Amazon Web Service, Microsoft Azure, Google AppEngine • Private clouds – internal data centers of businesses or organizations – normally not included under cloud computing Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley 14
  • 15. Utility Computing • Illusion of infinite computing resources – available on demand – no need for users to plan ahead for provisioning • No up-front cost or commitment by users – companies can start small (demand unknown in advance) – increase resources only when there is an increase in need (demand varies with time) • Pay for use on short-term basis as needed – processors by the hour and storage by the day – release them as needed, reward conservation • “Cost associativity” – 1000 EC2 machines for 1 hour = 1 EC2 machine for 1000 hours Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley 15
  • 16. Cloud Computing Users and Providers Picture credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley 16
  • 17. Virtualization • Virtual resources abstract from physical resources – hardware platform, software, memory, storage, network – fine-granular, lightweight, flexible and dynamic • Relevance to cloud computing – centralize and ease administrative tasks – improve scalability and work loads – increase stability and fault-tolerance – provide standardized, homogenous computing platform through hardware virtualization, i.e. virtual machines 17
  • 18. Spectrum of Virtualization • Computation virtualization – Instruction set VM (Amazon EC2, 3Tera) – Byte-code VM (Microsoft Azure) – Framework VM (Google AppEngine, Force.com) • Storage virtualization • Network virtualization EC2 Azure AppEngine Force.com Lower-level, Less management Higher-level, More management Slide Credit: RAD Lab, UC Berkeley 18
  • 19. Table credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley 19
  • 20. Unused resources Economics of Cloud Users • Pay by use instead of provisioning for peak Static data center Data center in the cloud Demand Capacity Time Resources Demand Capacity Time Resources Slide Credit: RAD Lab, UC Berkeley 20
  • 21. Unused resources Economics of Cloud Users • Risk of over-provisioning: underutilization Static data center Demand Capacity Time Resources Slide Credit: RAD Lab, UC Berkeley 21
  • 22. Economics of Cloud Users • Heavy penalty for under-provisioning Lost revenue Lost users Resources Demand Capacity Time (days) 1 2 3 Resources Demand Capacity Time (days) 1 2 3 Resources Demand Capacity Time (days) 1 2 3 Slide Credit: RAD Lab, UC Berkeley 22
  • 23. Economics of Cloud Providers • Cloud computing is 5-7x cheaper than traditional in-house computing • Power/cooling costs: approx double cost of storage, CPU, network • Added benefits (to cloud providers) – utilize off-peak capacity (Amazon) – sell .NET tools (Microsoft) – reuse existing infrastructure (Google) Resource Cost in Medium Data Center Cost in Very Large Data Center Ratio Network $95/Mbps/month $13/Mbps/month 7.1x Storage $2.20/GB/month $0.40/GB/month 5.7x Administration ≈140 servers/admin >1000 servers/admin 7.1x Slide Credit: RAD Lab, UC Berkeley Source: James Hamilton (http://perspectives.mvdirona.com) 23
  • 24. What is Cloud Data Management? • Data management applications are potential candidates for deployment in the cloud – industry: enterprise database system have significant up-front cost that includes both hardware and software costs – academia: manage, process and share mass-produced data in the cloud • Many “Cloud Killer Apps” are in fact data-intensive – Batch Processing as with map/reduce – Online Transaction Processing (OLTP) as in automated business applications – Online Analytical Processing (OLAP) as in data mining or machine learning 24
  • 25. Scientific Data Management Applications • Old model – “Query the world” – data acquisition coupled to a specific hypothesis • New model – “Download the world” – data acquired en masse, in support of many hypotheses • E-science examples – astronomy: high-resolution, high-frequency sky surveys, … – oceanography: high-resolution models, cheap sensors, satellites, … – biology: lab automation, high-throughput sequencing, ... Slide Credit: Bill Howe, U Washington 25
  • 26. Scaling Databases • Flavors of database scalability – lots of (small) transactions – lots of copies of the data – lots of processors running on a single query (compute intensive tasks) – extremely large data set for one query (data intensive tasks) • Data replication – move data to where it is needed – managed replication for availability and reliability 26
  • 27. Revisit Cloud Characteristics • Compute power is elastic, but only if workload is parallelizable – transactional database management systems do not typically use a shared-nothing architecture – shared-nothing is a good match for analytical data management – some things parallelize well (i.e. sum), some do not (i.e. median) – Think about: Google gmail, Amazon web site – easy? Difficult? – Google App Engine – API forces ability to run in shared nothing • Scalability – in the past: out-of-core, works even if data does not fit in main memory – in the present: exploits thousands of (cheap) nodes in parallel Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 27
  • 28. Parallel Database Architectures interconnect … … interconnect interconnect … Shared nothing Shared disc Shared memory processor memory disk Source: D. DeWitt and J. Gray: “Parallel Database Systems: The Future of High Performance Database Processing”, CACM 36(6), pp. 85-98, 1992. 28
  • 29. Revisit Cloud Characteristics • Data is stored at an untrusted host – there are risks with respect to privacy and security in storing transactional data on an untrusted host – particularly sensitive data can be left out of analysis or anonymized – sharing and enabling access is often precisely the goal of using the cloud for scientific data sets – where exactly is your data? and what are that country’s laws? Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 29
  • 30. Revisit Cloud Characteristics • Data is replicated, often across large geographic distances – it is hard to maintain ACID guarantees in the presence of large-scale replication – full ACID guarantees are typically not required in analytical applications • Virtualizing large data collections is challenging – data loading takes more time than starting a VM – storage cost vs. bandwidth cost – online vs. offline replication Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 30
  • 31. Challenges • Trade-off between functionality and operational cost – restricted interface, minimalist query language, limited consistency guarantees – pushes more programming burden on developers – enables predictable services and service level agreements • Manageability – limited human intervention, high-variance workloads, and a variety of shared infrastructures – need for self-managing and adaptive database techniques Based on: “The Claremont Report on Database Research”, 2008 31
  • 32. Challenges • Scalability – today’s SQL databases cannot scale to the thousands of nodes deployed in the cloud context – hard to support multiple, distributed updaters to the same data set – hard to replicate huge data sets for availability, due to capacity (storage, network bandwidth, …) – storage: different transactional implementation techniques, different storage semantics, or both – query processing and optimization: limitations on either the plan space or the search will be required – programmability: express programs in the cloud Based on: “The Claremont Report on Database Research”, 2008 32
  • 33. Challenges • Data privacy and security – protect from other users and cloud providers – specifically target usage scenarios in the cloud with practical incentives for providers and customers • New applications: “mash up” interesting data sets – expect services pre-loaded with large data sets, stock prices, web crawls, scientific data – data sets from private or public domain – might give rise to federated cloud architectures Based on: “The Claremont Report on Database Research”, 2008 33
  • 34. Transactional Data Management – Cloud or not? • Transactional Data Management – Banking, airline reservation, e-commerce, etc… – Require ACID, write-intensive • Features – Do not typically use shared-nothing architectures (changing a bit) – Hard to maintain ACID guarantees in the face of data replication over large geographic distances – There are risks in storing transactional data on an untrusted host • Conclusion: not appropriate for the cloud Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 34
  • 35. Analytical Data Management – Cloud or not? • Analytical Data Management – Query data from a data store for planning, problem solving, decision support – Large scale – Read-mostly • Features – Shared-nothing architecture is a good match for analytical data management (Teradata, Greenplum, Vertica…) – ACID guarantees typically not needed – Particularly sensitive data left out of analysis • Conclusion: appropriate for the cloud Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 35
  • 36. Cloud DBMS Wish List • Efficiency • Fault tolerance (query restart not required, commodity hw) • Heterogeneous environment (performance of compute nodes not consistent) • Operate on encrypted data • Interface with business intelligence products Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 36
  • 37. Option 1: MapReduce-like software • Fault tolerance (yes, commodity hw) • Heterogeneous environment (yes, by design) • Operate on encrypted data (no) • Interface with business intelligence products (no, not SQL- compliant, no standard) • Efficiency (up for debate) – Questionable results in the MapReduce paper – Absence of a loading phase (no indices, materialized views) Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 37
  • 38. Option 2: Shared-Nothing Parallel Database • Interface with business intelligence products (yes, by design) • Efficiency (yes) • Fault tolerance (no - query restart required) • Heterogeneous environment (no) • Operate on encrypted data (no) Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009. 38
  • 39. Option 3: A Hybrid Solution • HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html) – A hybrid of DBMS and MapReduce technologies that targets analytical workloads – Designed to run on a shared-nothing cluster of commodity machines, or in the cloud – An attempt to fill the gap in the market for a free and open source parallel DBMS – Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems. – As scalable as Hadoop, while achieving superior performance on structured data analysis workload • Commercialized as Hadapt (hadapt.com) 39
  • 40. Why NoSQL? • Value of relational databases – Persistent data – Concurrency/transactions – Integration – (Mostly) Standard Model • Impedance Mismatch • Application vs. Integration databases Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013 40
  • 41. Attack of the Clusters • Data growth (links, social networks, logs, users) • Need to scale to accommodate growth • Traditional RDBMS (Oracle / Microsoft SQL Server) – shared disk – don’t scale well • “technical issues are exacerbated by licensing costs” – Google, Amazon influential • “The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do… I don’t’ understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.” – Larry Ellison Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013 41
  • 42. Emergence of NoSQL • No strong definition, but… – Do not use SQL – Typically open-source – Typically oriented towards clusters (but not all) – Operate without a schema • Various types (in order of complexity) – Key-value stores – Document Stores – Extensible Record Stores Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013 42
  • 43. What were we talking about? • Cloud Computing – Utility Computing – Virtualization – Economics (pay as you go) • Data management in the cloud – Cloud characteristics (elasticity if parallelizable, untrusted host, large distances) – Transactional vs. Analytical – Wish List – Map Reduce vs. Shared-Nothing -> Hybrid • DB vs. NoSQL in two lines… – Database: complex / concurrent – NoSQL: simple / scalable 43
  • 44. References • M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, M. Zaharia: Above the Clouds: A Berkeley View of Cloud Computing. Tech. Rep. No. UCB/EECS-2009-28, 2009. • D. J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1), pp. 3—12, 2009. • R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey, S. Chaudhuri, A. Doan, D. Florescu, M. J. Franklin, H. Garcia‐ Molina, J. Gehrke, L. Gruenwald, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. F. Korth, D. Kossmann, S. Madden, R. Magoulas, B. Chin Ooi, T. O’Reilly, R. Ramakrishnan, S. Sarawagi, M. Stonebraker, A. S. Szalay, G. Weikum: The Claremont Report on Database Research. 2008. • P. Sadalage, M. Fowler. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. 2013 44