This document discusses the key aspects of cloud computing. It begins by outlining the massive scale of today's clouds, with companies like Facebook, Microsoft, and Amazon operating clouds with tens or hundreds of thousands of servers. It then discusses the main characteristics of cloud computing, including on-demand access in a pay-as-you-go model, the data-intensive nature of workloads involving terabytes and petabytes of data, and new programming paradigms like MapReduce. The document also covers the differences between public, private, and academic clouds, and factors to consider in choosing between outsourcing to a public cloud or operating your own private cloud.
Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand. It is a model for enabling ubiquitous, on-demand access to a shared pool of configurable computing resources (e.g., computer networks, servers, storage, applications and services),
Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand. It is a model for enabling ubiquitous, on-demand access to a shared pool of configurable computing resources (e.g., computer networks, servers, storage, applications and services),
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.
It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.
This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.
No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
Its a complete very concise yet effective presentation on cloud computing. This is an emerging technique in developing countries like african countries.
cloud computing - concepts and technologies and mechanisms of tackling problems in cloud
you plz ignore who created it , plz focus on problem oriented points
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.
It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.
This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.
No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
Its a complete very concise yet effective presentation on cloud computing. This is an emerging technique in developing countries like african countries.
cloud computing - concepts and technologies and mechanisms of tackling problems in cloud
you plz ignore who created it , plz focus on problem oriented points
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. The Hype!
• Forrester in 2010 – Cloud computing will go from
$40.7 billion in 2010 to $241 billion in 2020.
• Goldman Sachs says cloud computing will grow
at annual rate of 30% from 2013-2018
• Hadoop market to reach $20.8 B by by 2018:
Transparency Market Research
• Companies and even Federal/state governments
using cloud computing now: fbo.gov
3
4. Many Cloud Providers
• AWS: Amazon Web Services
– EC2: Elastic Compute Cloud
– S3: Simple Storage Service
– EBS: Elastic Block Storage
• Microsoft Azure
• Google Cloud/Compute Engine/AppEngine
• Rightscale, Salesforce, EMC, Gigaspaces, 10gen, Datastax,
Oracle, VMWare, Yahoo, Cloudera
• And many many more!
4
5. Two Categories of Clouds
• Can be either a (i) public cloud, or (ii) private cloud
• Private clouds are accessible only to company employees
• Public clouds provide service to any paying customer:
– Amazon S3 (Simple Storage Service): store arbitrary datasets, pay per GB-month
stored
• As of 2019: 0.4c-3 c per GB month
– Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary OS images, pay
per CPU hour used
• As of 2019: 0.2 c per CPU hr to $7.2 per CPU hr (depending on strength)
– Google cloud: similar pricing as above
– Google AppEngine/Compute Engine: develop applications within their appengine
framework, upload data that will be imported into their format, and run
5
6. Customers Save Time and $$$
• Dave Power, Associate Information Consultant at Eli Lilly and
Company: “With AWS, Powers said, a new server can be up and
running in three minutes (it used to take Eli Lilly seven and a half
weeks to deploy a server internally) and a 64-node Linux cluster can
be online in five minutes (compared with three months internally). …
It's just shy of instantaneous.”
• Ingo Elfering, Vice President of Information Technology Strategy,
GlaxoSmithKline: “With Online Services, we are able to reduce our IT
operational costs by roughly 30% of what we’re spending”
• Jim Swartz, CIO, Sybase: “At Sybase, a private cloud of virtual servers
inside its datacenter has saved nearly $US2 million annually since
2006, Swartz says, because the company can share computing power
and storage resources across servers.”
• 100s of startups in Silicon Valley can harness large computing resources
without buying their own machines.
6
8. What is a Cloud?
• It’s a cluster!
• It’s a supercomputer!
• It’s a datastore!
• It’s superman!
• None of the above
• All of the above
• Cloud = Lots of storage + compute cycles nearby
8
9. What is a Cloud?
• A single-site cloud (aka “Datacenter”) consists
of
– Compute nodes (grouped into racks) (2)
– Switches, connecting the racks
– A network topology, e.g., hierarchical
– Storage (backend) nodes connected to the network
(3)
– Front-end for submitting jobs and receiving client
requests (1)
– (1-3: Often called “three-tier architecture”)
– Software Services
• A geographically distributed cloud consists of
– Multiple such sites
– Each site perhaps with a different structure and
services
9
11. “A Cloudy History of Time”
1940
1950
1960
1970
1980
1990
2000
Timesharing Companies
& Data Processing Industry
Grids
Peer to peer systems
Clusters
The first datacenters!
PCs
(not distributed!)
Clouds and datacenters
2012
11
12. “A Cloudy History of Time”
1940
1950
1960
1970
1980
1990
2000
2012 Clouds
Grids (1980s-2000s):
•GriPhyN (1970s-80s)
•Open Science Grid and Lambda Rail (2000s)
•Globus & other standards (1990s-2000s)
Timesharing Industry (1975):
•Market Share: Honeywell 34%, IBM 15%,
•Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10%
•Honeywell 6000 & 635, IBM 370/168,
Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108
Data Processing Industry
- 1968: $70 M. 1978: $3.15 Billion
First large datacenters: ENIAC, ORDVAC, ILLIAC
Many used vacuum tubes and mechanical relays
Berkeley NOW Project
Supercomputers
Server Farms (e.g., Oceano)
P2P Systems (90s-00s)
•Many Millions of users
•Many GB per day
12
13. Trends: Technology
• Doubling Periods – storage: 12 mos, bandwidth: 9 mos,
and (what law is this?) cpu compute capacity: 18 mos
• Then and Now
– Bandwidth
• 1985: mostly 56Kbps links nationwide
• 2015: Tbps links widespread
– Disk capacity
• Today’s PCs have TBs, far more than a 1990 supercomputer
13
14. Trends: Users
• Then and Now
Biologists:
– 1990: were running small single-molecule
simulations
– Today: CERN’s Large Hadron Collider producing
many PB/year
14
15. Prophecies
• In 1965, MIT's Fernando Corbató and the other designers of
the Multics operating system envisioned a computer facility
operating “like a power company or water company”.
• Plug your thin client into the computing Utility and Play
your favorite Intensive Compute & Communicate
Application
– Have today’s clouds brought us closer to this reality? Think
about it.
15
16. Four Features New in Today’s
Clouds
I. Massive scale.
II. On-demand access: Pay-as-you-go, no upfront commitment.
– And anyone can access it
III. Data-intensive Nature: What was MBs has now become TBs, PBs and
XBs.
– Daily logs, forensics, Web data, etc.
– Humans have data numbness: Wikipedia (large) compressed is only about 10 GB!
IV. New Cloud Programming Paradigms: MapReduce/Hadoop,
NoSQL/Cassandra/MongoDB and many others.
– High in accessibility and ease of programmability
– Lots of open-source
Combination of one or more of these gives rise to novel and unsolved
distributed computing problems in cloud computing. 16
17. I. Massive Scale
• Facebook [GigaOm, 2012]
– 30K in 2009 -> 60K in 2010 -> 180K in 2012
• Microsoft [NYTimes, 2008]
– 150K machines
– Growth rate of 10K per month
– 80K total running Bing
– In 2013, Microsoft Cosmos had 110K machines (4 sites)
• Yahoo! [2009]:
– 100K
– Split into clusters of 4000
• AWS EC2 [Randy Bias, 2009]
– 40K machines
– 8 cores/machine
• eBay [2012]: 50K machines
• HP [2012]: 380K in 180 DCs
• Google [2011, Data Center Knowledge] : 900K
17
19. Quiz: Where is the World’s
Largest Datacenter?
• (2018) China Telecom. 10.7 Million sq. ft.
• (2017) “The Citadel” Nevada. 7.2 Million sq. ft.
• (2015) In Chicago!
• 350 East Cermak, Chicago, 1.1 MILLION sq. ft.
• Shared by many different “carriers”
• Critical to Chicago Mercantile Exchange
• See:
– https://www.gigabitmagazine.com/top10/top-10-biggest-data-centres-world
– https://www.racksolutions.com/news/data-center-news/top-10-largest-data-centers-world/
19
20. What does a datacenter look like
from inside?
• A virtual walk through a datacenter
• Reference: http://gigaom.com/cleantech/a-rare-look-
inside-facebooks-oregon-data-center-photos-video/
20
22. Power
Off-site
On-site
•WUE = Annual Water Usage / IT Equipment Energy (L/kWh) – low is good
•PUE = Total facility Power / IT Equipment Power – low is good
(e.g., Google~1.1)
22
23. Cooling
Air sucked in from top (also, Bugzappers) Water purified
Water sprayed into air 15 motors per server bank
23
24. Extra - Fun Videos to Watch
• Microsoft GFS Datacenter Tour (Youtube)
– http://www.youtube.com/watch?v=hOxA1l1pQIw
• Timelapse of a Datacenter Construction on the Inside
(Fortune 500 company)
– http://www.youtube.com/watch?v=ujO-xNvXj3g
24
25. II. On-demand access: *aaS
Classification
On-demand: renting a cab vs. (previously) renting a car, or buying one. E.g.:
– AWS Elastic Compute Cloud (EC2): a few cents to a few $ per CPU hour
– AWS Simple Storage Service (S3): a few cents per GB-month
• HaaS: Hardware as a Service
– You get access to barebones hardware machines, do whatever you want with them,
Ex: Your own cluster
– Not always a good idea because of security risks
• IaaS: Infrastructure as a Service
– You get access to flexible computing and storage infrastructure. Virtualization is one
way of achieving this (cgroups, Kubernetes, Dockers, VMs,…). Often said to
subsume HaaS.
– Ex: Amazon Web Services (AWS: EC2 and S3), OpenStack, Eucalyptus, Rightscale,
Microsoft Azure, Google Cloud. 25
26. II. On-demand access: *aaS
Classification
• PaaS: Platform as a Service
– You get access to flexible computing and storage infrastructure,
coupled with a software platform (often tightly coupled)
– Ex: Google’s AppEngine (Python, Java, Go)
• SaaS: Software as a Service
– You get access to software services, when you need them. Often
said to subsume SOA (Service Oriented Architectures).
– Ex: Google docs, MS Office 365 Online
26
27. III. Data-intensive Computing
• Computation-Intensive Computing
– Example areas: MPI-based, High-performance computing, Grids
– Typically run on supercomputers (e.g., NCSA Blue Waters)
• Data-Intensive
– Typically store data at datacenters
– Use compute nodes nearby
– Compute nodes run computation services
• In data-intensive computing, the focus shifts from computation to the data:
CPU utilization no longer the most important resource metric, instead I/O is
(disk and/or network)
27
28. IV. New Cloud Programming
Paradigms
• Easy to write and run highly parallel programs in new cloud programming
paradigms:
– Google: MapReduce and Sawzall
– Amazon: Elastic MapReduce service (pay-as-you-go)
– Google (MapReduce)
• Indexing: a chain of 24 MapReduce jobs
• ~200K jobs processing 50PB/month (in 2006)
– Yahoo! (Hadoop + Pig)
• WebMap: a chain of several MapReduce jobs
• 300 TB of data, 10K cores, many tens of hours (~2008)
– Facebook (Hadoop + Hive)
• ~300TB total, adding 2TB/day (in 2008)
• 3K jobs processing 55TB/day
– Similar numbers from other companies, e.g., Yieldex, eharmony.com, etc.
– NoSQL: MySQL is an industry standard, but Cassandra is 2400 times faster! 28
29. Two Categories of Clouds
• Can be either a (i) public cloud, or (ii) private cloud
• Private clouds are accessible only to company employees
• Public clouds provide service to any paying customer
• You’re starting a new service/company: should you use a public cloud
or purchase your own private cloud?
29
30. Single site Cloud: to Outsource
or Own?
• Medium-sized organization: wishes to run a service for M months
– Service requires 128 servers (1024 cores) and 524 TB
– Same as UIUC CCT (Cloud Computing Testbed) cloud site (bought in 2009, now
decommissioned)
• Outsource (e.g., via AWS): monthly cost
– S3 costs: $0.12 per GB month. EC2 costs: $0.10 per CPU hour (costs from 2009)
– Storage = $ 0.12 X 524 X 1000 ~ $62 K
– Total = Storage + CPUs = $62 K + $0.10 X 1024 X 24 X 30 ~ $136 K
• Own: monthly cost
– Storage ~ $349 K / M
– Total ~ $ 1555 K / M + 7.5 K (includes 1 sysadmin / 100 nodes)
• using 0.45:0.4:0.15 split for hardware:power:network and 3 year
lifetime of hardware 30
31. Single site Cloud: to Outsource
or Own?
• Breakeven analysis: more preferable to own if:
- $349 K / M < $62 K (storage)
- $ 1555 K / M + 7.5 K < $136 K (overall)
Breakeven points
- M > 5.55 months (storage)
- M > 12 months (overall)
- As a result
- Startups use clouds a lot
- Cloud providers benefit monetarily most from storage
31
34. Public Research Clouds
• Accessible to researchers with a qualifying grant
• Chameleon Cloud: https://www.chameleoncloud.org/
• HaaS
• OpenStack (~AWS)
• CloudLab: https://www.cloudlab.us/
• Build your own cloud on their hardware
34
35. Summary
• Clouds build on many previous generations of distributed
systems
• Especially the timesharing and data processing industry of
the 1960-70s.
• Need to identify unique aspects of a problem to classify it as
a new cloud computing problem
– Scale, On-demand access, data-intensive, new programming
• Otherwise, the solutions to your problem may already exist!
• Next: Mapreduce!
35