Big data and cloud

Big Data and Cloud

Jun 30, 2011
Schubert Zhang

Who am I
• Schubert Zhang (张松波)

• Chief Architect and Director of Big Data Engineering
and Cloud
• Research Cloud Tech., Develop Cloud Projects and
Products from 2007
• Led the core development team of CMCC “Big Cloud”.
@Hanborq

• 10-years telecom products development and tech-
management. @UTStarcom

Agenda
• Introduction of Cloud Storage and Computing

• Big Data and Cloud

• Our Big-Data/Cloud Products and Solutions

• Anything for Discussion …

PART-1:

INTRODUCTION OF CLOUD
STORAGE AND COMPUTING

A Popular Definition of Cloud …
• Cloud computing is a model for enabling convenient, on-demand network access
to a shared pool of configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly provisioned and released
with minimal management effort or service provider interaction.

• Cloud storage is a model of networked online storage where data is stored on
multiple servers. Hosting companies operate large data centers, which provides
the resources according to the requirements of the customer and expose them as
storage pools, which the customers can themselves use to store files or data
objects. Physically, the resource may span across multiple servers or/and data
centers.

• It promotes availability and is composed of five essential characteristics, three
service models, and four deployment models.

A Popular Definition of Cloud …
Hybrid
Clouds

Deployment Private Community Public Cloud
Models Cloud Cloud

Service Software as a Platform as a Infrastructure as a
Models Service (SaaS) Service (PaaS) Service (IaaS)
On Demand Self-Service
Essential Broad Network Access Rapid Elasticity
Characteristics
Resource Pooling Measured Service

Massive Scale Elastic Computing
Common Homogeneity Geographic Distribution
Characteristics
Virtualization Service Orientation
Low Cost Software Advanced Security

Examples of Famous Cloud Products
• Google Techs:
– Google AppEngine (Storage for Database, etc.) GFS2/Bigtable/MapReduce/
– Google Storage (Storage for Objects) Megastore/Spanner/Pregel
/Dremel…
• Amazon AWS
– Simple Storage Service – S3 (Storage for Objects) Techs:
– Cloud Drive (Online Storage for Individuals) Web-Service-Protocol/
– SimpleDB (Storage for Database)
– Elastic Compute Cloud – EC2 (Compute)
Bitstore/Keymap/Dynamo
…
• Rackspace
Techs:
– Cloud Servers (Compute)
– Cloud Files (Storage for Objects) Open Stack …

• Facebook Techs:
– Messages Hive/Scribe/Haystack/Hadoop
– Photo Storage
…
• Cloudera
– Hadoop …

We focus on
The Technologies Back of the Cloud
• Storage • Computing
• High Scalability • High Scalability
– Shared-Nothing
– Object-Oriented • Parallel Computing Framework
– NoSQL
– … – MR - MapReduce
– BSP - Bulk Synchronous Parallel
• High Availability
– Failure-Detecting • Job/Task scheduler
– Server Clustering
– Replication
• Failure rework
– Eventual Consistency • PDM - Parallel Data Analysis/Mining
– …
Algorithms
• Big Data – Simple Statistic/Analysis
– PB level storage
– Structured or non-structured – Classification/Clustering …
– Information Retrieval – For Recommendation and AD
– Indexing
– Automatic re-sharding/re-partitioning – …
– Automatic load balancing
– …

• High Throughput/Latency
– Optimized IO and data write/read models.

Big Data
• Immutable Law of Big Data
– Volume
– Variety
– Velocity

• Need ….
– Distributed System
• Many-many commodity machines
– Scale-out vs. Scale-Up
• Scale-out: Auto vs. Manually

Big Data, Big Business
$2.25B
$400M

$1.7B

$250M

$263M

$2.35B

>>$30.5M (vc)

Storage Products/Solutions Data Warehouse
NAS (Limited Scale-out) (MPP)

The Next Decade in Data Management

A stable system capable of variety of apps is necessary.
Innovations in database are a requirement.
New data stores are necessary.
Differentiation between programs ill continue until key innovations in data management
platforms become uniform.

Engineering

PART-3:

OUR BIG-DATA/CLOUD PRODUCTS
AND SOLUTIONS

Overview
Cloud Applications
(MagicBox, EnterpriseApps …) Cloud Datasets
RESTful
科研
Cloud Services (web-based)
(ObjectStorage Service, DataStore Service,
MapReduce Service, Compute Service …)
NGO

…

Cloud Stack • 以Cloud Stack云技术产品和
(CloudOS, SandStor, PebStor, MapReduce, vCompute, …) 方案为基础；
• 提供面向大规模数据存储和
处理的行业应用解决方案:
Cloud Solutions；
Cloud Solutions • 提供面向公众和企业的存储、
计算、应用云服务产品:
物互 Cloud Services；
电电视交医政提供云应用: Cloud
联联 … •
力信频通疗府 Applications。
网网

Our Focus
• Enterprise Big Data Management

• Leverage of the Cloud Tech. from Internet
Backend

Hardware
采用标准的普通服务器硬件(PC-Server)和网络设备，采用大数集群软件平台构建灵活的集群
系统。集群规模可从几个节点到几千节点，存储规模可高达PB级。
We rely more on software layer scalability (scale-out) and fault-tolerance.
传统服务器：
IBM小型机(p5 570)
联系集群系统(深腾7000G)
曙光集群系统(曙光TC5000)
SUN服务器
…

传统存储系统：
NAS系统
SAN系统
磁盘阵列
• 普通标准PC服务器
• 自带存储 (单点可>10TB) 弱点：
• 易维护昂贵、扩展难、限制多
• 节点可替代
• 集群扩展方便拒绝昂贵、难扩展、局限性
• 组网灵活多的小型机、硬件捆绑集群
• Cluster-Level Soft RAID 和SAN/NAS等存储设备。

Products and Features
Cloud API
Cloud DataStore ObjectStorage MapReduce Compute
Services Cloud Cloud Cloud Cloud

SandStor PebStor MapReduce
Cloud vCompute
CloudOS
Stack
Hardware & OS

CloudOS SandStor PebStor MapReduce vCompute

• Distributed Cloud Platform • Distributed • Distributed Blob • Flexible Parallel Data • Virtual Machines
• Commodity Hardware and Structured Data Data Management Processing and Computing
Cluster Management Framework Resources mgmt
• Common features
• Common features of CloudOS • Common features of • Multi VMs support
• High Scalability CloudOS
• High Reliability(Data Replication) of CloudOS • Efficiency indexes • Elastic VMs
• Large-scale
• High Availability • High efficiency and meta mgmt provisioning
• High parallelized
Indexing • Efficiency storage • Auto-scale
• Strong Consistency • Locality computing
• Multi-level Cache space mgmt
• High Throughput • Simple model for
• Compression • De-duplicating programming
• Load Balancing
• Fast random access, • Unlimited blob size • Abundant high-level
• Global Data Access
Low Latency languages and
• Global File system toolkits
• Flexible Schema
• Simplify Complexity of Apps • Seamlessly integrated
• High Durability, no
data loss with storage system
July 3, 2012 17

Cloud Service Platform
Cloud Services 相似的同类产品或业务 • Cloud Services API
ObjectStorage Cloud Service Amazon S3 – 基于Web，随处可得
Google Storage for Developer – RESTful风格，简单易用
Rackspace Files/OpenStack Swift – 提供对语言开发SDK
Google BlobStore
DataStore Cloud Service Amazon SimpleDB • Cloud Services的特点
Google DataStore – 用户无需关心实现
MapReduce Cloud Service Amazon MapReduce – 随处可得
Hadooop – 数据可靠性高
Video Media Cloud Service … Video – 伸缩性强
Delivery/Streaming/Transcoding/ – 可用性高(99.9%)
Time-shifting/Analytics
– 按实际使用付费
– 简单易用
• Multi-Level Cloud Services:
– API符合业界标准/习惯
– Infrastructure
– Platform
– 丰富的管理和监控工具
– Applications – 严密且灵活的安全策略
– 多种云服务整合的AAA服
务

Object Storage Platform
build another S3
RockStor Object Storage system provides object storage infrastructure
services which guaranteed efficiency, robustness and load-balance.

Object Access Layer
Providing Client Lib Object-Oriented

High Availability
MetaStore Layer
DHT-based Consistent Overlay Network

High Scalability
Data Chunk Store Layer
Autonomous Overlay Network Huge Capacity

Clustered storage nodes

Object Storage Cloud Services

RESTful API举例
(一个简单的对象上传/PUT操作)

Object Storage
Web-based管理系统
和Amazon S3类似

2000
4000
6000
8000
10000

0
1306028040000
1306028520000
1306029000000
1306029480000
1306029960000
1306030440000
1306030920000
1306031400000 count
Total used
time(hour)

latency(us)
1306031880000

Total average
Total operations

1306032360000
Total Data size(GB)

1306032840000 Total throughput/sec
1306033320000
1306033800000
1306034280000
1306034760000
1306035240000
4.93

1306035720000
132.230
7084.320
Write(8KB)

134220800

1306036200000
1024 （=1TB）

1306036680000
1306037160000
1306037640000
1306038120000
1306038600000
1306039080000
1306039560000
1306040040000
1306040520000
17.267

464.012
2155.119
Read(8KB)

1306041000000
134220800
1024 （=1TB）

1306041480000
Performance of S3

1306041960000
1306042440000
1306042920000
1306043400000
1306043880000
1306044360000
1306044840000
1306045320000
dThrou(ops/sec)

1306045800000

DataStore Platform
build a scalable BDMS
应用层数据访问层
SQL语言，JDBC Driver
API
导入工具
数据分析接口 (包括Hadoop集成接口)

数据模型和表述层
数据模型和Schema定义，存储引擎映射
API, SQL, Hadoop MapReduce接口索引管理
简单关系模型

BDMS集群分布式存储引擎层
WAL，写缓存和读缓存
存储文件结构和索引结构
Structured/Semi- 数据压缩和压紧
数据分布管理和索引
本地分析引擎

High Availability 分布式存储平台层
分布式数据存储
负载均衡
数据副本和一致性管理
High Scalability 数据寻址

集群服务层
集群节点网络拓扑
Big Data 故障监测
分布式异步通讯框架

BDMS逻辑架构 BDMS软件层次架构

Performance of BDMS
Streaming Ingest Data Throughput
write ops/Sec
140000
120000
100000
80000
60000
40000
20000
0
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
417
433
449
465
481
497
513
529
545
561
577
593
609
625
641
657
673
689
705
totalThroughput deltaThroughput

SLA of Random Query
Query Result select * from table where
percentage of read ops msisdn > xxx limit N;
100.00%
80.00%
limit 1 0.34 second
60.00% limit 10 0.31 second
40.00% limit 100 0.40 second
20.00%
limit 1000 0.46 second
0.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 limit 10000 1.25 seconds
100ms limit 500000 55.42 seconds

CloudNAS+MagicBox Enterprise
Solution
办公/SOHO网络 Company LAN or WAN
BigdataClou
d NAS Proxy
Enterprise Private
Access files via Web Service BigdataCloud
CIFS/NFS/FTP RESTful API
MagicBox
Service
MagicBox
Client

• CloudNAS • MagicBox
NAS Proxy + NAS in BigdataCloud Backup/Sync/Sharing/Versioning
– File Server
– Documents Backup
– Archive Server
– Backup Server – Collaboration

Parallel Computing Platform
Applications

Dataset as Input. job launch
Partition/Split as used
defined policy
MapReduce
JobTracker
ass
ign
red
assign map uce

Data Split-1 Map-1

Data Split-2 Map-2
Reduce-1 Output-1

Data Split-3 Map-3

Data Split-4 Map-4
Reduce-2 Output-2

Data Split-5 Map-5

MapReduce
BSP

Thank You Very Much!
Any more question?

schubert.zhang@gmail.com

http://cloudepr.blogspot.com
http://www.slideshare.net/schubertzhang

Big data and cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data and cloud

Similar to Big data and cloud (20)

More from Schubert Zhang

More from Schubert Zhang (10)

Recently uploaded

Recently uploaded (20)

Big data and cloud