Vector Search -An Introduction in Oracle Database 23ai.pptx
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
1. 1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Axel Rosenberg
Sr. Technical Marketing Manager
April 28, 2015
2. Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
3. Designed for Big Data Workloads @ PB Scale
Mixed media container, active-
archiving, backup, locality of data
Large containers with application
SLAs
Internet of Things, Sensor
Analytics
Time-to-Value and Time-to-Insight
Hadoop
NoSQL
Cassandra
MongoDB
High read intensive access from
billions of edge devices
Hi-Def video driving even greater
demand for capacity and
performance
Surveillance systems, analytics
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
4. InfiniFlash System
• Ultra-dense All-Flash Appliance
- 512TB in 3U
- Best in class $/IOPS/TB
• Scale-out software for massive capacity
- Unified Content: Block, Object
- Flash optimized software with
programmable interfaces (SDK)
• Enterprise-Class storage features
- snapshots, replication, thin
provisioning
IF500
InfiniFlash OS (Ceph)
Ideal for large-scale object storage use cases
6. Innovating Performance @Massive Scale
InfiniFlash OS
Ceph Transformed for Flash Performance and
Contributed Back to Community
• 10x Improvement for Block Reads,
2x Improvement for Object Reads
Major Improvements to Enhance Parallelism
• Removed single Dispatch queue bottlenecks for
OSD and Client (librados) layers
• Shard thread pool implementation
• Major lock reordering
• Improved lock granularity – Reader / Writer locks
• Granular locks at Object level
• Optimized OpTracking path in OSD eliminating
redundant locks
Messenger Performance Enhancements
• Message signing
• Socket Read aheads
• Resolved severe lock contentions
Backend Optimizations – XFS and Flash
• Reduced ~2 CPU core usage with improved
file path resolution from object ID
• CPU and Lock optimized fast path for reads
• Disabled throttling for Flash
• Index Manager caching and Shared
FdCache in filestore
17. Open Source with SanDisk Advantage
InfiniFlash OS – Enterprise Level Hardened Ceph
Innovation and speed of Open Source with
the trustworthiness of Enterprise grade and
Web-Scale testing, hardware optimization
Performance optimization for flash and
hardware tuning
Hardened and tested for Hyperscale
deployments and workloads
Enterprise class support and services
from SanDisk
Risk mitigation through long term support
and a reliable long term roadmap
Continual contribution back to the community
Enterprise Level
Hardening
Testing at
Hyperscale
Failure
Testing
9,000 hours
of cumulative
IO tests
1,100+
unique test
cases
1,000 hours
of Cluster
Rebalancing
tests
1,000 hours
of IO on iSCSI
Over 100
server node
clusters
Over 4PB of
Flash Storage
2,000 Cycle
Node Reboot
1,000 times
Node Abrupt
Power Cycle
1,000 times
Storage Failure
1,000 times
Network
Failure
IO for 250
hours at a
stretch
18. IFOS on InfiniFlash
HSEB A HSEB B
OSDs
SAS
Connected
….
HSEB A HSEB B HSEB A HSEB B
….
ComputeFarm
LUN LUN
Client Application
…LUN LUN
Client Application
…LUN LUN
Client Application
…
RBDs / RGW
SCSI Targets
ReadIOO
ReadIOO
Write IO
RBDs / RGW
SCSI Targets
RBDs / RGW
SCSI Targets
OSDs OSDs OSDs OSDs OSDs
ReadIOO
Disaggregated Architecture
Compute & Storage
Disaggregation leads to Optimal
Resource utilization
Independent Scaling of Compute
and Storage
Optimized for Performance
Software & Hardware
Configurations tuned for
performance
Reduced Costs
Reduce the replica count with
higher reliability of Flash
Choice of Full Replicas or Erasure
Coded Storage pool on Flash
StorageFarm
19. Flash + HDD with Data Tier-ing
Flash Performance with TCO of HDD
InfiniFlash OS performs automatic data
placement and data movement between tiers
based transparent to Applications
User defined Policies for data placement on
tiers
Can be used with Erasure coding to further
reduce the TCO
Benefits
Flash based performance with HDD like TCO
Lower performance requirements on HDD tier
enables use of denser and cheaper SMR drives
Denser and lower power compared to HDD only
solution
InfiniFlash for High Activity data and SMR drives
for Low activity data
60+ HDD per Server
Compute Farm
20. Flash Primary + HDD Replicas
Flash Performance with TCO of HDD
Primary replica on
InfiniFlash
HDD based data node
for 2nd local replica
HDD based data node
for 3rd DR replica
Higher Affinity of the Primary Replica ensures much
of the compute is on InfiniFlash Data
2nd and 3rd replicas on HDDs are primarily for data
protection
High throughput of InfiniFlash provides data
protection, movement for all replicas without
impacting application IO
Eliminates cascade data propagation requirement
for HDD replicas
Flash-based accelerated Object performance for
Replica 1 allows for denser and cheaper SMR HDDs
for Replica 2 and 3
Compute Farm
21. TCO Example - Object Storage
Scale-out Flash Benefits at the TCO of HDD
Note that operational/maintenance cost and performance
benefits are not accounted for in these models!!!
@Scale Operational Costs demand Flash
• Weekly failure rate for a 100PB deployment
15-35 HDD vs. 1 InfiniFlash Card
• HDD cannot handle simultaneous egress/ingress
• Long rebuild times, multiple failures
• Rebalancing of P’s of data impact in service
disruption
• Flash provides guaranteed & consistent SLA
• Flash capacity utilization >> HDD due to reliability & ops
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
Traditional
ObjStore on
HDD
InfiniFlash
ObjectStore -3
Full Replicas
on Flash
InfiniFlash
with
ErasureCoding
- All Flash
InfiniFlash -
Flash Primary
& HDD copies
3 Year TCO Comparison for 96PB Object
Storage
3 Year
Opex
TCA
0
20
40
60
80
100
Total Racks
22. Flash Card Performance**
Read Throughput > 400MB/s
Read IOPS > 20K IOPS
Random Read/Write
@4K- 90/10 > 15K IOPS
Flash Card Integration
Alerts and monitoring
Latching integrated
and monitored
Integrated air temperature
sampling
InfiniFlash System
Capacity 512TB* raw
All-Flash 3U Storage System
64 x 8TB Flash Cards with Pfail
8 SAS ports total
Operational Efficiency and Resilient
Hot Swappable components, Easy
FRU
Low power 450W(avg), 750W(active)
MTBF 1.5+ million hours
Scalable Performance**
780K IOPS
7GB/s Throughput
Upgrade to 12GB/s in Q315
* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.
** Based on internal testing of InfiniFlash 100. Test report available.
25. Emerging Storage Solutions (EMS) SanDisk Confidential 25
Messenger layer
Removed Dispatcher and introduced a “fast path” mechanism for
read/write requests
• Same mechanism is now present on client side (librados) as well
Fine grained locking in message transmit path
Introduced an efficient buffering mechanism for improved
throughput
Configuration options to disable message signing, CRC check etc
26. Emerging Storage Solutions (EMS) SanDisk Confidential 26
OSD Request Processing
Running with Memstore backend revealed bottleneck in OSD
thread pool code
OSD worker thread pool mutex heavily contended
Implemented a sharded worker thread pool. Requests sharded
based on their pg (placement group) identifier
Configuration options to set number of shards and number of
worker threads per shard
Optimized OpTracking path (Sharded Queue and removed
redundant locks)
27. Emerging Storage Solutions (EMS) SanDisk Confidential 27
FileStore improvements
Eliminated backend storage from picture by using a small workload
(FileStore served data from page cache)
Severe lock contention in LRU FD (file descriptor) cache. Implemented a
sharded version of LRU cache
CollectionIndex (per-PG) object was being created upon every IO request.
Implemented a cache for the same as PG info doesn’t change often
Optimized “Object-name to XFS file name” mapping function
Removed redundant snapshot related checks in parent read processing
path
28. Emerging Storage Solutions (EMS) SanDisk Confidential 28
Inconsistent Performance Observation
Large performance variations on different pools across multiple
clients
First client after cluster restart gets maximum performance
irrespective of the pool
Continued degraded performance from clients starting later
Issue also observed on read I/O with unpopulated RBD images –
Ruled out FS issues
Performance counters show up to 3x increase in latency through
the I/O path with no particular bottleneck
29. Emerging Storage Solutions (EMS) SanDisk Confidential 29
Issue with TCmalloc
Perf top shows rapid increase in time spent in TCmalloc functions
14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()
7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned
long, int)
I/O from different client causing new threads in sharded thread pool to process
I/O
Causing memory movement from thread caches and increasing alloc/free
latency
JEmalloc and Glibc malloc do not exhibit this behavior
JEmalloc build option added to Ceph Hammer
Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to
larger value (64M) alleviates the issue
30. Emerging Storage Solutions (EMS) SanDisk Confidential 30
Client Optimizations
Ceph by default, turns Nagle’s algorithm OFF
RBD kernel driver ignored TCP_NODELAY setting
Large latency variations at lower queue depths
Changes to RBD driver submitted upstream
Editor's Notes
Video continues to drive the need for storage, and Point-Of-View cameras like GoPro are producing compelling high resolution videos on our performance cards. People using smartphones to make high resolution videos choose our performance mobile cards also, driving the need for higher capacities.
There is a growing customer base for us around the world, with one billion additional people joining the Global Middle Class between 2013 and 2020. These people will use smart mobile devices as their first choice to spend discretionary income on, and will expand their storage using removable cards and USB drives.
We are not standing still, but creating new product categories to allow people to expand and share their most cherished memories.
___________________________________________________________
Performance: Shorter jobs by 4x per study, flash enablement)
Share Compute with other infrastructure (win for any company with seasonality).
Flexible & Elastic Storage Platform to handle MapReduce load spikes