WatsonBruce Ikadega sample 1

Introduction to DirectPath subsystems
PROPRIETARY and CONFIDENTIAL. NDA REQUIRED. 8/14/2001 10:54 AM Page 1
Copyright Ikadega, Inc. All rights reserved.
(docstechpubsinternal_docssubsystem_intro.doc)
This document contains overview information on the DirectPath™ subsystems. The
information comes from the Ikadega online documentation; the online component, and
not this document, will be the version that will be kept current. This document will be
updated from time to time from the online documents.
Information is currently available for a set of subsystems. More will follow over time.
Note: DirectPath is an evolving and changing system. This document describes the future
vision for each subsystem – how it is expected to look at some future point (such as
when the product first ships). Many parts of the design as described in this document
have not yet been implemented.
Note: Underlined terms are defined in the Ikadega glossary.
Document contents
Internet delivery subsystem.................................................................................................2
Component life cycles.....................................................................................................4
The TV delivery and MPEG platform subsystems..............................................................6
How hospitality systems work.........................................................................................6
How ad insertion works...................................................................................................8
The jukebox model..........................................................................................................9
The interactive model (hospitality only) .......................................................................10
The volume and file access subsystems ............................................................................11
File system service layers..............................................................................................12
Typical uses of the file system ......................................................................................12
The UNIX file system ...................................................................................................13
The file access subsystem..............................................................................................13
Access to smaller, named files ..................................................................................14
Block search services for the volume access subsystem ...........................................14
The volume access subsystem.......................................................................................16
Aggregation...............................................................................................................17
Striping ......................................................................................................................18
The hardware layer........................................................................................................18
Checkpoints...................................................................................................................19
A simple example......................................................................................................19
Checkpointing states and transitions .........................................................................23
Checkpointing states ............................................................................................23
State transitions ....................................................................................................24
Replication.....................................................................................................................25
Replication and checkpointing ..................................................................................26
Content transfer engine subsystem....................................................................................27
Inside the CTE...............................................................................................................28

This document describes the subsystems in the groupings shown in the Introduction to
DirectPath:
Internet
delivery
Manager
File access
IP messagingVolume access
Traffic & array control
ITM
Platforms:
Content transfer engine (CTE)
Controller event engine (CEE)
Open source environment (OSE)
MPEG platform (MPP)
TV (MPEG)
delivery
Core services
Application subsystems
Content transfer engine extension (CTEX)
See the introductory document for high-level descriptions of the subsystems. This
document contains more detailed descriptions.
Internet
delivery
subsystem
The Internet delivery subsystem drives the process of sending content to Internet users.
This picture shows its important components:
. . .
HTTP CTDs
HTTP XCTDs
FTP CTDs
FTP XCTDs
RTP CTDs
RTP XCTDs
Internet delivery subsystem
The subsystem contains a large number of content transfer daemons (CTDs) – generally a
separate one for each end-user session. (End users can have multiple sessions at the same
time.) The CTDs all run the same code, but each has its own event queue and a small
amount of private memory in external RAM.
A CTD’s primary function is data transfer. It has a limited set of commands and
functions, but this allows it to perform them very efficiently. Most of the traffic it handles
is bound for clients outside the DirectPath system. Its main task is to receive data from
the fabric, then place this data into outgoing message frames for the Internet. The daemon
is optimized for data transfer and does only minimal processing of the data.
Most CTDs have corresponding extended content transfer daemons (XCTDs). XCTDs
handle the non-routine processing that the CTDs do not do – the more complex error and

exception processing. CTDs and XCTDs communicate with each other via either ITM or
IP messaging.
The CTDs run in an FPGA on an IP access node. XCTDs run in a supplemental
processor, which is either on an IP access node or a supplemental processor node. The
content transfer engine (CTE) is the logical platform for the CTDs. For the XCTDs, the
platform is the content transfer engine extension (CTEX).
FPGA
Supplemental
processor
CTD XCTD
CTE CTEX
In some applications there is one XCTD per CTD, but in other applications an XCTD
might oversee several CTDs. For example, in a streaming video application, one XCTD
might work with the following CTDs, each of which processes a different type of
information:
XCTDRTSP CTD
HTML CTD
RTP CTD
For selecting titles
to download
For negotiating xfer
parameters -- speed,
format, etc.
For downloading
selected titles
There are several categories of CTDs and XCTDs – one category for each Internet
service supported by the system (HTTP, FTP, RTP, etc.). Event engines in the content
transfer engine (CTE) alert CTDs when there are events for them to process. This is the
flow of control for a single CTD/XCTD pair:
Event engine
CTE Internet delivery subsystem
CTD
Dispatch signal
XCTD
Exceptions and
complex tasks
Commands and
replies

This is the overall environment in which a single CTD and XCTD pair operates to handle
one user session:
CTD XCTD
Web server
daemons
OSE OS
File access
session
Volume access
session
Traffic/array
control
Client (e.g., end
user browser)
Component of the Internet delivery subsystem
Ikadega-developed DirectPath object not in the Internet delivery subsystem
Third-party component not developed by Ikadega
External resource
The Web server daemons run in the DirectPath system’s open source environment (OSE).
They must have a very specific server configuration to run effectively with the rest of the
system.
The Web delivery client is an Internet application such as a Web browser or FTP
program. In certain cases, there may also be one or more external resources for the
subsystem to deal with. One example of this is a credit card validation/approval system in
an e-commerce application. Depending on how complex the processing is, the interface
to an external resource could be handled by either the CTD (if only simple processing is
needed, such as reading cookies) or the XCTD (for more complicated processing).
In the initial versions of the system, the XCTD communicates directly with the Web
server daemons. In future versions, this communication may instead go through the
OSE’s operating system.
Component life
cycles
The system creates a fixed-size pool of CTDs at boot time, which it allocates one by one
for each new end-user session. The number of possible CTDs generally remains fixed. If
the system exhausts the CTD supply, it cannot create new end-user sessions until a CTD
is de-allocated. This protects the system against denial-of-service (DOS) attacks. By
limiting the number of possible sessions, the system can continue running if it receives
numerous session requests, though it may be temporarily unable to allow new sessions.
(System operators can set the size of the CTD pool in the Web-based configuration
utility.)
Since XCTDs run on a supplemental processor, which has an operating system and a
richer execution environment, there is not a fixed set of XCTDs. The system creates new
ones as needed.

This is the life cycle of a CTD/XCTD pair supporting a typical end user HTTP session
(assuming a 1:1 relationship between CTDs and XCTDs):
1. When the system receives a request for a new user session, it allocates a CTD and
XCTD (creating a new one as necessary). The system does handshaking with the
client to determine the session type (HTTP, FTP, RTP, etc.). It configures the
CTD/XCTD pair and initializes a context accordingly.
2. As described above, either the XCTD or CTD might take part in authorizing and
validating the end user session.
3. Through an event engine in the content transfer engine (CTE), the CTD receives a
client request to transfer a file. To the CTD, this is simply a command it is not
programmed to process, so it passes it to the XCTD.
4. The XCTD receives the file transfer request and attempts to validate the transfer –
checking to see if the requested file exists and if the end user is authorized to receive
it, etc.. If it successfully validates the request, the XCTD generates a handle for the
requested file and passes it to the CTD. Then it tells the CTD to transfer the file.
5. The CTD begins the process of requesting data and preparing it to go out to the
Internet, using various services from other subsystems. At this stage, the XCTD only
becomes involved if there is an error or an exception, or if processing is needed that
the CTD does not know how to do.
During the file transfer, the XCTD knows what file is being downloaded but does not
know any details about the download, such as how much data has been sent so far.
The CTD knows these details but does not know what file it is processing.
6. The CTD notifies the XCTD when the file transfer is done.
7. The previous steps repeat for each subsequent file download requested by the client.
8. When the client sends a request to end the session, the CTD passes it to the XCTD
(again since the CTD is not programmed to process the request). The XCTD de-
allocates itself and the CTD.

The TV
delivery and
MPEG
platform
subsystems
The DirectPath TV delivery and MPEG platform subsystems are closely tied together.
This document describes them both. These two subsystems deliver digital video to
support two DirectPath applications:
• Hospitality – A hospitality system provides in-room, on-demand video content to
local end users. (Future hospitality systems may also support in-room Web
browsing, as described later in this document.)
• Ad insertion – In this application, a customer such as a cable TV provider uses a
DirectPath system to insert their own advertisements or other content into a video
signal sent to cable subscribers.
Media server is Ikadega’s name for a DirectPath system used in either of these
applications.
How hospitality
systems work
In a typical hospitality application, one or more DirectPath media servers deliver digital
movies to hotel guests. The content can also include things like short advertising videos
for other nearby businesses.
The following picture shows the devices involved in hospitality delivery. The only
Ikadega-supplied component is the media server. The customer supplies and manages the
rest.
Media server
(DirectPath system)
Facility cable
plant
TV set
End user's room
End user
agent

To watch a movie, the user interacts with the customer’s end user agent system rather
than with DirectPath. This is the normal sequence of events when an end user wants to
use hospitality services:
1. The end user, on an in-room TV, makes a request to watch a movie or other program
(via an input device such as a remote control).
2. The customer’s end user agent receives this request and queries the media server for
information on the available content.
3. The media server sends the end user agent data on all the selections available,
including the title, running time, description, rating, etc. for each content file.
4. The end user agent takes this information to display menus and help the end user
make a selection. The agent also makes any necessary billing arrangements.
5. When the guest makes a selection, the end user agent directs the media server to
begin playback of the requested program to a specific media server port. The end user
agent tunes the user’s TV to the correct channel to receive the program. This channel
change is invisible to the end user – it does not change the channel number displayed
on the TV.
6. The media server plays the selection as requested. It sends the signal directly to the
end user’s TV through the building’s cable plant. The user may pause or halt
playback at any time. The only involvement the end user agent has during this phase
is to pass any pause/restart/stop commands to the media server.
7. The media server notifies the end user agent when playback is done.
Communication from the agent goes through an end user agent proxy. This is an
application that runs in the open source environment. It handles communication between
the end user agent and the DirectPath controller (DPC). The DPC takes action as
appropriate, which often affects the TV access node.
External
system
Proxy task DPC
TV access
node
Media server
OSE

How ad
insertion works
Ad insertion allows a local cable company to substitute its own commercials (usually for
local businesses) for those in the input broadcast (which are often made for a national
audience). This picture shows the major components involved:
Media server
(DirectPath system)
Ad
scheduler
Content
loader
Cable TV
head end
A/B
switch
"Go"
command
Signal
New
ads
Ad source
Subscribers
Numerous channels of content arrive at the cable TV head end. If there is no ad insertion
happening for a particular channel, that channel’s signal passes unchanged through the
A/B switch and on to the subscribers watching that channel. However, when the head end
receives notification that a commercial is about to start, it signals the ad scheduler
system.
The ad scheduler has a list of the commercials stored on the media server. It decides
whether to replace the national ad with one of these commercials, and then it chooses the
commercial to run. The scheduler sends a command to the media server to play that ad on
the specified channel. The ad scheduler also uses a proxy task to communicate with the
media server.
The media server immediately begins to play the commercial. The A/B switch replaces
the signal coming from the head end with the media server’s output signal. The
subscribers watching that channel see the ad being played by the media server.
From time to time, the content loader receives new digital ads. It passes them on to the
media server for storage, and it also notifies the ad scheduler, so the scheduler has a
current list of which commercials are stored in the media server. The ad scheduler and
content loader can run on the same machine or different machines.

The jukebox
model
The early versions of the media server are designed around a jukebox model, in which it
plays the selection it’s told to by an outside system (the end user agent or ad scheduler).
A later section of this document describes the interactive model, to be implemented
sometime in the future. Whether it’s playing movies or inserting ads, the DirectPath
system has hardware and software in its TV access nodes to support video playback:
Sub-node 0
MPEG drivers
OS-9
DAVID
Application
MPEG decoder (and related
components)
Sub-node 1
Microprocessor
Node-fabric
interface
TV signal
Control
data
MPEG
image
stream
Ikadega component Third-party component
TV access node
There are eight sub-nodes on a TV access node, each of which produces one video signal.
Notice that one node-fabric interface (NFIF) handles fabric communication for all of
them. Most of the data arriving at the NFIF is the video data, which it passes directly to
the appropriate MPEG decoder rather than to the microprocessor. (This is similar in
philosophy to how DirectPath storage nodes pass data directly to access nodes without
going through the DirectPath controller.) The data going from the microprocessor to the
node-fabric interface includes requests for more content data from the storage nodes.
The Ikadega application running in the microprocessor would work on tasks such as
closed captioning, providing visuals to accompany audio-only content, and
superimposing text or graphics over the video for weather warnings, logos and other
images. The MPEG decoder’s “related components” from the previous drawing include
logic to support internal MPEG transport, superimposing, and audio-video mixing.
Subsystem information: the Ikadega microprocessor application and end user agent proxy
are part of the TV delivery subsystem, while the other sub-node components are in the
MPEG platform subsystem.

The interactive
model
(hospitality
only)
In future versions of hospitality systems, the user will interact with a Web browser
running in the media server. This provides a more appealing and functional selection
system that that provided by the original end user agent, which tends to be character-
oriented. The design might look like this:
Sub-node 0
Drivers
OS-9
DAVID
Browser
MPEG decoder (and related
components)
Sub-node 1
Microprocessor
Node-fabric
interface
TV signal
Control
data
MPEG
image
stream
Ikadega component Third-party component
Applet
Created by VAR
These are the major differences in the interactive model:
• End users will be able to go on the Internet from their rooms.
• End users who would rather watch a movie than go on the Internet will select
content via an applet running in a browser in the microprocessor. The customer
or VAR will probably create this applet.
• Since the end user will use the browser to make content selections, the end user
agent has a reduced role – it simply passes keystrokes between the end user and
the browser.
• There is no interactive model for ad insertion.
Subsystem information: the applet and end user agent proxy are in the TV delivery
subsystem, while the other sub-node components are in the MPEG platform subsystem.

The volume
and file
access
subsystems
The volume access subsystem and file access subsystem are the DirectPath file system.
These subsystems support the reading and writing of data on storage node hard drives.
The file system can accommodate a wide range of uses. In some applications, such as
hospitality systems that primarily play back movies to locally connected TV sets, the file
system holds a relatively small number (in the range of hundreds) of very large files.
These files do not change very frequently, and owners load new files relatively
infrequently (say on a daily or weekly basis). Other customers, however, will use
DirectPath to host and deliver Web sites. These customers need a file system that can
handle large numbers (in the tens of thousands) of small files that change relatively
frequently. Between these two extremes are customers like an online music service, who
must deliver one set of small files (say the Web pages where users select songs to
download) and another set of fairly large ones (the actual MP3 song files). The
DirectPath file system has flexibility to accommodate these varying uses in one design.
The file system consists of two DirectPath subsystems:
• Volume access subsystem – In DirectPath, a volume is a logically continuous set
of disk sectors. The volume access subsystem is unaware that some volumes
contain multiple files.
• File access subsystem – A DirectPath file is a named portion of a volume. File
accesses go through the volume access subsystem.
Notice from this drawing that all disk accesses go through the volume access subsystem,
either directly or through the file access subsystem:
Client
task
Volume access
subsystem
File access
subsystem
Accessing a file
Accessing a volume

File system
service layers
You can think of the DirectPath file system as a collection of services divided into the
following layers:
Implemented here: aggregation,
replication, striping, checkpoints.
Used here: block search services
from the file layer.
Applications
Request and work
with the data
File subsys.
Identifies & manages
named data files
Volume subsystem
Locates & places the
data on disk
Hardware layer
Reads and writes the
data
UNIX file
system
Implemented here: block search
services for the volume layer.
Used here: checkpoints.
The remaining introductory pages describe these components, from the higher-level
directory layer and UNIX file system to the low-level hardware layer.
Typical uses of
the file system
DirectPath can actually support multiple file systems running concurrently. The Ikadega-
supplied file system can run together with the UNIX file system. It can also exist in the
same machine as an optional customer-defined file system.
Below are some examples of how customers could used the Ikadega-supplied file
systems:
• Local large content delivery, where the system delivers very large files to nearby
users – for example, movies to hotel guests. In this scheme, there usually is only
one company providing the content. Since the data for a content file is not likely
to change, the content rarely if ever goes through different versions. What does
change over time is the set of movies available – new ones are added and older
ones might be removed. The volume access subsystem provides the services for
this type of use.
In this type of system, the file access subsystem exists but is essentially empty –
it just passes I/O requests to the volume access subsystem with little or no
processing.

• Internet delivery, where the system hosts numerous Web sites containing various
file types, from small files to movies. The content on these sites comes from a
number of content providers, and from time to time the system owner may need
to find out who created a certain file. While some of these files may be as large
as the movies described above, there are probably also a number of small files.
The system must be able to locate and process all of these files. It must also be
able to deal with them being replaced frequently. In applications like this, the
system relies on the services of the volume and file access subsystems.
The file access subsystem processes these files.
• UNIX file access, described in the next section.
Most DirectPath systems have a mixture of these file types.
The UNIX file
system
A complete UNIX file system may exist in DirectPath to support legacy applications and
other programs that need UNIX services. One example of this is the system event
logging, which could be implemented by using UNIX logging services. Also, the
Manager subsystem, designed to be as UNIX-like as possible, uses UNIX services.
The UNIX file system can perform reads and writes on its own private disk (a disk
invisible to the other file systems), or it can use the volume access subsystem for disk
access, or it can do both. The private disk is currently used in system booting. When it
uses the volume access subsystem for disk access, the UNIX file system has its own
volume, which it thinks is an entire disk. It isn’t aware that the volume access subsystem
is even there.
Possible future directions: In media server applications that mainly deliver digital video,
it may be possible to use the UNIX file system as the only file service, without using the
volume or file access subsystem. (The file access subsystem doesn’t do much in these
applications anyway). It’s also possible, though, that the file access subsystem might take
over all the functions of the UNIX file system in future versions of the system.
The file access
subsystem
The file access subsystem is the highest layer in the file system hierarchy. The nature of
its processing depends on the type of data being processed. The subsystem is mostly
transparent when processing large content files such as digital movies or music – the
volume access subsystem does most of the work on these files. The file access subsystem
becomes important when the system works with numerous, small content files. For
example, if a customer uses a DirectPath system for hosting Web sites, each site will
have a number of relatively small files, and the file access subsystem would process the
individual files in the site (HTML, GIF, etc.).
One key feature of the DirectPath file system is that virtually all of the content transfer
from disk happens in the volume access subsystem rather than the file access subsystem.
This gives the system a speed advantage over traditional file servers.

Access to smaller,
named files
The volume access subsystem sees large blocks of data with no internal structure – for
example, digital movie files that are delivered to users from beginning to end. The file
access subsystem gives the system access to many smaller named files, such as the files
that make up a Web site.
f1 f2 f3 f4 f5 . . . fn
Where the volume access subsystem
sees one large volume...
...the file access subsystem might see
a number of smaller named files.
Block search
services for the
volume access
subsystem
To find files, the file system has several different directories:
• Inode directory – an inode is a system data structure that describes a file. An
application references a file by giving the file system an inode number.
• URL directory – this is a table that maps URLs to inodes, in effect providing
URL “names” for the inodes.
• Traditional file system directory – another inode mapping table, but one that
mimics the hierarchical tree structure of subdirectories and files commonly used
in PCs and UNIX machines. These directories also point to (and “name”) inodes.
There are three basic methods for reading content, depending on the nature of the files
involved:
• Locate method – the client task wants to read from a certain offset into a volume,
which it has a handle to. This method is for large files. Here is a typical sequence
of events:
File system
session
1
2
3 4
5
6
7
8
9
Volume system
session
Storage node
Client
10
11

1: Open request (client sends either a file name or URL). 2: Open reply (returns a handle to the
file, if found). 3: Locate request. 4: Locate reply (returns a map of the file’s block segments). 5:
Volume read request. 6: Sector read request. 7: Data transfer to client buffer (an RDMA transfer).
8 & 9: Request replies. 10: Close request. 11: Close reply.
Steps 5 through 9 repeat until the client has received the entire file.
• Whole file method – for quick access to files small enough to be fully retrieved in
one read operation (such as Web site files). One benefit of this method is that
there are no file open or close operations.
File system
session
1
7
2
5
4 3
6
Volume system
session
Storage node
Client
1: Read file request (client sends either a file name or URL). 2: File system session passes read
request along. 3: Sector read request. 4: Data transfer to client buffer (an RDMA transfer). 5, 6, 7:
Request replies.
• Traditional method (with Ikadega enhancements) – this method supports file
reads as done on a UNIX system. The method also supports traditional file
operations such as renaming, setting permissions, etc., and it supports DAFS.
File system
session
1
2
4
56
3
8
Volume system
session
Storage node
Client
9
10
11
7
1: File open request. 2:Open request reply. 3: File read request. 4: Read request passed along. 5:
Sector read request. 6: Data transfer to client buffer (an RDMA transfer). 7, 8, 9: Request replies.
10: File close request. 11: File close reply.

The “Ikadega enhancements” mentioned above include the direct RDMA content
transfer from the storage node to the client. Traditional file systems would send
the content to the client through the volume and file system sessions.
For added flexibility, clients may shift between the locate and traditional methods with
the same file handle.
Note: These drawings assume that the directories are fully cached in memory and that
files are stored contiguously on disk.
The volume
access
subsystem
The volume access subsystem supports the file access subsystem and UNIX file system.
Volumes are logical collections of sectors, often organized into categories of content
stored on the system, such as the top N most popular titles and the other less-popular
files. Most volumes contain large content files sized from the hundreds of megabytes to
gigabytes and beyond. Volumes generally have fewer attributes than files – they do not
have items such as access permission data, modification and access dates, checkpoint
information, etc.
The volume access subsystem is where you first start to see disk organization. A disk has
one or more disk slices, each of which contains partitions. Partitions cannot cross disk
slice boundaries. There also is a partition descriptor for each partition in a slice.
Disks
Partitions
Disk slice
boundaries
Partition descriptors
Disk slices help support the work of offline utility applications. One example of such an
application is a program that pre-loads content before disks are shipped. The system
formats disk slices like conventional operating system partitions.
Note to readers who are familiar with the system’s traffic shaping components: You can
think of the volume access subsystem as a part of the components responsible for storage
array control and fabric traffic.

Aggregation
One method for splitting volumes into partitions is aggregation. This is simply breaking
the content into partitions, which can reside on different disks or storage nodes.
Original volume Disk 2 Disk 8
100 gigabytes 40 GB 60 GB
With checkpointing or replication, the aggregation can have different boundaries for each
version:
40 GB 60 GB 40 GB30 GB 30 GB

Striping
Striping is another way the system splits content files into partitions. Striping is a disk
storage technique that helps to protect against lost content. It splits up a content file into
equal-length blocks called stripes. The system stores these stripes on N different storage
nodes (here N = 4), along with an additional stripe described below:
0
1
2
3
4
(^ = exclusive OR)
5
6
7
0 1
2 3
4 5
6 7
Storage node 1 Storage node 2
Storage node 3 Storage node 4
Original volume contents
0^1^2^3
4^5^6^7
Storage node 5
Parity stripe
Each byte in the parity stripe (at the bottom of the drawing) is the result of an exclusive
OR logic operation on the bytes in the corresponding stripes. For example, the first byte
of the parity stripe is the result of an exclusive OR performed on the first bytes of stripes
0 through 3. If the system can’t read one of the stripes (say if there is a disk or storage
node error), it can re-create the lost data by comparing the values in the parity stripe with
those in the remaining stripes.
The hardware
layer
This layer contains the hard disks and their controlling hardware and software, all located
on storage nodes. The layer doesn’t know the meaning of the data it reads and writes. It
just responds to specific commands. Most of the work it does is read operations, but it
does write to disk as well, to load new content or make copies of volumes.
Every storage node has multiple sub-nodes (two of them at present), each of which
controls one ATA-type hard disk drive. The sub-nodes have custom disk-controlling
hardware as well as interfaces to the fabric. Since the whole DirectPath system is
designed to keep the disk drives as busy as possible, the system makes heavy demands on
the drives, and there is very little room for malfunctions or disk errors. Field service
people can replace disks “on the fly” (while the rest of the system continues to run and
deliver content) to remove a faulty disk, install a drive with more capacity, or insert a
disk pre-filled with new content.

To assist the disk activity scheduler, DirectPath maintains a set of performance history
data on each disk. This data reflects the actual performance of each individual drive
(rather than the specifications for the drive type).
Checkpoints
Checkpoints allow the system to keep multiple file versions on disk, mostly to ensure
read consistency – each end user getting files from the same file set. This is useful in
many applications, especially with frequently updated files such as Web files. If a site is
popular, there might be a number of people using it when it’s time for one of the file
updates Web sites often have. If a site changes frequently, at some point there will be
users with files open from several revisions ago, especially for users with slow Internet
connections. With checkpointing, the new and older versions of the site co-exist while the
system loads new files to the disk drives.
The checkpoint feature is implemented in the volume access subsystem, though it would
only be used on the files processed by the file access subsystem.
The files for a new checkpoint become available to users when the file system commits
them. A commit operation updates the partition descriptors for the volume. New users
aren’t able to use the new checkpoint until all of the new files are committed
successfully. Users that had site files open at the start of the update see only files from the
most current checkpoint when they started their sessions. If the system halts during the
loading stage (before it can commit a new checkpoint), the checkpoint and its new
content are lost. The system retains the previous committed checkpoints, though.
The DirectPath customer can specify how many checkpoints to keep for each volume.
The system generally re-uses the storage space of expired versions. This can be a rapid
process – on some systems that change content very quickly, the resources for a replaced
checkpoint may be re-used in as little as 4 minutes.
The checkpoint feature is implemented in the volume access subsystem, but DirectPath
only uses it on the smaller named files processed by the file access subsystem. At any
given moment, a checkpointed volume has files from 0 to n checkpoints available, and it
may also have a new checkpoint in progress.
A simple example
To understand checkpointing, take an example volume that only has five files. (The
example is small and unrealistic, but it demonstrates the basics of how checkpointing
works.) Suppose there is a DirectPath system that hosts Web sites, and it receives five
files for the initial version of a site. The files are a.html, b.html, c.html, d.gif, and e.gif.

When the new files arrive, if they are to go in a new volume, the application software
allocates a certain amount of space for the volume. At this point there is officially nothing
in the volume – none of the blocks is committed, though the file system may have started
loading the files to disk.
.
.
.
Data being stored
to disk
At this stage, the files for the example Web site are present on their way to disk, but they
are not yet available to users.
At this stage, the files for the Web site are present on their way to disk, but they are not
yet available to users.
When the files are stored successfully, the file system can commit the checkpoint. After it
does this, new users open the files from this first checkpoint:
a.html (1..latest)
b.html (1..latest)
c.html (1..latest)
d.gif (1..latest)
e.gif (1..latest)
..
.
Most recently committed: 1
Oldest retained checkpoint: 1
The (m..n) notation indicates the checkpoints each file belongs to. (The system does not
store this information with the files, however – it maintains it in memory only.) Oldest
retained checkpoint and Most recently committed are two variables the system uses to
keep track of a volume’s checkpoints.

Now, suppose sometime later there is a change to the a.html file, where a completely
new version of the file replaces the first version. The system stores the new version in the
first available free space, and then commits it:
a.html (1)
b.html (1..latest)
c.html (1..latest)
d.gif (1..latest)
e.gif (1..latest)
a.html (2..latest)
..
.
Invalidated
New
Oldest retained: 1
At this point, the volume contains files from checkpoints 1 and 2. The commit invalidates
the first version of a.html, which means that the file is still there but is no longer in the
latest checkpoint. However, the file is still valid for sessions using checkpoint 1, if any. If
certain conditions are met later (see below), the file system could eventually re-allocate
the storage used by this first version of a.html. The users are not aware of the
checkpointing or the different file versions.

Suppose now that there two file changes for the next checkpoint. The site owner changes
the b.html file, which had been the only file referencing d.gif. The new b.html no longer
uses the graphic file. In addition to invalidating the old b.html, the system invalidates
d.gif – the file does not apply to the new checkpoint or the ones that follow it (unless one
of the HTML files is changed to refer to it again). d.gif and the original b.html are still
valid for users of the checkpoints 1 and 2. Here’s what the volume looks like after the
commit:
a.html (1)
b.html (1..2)
c.html (1..latest)
d.gif (1..2)
e.gif (1..latest)
a.html (2..latest)
b.html (3..latest)
..
.
Oldest retained: 1
What eventually happens to the previous version of b.html, d.gif, and other invalidated
files depends on the customer’s file allocation policies. The system’s operators probably
want to keep files from at least some of the previous checkpoints, in which case these
files would remain there unchanged. However, to keep disk clutter down, most customers
also want to limit the number of checkpoints remaining on disk. So if a customer chooses
a checkpoint limit, it affects what the file system does when there is a new checkpoint. If
the following conditions are both true, then the file system could mark a file reclaimable
(discarded and available for re-use by new files):
• If there are currently no sessions using the checkpoint in question, and
• If the file’s checkpoint number is older than the new checkpoint’s number minus
the checkpoint limit (for example, with a limit of 5 and committing checkpoint 8,
if a file is from checkpoints 3, 2, or 1)
The file system marks a file as reclaimable only if both conditions are true for it. If the
file system marks a file as reclaimable, its blocks no longer contain valid data, though the
system might not re-use them right away.

Checkpointing
states and
transitions
This document describes some of the data the DirectPath file system uses to process
checkpoints. It also shows how these variables change in response to various checkpoint
events.
Note: The information in this document applies only to customer environments with
relatively small numbers of content providers. This document does not apply to
environments where there are numerous content providers.
The file system maintains the following variables to support checkpointing, which are
maintained by the volume access subsystem and file access subsystem:
• For each volume:
o OldestRetained – this is the number of the earliest checkpoint the system
must still honor (retain the files for). Recall from About checkpoints that
some end users could still be using files from one or more previous
checkpoints.
o LatestCommitted – the number of the most recently committed checkpoint
for a volume.
o OldestBeingUsed – the oldest checkpoint that still has active user sessions.
• For each file:
o Modified – this is the number of the checkpoint containing the latest version
of a particular file.
o Invalidated – the checkpoint when the file was removed from the latest
checkpoint, though it still may be in use by active user sessions.
Note: The DirectPath file system currently tracks these two variables for each
file. It could in the future track them by block instead.
Checkpointing
states
The following table shows the different states a file can be in. Note, though, that the
DPFS does not store these file states anywhere. The state of each file is implied by the
values of the above variables. The reason for this is system performance and consistency
– if a checkpoint that affects, say, 200 files aborts before committing, the system would
be slowed down by first marking and then un-marking all 200 files. Also, if the system
halts in the middle of a 200-file update, the file system would not be able to correctly tell
which files are members of each checkpoint when it comes back up.

There are abbreviations in the tables: F for file and V for volume. For example,
F.Modified means the Modified variable for a file, and V.Committed is the Committed
value for a volume.
Implied state Values causing that state Comment
Free F.Invalidated <= V.OldestRetained When the DirectPath file system wants
to create new files for a checkpoint, it
first allocates free space for them.
Newly
allocated
F.Modified > V.LatestCommitted;
F.Invalidated == NULL
This is the status of a new file being
created for a future (not yet committed)
checkpoint. If the file system commits
the checkpoint, the file’s state becomes
In-use retained. If the system instead
aborts the checkpoint, it makes the file’s
resources free again for re-use.
In-use
retained
F.Modified <= V.LatestCommitted;
F.Invalidated == NULL
This is a normal file state – it means that
the file is part of the volume’s most
recently committed checkpoint.
Invalidation
pending
F.Invalidated > V.LatestCommitted In this state, the file system is in the
process of creating a checkpoint that,
when committed, will invalidate the file.
If the commit operation completes, the
status changes to Retained. If the system
does not commit the new checkpoint, it
eventually returns the file to the In-use
retained state (sometime before it
commits the next checkpoint).
Retained F.Invalidated > V.OldestRetained;
F.Invalidated <= V.LatestCommitted
A file in this state has been invalidated.
It is no longer in the volume’s latest
checkpoint, but it is still part of an older
checkpoint that has current users. When
all of this checkpoint’s users end their
sessions, the system could delete the file
(put it in the Free state, in a sense) and
re-use its resources, depending on its
checkpoint retention policy.
State transitions
This drawing shows the checkpoint states a typical file goes through during its life span:
V.OldestRetained or
Is F.Invalidated < the smaller of: ?
Newly
allocated
In-use
retained
commit Newly
invalidated
commit Too old to
retain?
RetainedNo
Yes
Free
commit
V.OldestBeingUsed

The following table shows how the system updates variables and changes implied states
for individual files in response to miscellaneous checkpoint events.
This table describes how the system updates variables and changes states in response to
different checkpoint events.
From state To state Triggering event Variables modified
Free Newly allocated File allocation F.Modified := V.LatestCommitted + 1
Newly
allocated
Free File delete F.Invalidated := V.LatestCommitted +
1
Newly
allocated
Free Checkpoint abort F.Modified := NULL
Newly
allocated
In-use retained Commit Increment V.LatestCommitted
In-use retained Retained Commit, when
V.OldestRetained
is still <
F.Invalidated
Increment V.LatestCommitted
In-use retained Free Commit, when
V.OldestRetained
is now >=
F.Invalidated
Retained Free Commit, when
V.OldestRetained
is now >=
F.Invalidated
Invalidation
pending
In-use retained Checkpoint abort F.Invalidated := NULL
Replication
Replication is a disk storage method for protecting against lost content by making
complete copies of volumes on different storage nodes. This is useful when one copy of a
content file is not enough to meet the peak demand for the file. It also improves the
system’s peak throughput. System managers can use replication to place copies of
popular content near the physical outer edges of disk drives, where data is read faster
(since more bytes pass under the read/write head in the same amount of time than near
the middle). Replication strategy is how a customer wants to replicate volumes.
Replication of content is always a relatively low-priority activity, not important enough
to interfere with sending out content. As a result, there is lagged replication – the system
usually finishes the copy operation somewhat after finishing the storage of the original
content.

Replication and
checkpointing
On systems using checkpointing, lagged replication follows behind the commitment of
each checkpoint. For example, when the system originally allocates space for a volume,
the replicated copy doesn’t exist yet:
Original volume Copy
As the original volume grows and changes, the copy might lag behind like this:
CPs included: 1
CPs: 1, 2 CPs: 1
CPs: 1, 2, 3 CPs: 1, 2
At this point, if it needed data from checkpoints 1 or 2, the file system could get it from
either the original or the copy (assuming that the file in question has been invalidated in
either of these checkpoints).

Content
transfer
engine
subsystem
A content transfer engine is a platform that supports a number of content transfer
daemons (CTDs). In the initial implementation, it works with CTDs in the Internet
delivery subsystem. Together the two subsystems send data to Internet users.
A supplemental processor initializes and oversees the CTE. Sometimes these two
components are on the same access node:
IP access node
Content transfer
engine
Supplemental
processor
FabricInternet
Here the supplemental processor is on a separate supplemental processor node:
Supplemental
processor
Supplemental
processor node
Content transfer
engine
IP access node
Fabric
Internet
The CTE is implemented as a set of fixed-function engines and re-programmable micro-
engines, all on a field-programmable gate array (FPGA). The engine’s major functions
are buffer management and two communication interfaces going to the fabric and the
Internet.

Inside the CTE
This drawing shows the CTE’s internal structure:
Node-fabric
interface
engine
EventQEventQ
Event engine
External
network
interface
engine
EventQ
Fabric
External
network
Buffer &
memory
control
External RAM
Content Transfer Engine
There are multiple event engines – more than one, but perhaps only a few. Each event
engine has its own queue. The event engines receive notices about events to be processed
by particular CTDs. The event engine dispatches an event by waking up the correct CTD
to process it:
EventQ n
Event engine n
EventQ n-1
Event engine
n-1
. . .
Content Transfer Engine subsystem
Internet delivery
subsystem
CTDCTD
CTD
CTD
Dispatch signal
The CTE assigns an arriving event to an event engine in this way:
• If the CTE currently has no other pending events for the particular CTD, the
event goes to a randomly chosen event engine.
• However, if the event queues already contain at least one event for that CTD,
then all events for that CTD must go through the same event engine, until they
are all dispatched.
One of the critical issues for the content transfer engine is memory management. The
CTE makes a minimum of memory transfers as it receives data from storage nodes and
sends it to the external network. It does this by initially storing data from the access
nodes into external RAM buffers, then sending it to the Internet from those same buffers.
A supplemental processor runs a component called the content transfer engine extension
(CTEX). CTEX runs extended content transfer daemons (XCTDs), which extend the

content processing functions beyond the CTE’s limited processing scope. For example,
XCTDs have more flexible access to shared data than CTDs do.

WatsonBruce Ikadega sample 1

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (14)

Similar to WatsonBruce Ikadega sample 1

Similar to WatsonBruce Ikadega sample 1 (20)

WatsonBruce Ikadega sample 1