Basic Steps of Video Processing
UNIT IV
By
B. Shiny Vidya
Assist. Prof
Part 2
Contents
• Time-Varying Image Formation models
- Three-Dimensional Motion Models
- Geometric Image Formation
- Photometric Image Formation
- Sampling of Video signals
- Filtering operations
Time-Varying Image Formation models
• A time-varying image is represented by a function of three continuous variables,
Sc(x1, x2, t), which is formed by projecting a time-varying three-dimensional (3-D)
spatial scene into the two-dimensional (2-D) image plane.
• The temporal variations in the 3-D scene are usually due the movements of objects
in the scene.
• Time-varying images reflect a projection of 3-D moving objects into the 2-D image
plane as a function of time.
• Digital video corresponds to a spatio-temporally sampled version of the time-
varying image.
Time-Varying Image Formation models
• “3-D scene modeling” refers to modeling the motion and structure of objects in 3-D.
• “Image formation,” which includes geometric and photometric image formation, refers
to mapping the 3-D scene into an image plane intensity distribution.
• Geometric image formation, considers the projection of the 3-D scene into the 2-D
image plane.
• Photometric image formation, models variations in the image plane intensity distribution
due to changes in the scene illumination in time as well as the photometric effects of the
3-D motion.
Fig: Digital video formation
Three - Dimensional Motion Models
• It comprises modeling the relative 3-D motion between the camera and the objects in the
scene.
• This includes the 3-D motion of the objects in the scene, such as translation and
rotation, as well as the 3-D motion of the camera, such as zooming and panning.
• Models are presented to describe the relative motion of a set of 3-D object points and the
camera, in the Cartesian coordinate system (X1, X2, X3) and the Homogeneous
coordinate system (kX1,kX2, kX3), respectively.
• The depth X3 of each point appears as a free parameter in the mathematical expressions.
• According to classical kinematics, 3-D motion can be classified as
- Rigid motion: the relative distances between the set of 3-D points remain fixed as the
object evolves in time.
- Non-rigid motion: a deformable surface model is utilized in modeling the 3-D
structure.
Three - Dimensional Motion Models
It is well known that 3-D displacement of a rigid object in the Cartesian coordinates can be
modeled by an affine transformation of the form
Rigid Motion in the Cartesian Coordinate
where R is a 3 x 3 rotation matrix, T is a 3-D translation vector,
and
T1
T2
T3
T =
That is, the 3-D displacement can be expressed as the sum of a 3-D rotation and a 3-D translation.
Fig: Eulerian angles of rotation.
Three-dimensional rotation in the Cartesian coordinates can be
characterized either by
- The Eulerian angles of rotation about the three coordinate axes.
- An axis of rotation and an angle about this axis.
Three - Dimensional Motion Models
The Rotation Matrix
Fig: Rotation about an arbitrary axis.
Three - Dimensional Motion Models
The Eulerian angles of rotation about the three coordinate axes
The matrices that describe clockwise rotation about the
individual axes are given by
0 0 1
1 0
The order of multiplication makes no difference
Three - Dimensional Motion Models
Rotation about an arbitrary axis in the Cartesian coordinates
An alternative characterization of the rotation
matrix results if the 3-D rotation is described by an
angle about an arbitrary axis through the origin,
specified by the directional cosines n1, n2, and n3.
In video imagery, the assumption of infinitesimal rotation
usually holds, since the time difference between the
frames are in the order of 1/30 seconds.
Three - Dimensional Motion Models
Then the affine transformation in the coordinate system can be expressed as a linear
transformation in the homogenous coordinates
Rigid Motion in the Homogenous Coordinate
a11 a12 a13 T1
a21 a22 a23 T2
a31 a32 a33 T3
0 0 0 1
a11 a12 a13
a21 a22 a23
a31 a32 a33
A = = SR
Affine transform = linear mapping
method that preserves points, straight
lines, and planes
Three - Dimensional Motion Models
1 0 0 T1
0 1 0 T2
0 0 1 T3
0 0 0 1
Translation in the Homogeneous Coordinates Translation can be represented
as a matrix multiplication in the homogeneous coordinates given by
Translation matrix
Three - Dimensional Motion Models
Rotation in the Homogeneous Coordinates
Rotation in the homogeneous coordinates is represented by a
matrix multiplication in the form
Zooming in the Homogeneous Coordinates
Three - Dimensional Motion Models
• Deformable Motion Modeling the 3-D structure and motion of nonrigid objects is a complex task.
• Analysis and synthesis of nonrigid motion using deformable models is an active research area
today. In theory, according to the mechanics of deformable bodies,
the model can be extended to include 3-D nonrigid motion as
X’= (D+R)X+T
where D is an arbitrary deformation matrix.
Note that the elements of the rotation matrix are constrained to be related to the sines and cosines of
the respective angles, whereas the deformation matrix is not constrained in any way.
• Some examples of the proposed 3-D nonrigid motion models include those based on free
vibration or deformation models those based on constraints induced by intrinsic and extrinsic
forces.
Non rigid Motion modelling
Geometric Image Formation
Imaging systems capture 2-D projections of a time-varying 3-D scene. This projection can be
represented by a mapping from a 4-D space to a 3-D space,
where (X1, X2, X3), the 3-D world coordinates, (x1, x2), the 2-D image plane coordinates, and t,
time, are continuous variables.
There are two types of projection
- Perspective (central)
- Orthographic (parallel)
Geometric Image Formation
• Perspective projection reflects image formation using an
ideal pinhole camera according to the principles of
geometrical optics.
• All the rays from the object pass through the center of
projection, which corresponds to the center of the lens.
It is also known as “central projection.”
• The algebraic relations that describe the perspective
transformation for the configuration shown in Figure can
be obtained based on similar triangles formed by drawing
perpendicular lines from the object point (X1, X2, X3) and
the image point (x1, x2, 0) to the X3 axis, respectively.
• This leads to
Perspective Projection
Fig: Perspective projection model.
f denotes the distance from the
center of projection to the image
plane.
Geometric Image Formation
If we move the center of projection to coincide with the origin of the world coordinates, a simple change
of variables yields the following equivalent expressions:
Fig: Simplified perspective projection model.
We note that the perspective projection is nonlinear in the
Cartesian coordinates since it requires division by the X3
coordinate. However, it can be expressed as a linear mapping in
the homogeneous coordinates, as
lx1
lx2
l
xh =
denote the world and image plane points, respectively, in the
homogeneous coordinates
Geometric Image Formation
Orthographic Projection
• Orthographic projection is an approximation of the actual
imaging process where it is assumed that all the rays from the 3-
D object (scene) to the image plane travel parallel to each other.
• It is also called the “parallel projection.”
Fig: Orthographic projection model.
• Provided that the image plane is parallel to the X1 – X2 plane of
the world coordinate system, the orthographic projection can be
described in Cartesian coordinates as
The distance of the object from the camera does
not affect the image plane intensity distribution in
orthographic projection. That is, the object always
yields the same image no matter how far away it
is from the camera.
Photometric Image Formation
• Image intensities can be modeled as proportional to the
amount of light reflected by the objects in the scene.
• The scene reflectance function is generally assumed to
contain a Lambertian and a specular component.
• Surfaces where the specular component can be neglected
are called Lambertian surfaces.
Fig: Photometric image formation model
Lambertian Reflection Model
If a Lambertian surface is illuminated by a single point-source
with uniform intensity (in time), the resulting image intensity is
given by
where ρ denotes the surface albedo, i.e., the fraction of the light
reflected by the surface,
L = (L1, L2, L3) is the unit vector in the mean illuminant direction,
and N(t) is the unit surface normal of the scene, at spatial location
(Xl, X2, X3(X1, X2)) and time t, given by
p = dX3/dx1 and q = dX3/dx2 are the partial
derivatives of depth X3(x1, x2) with respect
to the image coordinates x1 and x2,
respectively, under the orthographic
projection.
Photometric Image Formation
The illuminant direction can also be expressed in terms of tilt and slant angles as
• As an object moves in 3-D, the surface normal changes as a function of time; so do the
photometric properties of the surface.
• Assuming that the mean illuminant direction L remains constant, we can express the change in
the intensity due to the photometric effects of the motion as
Sampling of Video signals
• The very beginning step of any digital video processing task is the conversion of an intrinsically
continuous video signal to a digital one.
• The digitization process consists of two steps: sampling and quantization.
• This could be implemented by a digital camera, which directly digitizes the video of a continuous
physical scene, or by digitizing an analog signal produced by an analog camera.
• We also frequently need to convert a digital video signal from one format (in terms spatial and
temporal resolution) to another, e.g., converting a video recorded in the PAL format to the NTSC
format.
Sampling of Video signals
• For 1D signals, samples are usually taken at regular spacing.
• With 2D signals, samples are taken in a rectangular grid. In fact, one can also take
samples on a non-rectangular grid, as long as the grid has a structure that allows the
specification of the grid points using integer vectors. Mathematically, this type of
grid is known as a lattice.
Definition:
Sampling of Video signals
Fig: Example of lattices and their reciprocals:
(a) a rectangular lattice; (b) a hexagonal lattice;
(c) the reciprocal of the rectangular lattice;
(d) the reciprocal of the hexagonal lattice.
Fig: Determining the Voronoi cell by drawing
equidistant lines.
Sampling of Video signals
• The two spatial dimensions and the temporal dimension are not symmetric in that they have
different characteristics and that the visual sensitivities to the spatial and temporal frequencies are
different.
• Although a video signal is continuously varying in space and time, cameras of today cannot capture
the entire signal continuously in all three dimensions.
• Most movie cameras sample the video in the temporal direction, and store a sequence of analog
video frames on films.
• On the other hand, most TV cameras capture a video sequence by sampling it in temporal and
vertical directions. The resulting signal is stored in a 1D raster scan, which is a concatenation of the
color variations along successive horizontal scan lines.
• In designing a video sampling system, two questions to be answered are:
• i) What are the necessary sampling rates for video, and
• ii) Which sampling lattice is the most efficient under a given total sampling rate.
Sampling of Video signals
• Spatial and Temporal resolutions are needed to be set at very first to design a video sampling
system.
• This is governed by several factors,
- The frequency content of the underlying signal
- Visual spatial and temporal cut-off frequencies
- The capture and display device characteristics
- The affordable processing, storage, and transmission cost
• Based on the sampling theorem, the sampling rate in each dimension should be at least twice of
the highest frequency along that direction.
• Human beings cannot observe spatial and temporal variations beyond certain high frequencies.
• The highest spatial and temporal frequencies that can be observed by the HVS, should be the
driving factor in determining the sampling rates for video.
Sampling of Video signals
• For TV signals, which are very bright, the visual temporal and spatial thresholds call for a frame rate
of over 70 Hz, and a spatial resolution of at least 30 cpd (cycles per degree).
• At a normal viewing distance of 3 times the screen height, a spatial frequency of 25 cpd translates to
about 500 line/frame.
• To sample each line, the horizontal sampling interval should approximately match the vertical
interval, so that the resulting pixels are square (i.e., having a PAR of 1). This leads to about 670
pixel/line for a display with 500 lines and an IAR(Image Aspect Ratio) of 4:3. These sampling rates
are way higher to implement practically.
• In order to reduce the data rate and consequently the cost for video capture, transmission, and
display, interlaced scan was developed, which trades-off the vertical resolution for increased
temporal resolution, for a given total data rate (product of the frame rate and line rate).
• But for a high motion scene with special patterns (vertical line patterns), it can lead to the notorious
“interlacing artifact”
• The interlaced format is retained mainly for compatibility with the analog TV system.
Sampling of Video signals
• For movies, because of the reduced visual sensitivity
in a movie theater where the ambient brightness is
kept low, a frame rate of 24 fps (progressive) is
usually sufficient.
• Although the original movie is captured at 24 fps,
when played back, a rotating blade is placed before
the projection lens which rotates 3 times per frame, so
that the effective playback rate is 72 fps. This is to
suppress the flicker artifacts that may be experienced
by some acute viewers.
• Old movies are made using a screen IAR of 4:3. For
more dramatic visual impact, newer movies are
usually made with an IAR of up to 2:1.
The HDTV systems further enhance the visual impact by employing a wider screen size with an IAR of
16:9, a sampling resolution of 60 frame/s, and 720 line/frame. Again, for compatibility purposes, an
interlaced format with 60 field/s and 540 line/field can also be used.
• For computer displays, much higher
temporal and spatial sampling rates are
needed.
• For example, a SVGA(Super Video
Graphics Array) display has a frame rate
of 72 fps (progressive) and a spatial
resolution of 1024x720 pixels.
• This is to accommodate the very close
viewing distance (normally between 1
and 2 times of the picture height) and the
high-frequency content of the displayed
material (line graphics and text).
Sampling of Video signals
Sampling Video in 2D: Progressive vs. Interlaced Scans
The video signal as a 2D signal in the
space spanned by the temporal and
vertical directions (horizontal direction is
ignored).
Let Δt is the field interval and Δy is the line
interval. Then the sampling lattices
employed by the progressive and
interlaced scans are as shown in figure
respectively.
In these figures, we also indicate the basis
vectors for generating each lattice.
Fig: Comparison of progressive and interlaced scans: (a)
sampling lattice for progressive scan; (b) sampling lattice
for the interlaced scan
Sampling of Video signals
Sampling Video in 2D: Progressive vs. Interlaced Scans
Fig: Comparison of progressive and interlaced scans; (c) reciprocal
lattice for progressive scan; (d) reciprocal lattice for interlaced scan.
Filled circles in (c) and (d) indicate the nearest aliasing components.
From these basis vectors, we derive
the following generating matrices for
the original lattices and the reciprocal
lattices:
Note that when drawing the lattices, the spatial and
temporal dimensions are scaled in such a way that
spatial frequency will be equal to the vertical
sampling rate
1/Δt = 1/Δy
Sampling of Video signals
Sampling Video in 2D: Progressive vs. Interlaced Scans
Comparing the original and reciprocal lattices of these two scans, we arrive at the following
observations:
1. They have the same sampling density, i.e., d1(Ʌ) = d2(Ʌ) = 1/2ΔtΔy.
2. They have the same nearest aliases, at 1/ Δy; along the vertical frequency axis.
3. They have different nearest aliases along the temporal frequency axis (flickering artifacts in
progressive). For the progressive scan, the first alias occurs at 1/2Δt, while for the interlaced scan, this
occurs at 1/Δt
4. They have different mixed aliases.
The mixed alias is defined as the nearest off -axis alias component. A frequency component that is close to
the mixed alias causes inter-line flicker and line crawl.
For the progressive scan, the mixed alias occurs at (1/2Δt, 1/Δy), while for the interlaced scan, this occurs at (1/2Δt, 1/2Δy).
Because the interlaced scan has a mixed alias that is closer to the origin, the inter-line flicker and line crawl artifacts are
more visible in interlaced scans. These are the notorious interlacing artifacts.
5. For a signal with isotropic spectral support, the interlaced scan is more efficient.
Sampling of Video signals
Sampling of a raster Scan: BT.601 format
• A raster is a 1D signal consisting of successive horizontal scan-lines in successive frames (fields for an
interlaced raster). Therefore, the sampling interval along the scan line directly determines the
horizontal sampling interval.
• To determine the sampling interval, several factors need to be taken into account.
- First, the resulting horizontal sampling spacing should match with the vertical spacing
between scan lines so that the sampling frequencies in the horizontal and vertical directions are
similar.
- Second, the resulting samples in the 3D space should follow the desired sampling lattice.
• Video raster has a single luminance component.
• For a color video raster, including one luminance and two chrominance components, a straightforward
approach is to use the same sampling frequency for all components.
Note: To sample a composite color video, one needs to first separate the individual color components and then
perform sampling.
Sampling of Video signals
Sampling Video in 3D
• In the sampling schemes described in 2D, the horizontal samples are aligned vertically in all fields.
Such a sampling scheme is by no means optimal.
• One can also take samples in an interlaced or more complicated pattern over the x-y plane. More
generally, one can directly sample the 3D space using a desired lattice.
• It is difficult to make a camera that can implement a complicated 3D sampling structure. However,
one can first acquire samples using a dense but simply structured lattice and then down-convert it
to the desired lattice.
• We will assume that the frequency axes, fx, fy, ft, are scaled by the maximum frequencies of the
signal, 1/fx, max; 1/fy, max; 1/ft, max, respectively. The horizontal, vertical, and temporal axes, x; y; t,
are correspondingly scaled by fx, max; fy, max; ft, max. (The following discussion is in terms of the
scaled unit-less variables.)
Sampling of Video signals
Sampling Video in 3D
• Consider the sampling of a progressively scanned raster with frame interval Δt and line interval Δy. If
the samples are aligned vertically with a horizontal interval of Δx, then the equivalent 3D sampling
lattice is simply cubic or orthorhombic (denoted by ORT).
• To avoid aliasing, we need to have Δx = Δy = Δt = 1/2, with a sampling density d(ORT) = 8
Fig: 3D video sampling lattices and their reciprocals. Left is the sampling lattice, and right is the
reciprocal lattice. The matrix V indicates the generating matrix for the sampling lattice.
Refer to Appendix for more info.
Filtering Operations
Filtering Operations in Cameras
• Consider a camera that samples a continuously varying scene at regular intervals x, y, t in the horizontal,
vertical, and temporal directions, respectively. This corresponds to using a simple cubic lattice.
• The sampling frequencies are fs, x = 1/Δx; fs, y = 1/ Δy; fs, t = 1/Δt.
• The ideal pre-filter should be a low-pass filter with cut-off frequencies at half of the sampling frequencies
• A video camera typically accomplishes a certain degree of pre-filtering in the capturing process.
• The intensity values read out at any frame instant is not the sensed value at that time, but rather, the
average of the sensed signal over a certain time interval, Δe, known as exposure time.
• Therefore, the camera is applying a pre-filter in the temporal domain with an impulse response of
the form:
1. Temporal Aperture
Filtering Operations
• We can see that H(f) reaches zero at ft = 1/ Δe.
• Recall that 1 = Δt is the temporal sampling rate and that the ideal pre-filter for
this task is a low-pass filter with cut-off frequency at half of the sampling rate.
• By choosing Δe ≥ Δt, the camera can suppress temporal alias components near
the sampling rate.
• But too large Δe will make the signal blurred.
• In practice, the effect of blurring is sometimes more noticeable than aliasing.
• Therefore, the exposure time Δe must be chosen to reach a proper trade-off
between aliasing and blurring effects.
1. Temporal Aperture
Filtering Operations
• In addition to temporal integration, the camera also performs spatial integration.
• The value read-out at any pixel (a position on a scan line with a tube-based camera or a sensor
in a CCD [Charge Coupled Device] camera) is not the optical signal at that point alone, rather a
weighted integration of the signals in a small window surrounding it, called aperture.
• The shape of the aperture and the weighting values constitute the camera spatial aperture function.
This aperture function serves as the spatial pre-filter, and its Fourier transform is known as the
modulation transfer function (MTF) of the camera.
• With most cameras, the spatial aperture function can be approximated by a circularly symmetric
Gaussian function
2. Spatial Aperture
Filtering Operations
3. Combined Aperture
The impulse response of a camera with Δe = Δt = 1/60 second and fs,y = 480 (line/picture-height) is
shown in Fig. (a); Its frequency response is given in Fig. (b). Only the fy - ft plane at fx = 0 is shown.
Combined Aperture The overall camera aperture function or pre-filter is
Fig: The aperture function of
a typical camera. (a) The
impulse response; (b) The
frequency response.
Filtering Operations
• On one hand, combined aperture attenuates the frequency components inside the desired pass-
band, thereby reducing the signal resolution unnecessarily; on the other hand, it does not
remove completely the frequency components in the desired stop-band, which will cause
aliasing in the sampled signal.
• It has been found that the human eye is more annoyed by the loss of resolution than the aliasing
artifact.
• This is partly because the aliasing artifact only causes noticeable visual artifacts when the image
contains high-frequency periodic patterns that are close to the lowest aliasing frequency, which are
rare in natural scene imagery. For this reason, preservation of the signal in the pass-band is more
important than suppression of the signal outside the pass-band.
3. Combined Aperture
Filtering Operations
Filtering Operations in Displays
• An electronic gun emits an electron beam across the screen line by line, striking phosphors with
intensities proportional to the intensity of the video signal at corresponding locations.
• To display a color image, three beams are emitted by three separate guns, striking red, green, and blue
phosphors with the desired intensity combination at each location.
• The beam thickness essentially determines the vertical filtering: a very thin beam will make the image
looks sharper but will also cause the perception of the scan lines if the viewer sits too close to the
screen; on the other hand, a thicker beam will blur the image.
• Normally to minimize the loss of spatial resolution, thin beams are used.
• Temporal filtering is determined by the phosphors used. The P22 phosphors used in color television
decay to less than 10 percent of peak responses in 10 s to 1 ms.
• Fortunately, the human visual system (HVS) has a low-pass or band-pass characteristic, depending on the regime of
the temporal and spatial frequencies in the imagery.
• Therefore, the eye performs to some degree the required interpolation task. For improved performance, one can use
a digital filter to first up-convert the sampled signal to a higher resolution, which is then fed to a high resolution
display system.
Appendix
Sampling of Video signals
Sampling Video in 3D
Sampling of Video signals
Sampling Video in 3D
Sampling of Video signals
Sampling Video in 3D
Sampling of Video signals
Sampling Video in 3D

Basic Steps of Video Processing - unit 4 (2).pdf

  • 1.
    Basic Steps ofVideo Processing UNIT IV By B. Shiny Vidya Assist. Prof Part 2
  • 2.
    Contents • Time-Varying ImageFormation models - Three-Dimensional Motion Models - Geometric Image Formation - Photometric Image Formation - Sampling of Video signals - Filtering operations
  • 3.
    Time-Varying Image Formationmodels • A time-varying image is represented by a function of three continuous variables, Sc(x1, x2, t), which is formed by projecting a time-varying three-dimensional (3-D) spatial scene into the two-dimensional (2-D) image plane. • The temporal variations in the 3-D scene are usually due the movements of objects in the scene. • Time-varying images reflect a projection of 3-D moving objects into the 2-D image plane as a function of time. • Digital video corresponds to a spatio-temporally sampled version of the time- varying image.
  • 4.
    Time-Varying Image Formationmodels • “3-D scene modeling” refers to modeling the motion and structure of objects in 3-D. • “Image formation,” which includes geometric and photometric image formation, refers to mapping the 3-D scene into an image plane intensity distribution. • Geometric image formation, considers the projection of the 3-D scene into the 2-D image plane. • Photometric image formation, models variations in the image plane intensity distribution due to changes in the scene illumination in time as well as the photometric effects of the 3-D motion. Fig: Digital video formation
  • 5.
    Three - DimensionalMotion Models • It comprises modeling the relative 3-D motion between the camera and the objects in the scene. • This includes the 3-D motion of the objects in the scene, such as translation and rotation, as well as the 3-D motion of the camera, such as zooming and panning. • Models are presented to describe the relative motion of a set of 3-D object points and the camera, in the Cartesian coordinate system (X1, X2, X3) and the Homogeneous coordinate system (kX1,kX2, kX3), respectively. • The depth X3 of each point appears as a free parameter in the mathematical expressions. • According to classical kinematics, 3-D motion can be classified as - Rigid motion: the relative distances between the set of 3-D points remain fixed as the object evolves in time. - Non-rigid motion: a deformable surface model is utilized in modeling the 3-D structure.
  • 6.
    Three - DimensionalMotion Models It is well known that 3-D displacement of a rigid object in the Cartesian coordinates can be modeled by an affine transformation of the form Rigid Motion in the Cartesian Coordinate where R is a 3 x 3 rotation matrix, T is a 3-D translation vector, and T1 T2 T3 T = That is, the 3-D displacement can be expressed as the sum of a 3-D rotation and a 3-D translation.
  • 7.
    Fig: Eulerian anglesof rotation. Three-dimensional rotation in the Cartesian coordinates can be characterized either by - The Eulerian angles of rotation about the three coordinate axes. - An axis of rotation and an angle about this axis. Three - Dimensional Motion Models The Rotation Matrix Fig: Rotation about an arbitrary axis.
  • 8.
    Three - DimensionalMotion Models The Eulerian angles of rotation about the three coordinate axes The matrices that describe clockwise rotation about the individual axes are given by 0 0 1 1 0 The order of multiplication makes no difference
  • 9.
    Three - DimensionalMotion Models Rotation about an arbitrary axis in the Cartesian coordinates An alternative characterization of the rotation matrix results if the 3-D rotation is described by an angle about an arbitrary axis through the origin, specified by the directional cosines n1, n2, and n3. In video imagery, the assumption of infinitesimal rotation usually holds, since the time difference between the frames are in the order of 1/30 seconds.
  • 10.
    Three - DimensionalMotion Models Then the affine transformation in the coordinate system can be expressed as a linear transformation in the homogenous coordinates Rigid Motion in the Homogenous Coordinate a11 a12 a13 T1 a21 a22 a23 T2 a31 a32 a33 T3 0 0 0 1 a11 a12 a13 a21 a22 a23 a31 a32 a33 A = = SR Affine transform = linear mapping method that preserves points, straight lines, and planes
  • 11.
    Three - DimensionalMotion Models 1 0 0 T1 0 1 0 T2 0 0 1 T3 0 0 0 1 Translation in the Homogeneous Coordinates Translation can be represented as a matrix multiplication in the homogeneous coordinates given by Translation matrix
  • 12.
    Three - DimensionalMotion Models Rotation in the Homogeneous Coordinates Rotation in the homogeneous coordinates is represented by a matrix multiplication in the form Zooming in the Homogeneous Coordinates
  • 13.
    Three - DimensionalMotion Models • Deformable Motion Modeling the 3-D structure and motion of nonrigid objects is a complex task. • Analysis and synthesis of nonrigid motion using deformable models is an active research area today. In theory, according to the mechanics of deformable bodies, the model can be extended to include 3-D nonrigid motion as X’= (D+R)X+T where D is an arbitrary deformation matrix. Note that the elements of the rotation matrix are constrained to be related to the sines and cosines of the respective angles, whereas the deformation matrix is not constrained in any way. • Some examples of the proposed 3-D nonrigid motion models include those based on free vibration or deformation models those based on constraints induced by intrinsic and extrinsic forces. Non rigid Motion modelling
  • 14.
    Geometric Image Formation Imagingsystems capture 2-D projections of a time-varying 3-D scene. This projection can be represented by a mapping from a 4-D space to a 3-D space, where (X1, X2, X3), the 3-D world coordinates, (x1, x2), the 2-D image plane coordinates, and t, time, are continuous variables. There are two types of projection - Perspective (central) - Orthographic (parallel)
  • 15.
    Geometric Image Formation •Perspective projection reflects image formation using an ideal pinhole camera according to the principles of geometrical optics. • All the rays from the object pass through the center of projection, which corresponds to the center of the lens. It is also known as “central projection.” • The algebraic relations that describe the perspective transformation for the configuration shown in Figure can be obtained based on similar triangles formed by drawing perpendicular lines from the object point (X1, X2, X3) and the image point (x1, x2, 0) to the X3 axis, respectively. • This leads to Perspective Projection Fig: Perspective projection model. f denotes the distance from the center of projection to the image plane.
  • 16.
    Geometric Image Formation Ifwe move the center of projection to coincide with the origin of the world coordinates, a simple change of variables yields the following equivalent expressions: Fig: Simplified perspective projection model. We note that the perspective projection is nonlinear in the Cartesian coordinates since it requires division by the X3 coordinate. However, it can be expressed as a linear mapping in the homogeneous coordinates, as lx1 lx2 l xh = denote the world and image plane points, respectively, in the homogeneous coordinates
  • 17.
    Geometric Image Formation OrthographicProjection • Orthographic projection is an approximation of the actual imaging process where it is assumed that all the rays from the 3- D object (scene) to the image plane travel parallel to each other. • It is also called the “parallel projection.” Fig: Orthographic projection model. • Provided that the image plane is parallel to the X1 – X2 plane of the world coordinate system, the orthographic projection can be described in Cartesian coordinates as The distance of the object from the camera does not affect the image plane intensity distribution in orthographic projection. That is, the object always yields the same image no matter how far away it is from the camera.
  • 18.
    Photometric Image Formation •Image intensities can be modeled as proportional to the amount of light reflected by the objects in the scene. • The scene reflectance function is generally assumed to contain a Lambertian and a specular component. • Surfaces where the specular component can be neglected are called Lambertian surfaces. Fig: Photometric image formation model Lambertian Reflection Model If a Lambertian surface is illuminated by a single point-source with uniform intensity (in time), the resulting image intensity is given by where ρ denotes the surface albedo, i.e., the fraction of the light reflected by the surface, L = (L1, L2, L3) is the unit vector in the mean illuminant direction, and N(t) is the unit surface normal of the scene, at spatial location (Xl, X2, X3(X1, X2)) and time t, given by p = dX3/dx1 and q = dX3/dx2 are the partial derivatives of depth X3(x1, x2) with respect to the image coordinates x1 and x2, respectively, under the orthographic projection.
  • 19.
    Photometric Image Formation Theilluminant direction can also be expressed in terms of tilt and slant angles as • As an object moves in 3-D, the surface normal changes as a function of time; so do the photometric properties of the surface. • Assuming that the mean illuminant direction L remains constant, we can express the change in the intensity due to the photometric effects of the motion as
  • 20.
    Sampling of Videosignals • The very beginning step of any digital video processing task is the conversion of an intrinsically continuous video signal to a digital one. • The digitization process consists of two steps: sampling and quantization. • This could be implemented by a digital camera, which directly digitizes the video of a continuous physical scene, or by digitizing an analog signal produced by an analog camera. • We also frequently need to convert a digital video signal from one format (in terms spatial and temporal resolution) to another, e.g., converting a video recorded in the PAL format to the NTSC format.
  • 21.
    Sampling of Videosignals • For 1D signals, samples are usually taken at regular spacing. • With 2D signals, samples are taken in a rectangular grid. In fact, one can also take samples on a non-rectangular grid, as long as the grid has a structure that allows the specification of the grid points using integer vectors. Mathematically, this type of grid is known as a lattice. Definition:
  • 22.
    Sampling of Videosignals Fig: Example of lattices and their reciprocals: (a) a rectangular lattice; (b) a hexagonal lattice; (c) the reciprocal of the rectangular lattice; (d) the reciprocal of the hexagonal lattice. Fig: Determining the Voronoi cell by drawing equidistant lines.
  • 23.
    Sampling of Videosignals • The two spatial dimensions and the temporal dimension are not symmetric in that they have different characteristics and that the visual sensitivities to the spatial and temporal frequencies are different. • Although a video signal is continuously varying in space and time, cameras of today cannot capture the entire signal continuously in all three dimensions. • Most movie cameras sample the video in the temporal direction, and store a sequence of analog video frames on films. • On the other hand, most TV cameras capture a video sequence by sampling it in temporal and vertical directions. The resulting signal is stored in a 1D raster scan, which is a concatenation of the color variations along successive horizontal scan lines. • In designing a video sampling system, two questions to be answered are: • i) What are the necessary sampling rates for video, and • ii) Which sampling lattice is the most efficient under a given total sampling rate.
  • 24.
    Sampling of Videosignals • Spatial and Temporal resolutions are needed to be set at very first to design a video sampling system. • This is governed by several factors, - The frequency content of the underlying signal - Visual spatial and temporal cut-off frequencies - The capture and display device characteristics - The affordable processing, storage, and transmission cost • Based on the sampling theorem, the sampling rate in each dimension should be at least twice of the highest frequency along that direction. • Human beings cannot observe spatial and temporal variations beyond certain high frequencies. • The highest spatial and temporal frequencies that can be observed by the HVS, should be the driving factor in determining the sampling rates for video.
  • 25.
    Sampling of Videosignals • For TV signals, which are very bright, the visual temporal and spatial thresholds call for a frame rate of over 70 Hz, and a spatial resolution of at least 30 cpd (cycles per degree). • At a normal viewing distance of 3 times the screen height, a spatial frequency of 25 cpd translates to about 500 line/frame. • To sample each line, the horizontal sampling interval should approximately match the vertical interval, so that the resulting pixels are square (i.e., having a PAR of 1). This leads to about 670 pixel/line for a display with 500 lines and an IAR(Image Aspect Ratio) of 4:3. These sampling rates are way higher to implement practically. • In order to reduce the data rate and consequently the cost for video capture, transmission, and display, interlaced scan was developed, which trades-off the vertical resolution for increased temporal resolution, for a given total data rate (product of the frame rate and line rate). • But for a high motion scene with special patterns (vertical line patterns), it can lead to the notorious “interlacing artifact” • The interlaced format is retained mainly for compatibility with the analog TV system.
  • 26.
    Sampling of Videosignals • For movies, because of the reduced visual sensitivity in a movie theater where the ambient brightness is kept low, a frame rate of 24 fps (progressive) is usually sufficient. • Although the original movie is captured at 24 fps, when played back, a rotating blade is placed before the projection lens which rotates 3 times per frame, so that the effective playback rate is 72 fps. This is to suppress the flicker artifacts that may be experienced by some acute viewers. • Old movies are made using a screen IAR of 4:3. For more dramatic visual impact, newer movies are usually made with an IAR of up to 2:1. The HDTV systems further enhance the visual impact by employing a wider screen size with an IAR of 16:9, a sampling resolution of 60 frame/s, and 720 line/frame. Again, for compatibility purposes, an interlaced format with 60 field/s and 540 line/field can also be used. • For computer displays, much higher temporal and spatial sampling rates are needed. • For example, a SVGA(Super Video Graphics Array) display has a frame rate of 72 fps (progressive) and a spatial resolution of 1024x720 pixels. • This is to accommodate the very close viewing distance (normally between 1 and 2 times of the picture height) and the high-frequency content of the displayed material (line graphics and text).
  • 27.
    Sampling of Videosignals Sampling Video in 2D: Progressive vs. Interlaced Scans The video signal as a 2D signal in the space spanned by the temporal and vertical directions (horizontal direction is ignored). Let Δt is the field interval and Δy is the line interval. Then the sampling lattices employed by the progressive and interlaced scans are as shown in figure respectively. In these figures, we also indicate the basis vectors for generating each lattice. Fig: Comparison of progressive and interlaced scans: (a) sampling lattice for progressive scan; (b) sampling lattice for the interlaced scan
  • 28.
    Sampling of Videosignals Sampling Video in 2D: Progressive vs. Interlaced Scans Fig: Comparison of progressive and interlaced scans; (c) reciprocal lattice for progressive scan; (d) reciprocal lattice for interlaced scan. Filled circles in (c) and (d) indicate the nearest aliasing components. From these basis vectors, we derive the following generating matrices for the original lattices and the reciprocal lattices: Note that when drawing the lattices, the spatial and temporal dimensions are scaled in such a way that spatial frequency will be equal to the vertical sampling rate 1/Δt = 1/Δy
  • 29.
    Sampling of Videosignals Sampling Video in 2D: Progressive vs. Interlaced Scans Comparing the original and reciprocal lattices of these two scans, we arrive at the following observations: 1. They have the same sampling density, i.e., d1(Ʌ) = d2(Ʌ) = 1/2ΔtΔy. 2. They have the same nearest aliases, at 1/ Δy; along the vertical frequency axis. 3. They have different nearest aliases along the temporal frequency axis (flickering artifacts in progressive). For the progressive scan, the first alias occurs at 1/2Δt, while for the interlaced scan, this occurs at 1/Δt 4. They have different mixed aliases. The mixed alias is defined as the nearest off -axis alias component. A frequency component that is close to the mixed alias causes inter-line flicker and line crawl. For the progressive scan, the mixed alias occurs at (1/2Δt, 1/Δy), while for the interlaced scan, this occurs at (1/2Δt, 1/2Δy). Because the interlaced scan has a mixed alias that is closer to the origin, the inter-line flicker and line crawl artifacts are more visible in interlaced scans. These are the notorious interlacing artifacts. 5. For a signal with isotropic spectral support, the interlaced scan is more efficient.
  • 30.
    Sampling of Videosignals Sampling of a raster Scan: BT.601 format • A raster is a 1D signal consisting of successive horizontal scan-lines in successive frames (fields for an interlaced raster). Therefore, the sampling interval along the scan line directly determines the horizontal sampling interval. • To determine the sampling interval, several factors need to be taken into account. - First, the resulting horizontal sampling spacing should match with the vertical spacing between scan lines so that the sampling frequencies in the horizontal and vertical directions are similar. - Second, the resulting samples in the 3D space should follow the desired sampling lattice. • Video raster has a single luminance component. • For a color video raster, including one luminance and two chrominance components, a straightforward approach is to use the same sampling frequency for all components. Note: To sample a composite color video, one needs to first separate the individual color components and then perform sampling.
  • 31.
    Sampling of Videosignals Sampling Video in 3D • In the sampling schemes described in 2D, the horizontal samples are aligned vertically in all fields. Such a sampling scheme is by no means optimal. • One can also take samples in an interlaced or more complicated pattern over the x-y plane. More generally, one can directly sample the 3D space using a desired lattice. • It is difficult to make a camera that can implement a complicated 3D sampling structure. However, one can first acquire samples using a dense but simply structured lattice and then down-convert it to the desired lattice. • We will assume that the frequency axes, fx, fy, ft, are scaled by the maximum frequencies of the signal, 1/fx, max; 1/fy, max; 1/ft, max, respectively. The horizontal, vertical, and temporal axes, x; y; t, are correspondingly scaled by fx, max; fy, max; ft, max. (The following discussion is in terms of the scaled unit-less variables.)
  • 32.
    Sampling of Videosignals Sampling Video in 3D • Consider the sampling of a progressively scanned raster with frame interval Δt and line interval Δy. If the samples are aligned vertically with a horizontal interval of Δx, then the equivalent 3D sampling lattice is simply cubic or orthorhombic (denoted by ORT). • To avoid aliasing, we need to have Δx = Δy = Δt = 1/2, with a sampling density d(ORT) = 8 Fig: 3D video sampling lattices and their reciprocals. Left is the sampling lattice, and right is the reciprocal lattice. The matrix V indicates the generating matrix for the sampling lattice. Refer to Appendix for more info.
  • 33.
    Filtering Operations Filtering Operationsin Cameras • Consider a camera that samples a continuously varying scene at regular intervals x, y, t in the horizontal, vertical, and temporal directions, respectively. This corresponds to using a simple cubic lattice. • The sampling frequencies are fs, x = 1/Δx; fs, y = 1/ Δy; fs, t = 1/Δt. • The ideal pre-filter should be a low-pass filter with cut-off frequencies at half of the sampling frequencies • A video camera typically accomplishes a certain degree of pre-filtering in the capturing process. • The intensity values read out at any frame instant is not the sensed value at that time, but rather, the average of the sensed signal over a certain time interval, Δe, known as exposure time. • Therefore, the camera is applying a pre-filter in the temporal domain with an impulse response of the form: 1. Temporal Aperture
  • 34.
    Filtering Operations • Wecan see that H(f) reaches zero at ft = 1/ Δe. • Recall that 1 = Δt is the temporal sampling rate and that the ideal pre-filter for this task is a low-pass filter with cut-off frequency at half of the sampling rate. • By choosing Δe ≥ Δt, the camera can suppress temporal alias components near the sampling rate. • But too large Δe will make the signal blurred. • In practice, the effect of blurring is sometimes more noticeable than aliasing. • Therefore, the exposure time Δe must be chosen to reach a proper trade-off between aliasing and blurring effects. 1. Temporal Aperture
  • 35.
    Filtering Operations • Inaddition to temporal integration, the camera also performs spatial integration. • The value read-out at any pixel (a position on a scan line with a tube-based camera or a sensor in a CCD [Charge Coupled Device] camera) is not the optical signal at that point alone, rather a weighted integration of the signals in a small window surrounding it, called aperture. • The shape of the aperture and the weighting values constitute the camera spatial aperture function. This aperture function serves as the spatial pre-filter, and its Fourier transform is known as the modulation transfer function (MTF) of the camera. • With most cameras, the spatial aperture function can be approximated by a circularly symmetric Gaussian function 2. Spatial Aperture
  • 36.
    Filtering Operations 3. CombinedAperture The impulse response of a camera with Δe = Δt = 1/60 second and fs,y = 480 (line/picture-height) is shown in Fig. (a); Its frequency response is given in Fig. (b). Only the fy - ft plane at fx = 0 is shown. Combined Aperture The overall camera aperture function or pre-filter is Fig: The aperture function of a typical camera. (a) The impulse response; (b) The frequency response.
  • 37.
    Filtering Operations • Onone hand, combined aperture attenuates the frequency components inside the desired pass- band, thereby reducing the signal resolution unnecessarily; on the other hand, it does not remove completely the frequency components in the desired stop-band, which will cause aliasing in the sampled signal. • It has been found that the human eye is more annoyed by the loss of resolution than the aliasing artifact. • This is partly because the aliasing artifact only causes noticeable visual artifacts when the image contains high-frequency periodic patterns that are close to the lowest aliasing frequency, which are rare in natural scene imagery. For this reason, preservation of the signal in the pass-band is more important than suppression of the signal outside the pass-band. 3. Combined Aperture
  • 38.
    Filtering Operations Filtering Operationsin Displays • An electronic gun emits an electron beam across the screen line by line, striking phosphors with intensities proportional to the intensity of the video signal at corresponding locations. • To display a color image, three beams are emitted by three separate guns, striking red, green, and blue phosphors with the desired intensity combination at each location. • The beam thickness essentially determines the vertical filtering: a very thin beam will make the image looks sharper but will also cause the perception of the scan lines if the viewer sits too close to the screen; on the other hand, a thicker beam will blur the image. • Normally to minimize the loss of spatial resolution, thin beams are used. • Temporal filtering is determined by the phosphors used. The P22 phosphors used in color television decay to less than 10 percent of peak responses in 10 s to 1 ms. • Fortunately, the human visual system (HVS) has a low-pass or band-pass characteristic, depending on the regime of the temporal and spatial frequencies in the imagery. • Therefore, the eye performs to some degree the required interpolation task. For improved performance, one can use a digital filter to first up-convert the sampled signal to a higher resolution, which is then fed to a high resolution display system.
  • 39.
  • 40.
    Sampling of Videosignals Sampling Video in 3D
  • 41.
    Sampling of Videosignals Sampling Video in 3D
  • 42.
    Sampling of Videosignals Sampling Video in 3D
  • 43.
    Sampling of Videosignals Sampling Video in 3D