IEEE MultiMedia 2016 Title and Abstract

For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE TRANSACTION ON MULTIMEDIA 2016 TOPICS
Hybrid Zero Block Detection for High Efficiency Video Coding
Abstract - In this paper we propose an efficient hybrid zero block early detection method for
high efficiency video coding (HEVC). Our method detects both genuine zero blocks (GZBs) and
pseudo zero blocks (PZBs). For GZB detection, we use a two sum of absolute difference bounds
and a one sum of absolute transformed difference threshold to decrease the GZB detection
complexity. A fast rate-distortion estimation algorithm for HEVC is proposed to improve the
PZB detection rate. Experimental results on the HM platform show that the proposed method
saves about 50% of the rate-distortion optimization (RTO) time, with negligible Bjøntegaard
delta bit rate loss. Our method is faster than other state-of-the-art ZB detection methods for
HEVC by 10%-30%.
IEEE Transactions on Multimedia (March 2016)
Consistent Coding Scheme for Single-Image Super-Resolution Via Independent
Dictionaries
Abstract - In this paper, we present a unified frame based on collaborative representation (CR)
for single-image super-resolution (SR), which learns low-resolution (LR) and high-resolution
(HR) dictionaries independently in the training stage and adopts a consistent coding scheme
(CCS) to guarantee the prediction accuracy of HR coding coefficients during SR reconstruction.
The independent LR and HR dictionaries are learned based on CR with l2-norm regularization,
which can well describe the corresponding LR and HR patch space, respectively. Furthermore, a
mapping function is learned to map LR coding coefficients onto the corresponding HR coding
coefficients. Propagation filtering can achieve smoothing over an image while preserving image
context like edges or textural regions. Moreover, to preserve the edge structures of a super-
resolved image and suppress artifacts, a propagation filtering-based constraint and image
nonlocal self-similarity regularization are introduced into the SR reconstruction framework.
Experimental comparison with state-of-the-art single image SR algorithms validates the
effectiveness of proposed approach.

Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene
Relations
Abstract - The rapid growth of web images presents new challenges as well as opportunities to
the task of image understanding. Conventional approaches rely heavily on fine-grained
annotations, such as bounding boxes and semantic segmentations, which are not available for
web-scale images. In general, images over the Internet are accompanied with descriptive texts,
which are relevant to their contents. To bridge the gap between textual and visual analysis for
image understanding, this paper presents an algorithm to learn the relations between scenes,
objects, and texts with the help of image-level annotations. In particular, the relation between the
texts and objects is modeled as the matching probability between the nouns and the object
classes, which can be solved via a constrained bipartite matching problem. On the other hand, the
relations between the scenes and objects/texts are modeled as the conditional distributions of
their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings
together scenes, objects, and texts for joint image understanding, including scene classification,
object classification and localization, and the prediction of object cardinalities. The proposed
cross-domain learning algorithm and the integrated model elevate the performance of image
understanding for web images in the context of textual descriptions. Experimental results show
that the proposed algorithm significantly outperforms conventional methods in various computer
vision tasks.
Blind Quality Assessment of Tone-Mapped Images Via Analysis of Information,
Naturalness, and Structure
Abstract - High dynamic range (HDR) imaging techniques have been working constantly,
actively, and validly in the fault detection and disease diagnosis in the astronomical and medical
fields, and currently they have also gained much more attention from digital image processing
and computer vision communities. While HDR imaging devices are starting to have friendly
prices, HDR display devices are still out of reach of typical consumers. Due to the limited
availability of HDR display devices, in most cases tone mapping operators (TMOs) are used to
convert HDR images to standard low dynamic range (LDR) images for visualization. But
existing TMOs cannot work effectively for all kinds of HDR images, with their performance

largely depending on brightness, contrast, and structure properties of a scene. To accurately
measure and compare the performance of distinct TMOs, in this paper develop an effective and
efficient no-reference objective quality metric which can automatically assess LDR images
created by different TMOs without access to the original HDR images. Our model is shown to be
statistically superior to recent full- and no-reference quality measures on the existing tone-
mapped image database and a new relevant database built in this work.
Semi-Supervised Bi-Dictionary Learning for Image Classification With Smooth
Representation-Based Label Propagation
Abstract - In this paper, we propose semi-supervised bi-dictionary learning for image
classification with smooth representation-based label propagation (SRLP). Natural images
contain complex contents of multiple objects with complicated background, clutter, and
occlusions, which prevents image features from belonging to a specific category. Therefore, we
employ reconstruction-based classification to implement discriminative dictionary learning in a
probabilistic manner. We jointly learn a discriminative dictionary called anchor in the feature
space and its corresponding soft label called anchor label in the label space, where the
combination of anchor and anchor label is referred to as bi-dictionary. The learnt bi-dictionary is
utilized to bridge the semantic gap in image classification. First, SRLP constructs smoothed
reconstruction problems for bi-dictionary learning. Then, SRLP produces the reconstruction
coefficients in the feature space over the anchor to infer soft labels of samples in the label space.
Experimental results demonstrate that the proposed method is capable of learning a pair of
discriminative dictionaries for image classification in the feature and label spaces and
outperforms the-state-of-the-art reconstruction-based classification ones.
A Distance-Computation-Free Search Scheme for Binary Code Databases
Abstract - Recently, binary codes have been widely used in many multimedia applications to
approximate high-dimensional multimedia features for practical similarity search due to the
highly compact data representation and efficient distance computation. While the majority of the
hashing methods aim at learning more accurate hash codes, only a few of them focus on indexing
methods to accelerate the search for binary code databases. Among these indexing methods,

most of them suffer from extremely high memory cost or extensive Hamming distance
computations. In this paper, we propose a new Hamming distance search scheme for large scale
binary code databases, which is free of Hamming distance computations to return the exact
results. Without the necessity to compare database binary codes with queries, the search
performance can be improved and databases can be externally maintained. More specifically, we
adopt the inverted multi-index data structure to index binary codes. Importantly, the Hamming
distance information embedded in the structure is utilized in the designed search scheme such
that the verification of exact results no longer relies on Hamming distance computations. As a
step further, we optimize the performance of the inverted multi-index structure by taking the
code distributions among different bits into account for index construction. Empirical results on
large-scale binary code databases demonstrate the superiority of our method over existing
approaches in terms of both memory usage and search efficiency.
QoE Evaluation of Multimedia Services Based on Audiovisual Quality and User Interest
Abstract - Quality of experience (QoE) has significant influence on whether or not a user will
choose a service or product in the competitive era. For multimedia services, there are various
factors in a communication ecosystem working together on users, which stimulate their different
senses inducing multidimensional perceptions of the services, and inevitably increase the
difficulty in measurement and estimation of the user's QoE. In this paper, a user-centric objective
QoE evaluation model (QAVIC model for short) is proposed to estimate the user's overall QoE
for audiovisual services, which takes account of perceptual audiovisual quality (QAV) and user
interest in audiovisual content (IC) amongst influencing factors on QoE such as technology,
content, context, and user in the communication ecosystem. To predict the user interest, a
number of general viewing behaviors are considered to formulate the IC evaluation model.
Subjective tests have been conducted for training and validation of the QAVIC model. The
experimental results show that the proposed QAVIC model can estimate the user's QoE
reasonably accurately using a 5-point scale absolute category rating scheme.
A Locality Sensitive Low-Rank Model for Image Tag Completion

Abstract - Many visual applications have benefited from the outburst of web images, yet the
imprecise and incomplete tags arbitrarily provided by users, as the thorn of the rose, may hamper
the performance of retrieval or indexing systems relying on such data. In this paper, we propose
a novel locality sensitive low-rank model for image tag completion, which approximates the
global nonlinear model with a collection of local linear models. To effectively infuse the idea of
locality sensitivity, a simple and effective pre-processing module is designed to learn suitable
representation for data partition, and a global consensus regularizer is introduced to mitigate the
risk of overfitting. Meanwhile, low-rank matrix factorization is employed as local models, where
the local geometry structures are preserved for the low-dimensional representation of both tags
and samples. Extensive empirical evaluations conducted on three datasets demonstrate the
effectiveness and efficiency of the proposed method, where our method outperforms pervious
ones by a large margin.
Compressed-Sensed-Domain L1-PCA Video Surveillance
Abstract - We consider the problem of foreground and background extraction from compressed-
sensed (CS) surveillance videos that are captured by a static CS camera. We propose, for the first
time in the literature, a principal component analysis (PCA) approach that computes directly in
the CS domain the low-rank subspace of the background scene. Rather than computing the
conventional L2-norm-based principal components, which are simply the dominant left singular
vectors of the CS-domain data matrix, we compute the principal components under an L1-norm
maximization criterion. The background scene is then obtained by projecting the CS
measurement vector onto the L1 principal components followed by total-variation (TV)
minimization image recovery. The proposed L1-norm procedure directly carries out low-rank
background representation without reconstructing the video sequence and, at the same time,
exhibits significant robustness against outliers in CS measurements compared to L2-norm PCA.
An adaptive CS- L1-PCA method is also developed for low-latency video surveillance. Extensive
experimental studies described in this paper illustrate and support the theoretical developments.
User-Service Rating Prediction by Exploring Social Users' Rating Behaviors

Abstract - With the boom of social media, it is a very popular trend for people to share what
they are doing with friends across various social networking platforms. Nowadays, we have a
vast amount of descriptions, comments, and ratings for local services. The information is
valuable for new users to judge whether the services meet their requirements before partaking. In
this paper, we propose a user-service rating prediction approach by exploring social users' rating
behaviors. In order to predict user-service ratings, we focus on users' rating behaviors. In our
opinion, the rating behavior in recommender system could be embodied in these aspects: 1)
when user rated the item, 2) what the rating is, 3) what the item is, 4) what the user interest that
we could dig from his/her rating records is, and 5) how the user's rating behavior diffuses among
his/her social friends. Therefore, we propose a concept of the rating schedule to represent users'
daily rating behaviors. In addition, we propose the factor of interpersonal rating behavior
diffusion to deep understand users' rating behaviors. In the proposed user-service rating
prediction approach, we fuse four factors-user personal interest (related to user and the item's
topics), interpersonal interest similarity (related to user interest), interpersonal rating behavior
similarity (related to users' rating behavior habits), and interpersonal rating behavior diffusion
(related to users' behavior diffusions)-into a unified matrix-factorized framework. We conduct a
series of experiments in the Yelp dataset and Douban Movie dataset. Experimental results show
the effectiveness of our approach.
A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision
Fusion
Abstract - Keyword spotting remains a challenge when applied to real-world environments with
dramatically changing noise. In recent studies, audio-visual integration methods have
demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for
visual speech recognition, individual utterance mannerisms can lead to confusion and false
recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-
based and appearance-based features in this paper. Specifically, a set of geometry-based features
is proposed based on an advanced facial landmark localization method. In order to obtain robust
and discriminative representation, a spatiotemporal lip feature is put forward concerning
similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel

two-step keyword spotting strategy based on decision fusion is proposed in order to make the
best use of audio-visual speech and adapt to diverse noise conditions. Weights generated using a
neural network combine acoustic and visual contributions. Experimental results on the OuluVS
dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive
performance compared to the state of the art. Additionally, the proposed audio-visual keyword
spotting (AV-KWS) method based on decision-level fusion significantly improves the noise
robustness and attains better performance than feature-level fusion, which is also capable of
adapting to various noisy conditions.
Collaborative Wireless Freeview Video Streaming With Network Coding
Abstract - Free viewpoint video (FVV) offers compelling interactive experience by allowing
users to switch to any viewing angle at any time. An FVV is composed of a large number of
camera-captured anchor views, with virtual views (not captured by any camera) rendered from
their nearby anchors using techniques such as depth-image-based rendering (DIBR). We
consider a group of wireless users who may interact with an FVV by independently switching
views. We study a novel live FVV streaming network where each user pulls a subset of anchors
from the server via a primary channel. To enhance anchor availability at each user, a user
generates network-coded (NC) packets using some of its anchors and broadcasts them to its
direct neighbors via a secondary channel. Given limited primary and secondary channel
bandwidths at the devices, we seek to maximize the received video quality (i.e., minimize
distortion) by jointly optimizing the set of anchors each device pulls and the anchor combination
to generate NC packets. To our best knowledge, this is among the first body of work addressing
such joint optimization problem for wireless live FVV streaming with NC-based collaboration.
We first formulate the problem and show that it is NP-hard. We then propose a scalable and
effective algorithm called PAFV (Peer-Assisted Freeview Video). In PAFV, each node
collaboratively and distributedly decides on the anchors to pull and NC packets to share so as to
minimize video distortion in its neighborhood. Extensive simulation studies show that PAFV
outperforms other algorithms, achieving substantially lower video distortion (often by more than
20-50%) with significantly less redundancy (by as much as 70%). Our Android-based video
experiment further confirms the effectiveness of PAFV over comparison schemes.

A Decision-Tree-Based Perceptual Video Quality Prediction Model and Its Application in
FEC for Wireless Multimedia Communications
Abstract - With the exponential growth of video traffic over wireless networked and embedded
devices, mechanisms are needed to predict and control perceptual video quality to meet the
quality of experience (QoE) requirements in an energy-efficient way. This paper proposes an
energy-efficient QoE support framework for wireless video communications. It consists of two
components: 1) a perceptual video quality model that allows the prediction of video quality in
real-time and with low complexity, and 2) an application layer energy-efficient and content-
aware forward error correction (FEC) scheme for preventing quality degradation caused by
network packet losses. The perceptual video quality model characterizes factors related to video
content as well as distortion caused by compression and transmission. Prediction of perceptual
quality is achieved through a decision tree using a set of observable features from the
compressed bitstream and the network. The proposed model can achieve prediction accuracy of
88.9% and 90.5% on two distinct testing sets. Based on the proposed quality model, a novel FEC
scheme is introduced to protect video packets from losses during transmission. Given a user-
defined perceptual quality requirement, the FEC scheme adjusts the level of protection for
different components in a video stream to minimize network overhead. Simulation results show
that the proposed FEC scheme can enhance the perceptual quality of videos. Compared to
conventional FEC methods for video communications, the proposed FEC scheme can reduce
network overhead by 41% on average.
IEEE Transactions on Multimedia (April 2016)
mDASH: A Markov Decision-Based Rate Adaptation Approach for Dynamic HTTP
Streaming
Abstract - Dynamic adaptive streaming over HTTP (DASH) has recently been widely deployed
in the Internet. It, however, does not impose any adaptation logic for selecting the quality of
video fragments requested by clients. In this paper, we propose a novel Markov decision-based
rate adaptation scheme for DASH aiming to maximize the quality of user experience under time-
varying channel conditions. To this end, our proposed method takes into account those key
factors that make a critical impact on visual quality, including video playback quality, video rate

switching frequency and amplitude, buffer overflow/underflow, and buffer occupancy. Besides,
to reduce computational complexity, we propose a low-complexity sub-optimal greedy algorithm
which is suitable for real-time video streaming. Our experiments in network test-bed and real-
world Internet all demonstrate the good performance of the proposed method in both objective
and subjective visual quality.
Complexity Control Based on a Fast Coding Unit Decision Method in the HEVC Video
Coding Standard
Abstract - The emerging high-efficiency video coding standard achieves higher coding
efficiency than previous standards by virtue of a set of new coding tools such as the quadtree
coding structure. In this novel structure, the pixels are organized into coding units (CU),
prediction units, and transform units, the sizes of which can be optimized at every level
following a tree configuration. These tools allow highly flexible data representation; however,
they incur a very high computational complexity. In this paper, we propose an effective
complexity control (CC) algorithm based on a hierarchical approach. An early termination
condition is defined at every CU size to determine whether subsequent CU sizes should be
explored. The actual encoding times are also considered to satisfy the target complexity in real
time. Moreover, all parameters of the algorithm are estimated on the fly to adapt its behavior to
the video content, the encoding configuration, and the target complexity over time. The
experimental results prove that our proposal is able to achieve a target complexity reduction of
up to 60% with respect to full exploration, with notable accuracy and limited losses in coding
performance. It was compared with a state-of-the-art CC method and shown to achieve a
significantly better trade-off between coding complexity and efficiency as well as higher
accuracy in reaching the target complexity. Furthermore, a comparison with a state-of-the-art
complexity reduction method highlights the advantages of our CC framework. Finally, we show
that the proposed method performs well when the target complexity varies over time.
A Low-Power Video Recording System With Multiple Operation Modes for H.264 and
Light-Weight Compression

Abstract - An increasing demand for mobile video recording systems makes it important to
reduce power consumption and to increase battery lifetime. The H.264/AVC compression is
widely used for many video recording systems because of its high compression efficiency;
however, the complex coding structure of H.264/AVC compression requires large power
consumption. A light-weight video compression (LWC), based on discrete wavelet transform
and set partitioning in hierarchical trees, consumes less power than H.264/AVC compression
thanks to its relatively simple coding structure, although its compression efficiency is lower than
that of H.264/AVC compression. This paper proposes a low-power video recording system that
combines both the H.264/AVC encoder with high compression efficiency and LWC with low
power consumption. The LWC is used to compress video data for temporal storage while the
H.264/AVC encoder is used for permanent storage of data when some events are detected. For
further power reduction, a down-sampling operation is utilized for permanent data storage. For
an effective use of the two compressions with the down-sampling operation, an appropriate
scheme is selected according to the proportion of long-term to short-term storage and the target
bitrate. The proposed system reduces power consumption by up to 72.5% compared to that in a
conventional video recording system.
Human Visual System-Based Saliency Detection for High Dynamic Range Content
Abstract - The human visual system (HVS) attempts to select salient areas to reduce cognitive
processing efforts. Computational models of visual attention try to predict the most relevant and
important areas of videos or images viewed by the human eye. Such models, in turn, can be
applied to areas such as computer graphics, video coding, and quality assessment. Although
several models have been proposed, only one of them is applicable to high dynamic range (HDR)
image content, and no work has been done for HDR videos. Moreover, the main shortcoming of
the existing models is that they cannot simulate the characteristics of HVS under the wide
luminous range found in HDR content. This paper addresses these issues by presenting a
computational approach to model the bottom-up visual saliency for HDR input by combining
spatial and temporal visual features. An analysis of eye movement data affirms the effectiveness
of the proposed model. Comparisons employing three well-known quantitative metrics show that
the proposed model substantially improves predictions of visual attention for HDR content.

Multimodal Personality Recognition in Collaborative Goal-Oriented Tasks
Abstract - Incorporating research on personality recognition into computers, both from a
cognitive as well as an engineering perspective, would facilitate the interactions between humans
and machines. Previous attempts on personality recognition have focused on a variety of
different corpora (ranging from text to audiovisual data), scenarios (interviews, meetings),
channels of communication (audio, video, text), and different subsets of personality traits (out of
the five ones from the Big Five Model). Our study uses simple acoustic and visual nonverbal
features extracted from multimodal data, which have been recorded in previously uninvestigated
scenarios, and consider all five personality traits and not just a subset. First, we look at the
human-machine interaction scenario, where we introduce the display of different “collaboration
levels.” Second, we look at the contribution of the human-human interaction (HHI) scenario on
the emergence of personality traits. Investigating the HHI scenario creates a stronger basis for
future human-agents interactions. Our goal is to study, from a computational approach, the
emergence degree of the five personality traits in these two scenarios. The results demonstrate
the relevance of each of the two scenarios when it comes to the degree of emergence of certain
traits and the feasibility to automatically recognize personality under different conditions.
Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing
Systems
Abstract - The decreasing mean-time-to-failure estimates in cloud computing systems indicate
that multimedia applications running on such environments should be able to mitigate an
increasing number of core failures at runtime. We propose a new roll-forward failure-mitigation
approach for integer sum-of-product computations, with emphasis on generic matrix
multiplication (GEMM) and convolution/crosscorrelation (CONV) routines. Our approach is
based on the production of redundant results within the numerical representation of the outputs
via the use of numerical packing. This differs from all existing roll-forward solutions that require
a separate set of checksum (or duplicate) results. Our proposal imposes 37.5% reduction in the
maximum output bitwidth supported in comparison to integer sum-of-product realizations
performed on 32-bit integer representations which is comparable to the bitwidth requirement of

checksum-methods for multiple core failure mitigation. Experiments with state-of-the-art
GEMM and CONV routines running on a c4.8xlarge compute-optimized instance of amazon
web services elastic compute cloud (AWS EC2) demonstrate that the proposed approach is able
to mitigate up to one quadcore failure while achieving processing throughput that is: 1)
comparable to that of the conventional, failure-intolerant, integer GEMM and CONV routines, 2)
substantially superior to that of the equivalent roll-forward failure-mitigation method based on
checksum streams. Furthermore, when used within an image retrieval framework deployed over
a cluster of AWS EC2 spot (i.e., low-cost albeit terminatable) instances, our proposal leads to: 1)
16%-23% cost reduction against the equivalent checksum-based method and 2) more than 70%
cost reduction against conventional failure-intolerant processing on AWS EC2 on-demand (i.e.,
higher-cost albeit guaranteed) instances.
Factorization Algorithms for Temporal Psychovisual Modulation Display
Abstract - Temporal psychovisual modulation (TPVM) is a new information display technology
which aims to generate multiple visual percepts for different viewers on a single display
simultaneously. In a TPVM system, the viewers wearing different active liquid crystal (LC)
glasses with varying transparency levels can see different images (called personal views). The
viewers without LC glasses can also see a semantically meaningful image (called shared view).
The display frames and weights for the LC glasses in the TPVM system can be computed
through nonnegative matrix factorization (NMF) with three additional constrains: the values of
images and modulation weights should have upper bound (i.e., limited luminance of the display
and transparency level of the LC); the shared view without using viewing devices should be
considered (i.e., the sum of all basis images should be a meaningful image); and the sparsity of
modulation weights should be considered due to the material property of LC. In this paper, we
proposed to solve the constrained NMF problem by a modified version of hierarchical alternating
least squares (HALS) algorithms. Through experiments, we analyze the choice of parameters in
the setup of TPVM system. This work serves as a guideline for practical implementation of
TPVM display system.
Free-Energy Principle Inspired Video Quality Metric and Its Use in Video Coding

Abstract - In this paper, we extend the free-energy principle to video quality assessment (VQA)
by incorporating with the recent psychophysical study on human visual speed perception
(HVSP). A novel video quality metric, namely the free-energy principle inspired video quality
metric (FePVQ), is therefore developed and applied to perceptual video coding optimization. The
free-energy principle suggests that the human visual system (HVS) can actively predict “orderly”
information and avoid “disorderly” information for image perception. Basically, “orderly” is
associated with the skeletons and edges of objects, and “disorderly” mostly concerns textures in
images. Based on this principle, an image is separated into orderly and disorderly regions, and
processed differently in image quality assessment. For videos, visual attention, or fixation, is
associated with the objects with significant motion according to HVSP, resulting in a motion
strength factor in the FePVQ so that the free-energy principle is extended into spatio-temporal
domain for VQA. In addition, we investigate the application of the FePVQ in perceptual rate
distortion optimization (RDO). For this purpose, the FePVQ is realized with low computational
cost by using the relative total variation model and the block-wise motion vectors of video
coding to simulate the free-energy principle and the HVSP, respectively. The experimental
results indicate that the proposed FePVQ is highly consistent with the HVS perception. The
linear correlation coefficient and Spearman's rank-order correlation coefficient are up to 0.8324
and 0.8281 on the LIVE video database. Better perceptual quality of encoded video sequences is
achieved by FePVQ-motivated RDO in video coding.
Holons Visual Representation for Image Retrieval
Abstract - Along with the enlargement of image scale, convolutional local features, such as
SIFT, are ineffective for representing or indexing and more compact visual representations are
required. Due to the intrinsic mechanism, the state-of-the-art vector of locally aggregated
descriptors (VLAD) has a few limits. Based on this, we propose a new descriptor named holons
visual representation (HVR). The proposed HVR is a derivative mutational self-contained
combination of global and local information. It exploits both global characteristics and the
statistic information of local descriptors in the image dataset. It also takes advantages of local
features of each image and computes their distribution with respect to the entire local descriptor
space. Accordingly, the HVR is computed by a two-layer hierarchical scheme, which splits the

local feature space and obtains raw partitions, as well as the corresponding refined partitions.
Then, according to the distances from the centroids of partition spaces to local features and their
spatial correlation, we assign the local features into their nearest raw partitions and refined
partitions to obtain the global description of an image. Compared with VLAD, HVR holds
critical structure information and enhances the discriminative power of individual representation
with a small amount of computation cost, while using the same memory overhead. Extensive
experiments on several benchmark datasets demonstrate that the proposed HVR outperforms
conventional approaches in terms of scalability as well as retrieval accuracy for images with
similar intra local information.
Query-Adaptive Small Object Search Using Object Proposals and Shape-Aware
Descriptors
Abstract - While there has been a significant amount of work on object search and image
retrieval, the focus has primarily been on establishing effective models for the whole images,
scenes, and objects occupying a large portion of an image. In this paper, we propose to leverage
object proposals to identify small and smooth-structured objects in a large image database.
Unlike popular methods exploring a coarse image-level pairwise similarity, the search is
designed to exploit the similarity measures at the proposal level. An effective graph-based query
expansion strategy is designed to assess each of these better matched proposals against all its
neighbors within the same image for a precise localization. Combined with a shape-aware feature
descriptor EdgeBoW, a set of more insightful edge-weights and node-utility measures, the
proposed search strategy can handle varying view angles, illumination conditions, deformation,
and occlusion efficiently. Experiments performed on a number of other benchmark datasets show
the powerful and superior generalization ability of this single integrated framework in dealing
with both clutter-intensive real-life images and poor-quality binary document images at equal
dexterity.
Folksonomy-Based Visual Ontology Construction and Its Applications
Abstract - An ontology hierarchically encodes concepts and concept relationships, and has a
variety of applications such as semantic understanding and information retrieval. Previous work

for building ontologies has primarily relied on labor-intensive human contributions or focused on
text-based extraction. In this paper, we consider the problem of automatically constructing a
folksonomy-based visual ontology (FBVO) from the user-generated annotated images. A
systematic framework is proposed consisting of three stages as concept discovery, concept
relationship extraction, and concept hierarchy construction. The noisy issues of the user-
generated tags are carefully addressed to guarantee the quality of derived FBVO. The
constructed FBVO finally consists of 139 825 concept nodes and millions of concept
relationships by mining more than 2.4 million Flickr images. Experimental evaluations show that
the derived FBVO is of high quality and consistent with human perception. We further
demonstrate the utility of the derived FBVO in applications of complex visual recognition and
exploratory image search.
Learning Personalized Models for Facial Expression Analysis and Gesture Recognition
Abstract - Facial expression and gesture recognition algorithms are key enabling technologies
for human-computer interaction (HCI) systems. State of the art approaches for automatic
detection of body movements and analyzing emotions from facial features heavily rely on
advanced machine learning algorithms. Most of these methods are designed for the average user,
but the assumption “one-size-fits-all” ignores diversity in cultural background, gender, ethnicity,
and personal behavior, and limits their applicability in real-world scenarios. A possible solution
is to build personalized interfaces, which practically implies learning person-specific classifiers
and usually collecting a significant amount of labeled samples for each novel user. As data
annotation is a tedious and time-consuming process, in this paper we present a framework for
personalizing classification models which does not require labeled target data. Personalization is
achieved by devising a novel transfer learning approach. Specifically, we propose a regression
framework which exploits auxiliary (source) annotated data to learn the relation between person-
specific sample distributions and parameters of the corresponding classifiers. Then, when
considering a new target user, the classification model is computed by simply feeding the
associated (unlabeled) sample distribution into the learned regression function. We evaluate the
proposed approach in different applications: pain recognition and action unit detection using
visual data and gestures classification using inertial measurements, demonstrating the generality

of our method with respect to different input data types and basic classifiers. We also show the
advantages of our approach in terms of accuracy and computational time both with respect to
user-independent approaches and to previous personalization techniques.
Scalable Video Event Retrieval by Visual State Binary Embedding
Abstract - With the exponential increase of media data on the web, fast media retrieval is
becoming a significant research topic in multimedia content analysis. Among the variety of
techniques, learning binary embedding (hashing) functions is one of the most popular approaches
that can achieve scalable information retrieval in large databases, and it is mainly used in the
near-duplicate multimedia search. However, till now most hashing methods are specifically
designed for nearduplicate retrieval at the visual level rather than the semantic level. In this
paper, we propose a Visual State Binary Embedding (VSBE) model to encode the video frames,
which can preserve the essential semantic information in binary matrices, to facilitate fast video
event retrieval in unconstrained cases. Compared with other video binary embedding models,
one advantage of our proposed VSBE model is that it only needs a limited number of key frames
from the training videos for hash function training, so the computational complexity is much
lower in the training phase. At the same time, we apply the pair-wise constraints generated from
the visual states to sketch the local properties of the events at the semantic level, so accuracy is
also ensured. We conducted extensive experiments on the challenging TRECVID MED dataset,
and have proved the superiority of our proposed VSBE model.
Link Adaptation for High-Quality Uncompressed Video Streaming in 60-GHz Wireless
Networks
Abstract - The emerging 60-GHz multigigabits per second wireless technology enables the
streaming of high-quality “uncompressed” video, which has been impossible with other existing
wireless technologies. To support such a resource-hungry uncompressed video streaming service
with limited wireless resources, it is necessary to design efficient link adaptation policies
selecting suitable transmission rates for the 60-GHz wireless channel environment, thus
optimizing video quality and resource management. For proper design of the link adaptation
policies, we propose a new metric, called expected peak signal-to-noise ratio (ePSNR), to

numerically estimate the video streaming quality. By using the ePSNR as a criterion, we propose
two link adaptation policies with different objectives considering unequal error protection (UEP).
The proposed link adaptation policies attempt to 1) maximize the video quality for given wireless
resources, or 2) minimize the required wireless resources, while meeting the video quality. From
the link adaptation policies, we provide a distributed resource management scheme for multiple
users to maintain satisfactory video streaming quality. Our extensive simulation results
demonstrate that the newly proposed variable, i.e., ePSNR, well represents the level of video
quality. It is also shown that the proposed link adaptation policies can enhance the resource
efficiency while achieving acceptable quality of the video streaming.
Multiview and 3D Video Compression Using Neighboring Block Based Disparity Vectors
Abstract - Compression of the statistical redundancy among different viewpoints, i.e., inter-view
redundancy, is a fundamental and critical problem in multiview and three-dimensional (3D)
video coding. To exploit the inter-view redundancy, disparity vectors are required to identify
pixels of the same objects within two different views; in this way, the enhancement coding tools
can be efficiently employed as new modes in block-based video codecs to achieve higher
compression efficiency. Although disparity can be converted from depth, it is not possible in
multiview video coding since depth information is not considered. Even when depth information
is coded, it breaks the so-called multiview compatibility wherein texture views can be decoded
without depth information. To resolve this problem, in this paper, a neighboring block-based
disparity vector derivation (NBDV) method is proposed. The basic concept of NBDV is to derive
a disparity vector (DV) of a current block by utilizing the motion information of spatially and
temporally neighboring blocks predicted from another view. Through extensive experiments and
analysis, it is shown that the proposed NBDV method achieves efficient DV derivation in the
state-of-art video codecs, and it keeps the multiview compatibility with a relatively lower
complexity. The proposed method has become an essential part of the 3D video standard
extensions of H.264/AVC and HEVC.
Predicting the Performance in Decision-Making Tasks: From Individual Cues to Group
Interaction

Abstract - This paper addresses the problem of predicting the performance of decision-making
groups. Towards this goal, we evaluate the predictive power of group attributes and discussion
dynamics by using automatically extracted features, such as group members' aural and visual
cues, interaction between team members, and influence of each team member, as well as self-
reported features such as personality- and perception-related cues, hierarchical structure of the
group, and individual- and group-level task performances. We tackle the inference problem from
two angles depending on the way that features are extracted: 1) a holistic approach based on the
entire meeting, and 2) a sequential approach based on the thin slices of the meeting. In the
former, key factors affecting the group performance are identified and the prediction is achieved
by support vector machines. As for the latter, we compare and contrast the classification
performance of an influence model-based novel classifier with that of hidden Markov model
(HMM). Experimental results indicate that the group looking cues and the influence cues are
major predictors of group performance and the influence model outperforms the HMM in almost
all experimental conditions. We also show that combining classifiers covering unique aspects of
data results in improvement in the classification performance.
Comparison and Evaluation of Sonification Strategies for Guidance Tasks
Abstract - This paper aims to reveal the efficiency of sonification strategies in terms of rapidity,
precision, and overshooting in the case of a one-dimensional guidance task. The sonification
strategies are based on the four main perceptual attributes of a sound (pitch, loudness,
duration/tempo, and timbre) and classified with respect to the presence or not of one or several
auditory references. Perceptual evaluations are used to display the strategies in a
precision/rapidity space and enable prediction of user behavior for a chosen sonification strategy.
The evaluation of sonification strategies constitutes a first step toward general guidelines for
sound design in interactive multimedia systems that involve guidance issues.
3D Ear Identification Using Block-wise Statistics based Features and LC-KSVD
Abstract - Biometrics authentication has been corroborated to be an effective method for
recognizing a person’s identity with high confidence. In this field, the use of 3D ear shape is a
recent trend. As a biometric identifier, ear has several inherent merits. However, although a great

deal of efforts have been devoted, there is still a large room for improvement for developing a
highly effective and efficient 3D ear identification approach. In this paper, we attempt to fill this
gap to some extent by proposing a novel 3D ear classification scheme that makes use of the label
consistent K-SVD (LC-KSVD) framework. As an effective supervised dictionary learning
algorithm, LC-KSVD learns a single compact discriminative dictionary for sparse coding and a
multi-class linear classifier simultaneously. To use the LC-KSVD framework, one key issue is
how to extract feature vectors from 3D ear scans. To this end, we propose a block-wise statistics
based feature extraction scheme. Specifically, we divide a 3D ear ROI into uniform blocks and
extract a histogram of surface types from each block; histograms from all blocks are then
concatenated to form the desired feature vector. Feature vectors extracted in this way are highly
discriminative and are robust to mere misalignment between samples. Experiments demonstrate
that our approach can achieve better recognition accuracy than the other state-of-the-art methods.
More importantly, its computational complexity is extremely low, making it quite suitable for the
large-scale identification applications. Matlab source codes are publicly online available at
http://sse.tongji.edu.cn/linzhang/LCKSVDEar/LCKSVDEar.htm.
IEEE Transactions on Multimedia (May 2016)
Sketch-based Image Retrieval by Salient Contour Reinforcement
Abstract - The paper presents a sketch-based image retrieval algorithm. One of the main
challenges in sketch-based image retrieval (SBIR) is to measure the similarity between a sketch
and an image. To tackle this problem, we propose a SBIR based approach by salient contour
reinforcement. In our approach, we divide the image contour into two types. The first is the
global contour map. The second that is called the salient contour map is helpful to find out the
object in images similar to the query. In addition, based on the two contour maps, we propose a
new descriptor, namely angular radial orientation partitioning (AROP) feature. It fully utilizes
the edge pixels’ orientation information in contour maps to identify the spatial relationships. Our
AROP feature based on the two candidate contour maps is both efficient and effective to
discover false matches of local features between sketch and images, and can greatly improve the
retrieval performance. The application of retrieval system based on this algorithm is established.
The experiments on the image dataset with 0.3 million images show the effectiveness of the

proposed method and comparisons with other algorithms are also given. Compared to baseline
performance, the proposed method achieves 10% higher precision in top 5.
Democratic Diffusion Aggregation for Image Retrieval
Abstract - Content-based image retrieval is an important research topic in the multimedia filed.
In large-scale image search using local features, image features are encoded and aggregated into
a compact vector to avoid indexing each feature individually. In aggregation step, sum-
aggregation is wildly used in many existing work and demonstrates promising performance.
However, it is based on a strong and implicit assumption that the local descriptors of an image
are identically and independently distributed in descriptor space and image plane. To address this
problem, we propose a new aggregation method named democratic diffusion aggregation with
weak spatial context embedded. The main idea of our aggregation method is to re-weight the
embedded vectors before sum-aggregation by considering the relevance among local descriptors.
Different from previous work, by conducting a diffusion process on the improved kernel matrix,
we calculate the weighting coefficients more efficiently without any iterative optimization.
Besides, considering the relevance of local descriptors from different images, we also discuss an
efficient query fusion strategy which uses the initial topranked image vectors to enhance the
retrieval performance. Experimental results show that our aggregation method exhibits much
higher efficiency (about ×14 faster) and better retrieval accuracy compared with previous
methods, and the query fusion strategy consistently improves the retrieval quality.
Democratic Diffusion Aggregation for Image Retrieval
Abstract - Content-based image retrieval is an important research topic in the multimedia filed.
In large-scale image search using local features, image features are encoded and aggregated into
a compact vector to avoid indexing each feature individually. In aggregation step, sum-
aggregation is wildly used in many existing work and demonstrates promising performance.
However, it is based on a strong and implicit assumption that the local descriptors of an image
are identically and independently distributed in descriptor space and image plane. To address this
problem, we propose a new aggregation method named democratic diffusion aggregation with
weak spatial context embedded. The main idea of our aggregation method is to re-weight the

embedded vectors before sum-aggregation by considering the relevance among local descriptors.
Different from previous work, by conducting a diffusion process on the improved kernel matrix,
we calculate the weighting coefficients more efficiently without any iterative optimization.
Besides, considering the relevance of local descriptors from different images, we also discuss an
efficient query fusion strategy which uses the initial topranked image vectors to enhance the
retrieval performance. Experimental results show that our aggregation method exhibits much
higher efficiency (about ×14 faster) and better retrieval accuracy compared with previous
methods, and the query fusion strategy consistently improves the retrieval quality.
Tag based Image Search by Social Re-Ranking
Abstract - Social media sharing websites like Flickr allow users to annotate images with free
tags, which significantly contribute to the development of the web image retrieval and
organization. Tag-based image search is an important method to find images contributed by
social users in such social websites. However, how to make the top ranked result relevant and
with diversity is challenging. In this paper, we propose a social re-ranking system for tag-based
image retrieval with the consideration of image’s relevance and diversity. We aim at re-ranking
images according to their visual information, semantic information and social clues. The initial
results include images contributed by different social users. Usually each user contributes several
images. First we sort these images by inter-user re-ranking. Users that have higher contribution
to the given query rank higher. Then we sequentially implement intra-user re-ranking on the
ranked user’s image set, and only the most relevant image from each user’s image set is selected.
These selected images compose the final retrieved results. We build an inverted index structure
for the social image dataset to accelerate the searching process. Experimental results on Flickr
dataset show that our social re-ranking method is effective and efficient.
Learning Geographical Hierarchy Features via a Compositional Model
Abstract - Image location prediction is to estimate the geolocation where an image is taken,
which is important for many image applications, such as image retrieval, image browsing and
organization. Since social image contains heterogeneous contents, such as visual content and
textual content, effectively incorporating these contents to predict location is nontrivial.

Moreover, it is observed that image content patterns and the locations where they may appear
correlate hierarchically. Traditional image location prediction methods mainly adopt a single-
level architecture and assume images are independently distributed in geographical space, which
is not directly adaptable to the hierarchical correlation. In this paper, we propose a
Geographically Hierarchical Bi-modal Deep Belief Network model (GHBDBN), which is a
compositional learning architecture that integrates multi-modal deep learning model with non-
parametric hierarchical prior model. GH-BDBN learns a joint representation capturing the
correlations among different types of image content using a bi-modal DBN, with a
geographically hierarchical prior over the joint representation to model the hierarchical
correlation between image content and location. Then, an efficient inference algorithm is
proposed to learn the parameters and the geographical hierarchical structure of geographical
locations. Experimental results demonstrate the superiority of our model for image location
prediction.
Semantic Discriminative Metric Learning for Image Similarity Measurement
Abstract - Along with the arrival of multimedia time, multimedia data has replaced textual data
to transfer information in various fields. As an important form of multimedia data, images have
been widely utilized by many applications, such as face recognition, image classification.
Therefore, how to accurately annotate each image from a large set of images is of vital
importance but challenging. To perform these tasks well, it’s crucial to extract suitable features
to character the visual contents of images and learn an appropriate distance metric to measure
similarities between all images. Unfortunately, existing feature operators, such as histogram of
gradient, local binary pattern and color histogram, care more about the visual character of images
and lack the ability to distinguish semantic information. Similarities between those features can’t
reflect the real category correlations due to the well-known semantic gap. In order to solve this
problem, this paper proposes a regularized distance metric framework called Semantic
Discriminative Metric Learning (SDML). SDML combines geometric mean with normalized
divergences and separates images from different classes simultaneously. The learned distance
metric can treat all images from different classes equally. And distinctions between similar
classes with entirely different semantic contents are emphasized by SDML. This procedure

ensures the consistency between dissimilarities and semantic distinctions and avoids inaccuracy
similarities incurred by unbalanced locations of samples. Various experiments on benchmark
image datasets show the excellent performance of the novel method.
6-DOF Image Localization from Massive Geo-tagged Reference Images
Abstract - The 6-DOF (Degrees Of Freedom) image localization, which aims to calculate the
spatial position and rotation of a camera, is a challenging problem for most location-based
services. In existing approaches, this problem is often tackled by finding the matches between
2D image points and 3D structure points so as to derive the location information via direct linear
transformation algorithm. However, as these 2D-to-3D based approaches need to reconstruct the
3D structure points of the scene, they may be not flexible to employ massive and increasing geo-
tagged data. To this end, this paper presents a novel approach for 6-DOF image localization by
fusing candidate poses relative to reference images. In this approach, we propose to localize an
input image according to the position and rotation information of multiple geo-tagged images
retrieved from a reference dataset. From the reference images, an efficient relative pose
estimation algorithm is proposed to derive a set of candidate poses for the input image. Each
candidate pose encodes the relative rotation and direction of the input image with respect to a
specific reference image. Finally, these candidate poses can be fused together by minimizing a
welldefined geometry error so that the 6-DOF location of the input image is effectively derived.
Experimental results show that our method can obtain satisfactory localization accuracy. In
addition, the proposed relative pose estimation algorithm is much faster than existing work.
Delay-Optimized Video Traffic Routing in Software-Defined Interdatacenter Networks
Abstract - Many video streaming applications operate their geo-distributed services in the cloud,
taking advantage of superior connectivities between datacenters to push content closer to users or
to relay live video traffic between end users at a higher throughput. In the meantime, inter-
datacenter networks also carry high volumes of other types of traffic, including service
replication and data backups, e.g., for storage and email services. It is an important research topic
to optimally engineer and schedule inter-datacenter traffic, taking into account the stringent
latency requirements of video flows when transmitted along inter-datacenter links shared with

other types of traffic. Since inter-datacenter networks are usually overprovisioned, unlike prior
work that mainly aims to maximize link utilization, we propose a delay-optimized traffic routing
scheme to explicitly differentiate path selection for different sessions according to their delay
sensitivities, leading to a software-defined inter-datacenter networking overlay implemented at
the application layer. We show that our solution can yield sparse path selection by only solving
linear programs, and thus, in contrast to prior traffic engineering solutions, does not lead to
overly fine-grained traffic splitting, further reducing packet resequencing overhead and the
number of forwarding rules to be installed in each forwarding unit. Real-world experiments
based on a deployment on six globally distributed Amazon EC2 datacenters have shown that our
system can effectively prioritize and improve the delay performance of inter-datacenter video
flows at a low cost.
Multiple Human Identification and Cosegmentation: A Human-Oriented CRF Approach
with Poselets
Abstract - Localizing, identifying and extracting humans with consistent appearance jointly
from a personal photo stream is an important problem and has wide applications. The strong
variations in foreground and background and irregularly occurring foreground humans make this
realistic problem challenging. Inspired by the advance in object detection, scene understanding
and image cosegmentation, in this paper we explore explicit constraints to label and segment
human objects rather than other non-human objects and “stuff”. We refer to such a problem as
Multiple Human Identification and Cosegmentation (MHIC). To identify specific human
subjects, we propose an efficient human instance detector by combining an extended color line
model with a poselet-based human detector. Moreover, to capture high level human shape
information, a novel soft shape cue is proposed. It is initialized by the human detector, then
further enhanced through a generalized geodesic distance transform, and refined finally with a
joint bilateral filter. We also propose to capture the rich feature context around each pixel by
using an adaptive cross region data structure, which gives a higher discriminative power than a
single pixel-based estimation. The high-level object cues from the detector and the shape are
then integrated with the low-level pixel cues and mid-level contour cues into a principled
conditional random field (CRF) framework, which can be efficiently solved by using fast graph

cut algorithms. We evaluate our method over a newly created NTU-MHIC human dataset, which
contains 351 images with manually annotated ground-truth segmentation. Both visual and
quantitative results demonstrate that our method achieves state-of-the-art performance for the
MHIC task.
Game Theoretic Resource Allocation in Media Cloud with Mobile Social Users
Abstract - Due to the rapid increases in both the population of mobile social users and the
demand for quality of experience (QoE), providing mobile social users with satisfied multimedia
services has become an important issue. Media cloud has been shown to be an efficient solution
to resolve the above issue, by allowing mobile social users connecting it through a group of
distributed brokers. However, as the resource in media cloud is limited, how to allocate resource
among media cloud, brokers and mobile social users becomes a new challenge. Therefore, in this
paper, we propose a game theoretic resource allocation scheme for media cloud to allocate
resource to mobile social users though brokers. Firstly, a framework of resource allocation
among media cloud, brokers and mobile social users is presented. Media cloud can dynamically
determine the price of resource and allocate its resource to brokers. Mobile social user can select
his broker to connect media cloud by adjusting the strategy to achieve the maximum revenue,
based on the social features in the community. Next, we formulate the interactions among media
cloud, brokers and mobile social users by a four-stage Stackelberg game. In addition, through the
backward induction method, we propose an iterative algorithm to implement the proposed
scheme and obtain the Stackelberg equilibrium. Finally, simulation results show that each player
in the game can obtain the optimal strategy where Stackelberg equilibrium exists stably.
DPcode: Privacy-Preserving Frequent Visual Patterns Publication on Cloud
Abstract - Nowadays, cloud has become a promising multimedia data processing and sharing
platform. Many institutes and companies plan to outsource and share their large-scale video and
image datasets on cloud for scientific research and public interest. Among various video
applications, the discovery of frequent visual patterns over graphical data is an exploratory and
important technique. However, the privacy concerns over the leakage of sensitive information
contained in the videos/images impedes the further implementation. Although the frequent visual

patterns mining (FVPM) algorithm aggregates summary over individual frames and seems not to
pose privacy threat, the private information contained in individual frames still may be leaked
from the statistical result. In this paper, we study the problem of privacy-preserving publishing of
graphical data FVPM on cloud. We propose the first differentially private frequent visual
patterns mining algorithm for graphical data, named DPcode. We propose a novel mechanism
that integrates the privacy-preserving visual word conversion with the differentially private
mechanism under the noise allocation strategy of the sparse vector technique. The optimized
algorithms properly allocate the privacy budgets among different phases in FPM algorithm over
images and reduce the corresponding data distortion. Extensive experiments are conducted based
on datasets commonly used in visual mining algorithms. The results show that our approach
achieves high utility while satisfying a practical privacy requirement.
Audio recapture detection with convolutional neural networks
Abstract - In this work, we investigate how features can be effectively learned by deep neural
networks for audio forensic problems. By providing a preliminary feature preprocessing based
on Electric Network Frequency (ENF) analysis, we propose a convolutional neural network
(CNN) for training and classification of genuine and recaptured audio recordings. Hierarchical
representations which contain levels of details of the ENF components are learned from the deep
neural networks and can be used for further classification. The proposed method works for small
audio clips of 2 seconds’ duration, whereas the state of the art may fail with such small audio
clips. Experimental results demonstrate that the proposed network yields high detection accuracy
with each ENF harmonic component represented as a single-channel input. The performance can
be further improved by a combined input representation which incorporates both the fundamental
ENF and its harmonics. Convergence property of the network and the effect of using analysis
window with various sizes are also studied. Performance comparison against the support tensor
machine demonstrates the advantage of using CNN for the task of audio recapture detection.
Moreover, visualization of the intermediate feature maps provides some insight into what the
deep neural networks actually learn and how they make decisions.
A Context-aware Framework for Reducing Bandwidth Usage of Mobile Video Chats

Abstract - Mobile video chat apps offer users an approachable way to communicate with others.
As the highspeed 4G networks being deployed worldwide, the number of mobile video chat app
users increases. However, video chatting on mobile devices brings users financial concerns,
since streaming video demands high bandwidth and can use up a large amount of data in dozens
of minutes. Lowering the bandwidth usage of mobile video chats is challenging since video
quality may be compromised. In this paper, we attempt to tame this challenge. Technically, we
propose a context-aware frame rate adaption framework, named LBVC (Low-bandwidth Video
Chat). It follows a sender-receiver cooperative principle that smartly handles the trade-off
between lowering bandwidth usage and maintaining video quality. We implement LBVC by
modifying an open-source app - Linphone and evaluate it with both objective experiments and
subjective studies.
Resource Allocation With Video Traffic Prediction in Cloud-Based Space Systems
Abstract - This paper considers the resource allocation problems for video transmission in
space-based information networks. The queueing system analyzed in this study is constituted by
multiple users and a single server. The server is operated as a cloud that can sense the traffic
arrivals to each user's queue and then allocates the transmission resource and service rate for
users. The objectives are to make configurations over time to minimize the time average cost of
the system, and to minimize the waiting time of packets after they enter the queue. Meanwhile,
the constraints on the queue stability of the system must be satisfied. In this paper, we introduce
a predictive backpressure algorithm, which considers the future arrivals with a certain prediction
window size into the consideration of resource allocation to make decisions on which packets to
be served first. In addition, this paper designs a multiresolution wavelet decomposition-based
backpropagation network for the prediction of video traffic, which exhibits the long-range
dependence property. Simulation results indicate that the delay of the queueing system can be
reduced through this prediction-based resource allocation, and the prediction accuracy for the
video traffic is improved according to the proposed prediction system.
SALIC: Social Active Learning for Image Classification

Abstract - In this paper we present SALIC, an active learning method that is placed in the
context of social networks and focuses on selecting the samples that are most appropriate to
expand the training set of a binary classifier. The process of active learning can be fully
automated in this social context by replacing the human oracle with the user tagged images
obtained from social networks. However, the noisy nature of user-contributed tags adds further
complexity to the problem of sample selection since, apart from their informativeness (i.e. how
much they are expected to inform the classifier if we knew their label), our confidence about
their actual content should also be maximized (i.e. how certain the oracle is on its decision about
the contents of an image). The main contribution of this work is in proposing a probabilistic
approach for jointly maximizing the two aforementioned quantities with a view to automate the
process of active learning. Based on this approach the training set is expanded with samples that
maximize the joint probability of selecting a sample given its informativeness and our
confidence for its true content. In the examined noisy context, the oracle’s confidence is
necessary to provide a contextual-based indication of the images’ true contents, while the
samples’ informativeness is required to reduce the computational complexity and minimize the
mistakes of the unreliable oracle. We prove the validity and superiority of SALIC over various
baselines and state-of-theart methods experimentally. In addition, we show that SALIC allows us
to select training data as effectively as typical active learning, without the cost of manual
annotation. Finally, we argue that the speed-up achieved when learning actively in this social
context (where labels can be obtained without the cost of human annotation) is necessary to cope
with the continuously growing requirements of large scale applications. In this respect, we prove
experimentally- that SALIC requires 10 times less training data in order to reach the exactly
same performance as a straightforward informativeness-agnostic learning approach.
Efficient Image Sharpness Assessment Based on Content Aware Total Variation
Abstract - State-of-the-art sharpness assessment methods are mostly based on edge width,
gradient, high-frequency energy or pixel intensity variation. Such methods consider very little
the image content variation in conjunction with the sharpness assessment which causes the
sharpness metric to be less effective for different content images. In this paper, we propose an

efficient no-reference image sharpness assessment called Content Aware Total Variation
(CATV) by considering the importance of image content variation in sharpness measurement. By
parameterizing the image TV statistics using Generalized Gaussian Distribution (GGD), the
sharpness measure is identified by the standard deviation, and the image content variation
evaluator is indicated by the shape-parameter. However, the standard deviation is content
dependent which is different for the regions with strong edges, high frequency textures, low
frequency textures, and blank areas. By incorporating the shape-parameter in moderating of the
standard deviation, we propose a content aware sharpness metric. The experimental results show
that the proposed method is highly correlated with the human vision system and has better
sharpness assessment results than the state-of-the-art techniques on the blurred subset images of
LIVE, TID2008, CSIQ and IVC databases. Also, our method has very low computational
complexity which is suitable for online applications. The correlations with the subjective of the
four databases and statistical significance analysis reveal that our method has superior results
when compared with previous techniques.
SUPPORT OFFERED TO REGISTERED STUDENTS:
1. IEEE Base paper.
2. Review material as per individuals’ university guidelines
3. Future Enhancement
4. assist in answering all critical questions
5. Training on programming language
6. Complete Source Code.
7. Final Report / Document
8. International Conference / International Journal Publication on your Project.
FOLLOW US ON FACEBOOK @ TSYS Academic Projects

IEEE MultiMedia 2016 Title and Abstract

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (8)

Similar to IEEE MultiMedia 2016 Title and Abstract

Similar to IEEE MultiMedia 2016 Title and Abstract (20)

Recently uploaded

Recently uploaded (20)

IEEE MultiMedia 2016 Title and Abstract