• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Video Editing: Retargeting, Replay, Repainting and Reshuffling

Video Editing: Retargeting, Replay, Repainting and Reshuffling



Video editing, retargeting, seam carving, saliency map, replaying, video skim, key frame, story board, video summary, video repainting, video-based modeling, video-based rendering, time lapse, HDR, ...

Video editing, retargeting, seam carving, saliency map, replaying, video skim, key frame, story board, video summary, video repainting, video-based modeling, video-based rendering, time lapse, HDR, colorization, video cartooning, video stylization, video texture, video synopsis, video rewrite, video matting, compositing, inpainting, texture synthesis, video completion, video annotation, augmented reality, video segmentation, visual surveillance and monitoring, HCII, graph cut, belief propagation



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Video Editing: Retargeting, Replay, Repainting and Reshuffling Video Editing: Retargeting, Replay, Repainting and Reshuffling Presentation Transcript

    • Yu Huang Sunnyvale, CA 94089 yu.huang07@gmail.com
    •        Analog film; Hollywood Digital Photography; Photoshop Digital Video Camcorder; Premier, Final Cut, Media Composer Video Manipulation and Interaction: ◦ ◦ ◦ ◦ Spatial: Retargeting; Temporal: Replaying; Luminance: Repainting; Hybrid: Reshuffling/Reproduction. 2
    •   Video retargeting is the process of transforming an existing video to fit the dims of an arbitrary display; Top down: ◦ Object-based; ◦ Region of Interests.  Bottom-up: ◦ Saliency; ◦ Features.  Goal: keep the prominent objects untouched and distort only the homogeneous regions. 3
    •   Minimization of information loss based on image saliency, objects saliency and detected objects (e.g. faces). Cropping-and-scaling. 4
    •   Segment the frame into background and foreground by motion and color info. ; Scale each one of them independently and then recombine them to produce the retargeted frame. 5
    • NonNon-homogeneous Content-Driven ContentVideo Retargeting [Wolf et al., ICCV’07] Saliency map comprises of spatial edges, face detection and motion detection; Nonlinear global warping. 6
    •      Seam carving: intelligently removing pixels from (or adding pixels to) the frame that carry little importance [Avidan et al., Siggraph07]. A seam is an optimal 8-connected path of pixels from top to bottom, or left to right, where optimality is defined by an image energy function. The selection and order of seams protect the content of the image, as defined by the energy function. In spatial temporal volume, a seam must be monotonic and connected. 2D surfaces to be removed from the 3D video cube. 7
    • Industry Products Intelligent thumbnail generation;  Image Retargeting in Photoshop; ◦Seam carving for contentaware scaling; Video Retargeting in YouTube; ◦Seam carving for aspect ratio change.  8
    •  Video abstraction (replaying): ◦ Frames->Shots->Scenes->Events; ◦ Video Summary (still story board, dynamic slideshows); ◦ Video Skimming (dynamic preview):     Highlight; Summary sequence. Time-lapse: visualizing motions and processes that evolve too slowly to be perceived in real-time; Scrubbing: controlling the video frame time by mouse motion along a time line or slider. ◦ Object tracking. 9
    •   Audio and video info. extraction to create a "skim" video with a very short synopsis; Integration of language and image understanding techniques. 10
    • Computational time-lapse video time[Bennett, Siggraph’07] Non-uniform sampling to select salient output frames; Non-linear integration of multiple frames for norm. long exposures. 11
    •   Select the salient ROIs and resize them according to their saliencies; Seamlessly arranges these ROIs on a given canvas while preserving the temporal structure of video contents. A - 2D collage B - video sequence C - 1D collage D – selected key frames 12
    • Video Abstraction @ Huawei 13
    • Industry Products  Video search result in Bing (Microsoft) 14
    •  Pixel level: ◦ High dynamic range (HDR) video; ◦ Color transfer/Re-colorization; ◦ Video stylization/cartooning.  Object/Layer level: ◦ Rigid: planar homography; ◦ Non-rigid (deformable): texture map; ◦ 3-d modeling. 15
    •    Reproduce new videos that depict the full visual dynamics of real world scenes through tone reproduction (tone mapping) as possible; Tone mapping provides a method of scaling (mapping) luminance values in the real world to a displayable range, either spatially varying (locally) or spatially uniform (globally); Image->Video (temporal consistence). 16
    • HDR Video [Kang et al., Siggraph03] Statistics from temporally neighboring frames to produce tone-mapped images that smoothly vary in time;  Accurately registering neighboring frames and selecting the most trustworthy pixels for radiance map computation.  17
    • Display Adaptive Tone Mapping [Mantiuk et al., Siggraph2008] Display adaptive tone-mapping to produce the least distorted image (visible contrast distortions weighted by the HVS model);  Extension with temporal coherence to video sequences. 18
    • Patch-Based High Dynamic Range Video [N K Kalantari et al., Siggraph Asia 2012] Off-the-shelf camera alternates between M different exposures, capturing only one specific exposure at each frame; Missing exposures are reconstructed at each frame by patch search/vote on the two neighboring frames; Combination of patch-based synthesis and optic flow for continuity.  19
    •  Colorization is generally used for increasing visual appeal of grayscale images and perceptually enhancing various single band medical/scientific images (pseudo color). Example-based:  Stroke-based:  ◦ Luminance statistics to colorize grayscale images (target) using some reference color images (source); ◦ Transfer the entire color “mood” by matching luminance and texture information [Welsh et al., Siggraph02]. ◦ The user scribes some colors (strokes) in the image as constraints; ◦ Neighboring pixels having similar intensities in the monochrome data should have similar colors in the chroma channels [Levin et al., Siggraph04]; ◦ Luminance-weighted chrominance blending for high resolution colorization and fast intrinsic distance computations [Yatziv & Sapiro, IEEE T-IP 06]. 20
    • Color Harmonization   Adjust combination of colors in order to enhance their visual harmony. Optimization in the Hue space, histogram shift or adjustment. 21
    •    Stylized rendering of video is an active area of research in nonphotorealistic rendering (NPR); Cartoon animations are typically composed of large regions which are semantically meaningful and highly abstracted by artists. A region may simply be constantly colored as in most cel animation systems, or it may be rendered in some other consistent style. 22
    • Video Tooning [Wang et al., Siggraph04] The user simply outlines objects on key frames. A mean shift algorithm is then employed to create 3-d semantic regions by interpolation between the key frames; Maintain smooth trajectories along the time dimension. 23
    •   Abstracts imagery by modifying the contrast of visually important features, namely luminance and color opponents; Soft color quantization to create cartoon-like effects with temporal coherence. 24
    •  Automatically recover a cloud of 3D scene points from a video sequence [Voodoo Camera Tracker, DigiLab08]; ◦ Vulnerable to ambiguities in the image data, degeneracies in camera motion, and a lack of discernible features on the model surface.    Manual intervention in the modeling process [Video Trace Siggraph07]. 3-D modeling of the non-rigid structure has proved surprisingly troublesome. Idea: recover the object’s texture map, rather than its 3D shape [Rav-Acha et al., Siggraph08 ]; ◦ Accompanying the recovered texture map will be a 2D-to-2D mapping describing the texture map’s projection to the images, and a sequence of binary masks modeling occlusion. 25
    •   Partially recover planar regions and exploit it to make perspectively correction; Add/Remove objects by copying them from other video streams and distorting them perspectively. 26
    • 27
    •  Rearranging the content for repurposing: ◦ Spatio-temporal-luminance domain; ◦ Regions, layers or objects; ◦ Motions, events.  Motion/Events: ◦ Video Texture; ◦ Video Synopsis; ◦ Video Rewrite.  Objects/Regions: ◦ Video Matting & Compositing; ◦ Video Completion; ◦ Video Annotation. 28
    •     “Video Textures”: transitions within a video noticeable to indefinitely extend playing time. [Scheodl et al., Siggraph00]; “Panoramic Video Textures (PVT)”: a video being stitched into a single, wide field of view and playing continuously [Agarwala et al., Siggraph05]; “Evolving Time Fronts” (Dynamosaics): sweep the space-time video volume with a time front surface and generate time slices in a new video sequence [RavAcha et al., Siggraph05]; Video synopsis: allowing events to occur simultaneously and/or out of chron-ological order [Rav-Acha et al., Siggraph06] [Pritch et al., ICCV08]; ◦ Segmenting the input video into objects and activities rather than frames. 29
    • 30
    • Fixed camera Moving camera 31
    •  Digital matting or pulling a matte: accurately separating a foreground object from the background involves determining both full and partial pixel coverage; ◦ Trimap; ◦ Grabcut;  Video matting, extracting dynamic objects from video sequences is more challenging: ◦ large data size; ◦ temporal coherence; ◦ fast motion vs. low temporal resolution.  Compositing: seamlessly combining multiple image or video regions. 32
    • Video Matting of Complex Scenes [Chuang et al., Sigggraph02] Bayesian matting to get trimap; Background-foreground segmentation; optic flow to temporally propagate the trimaps. 33
    •   An adaptive trimap propagation mechanism for setting up the initial matting conditions in a temporally-coherent way; A temporal matte filter which can improve the temporal coherence of the mattes while maintaining the matte structures on individual frames. 34
    •     To produce spatiotemporally coherent clusters of moving foreground pixels in video matting; Non-local principles: kNN Laplacian matrix; Motion (optic flow) used for clustering; Trimap propagation for temporal coherence. 35
    •  Image completion: ◦ Inpainting; ◦ Texture synthesis; ◦ Shift map.  Video completion refers to the process of filling in missing pixels or replacing undesirable pixels in a video; ◦ the amount of data is much larger; ◦ temporal consistency is a necessity.  Simply completing video sequences frame by frame using image completion leads to flickering. 36
    •  Shift-map represents the selected label for each output pixel. ◦ A data term indicates constraints such as the change in image size, object rearrangement, a possible saliency map, etc. ◦ A smoothness term, minimizes the new discontinuities caused by discontinuities in the shift-map.  This labeling problem solved by graph cuts. Input Image Mask Shift-Map Result 37
    •   Iteratively fill the missing video portions pixel by pixel using spatio-temporal pyramids; Multiple target fragments are considered at different locations for the unknown pixel. 38
    •   Fill in missing video parts by sampling spatio-temporal patches of local motion instead of directly sampling color; Color can be propagated to produce a seamless holefree video. 39
    •     Input: video, mask for object to be removed, mask for dynamic objects to remain; Align other candidate frames in which parts of the missing region are visible; Intensity differences between sources are smoothed using gradient domain fusion; Homography-based frame alignment with epipolar contrstraints. 40
    •  Video annotation is the task of associating graphical objects with moving objects on the screen. ◦ ◦ ◦ ◦ Metadata-based interaction; Schematic storyboard (path arrow); Hyperlinks; Balloon or speech. 41
    •  Annotation Vocabulary: lexicon/thesaurus, taxonomy, ontology; ◦ Key words, textual; ◦ Well defined semantics;  Metadata Type (annotation dimension). ◦ content descriptive metadata addressing subject matter information; ◦ structural metadata describing spatial, temporal and spatio-teporal decomposition aspects; ◦ media metadata referring to low-level features, and administrative, covering descriptions regarding the creation date of the annotation, the annotation creator, etc.  Granularity: describes the whole content or specific parts of it. ◦ Annotation may refer to the entire video, temporal segments (shots), frames, regions within frames, or even to moving regions.  Localization: the way a part of interest is localized within a content asset. ◦ automatically; ◦ manually.  Annotation expressivity: level of expressivity supported w.r.t. the annotation vocabulary, such as ontology for subject matter descriptions; ◦ may support only concept based annotation, and even create annotations representing relations among concepts as well.    Application Type: a web-based or a stand-alone application. License: conditions under which the tool operates, e.g. open source, etc. Collaboration: concurrent annotations by multiple users or not.
    •  Image Annotation methods: ◦ top-down approach:  generating language through combinations of object detections and language models; ◦ bottom-up method:  propagation of keyword tags from training images to test images through prob. or NN techniques.  Video Annotation will take scene/event information into account. ◦ temporal variation/articulation and dependencies. Topic models for activities  Model language and visual topics jointly  ◦ hidden topic Markov model; ◦ frame-by-frame Markov topic models; ◦ Action Bank: ties high actions to constituent low action detections. ◦ Extract high-level concepts, like faces, humans, tables; ◦ generate language descriptions, then jointly with visual topics.
    •            The Video and Image Annotation (VIA) by the MK-Lab; IBM VideoAnnEx annotation tool; Ontolog: Norwegian U. of Science and Technology; Advene (Annotate Digital Video, Exchange on the NEt): LIRI laboratory at University Claude Bernard Lyon; Elan: Max Planck Institute for Psycholinguistics, primarily for linguistic purposes; Anvil: primarily designed for linguistic purposes, German Research Center for Artificial Intelligence (DFK); Semantic Video Annotation Suite75 (SVAS): Joanneum research Institute of Information Systems & Information Management; Vannotea: collaborative indexing, browsing, annotation of video content, University of Queensland; ProjectPad: web-based system for collaborative media annotation & Management at Northwestern U.; The Video Performance Evaluation Resource Kits Ground Truth (ViPERGT): Language And Media Processing (LAMP) lab, University of Maryland; ……
    • 45 CONT
    •   Allows to annotate object category, shape, motion, and activity information in real-world videos; Append with knowledge from static image data-bases to infer occlusion and depth information.
    •    Enhance weak annotation; Requires only one label per video, i.e., whether it contains the class or not, not even assume that all frames in the video contain the target class; Learn object class from both videos and still images jointly as a domain adaptation task, i.e. learning while shifting the domain from videos to images.
    •  A hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions.
    • 49
    • VideoVideo-based Rendering  Rely on a 2-d image sequence of a scene to render sequences of some novel views of this scene. ◦ ◦ ◦ ◦ Light-Field/Lumigraph: sample the set of light rays in the world; Panorama (mosaicing): a big image; Depth Image based (3-d warping): image + depth, for free-view point rendering; View morphing: view interpolation. 50
    • Augmented Reality AR allows to see the real world, with virtual objects superimposed upon or composite with the real world;  AR can be thought of as the “middle ground” between VE (completely synthetic) and tele-presence (completely real), one of Mixed Reality;  Key issues in AR:  ◦Focus and contrast; ◦Portability: Mobile AR; ◦Tracking and registration; ◦Environment modeling. 51
    • Intelligent Content Manipulation  A convenient and natural interaction with the computer; ◦ Facial Expression; ◦ Hand Gesture; ◦ Body Movement;  Applications: Touch screen, Game console, Access control, Auto driver assistance, Non-verbal communication for social network. 52
    • Clickable Video Definition: allow Internet users to click on various aspects of a video to pull up additional information and link to external sites, engage consumers to incite an immediate response.  Hypervideo, or hyperlinked video, is interactive video that is akin to hypertext and allows for non-linear navigation.  The process of creating hypervideo content is known as authoring;   VideoClix: the premier and only commercially available technology for creating clickable videos; Interactive TV (iTV) is a popular application area of hyperlinked video with the convergence between broadcast and network communications. 53
    • Visual Surveillance & Monitoring (VSAM) Detect, recognize and track certain objects from image sequences, and more generally to understand and describe object behaviors (indoor, outdoor or aerial).   Applications: security guard for communities and important buildings, traffic surveillance in cities and expressways, detection of military targets etc. 54
    •   Temporally: sub-shot, shot, scene and story units; Spatially: motion segmentation or ‘object’ segmentation; ◦ Generative layered models; ◦ Graph-based models; ◦ Mean-shift clustering; ◦ Manifold-embedding and eigen decomposition;  ISOMAP, spectral clustering.  Temporal volume, motion and occlusion for spatial segmentation. ◦ Motion history and object trajectories: region tracking.
    • Generative Model: MRF   Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L. Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. ◦ Generative model for joint probability p(x) ◦ allows no direct probabilistic interpretation ◦ define potential functions Ψ on maximal cliques A  map joint assignment to non-negative real number  requires normalization  MRF is undirected graphical models 56
    • Generative Model: HMM   A Hidden Markov Model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states; In HMM, state is not visible, but output, dependent on state, is visible. ◦ Each state has a probability distribution over the possible output tokens; ◦ Sequence of tokens generated by an HMM gives some information about the sequence of states.     Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model; A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process; Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP); Learning: optimize state transition and output probabilities by BaumWelch algorithm (special case of EM). 57
    • Discriminative Model: CRF  Conditional , not joint, probabilistic sequential models p(y|x) Allow arbitrary, non-independent features on the observation seq X Specify the probability of possible label seq given an observation seq Prob. of a transition between labels depend on past/future observ. Relax strong independence assumptions, no p(x) required CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables Linear chain CRF: transition score depends on current observation  Optimization for learning CRF: discriminative model       ◦ Inference by DP like HMM, learning by forward-backward as HMM ◦ Conjugate gradient, stochastic gradient,… 58
    •      A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0; The max-flow problem is to find the flow of maximum value on a flow network G; A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T; A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network; Methods of max flow or mini-cut: ◦ Ford Fulkerson method; ◦ "Push-Relabel" method.
    •   Mostly labeling is solved as an energy minimization problem; Two common energy models: ◦ Potts Interaction Energy Model; ◦ Linear Interaction Energy Model.  Graph G contain two kinds of vertices: p-vertices and i-vertices; ◦ all the edges in the neighborhood N, called n-links; ◦ edges between the p-vertices and the i-vertices called t-links.    In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex; The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices; The approximation algorithms to find this multi-way cut: ◦ "alpha-expansion" algorithm; ◦ "alpha-beta swap" algorithm.
    •    A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages sent between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal probabilities of all the variables; messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability); Two types of BP methods: ◦ max-product; ◦ sum-product.  BP provides exact solution when there are no loops in graph!  Equivalent to dynamic programming/Viterbi in these cases;  Loopy Belief Propagation: still provides approximate (but often good) solution;
    •  Generalized BP for pairwise MRFs ◦ Hidden variables xi and xj are connected through a compatibility function; ◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function;   The joint probability of {x} is given by To improve inference by taking into account higher-order interactions among the variables; ◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes; ◦ This is the intuition in Generalized Belief Propagation (GBP).
    • MCMC Sampling for Optimization  Markov Chain: a stochastic process in which future states are independent of past states but the present state. ◦ Markov chain will typically converge to a stable distribution.  Monte Carlo Markov Chain: sampling using ‘local’ information ◦ Devise a Markov chain whose stationary distribution is the target.  Ergodic MC must be aperiodic, irreducible, and positive recurrent. ◦ Monte Carlo Integration to get quantities of interest.  Metropolis-Hastings method: sampling from a target distribution ◦ Create a Markov chain whose transition matrix does not depend on the normalization term. ◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio). ◦ After sufficient number of iterations, the chain will converge the stationary distribution.  Gibbs sampling is a special case of M-H Sampling. ◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution. 63
    • Thanks! 64