The 11th European Conference on Computer Vision was held from September 5-10, 2010 in Hersonissos, Heraklion, Crete, Greece. The conference included tutorials on topics such as statistical and structural recognition of human actions. Motion capture and analysis has been an area of interest dating back to early Renaissance studies of human anatomy and biomechanics. Modern applications of human action recognition include motion capture for animation, video editing, unusual activity detection in surveillance video, and large-scale video search enabled by advances in computer vision techniques.
CVML2011: human action recognition (Ivan Laptev)zukun
This document provides an overview of a lecture on human action recognition. It discusses the historic motivation for studying human motion from early studies in art and biomechanics to modern applications in motion capture and video editing. It also covers challenges in human pose estimation and recent advances in appearance-based, motion-based, and space-time methods for recognizing human actions in images and videos. The lecture focuses on key approaches like pictorial structures, motion history images, and space-time features.
cvpr2011: human activity recognition - part 1: introductionzukun
This document provides an introduction to human activity analysis and recognition from video. It discusses the goals of semantic video understanding like labeling objects and events. It reviews early work on activity recognition using point light displays. The document outlines different levels of video understanding from object detection to activity recognition. It discusses applications in surveillance, intelligent environments, sports analysis, and video retrieval. It categorizes human activities based on complexity and number of participants. Finally, it discusses challenges like environment variations, various activity types, and limited training data.
How to build a Hypernatural Object, Massimo Scognamiglio @ Frontiers of Inter...Frontiers of Interaction
This document discusses the concept of hyper natural objects (HNOs). HNOs blend the physical and digital worlds through integrated sensors, software, networks and physical forms. They do not mimic reality but overlap different information and behavior layers. The first tool capable of powering HNOs is the iPad. HNOs have souls distributed over networks and redefine their host tools to enable new user experiences. They allow more direct interaction than mediated interfaces and contextualize sensor data. The document envisions a future where reality and augmented reality are indistinguishable and introduces a prototype HNO called the Hyper Natural Pad.
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
The document describes a model for recognizing complex human activities by decomposing them into simpler motion segments. The model represents an activity as an ordered sequence of motion segments, each with an anchor location in time and possible temporal uncertainty. Recognition works by matching motion segments in a query video to those in a learned activity model. The model is learned from weakly labeled videos using a max-margin framework that optimizes for appearance and temporal arrangement of segments. Experiments show the approach can recognize both simple and complex activities.
Military 2.0 - Patrick Lin - H+ Summit @ HarvardHumanity Plus
For better or worse, the military is a major driver of technological, world-changing innovations, such as the Internet. At the same time, wars and armed conflicts are a key roadblock in the evolution of humanity. Therefore, to understand how emerging technologies will change our lives, we must look at their military origins as a harbinger of things to come for society at large. This presentation will focus on ethical and policy questions arising from two key areas making headlines today and in the future: human enhancement technologies and robotics.
For instance, are there moral or practical issues with eliminating human emotions such as fear or anger, which have led to abuses and accidents in wartime? Must these enhancements (and others, such as super-strength) be temporary or reversible, considering that soldiers usually return to civilian life? Robots can discourage such abuses if equipped with cameras, becoming objective and unblinking observers on the battlefield, but would this erode cohesion and trust among soldiers – and in the civilian realm, would surveillance robots infringe on our privacy? Generally, would these new technologies make it easier to engage in war, since they would lower political costs by reducing the number of casualties on our side – if so, is it immoral, or otherwise counterproductive to humanity's progress, to develop these capabilities?
Patrick Lin is the director of the Ethics + Emerging Sciences Group , based at California Polytechnic State University, San Luis Obispo. Most recently, he has led research efforts that culminated in two major reports: Autonomous Military Robotics: Risk, Ethics, and Design (funded by the U.S. Dept. of Defense/Navy, 2008) and Ethics of Human Enhancement: 25 Questions & Answers (funded by the U.S. National Science Foundation, 2009). He has published several books and papers in the field of technology ethics, including a new monograph What Is Nanotechnology and Why Does It Matter?: From Science to Ethics (Wiley-Blackwell, 2010) and a forthcoming anthology Robot Ethics: The Social and Ethical Implication of Robotics (MIT Press, in preparation). Dr. Lin earned his B.A. from University of California at Berkeley, M.A. and Ph.D. from University of California at Santa Barbara, and completed a three-year post-doctoral appointment at Dartmouth College. He is currently an assistant professor in Cal Poly’s philosophy department and an ethics fellow at the U.S. Naval Academy.
I present the design and implementation of an ontology for scholarly event description (SEDE) to provide a backbone to represent, collect, share and allow inference from scholarly event information
ICCV2011: Human Action Recognition by Learning bases of action attributes and...zukun
This document describes a method for human action recognition in images by learning bases of action attributes and parts. The method represents actions using high-level semantic concepts like attributes, objects and body parts. It learns a basis or dictionary of these attributes and parts from an image dataset and represents each image as a sparse linear combination of the bases. This encodes contextual relationships between attributes and parts while being robust to initially weak detections. The bases and coefficients are learned jointly using an optimization that encourages sparsity. The method is evaluated on benchmark action datasets to perform action classification in still images.
CVML2011: human action recognition (Ivan Laptev)zukun
This document provides an overview of a lecture on human action recognition. It discusses the historic motivation for studying human motion from early studies in art and biomechanics to modern applications in motion capture and video editing. It also covers challenges in human pose estimation and recent advances in appearance-based, motion-based, and space-time methods for recognizing human actions in images and videos. The lecture focuses on key approaches like pictorial structures, motion history images, and space-time features.
cvpr2011: human activity recognition - part 1: introductionzukun
This document provides an introduction to human activity analysis and recognition from video. It discusses the goals of semantic video understanding like labeling objects and events. It reviews early work on activity recognition using point light displays. The document outlines different levels of video understanding from object detection to activity recognition. It discusses applications in surveillance, intelligent environments, sports analysis, and video retrieval. It categorizes human activities based on complexity and number of participants. Finally, it discusses challenges like environment variations, various activity types, and limited training data.
How to build a Hypernatural Object, Massimo Scognamiglio @ Frontiers of Inter...Frontiers of Interaction
This document discusses the concept of hyper natural objects (HNOs). HNOs blend the physical and digital worlds through integrated sensors, software, networks and physical forms. They do not mimic reality but overlap different information and behavior layers. The first tool capable of powering HNOs is the iPad. HNOs have souls distributed over networks and redefine their host tools to enable new user experiences. They allow more direct interaction than mediated interfaces and contextualize sensor data. The document envisions a future where reality and augmented reality are indistinguishable and introduces a prototype HNO called the Hyper Natural Pad.
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
The document describes a model for recognizing complex human activities by decomposing them into simpler motion segments. The model represents an activity as an ordered sequence of motion segments, each with an anchor location in time and possible temporal uncertainty. Recognition works by matching motion segments in a query video to those in a learned activity model. The model is learned from weakly labeled videos using a max-margin framework that optimizes for appearance and temporal arrangement of segments. Experiments show the approach can recognize both simple and complex activities.
Military 2.0 - Patrick Lin - H+ Summit @ HarvardHumanity Plus
For better or worse, the military is a major driver of technological, world-changing innovations, such as the Internet. At the same time, wars and armed conflicts are a key roadblock in the evolution of humanity. Therefore, to understand how emerging technologies will change our lives, we must look at their military origins as a harbinger of things to come for society at large. This presentation will focus on ethical and policy questions arising from two key areas making headlines today and in the future: human enhancement technologies and robotics.
For instance, are there moral or practical issues with eliminating human emotions such as fear or anger, which have led to abuses and accidents in wartime? Must these enhancements (and others, such as super-strength) be temporary or reversible, considering that soldiers usually return to civilian life? Robots can discourage such abuses if equipped with cameras, becoming objective and unblinking observers on the battlefield, but would this erode cohesion and trust among soldiers – and in the civilian realm, would surveillance robots infringe on our privacy? Generally, would these new technologies make it easier to engage in war, since they would lower political costs by reducing the number of casualties on our side – if so, is it immoral, or otherwise counterproductive to humanity's progress, to develop these capabilities?
Patrick Lin is the director of the Ethics + Emerging Sciences Group , based at California Polytechnic State University, San Luis Obispo. Most recently, he has led research efforts that culminated in two major reports: Autonomous Military Robotics: Risk, Ethics, and Design (funded by the U.S. Dept. of Defense/Navy, 2008) and Ethics of Human Enhancement: 25 Questions & Answers (funded by the U.S. National Science Foundation, 2009). He has published several books and papers in the field of technology ethics, including a new monograph What Is Nanotechnology and Why Does It Matter?: From Science to Ethics (Wiley-Blackwell, 2010) and a forthcoming anthology Robot Ethics: The Social and Ethical Implication of Robotics (MIT Press, in preparation). Dr. Lin earned his B.A. from University of California at Berkeley, M.A. and Ph.D. from University of California at Santa Barbara, and completed a three-year post-doctoral appointment at Dartmouth College. He is currently an assistant professor in Cal Poly’s philosophy department and an ethics fellow at the U.S. Naval Academy.
I present the design and implementation of an ontology for scholarly event description (SEDE) to provide a backbone to represent, collect, share and allow inference from scholarly event information
ICCV2011: Human Action Recognition by Learning bases of action attributes and...zukun
This document describes a method for human action recognition in images by learning bases of action attributes and parts. The method represents actions using high-level semantic concepts like attributes, objects and body parts. It learns a basis or dictionary of these attributes and parts from an image dataset and represents each image as a sparse linear combination of the bases. This encodes contextual relationships between attributes and parts while being robust to initially weak detections. The bases and coefficients are learned jointly using an optimization that encourages sparsity. The method is evaluated on benchmark action datasets to perform action classification in still images.
1) The document proposes using context specific information (CSI) to detect complex human activities in videos, unlike previous solutions that focused on basic actions.
2) It presents a prototype for detecting the activity of smoking based on repeating hand-to-mouth motions, identifying main frames, and detecting smoke.
3) Evaluation on movie datasets showed the face detector achieved 92% recall and 90% precision, while the smoke detector had 92% detection rate and 5% false positive rate on colored videos.
cvpr2011: human activity recognition - part 3: single layerzukun
The document discusses approaches to human activity analysis from the 1990s to 2000s, including early approaches that modeled activities as sequences of body poses using hidden Markov models, and later space-time approaches that analyzed activities as 3D video volumes, extracting global features or sparse interest points from the volumes to enable action recognition. It reviews key works that used techniques such as motion history images, volume matching, and spatio-temporal interest point detectors to analyze and recognize human activities in video data.
This document discusses the development of an Android application for physical activity recognition using the accelerometer sensor. It provides background on the Android operating system and its open development environment. It then summarizes relevant research papers on activity recognition using mobile sensors. The document outlines the process of collecting and labeling accelerometer data from smartphone sensors during different physical activities. Features are extracted from the sensor data and several machine learning classifiers are evaluated for activity recognition. The application will recognize activities and track metrics like calories burned, distance traveled, and implement fall detection and medical reminders.
This presentation was prepared by Ishara Amarasekera based on the paper, Activity Recognition using Cell Phone Accelerometers by Jennifer R. Kwapisz, Gary M. Weiss and Samuel A. Moore.
This presentation contains a summary of the content provided in this research paper and was presented as a paper discussion for the course, Mobile and Ubiquitous Application Development in Computer Science.
“Automatically learning multiple levels of representations of the underlying distribution of the data to be modelled”
Deep learning algorithms have shown superior learning and classification performance.
In areas such as transfer learning, speech and handwritten character recognition, face recognition among others.
(I have referred many articles and experimental results provided by Stanford University)
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
The document discusses human activity recognition from video data using computer vision techniques. It describes recognizing activities at different levels from object locations to full activities. Basic activities like walking and clapping are the focus. Key steps involve tracking segmented objects across frames and comparing motion patterns to templates to identify activities through model fitting. The DEV8000 development kit and Linux are used to process video and recognize activities in real-time. Applications discussed include surveillance, sports analysis, and unmanned vehicles.
This document summarizes a student project on human activity recognition using smartphones. A group of 4 students submitted the project to partially fulfill requirements for a Bachelor of Technology degree in computer science and engineering. The project involved developing a system to recognize human activities using the accelerometer and gyroscope sensors in smartphones. Various machine learning algorithms were tested and evaluated on experimental data collected from smartphone sensors. The goal of the project was to create an accurate and lightweight activity recognition system for smartphones, while also exploring active learning methods to reduce the amount of labeled training data needed.
Motion Media is defined as graphics that use video and/or animation to create the illusion of motion or a transforming appearance. Early motion media devices included the thaumatrope, phenakistoscope, stroboscope, zoetrope, and praxinoscope - toys from the 19th century that used persistence of vision to create the illusion of movement. Modern motion media includes video, animation, and combinations of the two. Video uses recorded moving images while animation creates motion through rapid display of sequentially different static images. Motion media has various applications in education by making lessons more engaging and accessible through video-based materials, interactive videos, and video conferences.
Video art challenges notions of screen, space, and realism by using technology to create visual art outside of traditional film narratives. It allows viewers to see in new ways, like Bill Viola's Emergence which uses extreme slow motion to reveal details invisible to the naked eye. This document discusses the history and goals of video art, key artists who experimented with the medium, and how viewing Viola's slow motion work influenced the creator to make their own videos exploring slow motion effects.
This document provides an overview of 2D animation techniques. It discusses the basics of animation including persistence of vision and how animations are created from sequences of still images. It also covers the history of animation from early philosophical toys to the golden age of theatrical cartoons. Additionally, it details the traditional animation process and principles as well as other animation techniques like 2D vector, 3D, motion graphics, and stop motion. Finally, it discusses important animation tools and concepts like the 12 principles of animation, model sheets, inbetweening, and the roles of different animators.
This document provides an overview of animation, including its history, techniques, uses, and future. It discusses how animation evolved from early devices like the zoetrope and thaumatrope in the late 1800s. Popular current techniques include cel animation, stop motion, and computer animation. Animated movies employ techniques like squash and stretch. Animation is widely used in entertainment, education, scientific visualization and more. Challenges include the time and human effort required, though the future promises more advanced 3D and virtual reality animation with lower costs.
The document summarizes the history of animation techniques from early pioneers like the Lumiere Brothers to modern digital applications. It covers techniques such as cel animation, where drawings are made on transparent sheets and combined with backgrounds; rotoscoping, where animators trace over filmed scenes; drawn on film, where animations are drawn directly onto film stock; and clay animation, which uses plasticine clay figures manipulated in stop motion. Pioneering devices discussed include the zoetrope, flipbook, and kinetoscope, precursors to modern movie projectors.
Animation basics: The optical illusion of motionAlicia Garcia
How do animators make still images come to life? Are the images really moving, or are they merely an optical illusion? TED-Ed takes you behind the scenes to reveal the secret of motion in movies.
This document discusses various techniques used within movie shots to communicate information to viewers. It covers topics like location and set design, blocking of actors, camera framing including different shot sizes, camera angles, lens choices, and focus. Examples are provided from famous movies like Citizen Kane, Jaws, and Psycho to illustrate different techniques. The document appears to be from a lecture on the technology of motion pictures and how elements within a shot can enhance intrashot communication.
Virtual worlds and the development of new technologies have changed how images are created and viewed. Mechanical reproduction of images through photography challenged the traditional concepts of realism and perspective in art. It also changed the role and value of images in society by making them more accessible to mass audiences. Advances in digital technologies now allow for augmented reality and mixed realities that blend virtual and real worlds. Ubiquitous surveillance and sousveillance through technologies like cameras and RFID tracking have implications for privacy and how humans view themselves and each other.
Animation involves rapidly displaying sequential images to create the illusion of movement. It can be generated through techniques like keyframing, motion capture, procedural methods, and simulation. The history of animation dates back to prehistoric cave paintings and early devices like the zoetrope, phenakistoscope, and praxinoscope in the 18th-19th centuries. Modern animation includes techniques like traditional cel animation, stop motion, 3D computer animation which applies principles like squash and stretch, anticipation, staging, and appeal. Realistic human characters are difficult for animation due to the uncanny valley effect.
The KLT tracker is a classic algorithm for visual object tracking published in 1981. It works by tracking feature points between consecutive video frames using the Lucas-Kanade optical flow method. The KLT tracker is still widely used due to its computational efficiency and availability in many computer vision libraries. However, it is best suited for tracking textured objects and may struggle with uniform textures or large displacements between frames.
The document discusses early interactive electronic games and the role of body movement in games. It describes how Thomas Goldsmith Jr developed the first electronic game called "Cathode Ray Tube Amusement Device" in 1947, which resembled radar displays and preceded computer graphics. Body movement in games is described as engaging, natural, immersive, motivating and fun. A taxonomy of movement in games separates hard-fun and easy-fun activities. The role of social interaction with body movement in player engagement is also discussed.
Biomechanics is the study of mechanics in living beings, specifically analyzing forces and motion in the human body. It is divided into kinetics, which studies forces producing motion, and kinematics, which describes motion. Biomechanics is studied to improve health, performance, and understand complex movements. It has various applications in sports, medical devices, rehabilitation, and more. The field has grown significantly over time from early scientists like Aristotle and Da Vinci applying mechanics to the body. Modern biomechanics utilizes fields like computational modeling, simulation, and motion analysis.
1) The document proposes using context specific information (CSI) to detect complex human activities in videos, unlike previous solutions that focused on basic actions.
2) It presents a prototype for detecting the activity of smoking based on repeating hand-to-mouth motions, identifying main frames, and detecting smoke.
3) Evaluation on movie datasets showed the face detector achieved 92% recall and 90% precision, while the smoke detector had 92% detection rate and 5% false positive rate on colored videos.
cvpr2011: human activity recognition - part 3: single layerzukun
The document discusses approaches to human activity analysis from the 1990s to 2000s, including early approaches that modeled activities as sequences of body poses using hidden Markov models, and later space-time approaches that analyzed activities as 3D video volumes, extracting global features or sparse interest points from the volumes to enable action recognition. It reviews key works that used techniques such as motion history images, volume matching, and spatio-temporal interest point detectors to analyze and recognize human activities in video data.
This document discusses the development of an Android application for physical activity recognition using the accelerometer sensor. It provides background on the Android operating system and its open development environment. It then summarizes relevant research papers on activity recognition using mobile sensors. The document outlines the process of collecting and labeling accelerometer data from smartphone sensors during different physical activities. Features are extracted from the sensor data and several machine learning classifiers are evaluated for activity recognition. The application will recognize activities and track metrics like calories burned, distance traveled, and implement fall detection and medical reminders.
This presentation was prepared by Ishara Amarasekera based on the paper, Activity Recognition using Cell Phone Accelerometers by Jennifer R. Kwapisz, Gary M. Weiss and Samuel A. Moore.
This presentation contains a summary of the content provided in this research paper and was presented as a paper discussion for the course, Mobile and Ubiquitous Application Development in Computer Science.
“Automatically learning multiple levels of representations of the underlying distribution of the data to be modelled”
Deep learning algorithms have shown superior learning and classification performance.
In areas such as transfer learning, speech and handwritten character recognition, face recognition among others.
(I have referred many articles and experimental results provided by Stanford University)
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
The document discusses human activity recognition from video data using computer vision techniques. It describes recognizing activities at different levels from object locations to full activities. Basic activities like walking and clapping are the focus. Key steps involve tracking segmented objects across frames and comparing motion patterns to templates to identify activities through model fitting. The DEV8000 development kit and Linux are used to process video and recognize activities in real-time. Applications discussed include surveillance, sports analysis, and unmanned vehicles.
This document summarizes a student project on human activity recognition using smartphones. A group of 4 students submitted the project to partially fulfill requirements for a Bachelor of Technology degree in computer science and engineering. The project involved developing a system to recognize human activities using the accelerometer and gyroscope sensors in smartphones. Various machine learning algorithms were tested and evaluated on experimental data collected from smartphone sensors. The goal of the project was to create an accurate and lightweight activity recognition system for smartphones, while also exploring active learning methods to reduce the amount of labeled training data needed.
Motion Media is defined as graphics that use video and/or animation to create the illusion of motion or a transforming appearance. Early motion media devices included the thaumatrope, phenakistoscope, stroboscope, zoetrope, and praxinoscope - toys from the 19th century that used persistence of vision to create the illusion of movement. Modern motion media includes video, animation, and combinations of the two. Video uses recorded moving images while animation creates motion through rapid display of sequentially different static images. Motion media has various applications in education by making lessons more engaging and accessible through video-based materials, interactive videos, and video conferences.
Video art challenges notions of screen, space, and realism by using technology to create visual art outside of traditional film narratives. It allows viewers to see in new ways, like Bill Viola's Emergence which uses extreme slow motion to reveal details invisible to the naked eye. This document discusses the history and goals of video art, key artists who experimented with the medium, and how viewing Viola's slow motion work influenced the creator to make their own videos exploring slow motion effects.
This document provides an overview of 2D animation techniques. It discusses the basics of animation including persistence of vision and how animations are created from sequences of still images. It also covers the history of animation from early philosophical toys to the golden age of theatrical cartoons. Additionally, it details the traditional animation process and principles as well as other animation techniques like 2D vector, 3D, motion graphics, and stop motion. Finally, it discusses important animation tools and concepts like the 12 principles of animation, model sheets, inbetweening, and the roles of different animators.
This document provides an overview of animation, including its history, techniques, uses, and future. It discusses how animation evolved from early devices like the zoetrope and thaumatrope in the late 1800s. Popular current techniques include cel animation, stop motion, and computer animation. Animated movies employ techniques like squash and stretch. Animation is widely used in entertainment, education, scientific visualization and more. Challenges include the time and human effort required, though the future promises more advanced 3D and virtual reality animation with lower costs.
The document summarizes the history of animation techniques from early pioneers like the Lumiere Brothers to modern digital applications. It covers techniques such as cel animation, where drawings are made on transparent sheets and combined with backgrounds; rotoscoping, where animators trace over filmed scenes; drawn on film, where animations are drawn directly onto film stock; and clay animation, which uses plasticine clay figures manipulated in stop motion. Pioneering devices discussed include the zoetrope, flipbook, and kinetoscope, precursors to modern movie projectors.
Animation basics: The optical illusion of motionAlicia Garcia
How do animators make still images come to life? Are the images really moving, or are they merely an optical illusion? TED-Ed takes you behind the scenes to reveal the secret of motion in movies.
This document discusses various techniques used within movie shots to communicate information to viewers. It covers topics like location and set design, blocking of actors, camera framing including different shot sizes, camera angles, lens choices, and focus. Examples are provided from famous movies like Citizen Kane, Jaws, and Psycho to illustrate different techniques. The document appears to be from a lecture on the technology of motion pictures and how elements within a shot can enhance intrashot communication.
Virtual worlds and the development of new technologies have changed how images are created and viewed. Mechanical reproduction of images through photography challenged the traditional concepts of realism and perspective in art. It also changed the role and value of images in society by making them more accessible to mass audiences. Advances in digital technologies now allow for augmented reality and mixed realities that blend virtual and real worlds. Ubiquitous surveillance and sousveillance through technologies like cameras and RFID tracking have implications for privacy and how humans view themselves and each other.
Animation involves rapidly displaying sequential images to create the illusion of movement. It can be generated through techniques like keyframing, motion capture, procedural methods, and simulation. The history of animation dates back to prehistoric cave paintings and early devices like the zoetrope, phenakistoscope, and praxinoscope in the 18th-19th centuries. Modern animation includes techniques like traditional cel animation, stop motion, 3D computer animation which applies principles like squash and stretch, anticipation, staging, and appeal. Realistic human characters are difficult for animation due to the uncanny valley effect.
The KLT tracker is a classic algorithm for visual object tracking published in 1981. It works by tracking feature points between consecutive video frames using the Lucas-Kanade optical flow method. The KLT tracker is still widely used due to its computational efficiency and availability in many computer vision libraries. However, it is best suited for tracking textured objects and may struggle with uniform textures or large displacements between frames.
The document discusses early interactive electronic games and the role of body movement in games. It describes how Thomas Goldsmith Jr developed the first electronic game called "Cathode Ray Tube Amusement Device" in 1947, which resembled radar displays and preceded computer graphics. Body movement in games is described as engaging, natural, immersive, motivating and fun. A taxonomy of movement in games separates hard-fun and easy-fun activities. The role of social interaction with body movement in player engagement is also discussed.
Biomechanics is the study of mechanics in living beings, specifically analyzing forces and motion in the human body. It is divided into kinetics, which studies forces producing motion, and kinematics, which describes motion. Biomechanics is studied to improve health, performance, and understand complex movements. It has various applications in sports, medical devices, rehabilitation, and more. The field has grown significantly over time from early scientists like Aristotle and Da Vinci applying mechanics to the body. Modern biomechanics utilizes fields like computational modeling, simulation, and motion analysis.
Real great achievements and inventions have been made over many years. Generally, inventions are named after the inventor, but sometimes the innovator who improved an invention is given credit instead if the original inventor is unknown. The document then discusses how to detect gaps in learning and how to bridge them, as well as tools that can be used for visual media like photos, videos, and 3D animation. These include software like Photoshop and apps for filming, editing, and distributing content online.
The document proposes a final year project titled "Glimpse" that uses time-lapse video to manipulate images and provoke audiences. It aims to show audiences the capabilities of images by compiling sequences of images into artistic time-lapse footage. Research was conducted on time-lapse concepts, field recording techniques, and precedent studies. The project will use GoPro cameras to capture time-lapse images in the field. It will then compile the images into time-lapse videos to show on projections or displays. The goal is to help audiences understand how images can be manipulated and to experience urban phenomena from different perspectives and over time.
This document provides an overview of cyborgs, including:
1) A cyborg is a being with both biological and artificial enhancements that was originally coined to refer to enhanced humans surviving in extraterrestrial environments.
2) Examples of current cyborg applications include restorative prosthetics, brain-computer interfaces to restore senses, and military research into cyborg animals.
3) Issues of cyborgization include questions of identity, dependence on technology, and maintenance of implants. The concept challenges boundaries between human/machine and other dichotomies.
The document discusses digital photography practices in everyday life and proposes a research project on the topic. It provides historical context on the evolution of photography. It then discusses how digital photography has changed practices by lowering costs and allowing easy production and sharing of photos online. The proposed research would identify new photography practices, examine how they shape identities and social bonds, and develop a model to explain the relationships between digital practices and subjectivities. It would take an ethnographic approach including observation of public photography practices.
This document provides an overview of motion capture technology and discusses how it has influenced the work of actors. It explains that motion capture involves recording the movements of a real person and mapping them onto a digital character. While early techniques like chronophotography and rotoscoping recorded movements, modern motion capture also captures location data. The document discusses how motion capture has blurred boundaries between live-action and animation, and debates whether motion capture performances should be considered acting or animation. It argues actors have lost some authorship over the final digital character as their performance is blended with visual effects.
ANIMATION PROJECT DOMAIN APPEAL IN MOTION CAPTURE ANIMATION (2015)Sara Parker
Motion capture technology emerged in the 1970s with early pioneers experimenting with magnetic and optical tracking systems to capture human motion. While the concept of imitating human motion in animation dates back to Disney's use of rotoscoping in 1937, motion capture allows the transfer of live-action movement more directly into digital characters. Advances in 3D graphics and motion capture systems now enable highly realistic 3D animation with applications in film, television, video games and other media.
This document outlines the content and assessments for a course on the history of camera technologies. The course will cover the development of still and moving image cameras from early toys and magic lanterns to modern electronic cameras. Students will work in groups to produce a short film using a vintage 35mm camera and give a presentation reflecting on the technical and creative aspects. Assessments include an individual or paired flip book exercise in the first week and the group film production and presentations in later weeks.
This document provides an overview of film, including its history, production process, associated fields like criticism and theory, and the film industry. It discusses how film developed from early visual devices like zoetropes into a global art form and business. Key aspects covered include how films are made up of frames to create the illusion of motion, the cultural impact of films, and how large-scale production involves many steps and financing.
Similar to ECCV2010 tutorial: statisitcal and structural recognition of human actions part I (20)
Mylyn helps address information overload and context loss when multi-tasking. It integrates tasks into the IDE workflow and uses a degree-of-interest model to monitor user interaction and provide a task-focused UI with features like view filtering, element decoration, automatic folding and content assist ranking. This creates a single view of all tasks that are centrally managed within the IDE.
This document provides an overview of OpenCV, an open source computer vision and machine learning software library. It discusses OpenCV's core functionality for representing images as matrices and directly accessing pixel data. It also covers topics like camera calibration, feature point extraction and matching, and estimating camera pose through techniques like structure from motion and planar homography. Hints are provided for Android developers on required permissions and for planar homography estimation using additional constraints rather than OpenCV's general homography function.
This document provides information about the Computer Vision Laboratory 2012 course at the Institute of Visual Computing. The course focuses on computer vision on mobile devices and will involve 180 hours of project work per person. Students will work in groups of 1-2 people on topics like 3D reconstruction from silhouettes or stereo images on mobile devices. Key dates are provided for submitting a work plan, mid-term presentation, and final report. Contact information is given for the lecturers and teaching assistant.
This document summarizes a presentation on natural image statistics given by Siwei Lyu at the 2009 CIFAR NCAP Summer School. The presentation covered several key topics:
1) It discussed the motivation for studying natural image statistics, which is to understand representations in the visual system and develop computer vision applications like denoising.
2) It reviewed common statistical properties found in natural images like 1/f power spectra and non-Gaussian distributions.
3) Maximum entropy and Bayesian models were presented as approaches to model these statistics, with Gaussian and independent component analysis discussed as specific examples.
4) Efficient coding principles from information theory were introduced as a framework for understanding neural representations that aim to decorrelate and
Camera calibration involves determining the internal camera parameters like focal length, image center, distortion, and scaling factors that affect the imaging process. These parameters are important for applications like 3D reconstruction and robotics that require understanding the relationship between 3D world points and their 2D projections in an image. The document describes estimating internal parameters by taking images of a calibration target with known geometry and solving the equations that relate the 3D target points to their 2D image locations. Homogeneous coordinates and projection matrices are used to represent the calibration transformations mathematically.
Brunelli 2008: template matching techniques in computer visionzukun
The document discusses template matching techniques in computer vision. It begins with an overview that defines template matching and discusses some common computer vision tasks it can be used for, like object detection. It then covers topics like detection as hypothesis testing, training and testing techniques, and provides a bibliography.
The HARVEST Programme evaluates feature detectors and descriptors through indirect and direct benchmarks. Indirect benchmarks measure repeatability and matching scores on the affine covariant testbed to evaluate how features persist across transformations. Direct benchmarks evaluate features on image retrieval tasks using the Oxford 5k dataset to measure real-world performance. VLBenchmarks provides software for easily running these benchmarks and reproducing published results. It allows comparing features and selecting the best for a given application.
This document summarizes VLFeat, an open source computer vision library. It provides concise summaries of VLFeat's features, including SIFT, MSER, and other covariant detectors. It also compares VLFeat's performance to other libraries like OpenCV. The document highlights how VLFeat achieves state-of-the-art results in tasks like feature detection, description and matching while maintaining a simple MATLAB interface.
This document summarizes and compares local image descriptors. It begins with an introduction to modern descriptors like SIFT, SURF and DAISY. It then discusses efficient descriptors such as binary descriptors like BRIEF, ORB and BRISK which use comparisons of intensity value pairs. The document concludes with an overview section.
This document discusses various feature detectors used in computer vision. It begins by describing classic detectors such as the Harris detector and Hessian detector that search scale space to find distinguished locations. It then discusses detecting features at multiple scales using the Laplacian of Gaussian and determinant of Hessian. The document also covers affine covariant detectors such as maximally stable extremal regions and affine shape adaptation. It discusses approaches for speeding up detection using approximations like those in SURF and learning to emulate detectors. Finally, it outlines new developments in feature detection.
The document discusses modern feature detection techniques. It provides an introduction and agenda for a talk on advances in feature detectors and descriptors, including improvements since a 2005 paper. It also discusses software suites and benchmarks for feature detection. Several application domains are described, such as wide baseline matching, panoramic image stitching, 3D reconstruction, image search, location recognition, and object tracking.
System 1 and System 2 were basic early systems for image matching that used color and texture matching. Descriptor-based approaches like SIFT provided more invariance but not perfect invariance. Patch descriptors like SIFT were improved by making them more invariant to lighting changes like color and illumination shifts. The best performance came from combining descriptors with color invariance. Representing images as histograms of visual word occurrences captured patterns in local image patches and allowed measuring similarity between images. Large vocabularies of visual words provided more discriminative power but were costly to compute and store.
This document summarizes a research paper on internet video search. It discusses several key challenges: [1] the large variation in how the same thing can appear in images/videos due to lighting, viewpoint etc., [2] defining what defines different objects, and [3] the huge number of different things that exist. It also notes gaps in narrative understanding, shared concepts between humans and machines, and addressing diverse query contexts. The document advocates developing powerful yet simple visual features that capture uniqueness with invariance to irrelevant changes.
The document discusses computer vision techniques for object detection and localization. It describes methods like selective search that group image regions hierarchically to propose object locations. Large datasets like ImageNet and LabelMe that provide training examples are also discussed. Performance on object detection benchmarks like PASCAL VOC is shown to improve significantly over time. Evaluation standards for concept detection like those used in TRECVID are presented. The document concludes that results are impressively improving each year but that the number of detectable concepts remains limited. It also discusses making feature extraction more efficient using techniques like SURF that take advantage of integral images.
This document provides an outline and overview of Yoshua Bengio's 2012 tutorial on representation learning. The key points covered include:
1) The tutorial will cover motivations for representation learning, algorithms such as probabilistic models and auto-encoders, and analysis and practical issues.
2) Representation learning aims to automatically learn good representations of data rather than relying on handcrafted features. Learning representations can help address challenges like exploiting unlabeled data and the curse of dimensionality.
3) Deep learning algorithms attempt to learn multiple levels of increasingly complex representations, with the goal of developing more abstract, disentangled representations that generalize beyond local patterns in the data.
Advances in discrete energy minimisation for computer visionzukun
This document discusses string algorithms and data structures. It introduces the Knuth-Morris-Pratt algorithm for finding patterns in strings in O(n+m) time where n is the length of the text and m is the length of the pattern. It also discusses common string data structures like tries, suffix trees, and suffix arrays. Suffix trees and suffix arrays store all suffixes of a string and support efficient pattern matching and other string operations in linear time or O(m+logn) time where m is the pattern length and n is the text length.
This document provides a tutorial on how to use Gephi software to analyze and visualize network graphs. It outlines the basic steps of importing a sample graph file, applying layout algorithms to organize the nodes, calculating metrics, detecting communities, filtering the graph, and exporting/saving the results. The tutorial demonstrates features of Gephi including node ranking, partitioning, and interactive visualization of the graph.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
This document describes an efficient framework for part-based object recognition using pictorial structures. The framework represents objects as graphs of parts with spatial relationships. It finds the optimal configuration of parts through global minimization using distance transforms, allowing fast computation despite modeling complex spatial relationships between parts. This enables soft detection to handle partial occlusion without early decisions about part locations.
Iccv2011 learning spatiotemporal graphs of human activities zukun
The document presents a new approach for learning spatiotemporal graphs of human activities from weakly supervised video data. The approach uses 2D+t tubes as mid-level features to represent activities as segmentation graphs, with nodes describing tubes and edges describing various relations. A probabilistic graph mixture model is used to model activities, and learning estimates the model parameters and permutation matrices using a structural EM algorithm. The learned models allow recognizing and segmenting activities in new videos through robust least squares inference. Evaluation on benchmark datasets demonstrates the ability to learn characteristic parts of activities and recognize them under weak supervision.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Pengantar Penggunaan Flutter - Dart programming language1.pptx
ECCV2010 tutorial: statisitcal and structural recognition of human actions part I
1. 11th European Conference on Computer Vision
Hersonissos, Heraklion, Crete,
Hersonissos Heraklion Crete Greece
September 5, 2010
Tutorial
T torial on
Statistical and Structural
Recognition of Human Actions
Ivan Laptev and Greg Mori
2.
3.
4. Motivation I: Artistic Representation
Early studies were motivated by human representations in Arts
Da Vinci: “it is indispensable for a painter, to become totally familiar with the
anatomy of nerves, bones, muscles, and sinews, such that he understands
for their various motions and stresses, which sinews or which muscle
causes a particular motion”
“I ask for the weight [pressure] of this man for every segment of motion
when climbing those stairs, and for the weight he places on b and on c.
Note the vertical line below the center of mass of this man.”
Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.
5. Motivation II: Biomechanics
• The emergence of biomechanics
• Borelli applied to biology the
analytical and geometrical methods,
developed by Galileo Galilei
• He was the first to understand that
bones serve as levers and muscles
function according to mathematical
principles
• His physiological studies included
muscle analysis and a mathematical
discussion of movements, such as
running or j p g
g jumping
Giovanni Alfonso Borelli (1608–1679)
6. Motivation III: Motion perception
Etienne-Jules Marey:
(1830–1904)
(1830 1904) made d
Chronophotographic
experiments influential
for the emerging field of
cinematography
Eadweard Muybridge
(1830–1904) invented a
machine for displaying
the recorded series of
images. He pioneered
motion pictures and
applied his technique to
movement studies
7. Motivation III: Motion perception
• Gunnar Johansson [1973] pioneered studies on the use of image
sequences for a programmed human motion analysis
• “Moving Light Displays” (
g g p y (LED) enable identification of familiar p p
) people
and the gender and inspired many works in computer vision.
Gunnar Johansson, Perception and Psychophysics, 1973
8. Human actions: Historic overview
15th century
studies of
•
anatomy
• 17th century
emergence of
biomechanics
19th century
emergence of
•
cinematography
• 1973
studies of h
t di f human
motion perception
Modern computer vision
11. Modern applications: Video editing
Space-Time Video Completion
Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
12. Modern applications: Video editing
Space-Time Video Completion
Y. Wexler, E. Shechtman and M. Irani, CVPR 2004
13. Modern applications: Video editing
Recognizing Action at a Distance
Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
14. Modern applications: Video editing
Recognizing Action at a Distance
Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003
15. Applications: Unusual Activity Detection
e.g. for surveillance
Detecting Irregularities in
Images and in Vid
I d i Video
Boiman & Irani, ICCV 2005
16. Applications: Video Search
• Huge amount of video is available and growing
TV-channels recorded
since 60’s
>34K hours of video
uploads every day
~30M surveillance cameras in US
=> ~700K video hours/day
17. Applications: Video Search
• useful for TV production, entertainment, education, social studies,
security,…
security
Home
videos:
TV & Web:
e.g.
e.g.
“My
“Fight in a
g
daughter
parlament”
climbing”
Sociology research: e g
e.g. Surveillance:
Manually e.g.
analyzed “Woman
smoking throws cat into
actions in wheelie bin”
900 movies 260K views in
7 days
• … and it’s mainly about people and human actions
21. Goal
Get f ili
G t familiar with:
ith
• Problem f
formulations
• Mainstream approaches
pp
• Particular existing techniques
• Current benchmarks
• Available baseline methods
• Promising future directions
22. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
23. What is Action Recognition?
• Terminology
– What is an “action”?
• Output representation
– What do we want to say about an image/video?
Unfortunately, neither question has atisfactory
answer yet
24. Terminology
T i l
• The terms “action recognition”, “activity
recognition”, “event recognition”, a e used
ecog t o , e e t ecog t o , are
inconsistently
– Finding a common language for describing videos
is an open problem
25. Terminology Example
• “Action” is a low-level primitive with semantic
meaning
i
– E.g. walking, pointing, placing an object
• “Activity” is a higher-level combination with some
temporal relations
– E.g. taking money out from ATM, waiting for a bus
• “Event” is a combination of activities, often
involving multiple individuals
– E.g. a soccer game, a traffic accident
• This is contentious
– No standard, rigorous definition exists
26. Output Representation
• Given this image what is the desired output?
• This image contains a
man walking
– A ti classification /
Action l ifi ti
recognition
• Th man walking i
The lki is
here
– Action detection
27. Output Representation
• Given this image what is the desired output?
• This image contains 5
men walking, 4 jogging,
2 running
• The 5 men walking are
here
• This is a soccer game
28. Output Representation
• Given this image what is the desired output?
• Frames 1 20 the man ran to the left
1-20 left,
then frames 21-25 he ran away from
the camera
• Is this an accurate description?
• Are labels and video frames in 1-1
correspondence?
30. Dataset: KTH-Actions
KTH-
• 6 action classes by 25 persons in 4 different scenarios
• Total of 2391 video samples
• Specified train, validation, test sets
p
• Performance measure: average accuracy over all
classes
Schuldt, Laptev, Caputo ICPR 2004
31. UCF-
UCF-Sports
• 10 different action classes
• 150 video samples in total
• Evaluation method: leave-one-out
• Performance measure: average accuracy over all
classes
Diving Kicking Walking
Skateboarding
g High-Bar-Swinging
g g g Golf-Swinging
g g
Rodriguez, Ahmed, and Shah CVPR 2008
32. UCF - YouTube Action Dataset
• 11 categories, 1168 videos
g ,
• Evaluation method: leave-one-out
• Performance measure: average accuracy over all
g y
classes
Liu, Luo and Shah CVPR 2009
33. Semantic Description of Human Activities
(ICPR 2010)
• 3 challenges: interaction, aerial view, wide-area
g , ,
• Interaction
– 6 classes, 120 instances over ~20 min. video
– Classification and detection tasks (+/- bounding boxes)
– Evaluation method: leave-one-out
Ryoo et al ICPR 2010 challenge
al.
34. Hollywood2
Hollywood2
• 12 action classes from 69 Hollywood movies
• 1707 video sequences in total
• Separate movies for training / testing
• Performance measure: mean average precision (mAP)
over all classes
GetOutCar AnswerPhone Kiss
HandShake StandUp DriveCar
Marszałek, Laptev, Schmid CVPR 2009
35. TRECVid Surveillance Event
Detection
• 10 actions: person runs, take picture, cell to ear, …
• 5 cameras ~100h video from LGW airport
cameras,
• Detection (in time, not space); multiple detections count as false
positives
• Evaluation method: specified training / test videos, evaluation at
NIST
• Performance measure: statistics on DET curves
Smeaton, Over, Kraaij, TRECVid
36. Dataset Desiderata
• Clutter
• Not choreographed by dataset collectors
– Real-world variation
• Scale
– Large amount of video
• Rarity of actions
– Detection harder than classification
– Chance performance should be very low
• Clear definition of training/test split
– Validation set for parameter tuning?
– Reproducing / comparing to other methods?
37. Datasets Summary
y
Clutter? Choreographed?
Cl tt ? Ch h d? Scale
S l Rarity f
R it of Training/testi
T i i /t ti
actions ng split
KTH No Yes 2391 videos Classification - Defined –
one per video by actors
UCF Sports Yes No 150 videos Classification – Undefined -
one per video LOO
UCF Youtube Yes No 1168 videos Classification – Undefined -
one per video LOO
SDHA-ICPR No Yes 20 minutes, Classification / Undefined -
Interaction
I t ti 120 detection
d t ti LOO
instances
Hollywood2 Yes No 69 movies, Detection, Defined –
~1600 ~xx actions/h by videos
instances
TRECVid Yes No ~100h Detection, Defined –
~20 actions/h
20 by time
39. Action understanding: Key components
Image measurements Prior knowledge
Foreground Deformable contour
segmentation
g models
Image
gradients Association
2D/3D body models
Optical flow
Local space-
time features
Motion priors
Background models
Learning Automatic Action labels
associations f
i ti from inference
i f •••
••• strong / weak
supervision
40. Foreground segmentation
g g
Image differencing: a simple way to measure motion/change
- > Const
Better Background / Foreground separation methods exist:
• M d li of color variation at each pixel with G
Modeling f l i ti t h i l ith Gaussian Mi t
i Mixture
• Dominant motion compensation for sequences with moving camera
• Motion layer separation for scenes with non-static backgrounds
41. Temporal Templates
p p
Idea: summarize motion in video in a
Motion History Image (MHI):
Descriptor: H moments of diff
D i Hu f different orders
d
[A.F. Bobick and J.W. Davis, PAMI 2001]
43. Temporal Templates: Summary
p p y
Pros:
+ Simple and fast Not all shapes are valid
Restrict the space
+ Works in controlled settings of admissible silhouettes
Cons:
C
- Prone to errors of background subtraction
Variations in light, shadows, clothing… What is the background here?
- Does not capture interior
motion and shape
Silhouette
tells little
about actions
44. Active Shape Models
p
Point Distribution Model
• Represent the shape of samples by a set
of corresponding points or landmarks
• Assume each shape can be represented
by the linear combination of basis shapes
such that
for the mean shape
and some parameter vector
[Cootes et al. 1995]
45. Active Shape Models
p
• Distribution of eigenvalues of
A small fraction of basis
shapes (eigenvecors)
accounts for the most of shape
variation (=> landmarks are
redundant)
• Three main modes of lips-shape variation:
46. Active Shape Models:
p
effect of regularization
• Projection onto the shape-space serves as a regularization
47. Person Tracking
g
Learning flexible models from image sequences
[A. Baumberg and D. Hogg, ECCV 1994]
48. Active Shape Models: Summary
p y
Pros:
+ Shape prior helps overcoming segmentation errors
+ Fast optimization
+ Can handle interior/exterior dynamics
Cons:
- Optimization gets trapped in local minima
- Re-initialization is problematic
Possible improvements:
P ibl i t
• Learn and use motion priors, possibly specific to
different actions
49. Motion priors
p
• Accurate motion models can be used both to:
Help accurate tracking
Recognize actions
g
• Goal: formulate motion models for different types of actions
and use such models for action recognition
Example:
Drawing with 3 action
modes
line drawing
scribbling
idle
idl
[M. Isard and A. Blake, ICCV 1998]
50. Dynamics with discrete states
Joint tracking and
gesture recognition in
the context of a visual
white-board interface
[M.J. Black and A.D. Jepson, ECCV 1998]
51. Motion priors & Trackimg: Summary
p g y
Pros:
+ more accurate tracking using specific motion models
+ Simultaneous tracking and motion recognition with
discrete state dynamical models
Cons:
- Local minima is still an issue
- Re-initialization is still an issue
52. Shape and Appearance vs. Motion
• Shape and appearance in images depends on many factors:
clothing, illumination contrast, image resolution, etc…
[Efros et al. 2003]
• Motion field (in theory) is invariant to shape and can be used
directly to describe human actions
54. Motion estimation: Optical Flow
• Classical problem of computer vision [Gibson 1955]
• Goal: estimate motion field
How? We only have access to image pixels
Estimate pixel-wise correspondence
between frames = Optical Flow
• Brightness Change assumption: corresponding
pixels preserve their intensity (color)
Useful assumption in many cases
Breaks at occlusions and
illumination changes
Physical and visual
motion may be different
55. Parameterized Optical Flow
1. Compute standard Optical Flow for many examples
2. Put velocity components into one vector
3. Do PCA on and obtain most informative PCA flow basis vectors
Training samples PCA flow bases
[Black, Yacoob, Jepson, Fleet, CVPR 1997]
56. Parameterized Optical Flow
• Estimated coefficients of PCA flow bases can be used as action
descriptors
Frame numbers
Optical flow seems to be an interesting descriptor for
motion/action recognition
ti / ti iti
[Black, Yacoob, Jepson, Fleet, CVPR 1997]
58. Spatio-
Spatio-Temporal Motion Descriptor
Temporal extent E
… … Sequence A
Σ … … Sequence B
t
A E A
E
I matrix
E
B B
E
frame-to-frame motion-to-motion
similarity matrix blurry I similarity matrix
59. Football Actions: matching
Input
Sequence
Matched
Frames
input
i t matched
t h d
[Efros, Berg, Mori, Malik, ICCV 2003]
62. Classifying Tennis Actions
6 actions; 4600 frames; 7-frame motion descriptor
Woman player used as training, man as testing.
p y g g
[Efros, Berg, Mori, Malik, ICCV 2003]
63.
64. Motion recognition
without motion estimations
• Motion estimation from video is a often noisy/unreliable
• Measure motion consistency between a template and test video
[Schechtman and Irani, PAMI 2007]
65. Motion recognition
without motion estimations
• Motion estimation from video is a often noisy/unreliable
• Measure motion consistency between a template and test video
Test video
Template video
p Correlation result
[Schechtman and Irani, PAMI 2007]
66. Motion recognition
without motion estimations
• Motion estimation from video is a often noisy/unreliable
• Measure motion consistency between a template and test video
Test video
Template video
p Correlation result
[Schechtman and
Irani, PAMI 2007]
67. Motion-
Motion-based template matching
p g
Pros:
+ Depends less on variations in appearance
Cons:
- Can be slow
- Does not model negatives
Improvements possible using discriminatively-trained
template-based action classifiers
68. Action Dataset and Annotation
Manual annotation of drinking actions in movies:
g
“Coffee and Cigarettes”; “Sea of Love”
“Drinking”: 159 annotated samples
g p
“Smoking”: 149 annotated samples
Temporal annotation
T l t ti
First frame Keyframe Last frame
Spatial annotation
head rectangle
torso rectangle
70. Actions == space-time objects?
space-
“stable-
stable
view”
objects
“atomic”
actions
car exit phoning smoking hand shaking drinking
Objective:
take
advantage
of space-
time shape
ti h
time time time time
71. Actions == Space-Time Objects?
Space-
p j
HOG features
Hist. f O ti Fl
Hi t of Optic Flow
72. Histogram features
g
HOG: histograms of HOF: histograms of
oriented gradient optic flow
~10^7 cuboid features
Choosing 10^3 randomly
10 3 •
4 grad. orientation bins 4 OF direction bins
+ 1 bin for no motion
73. Action learning
selected features
boosting
b ti
weak classifier
• Efficient discriminative classifier [Freund&Schapire’97]
AdaBoost:
• G d performance for face detection [Viola&Jones’01]
Good f f f d t ti [Vi l &J ’01]
pre-aligned Haar
samples optimal threshold
features
Fisher
discriminant
Histogram
see [Laptev BMVC’06]
features
for more details
74. Drinking action detection
g
Test episodes from the movie “Coffee and cigarettes”
[I. Laptev and P. Pérez, ICCV 2007]
75. Where are we so far ?
Temporal templates: Active shape models: Tracking with motion priors:
+ simple, fast + shape regularization + improved tracking and
- sensitive to - sensitive to simultaneous action recognition
segmentation errors initialization and - sensitive to initialization and
tracking failures tracking failures
Motion-based recognition:
+ generic d
i descriptors;
i t
less depends on
appearance
- sensitive to
localization/tracking
errors
76. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
77. How to handle real complexity?
Common methods: Common problems:
p
• Camera stabilization • Complex & changing BG
• Segmentation
g ? • Changes in appearance
• Tracking ? • Large variations in motion
•T
Template-based methods
l t b d th d ?
Avoid global assumptions!
79. Relation to local image features
Airplanes
p
Motorbikes
Faces
Wild Cats
Leaves
People
Bikes
80. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
81. Space-Time Interest Points
Space-
p
What neighborhoods to consider?
High image Look at the
Distinctive ⇒ variation in space ⇒ distribution of the
neighborhoods g
gradient
and time
Definitions:
Original image sequence
Space-time Gaussian with covariance
Gaussian derivative of
Space-time
Space time gradient
Second-moment matrix
[Laptev, IJCV 2005]
82. Space-Time Interest Points: Detection
Space-
p
Properties of
p :
defines second order approximation for the local
distribution of within neighborhood
⇒ 1D space-time variation of , e.g. moving bar
⇒ 2D space-time variation of , e.g. moving ball
g g
⇒ 3D space-time variation of , e.g. jumping ball
Large eigenvalues of μ can be detected by the
local maxima of H over (x,y,t):
(similar to Harris operator [Harris and Stephens, 1988])
[Laptev, IJCV 2005]
83. Space-Time Interest Points: Examples
Space-
p p
Motion event detection: synthetic sequences
appearance//
accelerations split/merge
disappearance
[Laptev, IJCV 2005]
88. Space-Time Features: Descriptor
Space-
p p
Multi scale space time
Multi-scale space-time patches
from corner detector
Histogram of Histogram
Public code available at oriented spatial of optical •
www.irisa.fr/vista/actions
www irisa fr/vista/actions grad. (HOG)
d flow (HOF)
3x3x2x4bins
3 3 2 4bi HOG 3x3x2x5bins
3 3 2 5bi HOF
descriptor descriptor
[Laptev, Marszałek, Schmid, Rozenfeld, CVPR 2008]
89. Visual Vocabulary: K-means clustering
y K- g
Group similar p
p points in the space of image descriptors using
p g p g
K-means clustering
Select significant clusters
Clustering
c1
c2
c3
c4
Classification
[Laptev, IJCV 2005]
90. Visual Vocabulary: K-means clustering
y K- g
Group similar p
p points in the space of image descriptors using
p g p g
K-means clustering
Select significant clusters
Clustering
c1
c2
c3
c4
Classification
[Laptev, IJCV 2005]
92. Periodic Motion
• Periodic views of a sequence can be approximately treated
as stereopairs
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
93. Periodic Motion
• Periodic views of a sequence can be approximately treated
as stereopairs
Fundamental matrix
is generally
g y
time-dependent
Periodic motion estimation ~ sequence alignment
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
94. Sequence alignment
q g
Generally hard problem
Unknown positions and motions of cameras
Unknown temporal offset
Possible time warping
Prior work treats special cases
Caspi and Irani “Spatio temporal alignment of sequences , PAMI
Spatio-temporal sequences”
2002
Rao et.al. “View-invariant alignment and matching of video
sequences”, ICCV 2003
” CC
Tuytelaars and Van Gool “Synchronizing video sequences”, CVPR
2004
Useful for
Reconstruction of dynamic scenes
Recognition of dynamic scenes
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
95. Sequence alignment
q g
Constant translation
Assume the camera is translating with velocity relatively to
the object
⇒ For sequences
corresponding points are related by
⇒ All corresponding periodic points are on the same epipolar line
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
96. Periodic motion detection
1. Corresponding points have
similar d
i il descriptors
i
2. Same period
for all features
3. Spatial arrangement of features across periods
satisfy epipolar constraint:
y pp
Use RANSAC to estimate F and p
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
99. Periodic motion segmentation
g
Assume periodic objects are planar
Periodic points can be related by a dynamic homography:
linear in time
[Laptev, Belongie, Pérez, Wills, ICCV 2005]
100. Periodic motion segmentation
g
Assume periodic objects are planar
⇒ Periodic points can be related by a dynamic homography:
linear in time
⇒ RANSAC estimation of H and p
103. Segmentation
g
[I. Laptev, S.J. Belongie, P. Pérez and J. Wills, ICCV 2005]
104. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
105. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
106. Action recognition framework
g
Bag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07,…]
space-time patches
Extraction of
Local features
Occurrence histogram K-means
of visual words clustering
Non-linear Feature
SVM with χ2 description
kernel
Feature
e u e
quantization
107. The spatio-temporal features/descriptors
spatio-
• Features: Detectors
• Harris3D [I. Laptev, IJCV 2005]
• Dollar [P. Dollar et al., VS-PETS 2005]
• Hessian [G. Willems et al, ECCV 2008]
• Regular sampling [H. Wang et al. BMVC 2009]
g p g [ g ]
• Descriptors
• HoG/HoF [I Laptev et al. CVPR 2008]
[I. Laptev, al
• Dollar [P. Dollar et al., VS-PETS 2005]
• HoG3D [A. Klaeser et al., BMVC 2008]
• Extended
E t d d SURF [G. Willems et al., ECCV 2008]
G CC
112. Improved BoF action classification
p
Goals:
• Inject additional supervision into BoF
j p
• Improve local descriptors with region-level information
Local features
ambiguous
bi Features with
F t ith
features disambiguated Visual
labels Vocabulary
Regions
R2 Histogram
representation
R1 R1 R1
SVM
R2 Classification
R2
113. Video Segmentation
g
• Spatio-temporal grids
Spatio temporal R3
R2
R1
• Static action detectors [Felzenszwalb’08]
– Trained from ~100 web-images per class
• Object and Person detectors (Upper body)
[
[Felzenszwalb’08]
]
115. Multi-
Multi-channel chi-square kernel
chi- q
Use SVMs with a multi-channel chi-square kernel for classification
multi channel chi square
• Channel c corresponds to particular region segmentation
• Dc(Hi, Hj) i th chi-square di t
is the hi distance b t
between hi t
histograms
• Ac is the mean value of the distances between all training
samples
• The best set of channels C for a given training set is found
based on a greedy approach
116. Hollywood-2 action classification
Hollywood-
y
Attributed feature Performance
(meanAP)
BoF 48.55
Spatiotemoral grid 24 channels 51.83
51 83
Motion segmentation 50.39
Upper body
pp y 49.26
Object detectors 49.89
Action detectors 52.77
Spatiotemoral grid + Motion segmentation 53.20
Spatiotemoral grid + Upper body 53.18
Spatiotemoral grid + Object detectors
S ti t l id Obj t d t t 52.97
52 97
Spatiotemoral grid + Action detectors 55.72
Spatiotemoral grid + Motion segmentation + Upper 55.33
body + Object detectors + Action detectors
[Ullah, Parizi, Laptev, BMVC 2009]
118. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
119. Course overview
• Definitions
• Benchmark datasets
• Early silhouette and tracking-based methods
tracking-
• Motion-
Motion-based similarity measures
• Template-
Template-based methods
• Local space-time features
space-
• Bag-of-
Bag-of-Features action recognition
• Weakly-
Weakly-supervised methods
• Pose estimation and action recognition
• Action recognition in still images
• Human interactions and dynamic scene models
• Conclusions and future directions
120. Why is action recognition hard?
Lots of diversity in the data (view-points, appearance, motion, lighting…)
Drinking Smoking
Lots of classes and concepts
121. The positive effect of data
• The performance of current visual recognition methods heavily
depends on the amount of available training data
Scene recognition: SUN database Object recognition: Caltech 101 / 256
[J. Xiao et al CVPR2010] [Griffin et al. Caltech tech. Rep.]
Action recognition:
Hollywood (~29 samples / class) mAP: 38.4 %
[Laptev et al. CVPR2008,
Hollywood 2 (~75 samples / class) mAP: 50.3%
Marszałek et al. CVPR2009]
122. The positive effect of data
• The performance of current visual recognition methods heavily
depends on the amount of available training data
Need to ll t b t ti l
N d t collect substantial amounts of d t for training
t f data f t i i
Current algorithms may not scale well / be optimal f l
C t l ith t l ll b ti l for large
datasets
• See also article “The Unreasonable Effectiveness of Data” by
A. Halevy, P. Norvig, and F. Pereira, Google, IEEE Intelligent Systems
123. Why is data collection difficult?
Car: 4441
asses
Person: 2524
Frequen of cla
Umbrella: 118
ncy
Dog: 37 [Russel et al. IJCV 2008]
F
Tower:11
Pigeon: 6
Garage: 5
Object classes in (a subset of) LabelMe datset
124. Why is data collection difficult?
• A few classes are very frequent, but most of the classes are very rare
• Similar phenomena have been observed for non-visual data, e.g. word
counts in natural language etc Such phenomena follow Zipf’s
language, etc. Zipf s
empirical law:
class rank = F(1 / class frequency)
• Manual supervision is very costly especially for video
Example: Common actions such as Kissing, Hand Shaking
and Answering Phone appear 3 4 times in typical
3-4
movies
~42 hours of video needs to be inspected to
p
collect 100 samples for each new action class
125. Learning Actions from Movies
g
• Realistic variation of human actions
• Many classes and many examples per class
Problems:
• Typically only a few class-samples per movie
• Manual annotation is very time consuming
126. Automatic video annotation
with scripts
• Scripts available for >500 movies (no time synchronization)
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
• Subtitles (with time info.) are available for the most of movies
• Can transfer time to scripts by text alignment
… subtitles movie script
1172 …
01:20:17,240 --> 01:20:20,437 RICK
Why weren't you honest with me? Why weren't you honest with me? Why
Why d
Why'd you keep your marriage a secret? did you keep your marriage a secret?
1173 01:20:17
01:20:20,640 --> 01:20:23,598 Rick sits down with Ilsa.
01:20:23
lt wasn't my secret Richard.
secret, Richard ILSA
Victor wanted it that way.
Oh, it wasn't my secret, Richard.
1174 Victor wanted it that way. Not even
01:20:23,800 > 01:20:26,189
01:20:23 800 --> 01:20:26 189 our closest friends knew about our
marriage.
Not even our closest friends …
knew about our marriage.
…
127. Script alignment
RICK
All right, I will. Here's looking at
g t, e e s oo g
you, kid.
01:21:50
01:21:59 ILSA
I wish I didn't love you so much.
y
01:22:00
01:22:03 She snuggles closer to Rick.
CUT TO:
EXT. RICK'S CAFE - NIGHT
Laszlo and Carl make their way through the darkness toward a
y g
side entrance of Rick's. They run inside the entryway.
The headlights of a speeding police car sweep toward them.
They flatten themselves against a wall to avoid detection.
The lights move past them.
01:22:15
01:22:17 CARL
I think we lost them.
…
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
128. Script alignment: Evaluation
• Annotate action samples in text
p
• Do automatic script-to-video alignment
• Check the correspondence of actions in scripts and movies
Example of a “visual false positive”
A black car pulls up, t
bl k ll two army
a: quality of subtitle-script matching officers get out.
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
129. Text-
Text-based action retrieval
• Large variation of action expressions in text:
GetOutCar “… Will gets out of the Chevrolet. …”
action: “… Erin exits her new truck…”
Potential false
“…About to sit down, he freezes…”
positives:
• => Supervised text classification approach
130. Hollywood-
Hollywood-2 actions dataset
Training and test
samples are obtained
from 33 and 36 distinct
movies respectively.
Hollywood-2
dataset is on-line:
http://www.irisa.fr/vista
http://www irisa fr/vista
/actions/hollywood2
• Learn vision-based classifier from automatic training set
• Compare performance to the manual training set
131. Bag-of-
Bag-of-Features Recognition
g g
space-time patches
Extraction of
Local features
K-means
Occurrence histogram clustering
of visual words
f i l d (k 4000)
(k=4000)
Feature
F
description
Non-linear
SVM with χ2 Feature
kernel quantization 131
132. Spatio-temporal bag-of-features
Spatio-
p p bag-of-
g
Use global spatio-temporal g
g p p grids
• In the spatial domain:
• 1x1 (standard BoF)
• 2x2, o2x2 (50% overlap)
• h3x1 (horizontal), v1x3 (vertical)
• 3x3
3 3
• In the temporal domain:
• t1 (standard BoF) t2 t3
BoF), t2,
•••
133. KTH actions dataset
Sample frames from KTH action dataset for six classes (columns) and
four scenarios (rows)
134. Robustness to noise in training
Proportion of wrong labels i t i i
P ti f l b l in training
• Up to p=0.2 the performance decreases insignificantly
• At p=0.4 the performance decreases by around 10%
135. Action recognition in movies
Note the suggestive FP: hugging or answering the phone
Note the dicult FN: getting out of car or handshaking
• Real data is hard!
• False Positives (FP) and True Positives (TP) often visually similar
• False Negatives (FN) are often particularly difficult
136. Results on Hollywood-2 dataset
Hollywood-
Class Average Precision (AP) and mean AP for
• Clean training set
• Automatic training set (with noisy labels)
• Random performance
137. Action classification
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,
“Indiana Jones and the Last Crusade” [Laptev et al. CVPR 2008]
138. Actions in Context (CVPR 2009)
• Human actions are frequently correlated with particular scene classes
q y p
Reasons: physical properties and particular purposes of scenes
Eating -- kitchen Eating -- cafe
Running -- road Running -- street
139. Mining scene captions
ILSA
I wish I didn't love you so much.
01:22:00
01:22:03 She snuggles closer to Rick.
CUT TO:
EXT. RICK'S CAFE - NIGHT
Laszlo and Carl make their way through the darkness toward a
side entrance of Rick's. They run inside the entryway.
The headlights of a speeding police car sweep toward them.
They flatten themselves against a wall to avoid detection.
The lights move past them.
CARL
01:22:15 I think we lost them.
01:22:17 …
140. Mining scene captions
INT. TRENDY RESTAURANT - NIGHT
INT.
INT MARSELLUS WALLACE’S DINING ROOM MORNING
WALLACE S
EXT. STREETS BY DORA’S HOUSE - DAY.
INT. MELVIN'S APARTMENT, BATHROOM – NIGHT
EXT.
EXT NEW YORK CITY STREET NEAR CAROL'S RESTAURANT – DAY
CAROL S
INT. CRAIG AND LOTTE'S BATHROOM - DAY
• Maximize word frequency street, living room, bedroom, car ….
• M
Merge words with similar senses using W dN t
d ith i il i WordNet:
taxi -> car, cafe -> restaurant
• Measure correlation of words with actions (in scripts) and
• Re-sort words by the entropy
for P = p(action | word)
143. Automatic gathering of relevant scene classes
and visual samples
Source:
69 movies
aligned with
the
th scripts
i t
Hollywood-2
dataset is
on-line:
http://www.irisa.fr/vista/actions/hollywood2
145. Classification with the help of context
Action classification score
Scene classification score
Weight,
Weight estimated from text:
New action score
146. Results: actions and scenes (jointly)
Actions
in h
i the
context
of
Scenes
Scenes
in the
context
of
Actions
147. Weakly- p
Weakly-Supervised
y
Temporal Action Annotation
[Duchenne at al ICCV 2009]
al.
• Answer questions: WHAT actions and WHEN they happened ?
Knock on the door Fight Kiss
• Train visual action detectors and annotate actions with the
minimal manual supervision
148. WHAT actions?
• Automatic discovery of action classes in text (movie scripts)
-- Text processing:
Part of Speech (POS) tagging;
Named Entity Recognition (NER);
WordNet pruning; Visual Noun filtering
-- Search action patterns
Person+Verb Person+Verb+Prep.
Person+Verb+Prep Person+Verb+Prep+Vis.Noun
Person+Verb+Prep+Vis Noun
3725 /PERSON .* is 989 /PERSON .* looks .* at 41 /PERSON .* sits .* in .* chair
2644 /PERSON .* looks 384 /PERSON .* is .* in 37 /PERSON .* sits .* at .* table
1300 /PERSON .* turns 363 /PERSON .* looks .* up 31 /PERSON .* sits .* on .* bed
916 /PERSON .* takes 234 /PERSON .* is .* on 29 /PERSON .* sits .* at .* desk
840 /PERSON .* sits 215 /PERSON .* picks .* up 26 /PERSON .* picks .* up .* phone
829 /PERSON .* has 196 /PERSON .* is .* at 23 /PERSON .* gets .* out .* car
807 /PERSON .* walks 139 /PERSON .* sits .* in 23 /PERSON .* looks .* out .* window
701 /PERSON .* stands
* 138 /PERSON .* is .* with
* * 21 /PERSON .* looks .* around .* room
* * *
622 /PERSON .* goes 134 /PERSON .* stares .* at 18 /PERSON .* is .* at .* desk
591 /PERSON .* starts 129 /PERSON .* is .* by 17 /PERSON .* hangs .* up .* phone
585 /PERSON .* does 126 /PERSON .* looks .* down 17 /PERSON .* is .* on .* phone
569 /PERSON .* gets 124 /PERSON .* sits .* on 17 /PERSON .* looks .* at .* watch
552 /PERSON .* pulls 122 /PERSON .* is .* of 16 /PERSON .* sits .* on .* couch
503 /PERSON .* comes 114 /PERSON .* gets .* up 15 /PERSON .* opens .* of .* door
493 /PERSON .* sees 109 /PERSON .* sits .* at 15 /PERSON .* walks .* into .* room
462 /PERSON .* are/VBP 107 /PERSON .* sits .* down 14 /PERSON .* goes .* into .* room
149. WHEN:
WHEN: Video Data and Annotation
• Want to target realistic video data
• Want to avoid manual video annotation for training
Use movies + scripts for automatic annotation of training samples
24:25
ertainty!
Unce
24:51
[Duchenne, Laptev, Sivic, Bach, Ponce, ICCV 2009]
150. Overview
Input:
p Automatic collection of training clips
g p
• Action type, e.g.
Person Opens Door
• Videos + aligned scripts
Output:
O Training classifier Clustering of positive segments
Sliding-
window-style
i d l
temporal
action
localization
[Duchenne, Laptev, Sivic, Bach, Ponce, ICCV 2009]
151. Action clustering
[Lihi Zelnik-Manor and Michal Irani CVPR 2001]
Spectral clustering
Descriptor space
Clustering results
Ground truth
152. Action clustering
Our data:
Standard clustering
methods do not work on
this data
153. Action clustering
Our view at the problem
Feature space Video space
Negative samples!
Nearest neighbor
N t i hb Random video samples: lots of them,
solution: wrong! very low chance to be positives
[Duchenne, Laptev, Sivic, Bach, Ponce, ICCV 2009]
154. Action clustering
Formulation
[Xu et al. NIPS’04]
[ ac
[Bach & Harchaoui NIPS’07]
a c aou S0 ]
discriminative
di i i ti cost t
Feature space
Loss on positive samples
Loss on negative samples
negative samples
parameterized positive samples
t i d iti l
Optimization
SVM solution for
Coordinate descent on
[Duchenne, Laptev, Sivic, Bach, Ponce, ICCV 2009]
156. Detection results
Drinking actions in Coffee and Cigarettes
• T i i B
Training Bag-of-Features classifier
fF t l ifi
• Temporal sliding window classification
• Non-maximum suppression
Detection trained on simulated clusters
Test set:
• 25min from “Coffee and
Coffee
Cigarettes” with GT 38
drinking actions
157. Detection results
Drinking actions in Coffee and Cigarettes
• T i i B
Training Bag-of-Features classifier
fF t l ifi
• Temporal sliding window classification
• Non-maximum suppression
Detection trained on automatic clusters
Test set:
• 25min from “Coffee and
Coffee
Cigarettes” with GT 38
drinking actions