Automated Video Analysis and Reporting for Construction Sites

Title: Automated Video Analysis
and Reporting for Construction
Sites
September

Objectives
• Primary Goal
• Develop an automated system for analyzing construction site videos
and generating comprehensive reports.
• Specific Objectives
• Customize Detectron2:
• Train with a dataset labeled specifically for construction site
objects.
• Customize MMAction2:
• Train with a dataset focused on construction-related actions.
• Integrate Outputs:
• Combine detections and actions to understand context.
• Generate Reports:
• Use generative models to create human-readable reports from
analyzed data.

Objectives
• Specific Objectives
• Customize Detectron2:
• Train with a dataset labeled specifically for construction site
objects.
• Customize MMAction2:
• Train with a dataset focused on construction-related actions.
• Integrate Outputs:
• Combine detections and actions to understand context.
• Generate Reports:
• Use generative models to create human-readable reports from
analyzed data.

Datasets
•Possible objects :
•Heavy Machinery:
•Excavators
•Bulldozers
•Cranes
•Loaders
•Dump Trucks
•Graders
•Backhoes
•Concrete Mixers

Datasets
•Building Materials:
•Bricks/Concrete blocks
•Steel beams
•Cement bags
•Wooden planks
•Pipes
•Construction Tools:
•Power drills
•Hammers
•Saws
•Jackhammers
•Welding machines

Datasets
•Safety Equipment:
•Safety cones
•Barricades
•Scaffolding
•Helmets (PPE)
•High-visibility vests
•Workers:
•Construction workers performing various tasks
•Supervisors/engineers on-site

Report Generation Process
• Output
• Comprehensive Reports:
• Summaries of daily activities, safety compliance, and
incidents.
• Customization:
• Adjust report style to suit different stakeholders (e.g.,
managers, safety officers).
• Example
• "On September 18th, between 9:00 AM and 10:00 AM, three
workers were observed assembling scaffolding without harnesses
near the east wing. Immediate attention is required to enforce
safety protocols."

Expected Outcomes
•Enhanced Monitoring
•Real-time detection and reporting enable proactive management.
•Automated Reporting
•Reduces administrative workload and speeds up information dissemination.
•Safety Improvements
•Early warning system for potential hazards.
•Ensures compliance with safety regulations.
•Operational Efficiency
•Data-driven insights help optimize resource allocation.
•Identifies bottlenecks and areas for improvement.

Potential Applications
•Safety Compliance
•Automatic detection of PPE usage and safety violations.
•Progress Tracking
•Monitoring task completion rates and project milestones.
•Resource Management
•Analysis of equipment usage patterns for maintenance scheduling.
•Incident Investigation
•Detailed reconstruction of events leading to accidents.
•Training and Development
•Use of annotated videos for employee training programs.

Challenges and Considerations
•Data Quality
• Noise and Variability:
• Dealing with low-light conditions, occlusions, and motion
blur.
• Dataset Bias:
• Ensuring the dataset represents all relevant scenarios.
•Computational Resources
• High-performance hardware required for training and real-time
inference.
• Consideration of cloud computing services for scalability.
•Privacy Concerns
• Compliance with privacy laws (e.g., GDPR).
• Implementing data anonymization techniques.
•Scalability
• Adapting the system to different site sizes and multiple
locations.

UCF-101
http://www.thumos.info/

Applications of Video Event Detection
Video Search Surveillance Video
Analysis
Driver Activity
Detection
Drone Action Recognition

Complex Video Events
• Basic event or action detection
5
• Complex or high-level event detection
Wedding Ceremony

Detect
Complex
(High-
level)
Events
• Jiang et al., high level events recognition in unconstrained videos, 2012 6

TRECVID Contest
• NIST TRECVID Multimedia Event Detection (MED) contest
• Detecting complex events in around 100,000 videos
7

Challenges of Video Event Detection
• Consist of interactions between human, objects and scenes
• Actions contain spatial data and temporal data
• Temporal data are huge and noisy
• Hard to define action classes

Optical Flow for Temporal Information
https://devblogs.nvidia.com/an-introduction-to-the-nvidia-optical-flow-sdk/

Estimate the Direction and Distance of Motion
• Optical flow can be used for motion estimation
https://nanonets.com/blog/optical-flow/

What is Optical Flow?
• Motion of objects between consecutive frames of sequence
• Caused by the relative movement between the object and camera

Optical Flow
• Brightness constancy
• Taylor series
…
• Truncating higher order terms and dividing by ∆𝑡

Spare vs. Dense Optical Flow

Selecting Feature Points
• Shi-Tomasi Corner Detector

Lucas-Kanade
• Take a small 3x3 window with the features detected by Shi-Tomasi

Solve the 3x3 Optical Flow Equation
• Av = b

Beyond 3x3 Window
• OpenCV adopts pyramid for LK optical flow

Farneback Optical Flow (Dense)
• Approximate each neighborhood of both frames by quadratic
polynomials

FlowNet: Learning Optical Flow with
Convolutional Networks
• Correlation layer compares each patch from ConvNet branch 1 with
each path from ConvNet branch 2.

Dense Trajectories
• H. Wang, et al. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013
• Tracking dense pixels has shown better than KLT (corner) features and SIFT points

Extracting Dense Trajectories
1. Sample feature points with 5-pixel step
2. Remove features in homogeneous areas
3. Track optical flows smoothed by 3x3 median filter in 15
frames
4. Extract HOG, HOF and MBH along each trajectory
-23-

Motion Boundary Histogram (MBH)
• First proposed by Dalal & Triggs
 “Human Detection Using Oriented Historgrams of Flow and Appearance“, ECCV, 2006
• Show best results on HMDB51 & TRECVID 2011*
• Based on oriented histogram of differential optical flows
• Can effectively cancel camera motions
-24- *A. Tamarakar, et al., Evaluation of Low-Level Features and their Combinations for Complex Event Detection in Open Source Videos,
CVPR, 2012

Improved Trajectories
• Cancel camera motions by
matching SURF points &
dense optical flows
between frames
• A human detector is also used
25

Recent DL Models for Action Recognition
• Yi Zhu et al., “A Comprehensive Study of Deep Video Action Recognition,” Amazon
Web Services, 2020

YouTube-8M
• https://research.google.com/youtube8m/

DeepMind Kinetics Dataset
• https://deepmind.com/research/open-source/kinetics

Learning Optical Flow with Deep Learning?
• Karpathy et al., “Large-scale Video Classification with Convolutional
Neural Networks”, 2014
http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review

Two-stream Convolutional Neural
Networks
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. NIPS

State-of-the-Arts
• LRCN [15], C3D [16], Conv3D & Attention [17], TwoStreamFusion [18],
TSN [19], ActionVlad [20], HiddenTwoStream [1] I3D [21] and T3D
[22]
http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review

Long-term Recurrent Convolutional
Networks
• Donahue et al., “Long-term Recurrent Convolutional Networks for
Visual Recognition and Description,” 2014 (Arxiv Link)
• Key Contributions:
 Building on previous work by using RNN as opposed to stream based designs
 Extension of encoder-decoder architecture for video representations
 End-to-end trainable architecture proposed for action recognition

Long-term Recurrent Convolutional
Networks

3D Convolutional Networks (C3D)
• Du Tran et al., “Learning Spatiotemporal Features with 3D Convolutional
Networks,” 2014 (Arxiv Link)
• Key Contributions
 Repurposing 3D convolutional networks as feature extractors
 Extensive search for best 3D convolutional kernel and architecture
 Using deconvolutional layers to interpret model decision

3D Convolutional Networks (C3D)
• Extract features on 2-second clip
• C3D tends to focus on spatial appearance in first few frames and
tracked the motion in the subsequent frames

F
N
actorized Spatio-temporal Convolutional
etworks
Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Conv3D + Attention
• Yao et al., Describing Videos by Exploiting Temporal Structure, 2015

Temporal Segment Networks (TSN)
• Wang et al., “Temporal Segment Networks: Towards Good Practices
for Deep Action Recognition”, 2016
• Sampling clips sparsely across the video to better model long range
temporal signal

Hidden Two Stream
• Zhu et al., “Hidden Two-Stream Convolutional Networks for Action
Recognition,” 2017
• Novel architecture for generating optical flow input on-the-fly using a
separate network

Efficient Two-stream Action Recognition on FPGA
• CVPR 2021 ECV Workshop
• Contributions:
 Has 10x less operations than other models with little accuracy drop (<2%)
 Use only 2D CNN operations
 Processing both spatial and temporal streams parallelly.

TF-Hub Action Recognition Model
• https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub

References
• https://nanonets.com/blog/optical-flow/
• https://neurohive.io/en/datasets/new-datasets-for-action-recognition/
• http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review
• Yi Zhu et al., “A Comprehensive Study of Deep Video Action Recognition,”
Amazon
 Web Services, 2020

MMAction2 Overview
• What is MMAction2?
• An open-source toolbox for action recognition and temporal action detection.
• Developed by the OpenMMLab team, based on PyTorch.
• Key Features
• Algorithm Support:
• Includes a wide range of models like TSN, TSM, SlowFast.
• Flexible Framework:
• Modular design for easy customization.
• Rich Dataset Support:
• Compatible with popular video datasets and formats.
• Why MMAction2?
• Suitable for recognizing complex, domain-specific actions.
• Offers tools for efficient data loading and preprocessing.

Customizing MMAction2
•Dataset Preparation
•Video Collection:
•Record or collect videos depicting various construction activities.
•Ensure a balanced representation of all relevant actions.
•Annotation:
•Label action segments using tools like VGG Image Annotator (VIA).
•Define action categories such as welding, drilling, lifting, inspecting.
•Training Process
•Model Selection:
•Choose models that suit the action types and computational resources (e.g.,
SlowFast for detailed temporal dynamics).
•Hyperparameter Tuning:
•Adjust learning rate schedules, optimizer settings, and augmentation
strategies.
•Fine-tuning:
•Start with pre-trained models on large datasets (e.g., Kinetics) and fine-tune
on the custom dataset.
•Evaluation
•Metrics like top-1 and top-5 accuracy, confusion matrices to evaluate performance.
•Analyze misclassifications to refine the model.
•Expected Outcomes
•High-precision recognition of construction-related actions.

Integrating Outputs
•Combining Object and Action Data
•Temporal Alignment:
•Synchronize detections and actions based on video timestamps.
•Spatial Correlation:
•Use bounding box coordinates to link objects with actions.
•Data Fusion Techniques
•Association Algorithms:
•Implement methods like the Hungarian algorithm for object-action pairing.
•Contextual Analysis:
•Use scene context to enhance association accuracy.
•Purpose of Integration
•Gain comprehensive insights by understanding who is doing what with which
objects.
•Improve the quality of information fed into the report generation model.

Generative Pretext Models
•Overview
•Generative models like GPT-3 generate coherent text based on input data.
•Capable of understanding context and producing human-like language.
•Role in the System
•Convert structured data (from Detectron2 and MMAction2) into narrative reports.
•Model Selection
•Evaluate models based on:
•Capability: Ability to handle domain-specific language.
•Accessibility: Availability for fine-tuning and deployment.
•Consider open-source options if proprietary models are not feasible.
•Customization
•Fine-tune the model with construction industry reports and terminology.
•Include safety guidelines and compliance language to enrich output.

• Input Data
• Structured outputs containing:
• Detected objects and their locations.
• Recognized actions and their durations.
• Generation Steps
• Data Formatting:
• Organize inputs into templates or prompts for the generative model.
• Model Inference:
• Feed the formatted data into the model to produce text.
• Post-processing:
• Correct any grammatical errors or inconsistencies.

• Output
• Comprehensive Reports:
• Summaries of daily activities, safety compliance, and incidents.
• Customization:
• Adjust report style to suit different stakeholders (e.g., managers, safety officers).
• Example
• "On September 18th, between 9:00 AM and 10:00 AM, three workers were observed assembling
scaffolding without harnesses near the east wing. Immediate attention is required to enforce
safety protocols."

Challenges and Considerations
•Data Quality
• Noise and Variability:
• Dealing with low-light conditions, occlusions, and motion blur.
• Dataset Bias:
• Ensuring the dataset represents all relevant scenarios.
•Computational Resources
• High-performance hardware required for training and real-time
inference.
• Consideration of cloud computing services for scalability.
•Privacy Concerns
• Compliance with privacy laws (e.g., GDPR).
• Implementing data anonymization techniques.
•Scalability
• Adapting the system to different site sizes and multiple locations.
•Ethical Considerations
• Transparency in surveillance practices.
• Addressing potential worker concerns about monitoring.

Automated Video Analysis and Reporting for Construction Sites

More Related Content

Similar to Automated Video Analysis and Reporting for Construction Sites

Recently uploaded

Automated Video Analysis and Reporting for Construction Sites