FAST OBJECT INSTANCE
SEARCH FROM ONE EXAMPLE
NAYAN SETH
JINGJING MENG, JUNSONG YUAN, YAP-PENG TAN, GANG WANG, "FAST OBJECT INSTANCE SEARCH IN VIDEOS FROM
ONE EXAMPLE"
ARIZONA STATE UNIVERSITY
ARIZONA STATE UNIVERSITY
OBJECTIVE
▸ To locate the object jointly across video frames using
spatiotemporal search.
Source: Object Localization Via Deep Neural Network
ARIZONA STATE UNIVERSITY
WHY?
▸ Vigilance
▸ Combining with AI technologies to help improve crop
productivity [Video]
▸ Autonomous Cars [Video]
▸ Robots which can catch moving objects
ARIZONA STATE UNIVERSITY
PREVIOUS WORK
▸ Boundary Box (made efficient using Branch & Bound)
▸ Max Path (uses Sliding Window)
Tran, D., Yuan, J., & Forsyth, D. (2014). Video Event Detection: From Subvolume Localization to
Spatiotemporal Path Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2),
404-416.
ARIZONA STATE UNIVERSITY
FEW MORE METHODS
▸ Video Google
▸ TRECVID
ARIZONA STATE UNIVERSITY
RANDOMISED VISUAL PHRASE
▸ Generates frame-wise confidence maps
▸ Method works on individual frames
▸ Why?
▸ Handle scale variations of object and provide robustness
▸ No need to rely on image segmentation
ARIZONA STATE UNIVERSITY
RANDOMISED VISUAL PHRASE (CONTD …)
▸ Local invariant features are extracted
▸ Random partition of image
▸ Each patch is bundled with visual phrases
▸ For each RVP, similarity score is computed with respect to query
object independently.
▸ Score used as voting weight for corresponding patch
▸ Final confidence score of each pixel computed considering all
the voting weights for the pixel.
ARIZONA STATE UNIVERSITY
Yuning Jiang, Jingjing Meng, Junsong Yuan, and Jiebo Luo, “Randomized spatial context for object
search,” Image Processing, IEEE Transactions on, p. to appear, 2015.
ARIZONA STATE UNIVERSITY
LOCALISATION
▸ Object localisation done using RVP
▸ Drawbacks
▸ First, RVP depends on a heuristic segmentation coefficient
α (alpha) to locate target objects in an image
▸ Second, its performance drops with insufficient rounds of
partition, as the confidence map would not be salient.
▸ Hence used along with Max Path for better accuracy
ARIZONA STATE UNIVERSITY
ALGORITHM
▸ V = {F1, F2…Fn} where Fi ∈ V
▸ Assumption: Trajectories are non-overlapping
▸ V = {V1, V2…Vn} where Vi ∈ V
▸ Ti = {Ti1, Ti2…Til} where l is total number of object trajectories
▸ Tij = {Bij1, Bij2…Bijk} where k is total number of frames in trajectory Tij
▸ To find the trajectory, Ti
*
= argmax s(Tij) where Tij ∈ Ti [equation 1]
▸ s(Tij) = ∑ s(B) where B ∈ Tij [equation 2]
▸ Once we have the best trajectory Ti
*
for each video chunk, we can then return
the ranked results of all trajectories.
WHERE DOES MAX PATH COME
INTO PLAY…
to solve equation 1
ARIZONA STATE UNIVERSITY
ARIZONA STATE UNIVERSITY
THERE’S A CATCH
ARIZONA STATE UNIVERSITY
COARSE TO FINE SEARCH
▸ Build two file indexes
▸ Coarsely Sampled Frames (filters low confidence video
chunks)
▸ Full Dataset Frames (computationally expensive)
▸ Ranking generated from the confidence score of Coarsely
Sampled Frames
▸ Top ranking chunks - per frame confidence map generated
ARIZONA STATE UNIVERSITY
OTHER METHODS DEPLOYED
▸ Hessian Affine Detectors [4]
▸ FLANN [5]
ARIZONA STATE UNIVERSITY
RESULTS
▸ K = 200 rounds of partition for coarse round with α = 3
▸ K = 50 rounds of partition for fine round with α = 3
▸ Max Path done at 1:1 aspect ratio with local neighborhood of
3x3, spatial step size of 10 and temporal step of 1 frame
▸ β = -2 (the -ve coefficient)
ARIZONA STATE UNIVERSITY
EFFICIENCY
▸ Quad Core Dual Processor
▸ 2.3GHz and 32GB RAM (no GPU)
▸ Coarse Ranking & Filtering of 10 objects, the average time is
0.833 seconds
▸ 28.738 seconds for obtaining top 100 trajectories for each
query (3.833 seconds for frame wise voting map and 24.866
seconds for Max-Path search)
▸ 29.57 seconds excluding I/O for 5.5hr video dataset
ARIZONA STATE UNIVERSITY
OBJECTS QUERIED
Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In
Videos From One Example"
ARIZONA STATE UNIVERSITY
Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In
Videos From One Example"
ARIZONA STATE UNIVERSITY
ACCURACY
Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In Videos
From One Example"
ARIZONA STATE UNIVERSITY
REFERENCE
1. Tran, D., Yuan, J., & Forsyth, D. (2014). Video Event Detection: From Subvolume
Localization to Spatiotemporal Path Search. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 36(2), 404-416.
2. Yuning Jiang, Jingjing Meng, Junsong Yuan, and Jiebo Luo, “Randomized spatial context
for object search,” Image Processing, IEEE Transactions on, p. to appear, 2015.
3. Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In
Videos From One Example"
4. Michal Perd’och, Ondrej Chum, and Jiri Matas, “Ef- ficient representation of local
geometry for large scale object retrieval,” in Proc. IEEE Conf. on Computer Vi- sion and
Pattern Recognition. IEEE, 2009, pp. 9–16.
5. Marius Muja and David G. Lowe, “Fast approximate nearest neighbors with automatic
algorithm configura- tion,” in International Conference on Computer Vision Theory and
Application VISSAPP’09). 2009, pp. 331– 340, INSTICC Press.

Fast Object Instance Search From One Example

  • 1.
    FAST OBJECT INSTANCE SEARCHFROM ONE EXAMPLE NAYAN SETH JINGJING MENG, JUNSONG YUAN, YAP-PENG TAN, GANG WANG, "FAST OBJECT INSTANCE SEARCH IN VIDEOS FROM ONE EXAMPLE" ARIZONA STATE UNIVERSITY
  • 2.
    ARIZONA STATE UNIVERSITY OBJECTIVE ▸To locate the object jointly across video frames using spatiotemporal search. Source: Object Localization Via Deep Neural Network
  • 3.
    ARIZONA STATE UNIVERSITY WHY? ▸Vigilance ▸ Combining with AI technologies to help improve crop productivity [Video] ▸ Autonomous Cars [Video] ▸ Robots which can catch moving objects
  • 4.
    ARIZONA STATE UNIVERSITY PREVIOUSWORK ▸ Boundary Box (made efficient using Branch & Bound) ▸ Max Path (uses Sliding Window) Tran, D., Yuan, J., & Forsyth, D. (2014). Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 404-416.
  • 5.
    ARIZONA STATE UNIVERSITY FEWMORE METHODS ▸ Video Google ▸ TRECVID
  • 6.
    ARIZONA STATE UNIVERSITY RANDOMISEDVISUAL PHRASE ▸ Generates frame-wise confidence maps ▸ Method works on individual frames ▸ Why? ▸ Handle scale variations of object and provide robustness ▸ No need to rely on image segmentation
  • 7.
    ARIZONA STATE UNIVERSITY RANDOMISEDVISUAL PHRASE (CONTD …) ▸ Local invariant features are extracted ▸ Random partition of image ▸ Each patch is bundled with visual phrases ▸ For each RVP, similarity score is computed with respect to query object independently. ▸ Score used as voting weight for corresponding patch ▸ Final confidence score of each pixel computed considering all the voting weights for the pixel.
  • 8.
    ARIZONA STATE UNIVERSITY YuningJiang, Jingjing Meng, Junsong Yuan, and Jiebo Luo, “Randomized spatial context for object search,” Image Processing, IEEE Transactions on, p. to appear, 2015.
  • 9.
    ARIZONA STATE UNIVERSITY LOCALISATION ▸Object localisation done using RVP ▸ Drawbacks ▸ First, RVP depends on a heuristic segmentation coefficient α (alpha) to locate target objects in an image ▸ Second, its performance drops with insufficient rounds of partition, as the confidence map would not be salient. ▸ Hence used along with Max Path for better accuracy
  • 10.
    ARIZONA STATE UNIVERSITY ALGORITHM ▸V = {F1, F2…Fn} where Fi ∈ V ▸ Assumption: Trajectories are non-overlapping ▸ V = {V1, V2…Vn} where Vi ∈ V ▸ Ti = {Ti1, Ti2…Til} where l is total number of object trajectories ▸ Tij = {Bij1, Bij2…Bijk} where k is total number of frames in trajectory Tij ▸ To find the trajectory, Ti * = argmax s(Tij) where Tij ∈ Ti [equation 1] ▸ s(Tij) = ∑ s(B) where B ∈ Tij [equation 2] ▸ Once we have the best trajectory Ti * for each video chunk, we can then return the ranked results of all trajectories.
  • 11.
    WHERE DOES MAXPATH COME INTO PLAY… to solve equation 1 ARIZONA STATE UNIVERSITY
  • 12.
  • 13.
    ARIZONA STATE UNIVERSITY COARSETO FINE SEARCH ▸ Build two file indexes ▸ Coarsely Sampled Frames (filters low confidence video chunks) ▸ Full Dataset Frames (computationally expensive) ▸ Ranking generated from the confidence score of Coarsely Sampled Frames ▸ Top ranking chunks - per frame confidence map generated
  • 14.
    ARIZONA STATE UNIVERSITY OTHERMETHODS DEPLOYED ▸ Hessian Affine Detectors [4] ▸ FLANN [5]
  • 15.
    ARIZONA STATE UNIVERSITY RESULTS ▸K = 200 rounds of partition for coarse round with α = 3 ▸ K = 50 rounds of partition for fine round with α = 3 ▸ Max Path done at 1:1 aspect ratio with local neighborhood of 3x3, spatial step size of 10 and temporal step of 1 frame ▸ β = -2 (the -ve coefficient)
  • 16.
    ARIZONA STATE UNIVERSITY EFFICIENCY ▸Quad Core Dual Processor ▸ 2.3GHz and 32GB RAM (no GPU) ▸ Coarse Ranking & Filtering of 10 objects, the average time is 0.833 seconds ▸ 28.738 seconds for obtaining top 100 trajectories for each query (3.833 seconds for frame wise voting map and 24.866 seconds for Max-Path search) ▸ 29.57 seconds excluding I/O for 5.5hr video dataset
  • 17.
    ARIZONA STATE UNIVERSITY OBJECTSQUERIED Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In Videos From One Example"
  • 18.
    ARIZONA STATE UNIVERSITY JingjingMeng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In Videos From One Example"
  • 19.
    ARIZONA STATE UNIVERSITY ACCURACY JingjingMeng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In Videos From One Example"
  • 20.
    ARIZONA STATE UNIVERSITY REFERENCE 1.Tran, D., Yuan, J., & Forsyth, D. (2014). Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 404-416. 2. Yuning Jiang, Jingjing Meng, Junsong Yuan, and Jiebo Luo, “Randomized spatial context for object search,” Image Processing, IEEE Transactions on, p. to appear, 2015. 3. Jingjing Meng, Junsong Yuan, Yap-Peng Tan, Gang Wang, "Fast Object Instance Search In Videos From One Example" 4. Michal Perd’och, Ondrej Chum, and Jiri Matas, “Ef- ficient representation of local geometry for large scale object retrieval,” in Proc. IEEE Conf. on Computer Vi- sion and Pattern Recognition. IEEE, 2009, pp. 9–16. 5. Marius Muja and David G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configura- tion,” in International Conference on Computer Vision Theory and Application VISSAPP’09). 2009, pp. 331– 340, INSTICC Press.