[A-GIST 발표] Crowdsourced 3D Mapping: A combined Multi-View Geometry and Self-Supervised Learning Approach
1. 2020 Presentation
Crowdsourced 3D Mapping: A combined Multi-View
Geometry and Self-Supervised Learning Approach
Jehong Lee
2020.08.25
Hemang Chawla et al.
NavInfo Europe, Netherlands
IROS 2020, Las Vegas, USA
2. 2 / 18
Problem statements (HD map)
HD map mapper
예상 가격 : 2 ~ 3억
HD Map의 작성
• 여러번 반복 운행 후 데이터 확보
• 동적 물체 (e.g. 사람, 차량) 제거
• 각종 구조물 형태 보정
• 완성
데이터 가공
데이터 수집
실시간 업데이트 필요한 지도 정보
• 도로상 사고 정보, 자연 재해에 따른 정보, 공사 정보
• 각종 이벤트 정보 작성은 그렇다 쳐도
HD Map 업데이트의
인적∙물질적∙시간적 비용 문제를
낮추는 노력이 필요
Traffic light
Wall
Lane drive path
Pole
Traffic sign
Lane arrow
Cross walk
Lane width
Stop line
HD Map의 구성
• Traffic Light
• Traffic Sign
• Wall
• Lane Features
• Cross Walk
• Stop Line
• Pole
• ... And so on
HD(High Definition) Map
3. 3 / 18
Crowdsourced 3D map
Model of Central Rome
[Johannes et al, ‘16 CVPR] – COLMAP
Crowdsourced Image data
[Sameer et al, 09’ ICCV]
OR
Flickr
Youtube
...
Structure from motion
Alg.
Inverse projection
from Depth map
with Odometry
HD map
Update
4. 4 / 18
Approach
3D positioning of traffic sign using open source dataset for autonomy vehicle
• Self-calibration of camera
• Multi-geometric based vs Deep leaning based vs Hybrid
for depth map and ego-motion estimation and mapping
Model of Central Rome
[Johannes et al, ‘16 CVPR] - Colmap
Crowdsourced Image data
[Sameer et al, 09’ ICCV]
OR
Structure from motion
Alg.
Inverse projection
from Depth map
with Odometry
Camera self-calibration
Geometric method
vs Deep learning method
vs Mixed method
Est. of Traffic sign
Ground True(GT)
of Traffic sign
Vehicle’s Path
Aerial Image
5. 5 / 18
KITTI
image sequences
KITTI GPS data
Focal lengths
Principle points
Pinhole camera model
ORB SLAM [1]
VITW [3]
Monodepth2 [4] Learning
Geometric
Colmap [2]
VITW [3]
Monodepth2 [4] Learning
Geometric
Framework
Input
Output of the main component
Traffic sign positioning data coming
from different vehicles
[1] Mur-Artal et al, "ORB-SLAM: a versatile and accurate monocular SLAM system." IEEE transactions on robotics, 2015
[2] Schonberger et al, "Structure-from-motion revisited." CVPR, 2016
[3] Gordon Ariel et al. "Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras.” ICCV, 2019
[4] Godard Clément et al. "Digging into self-supervised monocular depth estimation." ICCV, 2019
Estimated
Trajectory
(Alg.)
Scaled
Trajectory
Measured
Trajectory
(GPS)
6. 6 / 18
Traffic Sign Detection
• 각 이미지 열에서 Traffic sign의 Bounding box를 검출(detection) 후 분류(classification)
• Image의 edge에 있는 bounding box는
occlusion의 발생 가능성이 의심되기 때문에 제외 (inter-frame sign association)
• 어떤 알고리즘으로 Detection 했는지는 논문에 언급 X; 수작업?
Occlusion
Detected
Traffic Sign
7. 7 / 18
Camera self-calibration
(Unsupervised) Deep learning [3]
• Input: Single image
• Output
: Intrinsic parameter, Camera pose, Depth map
Frames from YouTube and learned disparity maps.
The camera intrinsic are learned. [2]
Geometric [2]
• Input: Camera poses for each frame + Matched Feature points on Image plane
Epipolar Geometry
!"#! = %
# = & ×( = )*+)
[-" ." 1]"#
-
.
1
= %
): camera intrinsic parameter
해석적 or 대수적 풀이
[-" ." 1]")*+)
-
.
1
= %
8. 8 / 18
Approach A for 3D positioning
Geometric based
• ORB-SLAM을 이용한 Pose trajectory를 KITTI GPS data을 활용하여 Similarity transformation
→ Scaled camera pose seq.
• Traffic sign이 관측된 프레임에 대해 Bundle Adjustment를 통한 최적화 → Traffic sign’s pos.
Camera
Intrinsic
Scaled
Camera Pose
Seq.
Mid-point
Algorithm
Bundle
Adjustment
Pos. of Sign
on Images
Traffic Sign’s
Position
Initial position
of Traffic sign
i: Sign # (Association)
j : Frame #
Ci,j: Pixel of Sign i in j frame
!",$
%&'
> 0
9. 9 / 18
Approach B for 3D positioning
Deep learning based
• Deep learning 기반 알고리즘으로 첫번째 프레임을 기준으로 pose tracking
• KITTI GPS data을 활용하여 Similarity transformation → Scaled depth maps
• Frame 간 상대 위치를 구한 뒤 첫번째 Frame의 절대 위치를 기준으로 평균 정렬
Camera
Intrinsic
Scaled Depth
Maps
Inverse
Projection
Hypothesis
Fusion
Pos. of Sign
on Images
Traffic Sign’s
Position
i: Sign # (Association)
j : Frame #
Ci,j: Pixel of Sign i in j frame
Sdj: Scale for depth map at j frame
k : the number of frames
10. 10 / 18
Result 1: Sensitivity to Camera Calibration
• True Camera intrinsic parameter에 일정량의 error를 포함하였을때,
Approach A를 통한 상대 위치 정보의 평균 오차를 확인
• KITTI seq. 5, 7에 대해서 10회 반복 테스트 후 가장 작은 값을 선정
KITTI Seq. 5
Loop Closure ↑
KITTI Seq. 7
Loop Closure ↓
11. 11 / 18
Result 1: Sensitivity to Camera Calibration
KITTI Seq. 5 KITTI Seq. 7
12. 12 / 18
Result 2-1: Self-Calibration
Self-Calibration
• Monodepth 2와 VITW을 KITTI dataset의 44 sequence로 학습시킴 (city, residential, road)
• COLMAP이 성능이 젤 좋음
• Geometric based 와 Deep learning based 모두 직선 주행만 있는 Seq. 4에서는
self-camera calibration이 불가능
KITTI Seq. 4
13. 13 / 18
Result 2-2: Ego-motion Estimation
• ORB-SLAM with Loop closure가 가장 성능 좋음
• 하지만, ORB-SLAM은 초반에 커브 구간이 살짝 있다가 계속 직선 주행만 있는 KITTI seq. 01 같은 경우에는 Fail 발생
ATE-5: 5 frame 마다 끊어서 matching 후 비교
ATE의 정의
KITTI Seq. 1
14. 14 / 18
Result 2-3: Depth Estimation
• Monodepth 2의 성능이 VITW보다 근소하게 좋으나,
학습 시에 dataset에 대한 average camera parameter에 대한 사전 정보가 필요하기 때문에
VITW가 crowdsourced map을 활용하는 데 있어서 더 적합하다고 제안
Error Accuracy
15. 15 / 18
Result 3-1: 3D Positioning Analysis
• Approach A가 성능이 좋음
• 하지만, ORB-SLAM의 Failure case를 보완할 필요가 있음.
!": Average relative sign positioning error using full trajectory
!#: Average relative sign positioning error using short trajectory
$: Average # of successfully positioned signs
! : Relative sign position error using depth maps
16. 16 / 18
Result 3-2: 3D Positioning Analysis (Hybrid)
RDP
algorithm
KITTI Seq. 1
ORB-SLAM
failure case
KITTI Seq. 4
Colmap
failure case
KITTI Seq. 8
Geometric + Deep learning 둘 다 적절히 쓰는게 좋다!
17. 17 / 18
Contributions
• Camera의 self-calibration 정확도에 따른 3D position triangulation의 민감도를 평가
• Camera self-calibration 기법으로서 Multi-view geometry 방식 (Colmap)
Deep learning 방식(Monodepth2, VITW)과 간의 성능 비교
• 사전 정보 없이 Commercial GPS와 Monocular color camera 만을 활용해서
crowdsourced 3D traffic sign positioning을 수행
• Multi-view geometry 방식의 정확도와 Deep learning 방식의 간결함을 결합하여
안정적인 정확도내에서의 3D Map 제작 가능성 확인
• KITTI의 3D traffic sign GT를 구성 후 공개;
https://github.com/hemangchawla/3d-groundtruth-traffic-sign-positions
18. 18 / 18
For Implementation
• ORB-SLAM2 without Stereo ≒ ORB-SLAM 1 (C++) [link]
• Camera self-calibration using Colmap for KITTI (C++) [link]
• Camera self-calibration using VITW (Python with TF 1.15) [link]
• Detection of Traffic sign using YOLOv4 (C++) [link]
• Migration for Waymo dataset [link]
(3D lidar GT for Traffic sign, 2D image with GT for Traffic sign)
• Similarity transformation with Umeyama’s algorithm using Eigen (C++) [link]
• Bundle adjustment using OpenMVG (C++ with Ceres solver) [link]
• 그 밖의 Inverse projection, RDP algorithm, Mid-point algorithm 등의 Utility는
새로 작성 혹은 공개된 코드들 활용 가능