Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

Jiayun Wang
UC Berkeley & Caltech
Yubei Chen
UC Davis
Stella X. Yu
UC Berkeley & U Michigan

ECCV 2024 (Oral)

Paper Code Dataset Poster

We capture two aspects of object recognition through SSL (self-supervised learning): what the object is and how the object is presented. Our training data are unlabeled image triplets with small pose changes from viewpoint trajectories, without any semantic or pose labels.


Abstract

Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different views of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying what an object is but also understanding how it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks.

We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects.


SSL for Both Object Semantics and Pose

SSL representations that capture both object semantics and pose. The learned representations are expected to discriminate different object semantics and poses, achieving high accuracies for both semantic classification and pose estimation. Notably, we expect to understand global pose from local pose changes. (Left) Our method excels at both semantic classification and pose estimation over existing methods. (Right)


Benchmark Dataset

We provide benchmark with a dataset for joint learning of object semantics and pose. For semantics, we use non-overlapping 13 in-domain semantic categories and 11 out-of-domain categories. We project in-domain and out-of-domain semantic classes with PCA-projected Word2Vec features. (Left) For pose, we adopt absolute and relative pose estimation as tasks. Notably, relative pose enables SSL’s generalizability test on out-of-domain data as it eliminates the need for category-specific canonical pose. (Right)


Method: SSL with Trajectory Loss

In addition to invariance SSL loss (e.g. SimCLR), we propose trajectory loss to avoid representation collapse on pose. We enforce representations of adjacent views of an object, z_L, z_C, z_R, to form a geodesic trajectory.


Results: Better Generalization

Our trajectory regularization consistently achieves higher relative pose estimation accuracy for in-domain and out-of-domain categories and in-domain, outof-domain poses. The SSL is on par or even outperforms supervised methods on out-of-domain data.


Visualizing the Learned Representation

We learn joint semantic-pose embedding: Images are clustered by semantics; within each semantic cluster, images form mini-cluster by pose Representation is grouped by different semantic categories. Images with the same semantic categories form clusters. (Left) Zooming in one category, airplane, we visualize 200 instances with different poses. As the azimuth changes, their representation also forms a trajectory. (Right)


Representation for a Single-Category (Airplanes)

Embedding of renderings of multiple airplanes with pose changes, which demonstrates the improved representation of our method (Left) over baseline (VICReg) (Right) . In each figure, different dots refer to different airplanes with the same pose. We observe as airplane poses change , their representations form a trajectory in the feature space. While the baseline method without trajectory loss can differentiate some views, it fails to form a trajectory, which could partially contribute to worse pose estimation performance.


Citation

@inproceedings{wang2024pose,
  title = {Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization},
  author={Wang, Jiayun and Chen, Yubei and Yu, Stella},
  booktitle={European Conference on Computer Vision},
  year={2024},
}

Acknowledgement

This project was supported, in part, by NSF 2215542, NSF 2313151, and Bosch gift funds to S. Yu at UC Berkeley and the University of Michigan. The authors thank Zezhou Cheng and Quentin Garrido for helpful discussions.