Publications
Asterisk indicates equal contribution.
An up-to-date list is available on Google Scholar.
Preprints
2025
- arXivUltrasound Lung Aeration Map via Physics-Aware Neural OperatorsJiayun Wang, Oleksii Ostras, Masashi Sode, Bahareh Tolooshams, Zongyi Li, Kamyar Azizzadenesheli, and 2 more authorsIn submission to Nature, 2025
Diagnostic lung ultrasound (LUS) is common in clinics in the evaluation of acute and chronic lung diseases. Ultrasound transducer emits diagnostic pulse, receives pressure waves at its surface, transforms them into radio frequency (RF) data and ultrasound machine beamforms them to B-mode images, which are read by radiologists for diagnostic purposes and monitoring of lung disease. However, interpreting LUS B-mode images is challenging due to the ill-posedness of the problem, which comes from ultrasound’s inability to penetrate air and the complex physics of wave propagation in the lung. The interpretation difficulty consequently reduces the repeatability and increases the variance in LUS-based diagnosis and monitoring. In this work, we propose a surrogate model for the inverse problem of ultrasound propagation in the lung and reconstruct lung aeration maps (the source) directly from the ultrasound RF data, which avoids the indirect and challenging interpretation of the B-mode images.
- arXivBeyond Closure Models: Learning Chaotic-Systems via Physics-Informed Neural OperatorsIn submission to ICLR, 2025
Accurately predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling. However, achieving such predictions typically requires iterative computations over a dense spatiotemporal grid to account for the unstable nature of chaotic systems, which is expensive and impractical in many real-world situations. An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a \textitclosure model, which approximates the overall information from fine scales not captured in the coarse-grid simulation. Recently, ML approaches have been used for closure modeling, but they typically require a large number of training samples from expensive fully-resolved simulations (FRS). In this work, we prove an even more fundamental limitation, i.e., the standard approach to learning closure models suffers from a large approximation error for generic problems, no matter how large the model is, and it stems from the non-uniqueness of the mapping. We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation by not using a closure model or a coarse-grid solver.
- arXivOpen Vocabulary Monocular 3D Object DetectionJin Yao, Hao Gu , Xuweiyi Chen , Jiayun Wang, and Zezhou ChengIn submission, 2025
In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories. We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space. Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories. Additionally, we propose a target-aware evaluation protocol to address inconsistencies in existing datasets, improving the reliability of model performance assessment. Extensive experiments on the Omni3D dataset demonstrate the effectiveness of the proposed method in zero-shot 3D detection for novel object categories, validating its robust generalization capabilities. Our method and evaluation protocols contribute towards the development of open-vocabulary object detection models that can effectively operate in real-world, category-diverse environments.
Conference & Journal Articles
2024
- ECCVPose-Aware Self-Supervised Learning with Viewpoint Trajectory RegularizationJiayun Wang, Yubei Chen, and Stella YuIn European Conference on Computer Vision, 2024
The paper was selected as an oral presentation (2.3%).
Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different views of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying what an object is but also understanding how it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks. We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects.
- Compress @ NeurIPSA Unified Model for Compressed Sensing MRI Across Undersampling PatternsArmeet Jatyani* , Jiayun Wang*, Zihui Wu, Miguel Liu-Schiaffini, Bahareh Tolooshams, and Anima AnandkumarIn NeurIPS Workshop, 2024
The full paper is in submission to CVPR 2025.
Compressed Sensing MRI (CS-MRI) reconstructs images of the body’s internal anatomy from undersampled and compressed measurements, thereby reducing scan times and minimizing the duration patients need to remain still. Recently, deep neural networks have shown great potential for reconstructing high-quality images from highly undersampled measurements. However, since deep neural networks operate on a fixed discretization, one needs to train multiple models for different measurement subsampling patterns and image resolutions. This approach is highly impractical in clinical settings, where subsampling patterns and image resolutions are frequently varied to accommodate different imaging and diagnostic requirements. We propose a unified model that is robust to different subsampling patterns and image resolutions in CS-MRI. Our model is based on neural operators, a discretization-agnostic architecture. We use neural operators in both image and measurement (frequency) space, which capture local and global image features for MRI reconstruction. Empirically, we achieve consistent performance across different subsampling rates and patterns, with up to 4x lower NMSE and 5 dB PSNR improvements over the state-of-the-art method. We also show the model is agnostic to image resolutions with zero-shot super-resolution results. Our unified model is a promising tool that is agnostic to measurement subsampling and imaging resolutions in MRI, offering significant utility in clinical settings where flexibility and adaptability are essential for efficient and reliable imaging.
- MICCAIInsight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease DiagnosisChun-Hsiao Yeh , Jiayun Wang, Andrew D. Graham , Andrea J. Liu, Bo Tan, Yubei Chen, and 2 more authorsIn Medical Image Computing and Computer-Assisted Intervention (MICCAI) Proceedings, 2024
Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predefined closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs’ reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses.
- ML4HMulti-Modal Self-Supervised Learning for Surgical Feedback Effectiveness AssessmentArushi Gupta, Rafal Kocielnik , Jiayun Wang, Firdavs Nasriddinov, Cherine Yang, Elyssa Wong, and 2 more authorsIn Machine Learning for Health, PMLR, 2024
The paper won the best paper award at 2024 Machine Learning for Health Conference.
During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback—specifically whether it leads to a change in trainee behavior, which is crucial for developing methods to improve surgical training and education. However, relying on human annotations to assess feedback effectiveness is labor-intensive and prone to biases, highlighting the need for an automated, scalable, and objective approach. Developing such a system is challenging, as it requires understanding both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness.
- HeliyonA Machine Learning Approach to Predicting Dry Eye-Related Signs, Symptoms and DiagnosesAndrew D Graham, Tejasvi Kothapalli , Jiayun Wang, Jennifer Ding, Vivien Tse, Penny Asbell, and 2 more authorsHeliyon, 2024
To use artificial intelligence to identify relationships between morphological characteristics of the Meibomian glands (MGs), subject factors, clinical outcomes, and subjective symptoms of dry eye. A deep learning model was trained to take meibography as input, segment the individual MG in the images, and learn their detailed morphological features. Morphological characteristics were then combined with clinical and symptom data in prediction models of MG function, tear film stability, ocular surface health, and subjective discomfort and dryness. The models were analyzed to identify the most heavily weighted features used by the algorithm for predictions. Machine learning-derived MG morphological characteristics were found to be important in predicting multiple signs, symptoms, and diagnoses related to MG dysfunction and dry eye. This deep learning method illustrates the rich clinical information that detailed morphological analysis of the MGs can provide, and shows promise in advancing our understanding of the role of MG morphology in ocular surface health.
- Nature PortfolioArtificial Intelligence Models Utilize Lifestyle Factors to Predict Dry Eye Related OutcomesAndrew D. Graham , Jiayun Wang, Tejasvi Kothapalli, Jennifer Ding, Helen Tasho, Alisa Molina, and 4 more authorsNature Scientific Reports, 2024
Purpose: To examine and interpret machine learning models that predict dry eye (DE)-related clinical signs, subjective symptoms, and clinician diagnoses by heavily weighting lifestyle factors in the predictions. Methods: Machine learning models were trained to take clinical assessments of the ocular surface, eyelids, and tear film, combined with symptom scores from validated questionnaire instruments for DE and clinician diagnoses of ocular surface diseases, and perform a classification into DE-related outcome categories. Outcomes are presented for which the data-driven algorithm identified subject characteristics, lifestyle, behaviors, or environmental exposures as heavily weighted predictors. Models were assessed by 5-fold cross-validation accuracy and class-wise statistics of the predictors. Conclusions: The results emphasize the importance of lifestyle, subject, and environmental characteristics in the etiology of ocular surface disease. Lifestyle factors should be taken into account in clinical research and care to a far greater extent than has been the case to date.
2023
- WACVCompact and Optimal Deep Learning with Recurrent Parameter GeneratorsIn Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
Deep learning has achieved tremendous success by training increasingly large models, which are then compressed for practical deployment. We propose a drastically different approach to compact and optimal deep learning: We decouple the Degrees of freedom (DoF) and the actual number of parameters of a model, optimize a small DoF with predefined random linear constraints for a large model of arbitrary architecture, in one-stage end-to-end learning. Specifically, we create a recurrent parameter generator (RPG), which repeatedly fetches parameters from a ring and unpacks them onto a large model with random permutation and sign flipping to promote parameter decorrelation. We show that gradient descent can automatically find the best model under constraints with faster convergence. Our extensive experimentation reveals a log-linear relationship between model DoF and accuracy. Our RPG demonstrates remarkable DoF reduction and can be further pruned and quantized for additional run-time performance gain. For example, in terms of top-1 accuracy on ImageNet, RPG achieves 96% of ResNet18’s performance with only 18% DoF (the equivalent of one convolutional layer) and 52% of ResNet34’s performance with only 0.25% DoF! Our work shows a significant potential of constrained neural optimization in compact and optimal deep learning.
- ML4HDeep Multimodal Fusion for Surgical Feedback ClassificationRafal Kocielnik, Elyssa Y Wong, Timothy N Chu , Lydia Lin, De-An Huang , Jiayun Wang, and 2 more authorsIn Machine Learning for Health, PMLR, 2023
The paper won the best paper award at 2023 Machine Learning for Health Conference.
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: “Anatomic” , “Technical” , “Procedural” , “Praise” and “Visual Aid” . We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
2022
- TPAMIOpen Long-Tailed Recognition in a Dynamic WorldIEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Real world data often exhibits a long-tailed and open-ended (i.e., with unseen classes) distribution. A practical recognition system must balance between majority (head) and minority (tail) classes, generalize across the distribution, and acknowledge novelty upon the instances of unseen classes (open classes). We define Open Long-Tailed Recognition++ (OLTR++) as learning from such naturally distributed data and optimizing for the classification accuracy over a balanced test set which includes both known and open classes. OLTR++ handles imbalanced classification, few-shot learning, open-set recognition, and active learning in one integrated algorithm, whereas existing classification approaches often focus only on one or two aspects and deliver poorly over the entire spectrum. The key challenges are: 1) how to share visual knowledge between head and tail classes, 2) how to reduce confusion between tail and open classes, and 3) how to actively explore open classes with learned knowledge. Our algorithm, OLTR++, maps images to a feature space such that visual concepts can relate to each other through a memory association mechanism and a learned metric (dynamic meta-embedding) that both respects the closed world classification of seen classes and acknowledges the novelty of open classes. Additionally, we propose an active learning scheme based on visual memory, which learns to recognize open classes in a data-efficient manner for future expansions. On three large-scale open long-tailed datasets we curated from ImageNet (object-centric), Places (scene-centric), and MS1M (face-centric) data, as well as three standard benchmarks (CIFAR-10-LT, CIFAR-100-LT, and iNaturalist-18), our approach, as a unified framework, consistently demonstrates competitive performance. Notably, our approach also shows strong potential for the active exploration of open classes and the fairness analysis of minority groups.
- ECCVUnsupervised Scene Sketch to Photo SynthesisJiayun Wang, Sangryul Jeon, Stella Yu, Xi Zhang, Himanshu Arora, and Yu LouIn European Conference on Computer Vision, 2022
Sketches make an intuitive and powerful visual expression as they are fast executed freehand drawings. We present a method for synthesizing realistic photos from scene sketches. Without the need for sketch and photo pairs, our framework directly learns from readily available large-scale photo datasets in an unsupervised manner. To this end, we introduce a standardization module that provides pseudo sketch-photo pairs during training by converting photos and sketches to a standardized domain, i.e. the edge map. The reduced domain gap between sketch and photo also allows us to disentangle them into two components: holistic scene structures and low-level visual styles such as color and texture. Taking this advantage, we synthesize a photo-realistic image by combining the structure of a sketch and the visual style of a reference photo. Extensive experimental results on perceptual similarity metrics and human perceptual studies show the proposed method could generate realistic photos with high fidelity from scene sketches and outperform state-of-the-art photo synthesis baselines. We also demonstrate that our framework facilitates a controllable manipulation of photo synthesis by editing strokes of corresponding sketches, delivering more fine-grained details than previous approaches that rely on region-level editing.
- Nature PortfolioPredicting Demographics from Meibography Using Deep LearningNature Scientific reports, 2022
This study introduces a deep learning approach to predicting demographic features from meibography images. A total of 689 meibography images with corresponding subject demographic data were used to develop a deep learning model for predicting gland morphology and demographics from images. The model achieved on average 77%, 76%, and 86% accuracies for predicting Meibomian gland morphological features, subject age, and ethnicity, respectively. The model was further analyzed to identify the most highly weighted gland morphological features used by the algorithm to predict demographic characteristics. The two most important gland morphological features for predicting age were the percent area of gland atrophy and the percentage of ghost glands. The two most important morphological features for predicting ethnicity were gland density and the percentage of ghost glands. The approach offers an alternative to traditional associative modeling to identify relationships between Meibomian gland morphological features and subject demographic characteristics. This deep learning methodology can currently predict demographic features from de-identified meibography images with better than 75% accuracy, a number which is highly likely to improve in future models using larger training datasets, which has significant implications for patient privacy in biomedical imaging.
- DIRA @ ECCV3D Shape Reconstruction from Free-Hand SketchesIn European Conference on Computer Vision, 2022
The paper was selected as an spotlight presentation.
- MI @ NeurIPSTracking the Dynamics of the Tear Film Lipid LayerTejasvi Kothapalli, Charlie Shou, Jennifer Ding , Jiayun Wang, Andrew D Graham, Tatyana Svitova, and 2 more authorsIn NeurIPS Workshop, 2022
Dry Eye Disease (DED) is one of the most common ocular diseases: over five percent of US adults suffer from DED [4]. Tear film instability is a known factor for DED, and is thought to be regulated in large part by the thin lipid layer that covers and stabilizes the tear film. In order to aid eye related disease diagnosis, this work proposes a novel paradigm in using computer vision techniques to numerically analyze the tear film lipid layer (TFLL) spread. Eleven videos of the tear film lipid layer spread are collected with a micro-interferometer and a subset are annotated. A tracking algorithm relying on various pillar computer vision techniques is developed.
2021
- TPAMISpatial Transformer for 3D Point CloudsJiayun Wang, Rudrasis Chakraborty, and Stella YuIEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Deep neural networks can efficiently process 3D point clouds. At each point convolution layer, local features can be learned from local neighborhoods of the point cloud. These features are combined together for further processing in order to extract the semantic information encoded in the point cloud. Previous networks adopt all the same local neighborhoods at different layers, as they utilize the same metric on fixed input point coordinates to define local neighborhoods. It is easy to implement but not necessarily optimal. Ideally local neighborhoods should be different at different layers so as to adapt to layer dynamics for efficient feature learning. One way to achieve this is to learn different transformations of the input point cloud at each layer, and then extract features from local neighborhoods defined on transformed coordinates. In this work, we propose a novel end-to-end approach to learn different non-rigid transformations of the input point cloud for different local neighborhoods at each layer. We propose both linear (affine) and non-linear (projective and deformable) spatial transformers for 3D point clouds.
- OVSQuantifying meibomian gland morphology using artificial intelligenceJiayun Wang , Shixuan Li , Thao N Yeh, Rudrasis Chakraborty, Andrew D Graham, Stella Yu, and 1 more authorOptometry and Vision Science, 2021
Quantifying meibomian gland morphology from meibography images is used for the diagnosis, treatment, and management of meibomian gland dysfunction in clinics. A novel and automated method is described for quantifying meibomian gland morphology from meibography images. Meibomian gland morphological abnormality is a common clinical sign of meibomian gland dysfunction, yet there exist no automated methods that provide standard quantifications of morphological features for individual glands. This study introduces an automated artificial intelligence approach to segmenting individual meibomian gland regions in infrared meibography images and analyzing their morphological features.
2020
- CVPROrthogonal Convolutional Neural NetworksIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
Deep convolutional neural networks are hindered by training instability and feature redundancy towards further performance improvement. A promising solution is to impose orthogonality on convolutional filters. We develop an efficient approach to impose filter orthogonality on a convolutional layer based on the doubly block-Toeplitz matrix representation of the convolutional kernel instead of using the common kernel orthogonality approach, which we show is only necessary but not sufficient for ensuring orthogonal convolutions. Our proposed orthogonal convolution requires no additional parameters and little computational overhead. This method consistently outperforms the kernel orthogonality alternative on a wide range of tasks such as image classification and inpainting under supervised, semi-supervised and unsupervised settings. Further, it learns more diverse and expressive features with better training stability, robustness, and generalization.
2019
- CVPRLarge-Scale Long-Tailed Recognition in an Open WorldIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019
The paper was selected as an oral presentation (2.5%).
Real world data often have a long-tailed and open-ended distribution. A practical recognition system must classify among majority and minority classes, generalize from a few known instances, and acknowledge novelty upon a never seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes. OLTR must handle imbalanced classification, few-shot learning, and open-set recognition in one integrated algorithm, whereas existing classification approaches focus only on one aspect and deliver poorly over the entire class spectrum. The key challenges are how to share visual knowledge between head and tail classes and how to reduce confusion between tail and open classes. We develop an integrated OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world.
- PBVS @ CVPRSur-Real: Frechet Mean and Distance Transform for Complex-Valued Deep LearningRudrasis Chakraborty , Jiayun Wang, and Stella YuIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun 2019
The paper won the best paper award at 2019 CVPR workshop.
We develop a novel deep learning architecture for naturally complex valued data, which are often subject to complex scaling ambiguity. We treat each sample as a field in the space of complex numbers. With the polar form of a complex number, the general group that acts on this space is the product of planar rotation and non-zero scaling. This perspective allows us to develop not only a novel convoluation operator using weighted Fréchet mean (wFM) on a Riemannian manifold, but also to a novel fully connected layer operator using the distance to the wFM, with natural equivariant properties to non-zero scaling and planar rotations for the former and invariance properites for the latter. We demonstrate our method on widely used SAR dataset MSTAR and RadioML dataset.
- Nature PortfolioInsights and Approaches Using Deep Learning to Classify WildlifeZhongqi Miao , Kaitlyn M Gaynor , Jiayun Wang, Ziwei Liu, Oliver Muellerklein, Mohammad Sadegh Norouzzadeh, and 5 more authorsNature Scientific Reports, Jun 2019
The implementation of intelligent software to identify and classify objects and individuals in visual fields is a technology of growing importance to operatives in many fields, including wildlife conservation and management. This study applies advanced software to classify wildlife species using camera-trap data. We trained a convolutional neural network (CNN) to classify 20 African species with 87.5% accuracy from 111,467 images. Gradient-weighted class activation mapping (Grad-CAM) revealed key features, and mutual information methods identified neurons responding strongly to specific species, exposing dataset biases. Hierarchical clustering produced a visual similarity dendrogram, and we evaluated the model’s ability to recognize known and unfamiliar species from images outside the training set.
- TVSTA Deep Learning Approach for Meibomian Gland Atrophy Evaluation in Meibography ImagesTranslational Vision Science & Technology, Jun 2019
Purpose: To develop a deep learning approach to digitally segmenting meibomian gland atrophy area and computing percent atrophy in meibography images. Methods: A total of 706 meibography images with corresponding meiboscores were collected and annotated for each one with eyelid and atrophy regions. The dataset was then divided into the development and evaluation sets. The development set was used to train and tune the deep learning model, while the evaluation set was used to evaluate the performance of the model. Conclusions: The proposed deep learning approach can automatically segment the total eyelid and meibomian gland atrophy regions, as well as compute percent atrophy with high accuracy and consistency. This provides quantitative information of the gland atrophy severity based on meibography images.
2018
- PRDeep Ranking Model by Large Adaptive Margin Learning for Person Re-IdentificationJiayun Wang, Sanping Zhou , Jinjun Wang, and Qiqi HouPattern Recognition, Jun 2018
Person re-identification aims to match images of the same person across disjoint camera views, which is a challenging problem in video surveillance. The major challenge of this task lies in how to preserve the similarity of the same person against large variations caused by complex backgrounds, mutual occlusions and different illuminations, while discriminating the different individuals. In this paper, we present a novel deep ranking model with feature learning and fusion by learning a large adaptive margin between the intra-class distance and inter-class distance to solve the person re-identification problem. Specifically, we organize the training images into a batch of pairwise samples. Treating these pairwise samples as inputs, we build a novel part-based deep convolutional neural network (CNN) to learn the layered feature representations by preserving a large adaptive margin. As a result, the final learned model can effectively find out the matched target to the anchor image among a number of candidates in the gallery image set by learning discriminative and stable feature representations. Overcoming the weaknesses of conventional fixed-margin loss functions, our adaptive margin loss function is more appropriate for the dynamic feature space.
2017
- CVPRPoint to Set Similarity Based Deep Feature Learning for Person Re-IdentificationSanping Zhou , Jinjun Wang , Jiayun Wang , Yihong Gong, and Nanning ZhengIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun 2017
Person re-identification (Re-ID) remains a challenging problem due to significant appearance changes caused by variations in view angle, background clutter, illumination condition and mutual occlusion. To address these issues, conventional methods usually focus on proposing robust feature representation or learning metric transformation based on pairwise similarity, using Fisher-type criterion. The recent development in deep learning based approaches address the two processes in a joint fashion and have achieved promising progress. One of the key issues for deep learning based person Re-ID is the selection of proper similarity comparison criteria, and the performance of learned features using existing criterion based on pairwise similarity is still limited, because only Point to Point (P2P) distances are mostly considered. In this paper, we present a novel person Re-ID method based on Point to Set similarity comparison. The Point to Set (P2S) metric can jointly minimize the intra-class distance and maximize the interclass distance, while back-propagating the gradient to optimize parameters of the deep model.
- arXivSuccessive Embedding and Classification loss for Aerial Image ClassificationJiayun Wang, Patrick Virtue, and Stella YuarXiv preprint arXiv:1712.01511, Jun 2017
Deep neural networks can be effective means to automatically classify aerial images but is easy to overfit to the training data. It is critical for trained neural networks to be robust to variations that exist between training and test environments. To address the overfitting problem in aerial image classification, we consider the neural network as successive transformations of an input image into embedded feature representations and ultimately into a semantic class label, and train neural networks to optimize image representations in the embedded space in addition to optimizing the final classification score. We demonstrate that networks trained with this dual embedding and classification loss outperform networks with classification loss only. We also find that moving the embedding loss from commonly-used feature space to the classifier space, which is the space just before softmax nonlinearization, leads to the best classification performance for aerial images. Visualizations of the network’s embedded representations reveal that the embedding loss encourages greater separation between target class clusters for both training and testing partitions of two aerial image classification benchmark datasets.