![]() ![]() Because the pooling layer requires fixed-size inputs, e.g., 7x7 for ResNet50 (R50) CLIP backbone, we crop and resize the region features with the ROI-Align layer (shown below). To obtain the VLM features for a region, we apply the feature pooling layer on the cropped backbone output features. F-VLM performs this open-vocabulary classification only at test time. Since the backbone features are frozen, they do not overfit to the training categories (e.g., donut, zebra) and can be directly cropped for region-level classification. The ability to perform open-vocabulary recognition at region level (i.e., bounding box level as opposed to image level) is integral to F-VLM. The detection losses include box regression and classification losses.Īt training time, F-VLM is simply a detector with the last classification layer replaced by base-category text embeddings. We adopt the feature extractor for detector head training, which is the only step we train (on standard detection data), to allow us to directly use frozen weights, inheriting rich semantic knowledge (e.g., long-tailed categories like martini, fedora hat, pennant) from the VLM backbone. The VLM image encoder consists of two parts: 1) a feature extractor and 2) a feature pooling layer. The category text embeddings are obtained by feeding the category names through the text model of pretrained VLM (which has both image and text models)r. The detection scores are the cosine similarity of region features (a set of bounding boxes that the detector head outputs) and category text embeddings. We take this VLM backbone and attach a detector head, which predicts object regions for localization and outputs detection scores that indicate the probability of a detected box being of a certain category. We use a frozen VLM image encoder as the detector backbone and a text encoder for caching the detection text embeddings of offline dataset vocabulary. We desire to retain the knowledge of pretrained VLMs as much as possible with a view to minimize effort and cost needed to adapt them for open-vocabulary detection. Learning upon frozen vision and language models We are also releasing the F-VLM code along with a demo on our project page. We demonstrate that by preserving the knowledge of pre-trained VLMs completely, F-VLM maintains a similar philosophy to ViTDet and decouples detector-specific learning from the more task-agnostic vision knowledge in the detector backbone. F-VLM reduces the training complexity of an open-vocabulary detector to below that of a standard detector, obviating the need for knowledge distillation, detection-tailored pre-training, or weakly supervised learning. In “ F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models”, presented at ICLR 2023, we introduce a simple and scalable open-vocabulary detection approach built upon frozen VLMs. The same frozen features can classify groundtruth (GT) regions well without fine-tuning (column 3). The K-Means feature grouping reveals rich semantic and region-sensitive information where object boundaries are nicely delineated (column 2). We explore the potential of frozen vision and language features for open-vocabulary detection. This motivates us to explore the use of frozen VLMs for open-vocabulary object detection with the goal to expand detection beyond the limited set of annotated categories. In fact, feature grouping can nicely delineate object boundaries without any supervision. Surprisingly, features of a frozen VLM contain rich information that are both region sensitive for describing object shapes (second column below) and discriminative for region classification (third column below). Intuitively, to align the image content with the text description during training, VLMs may learn region-sensitive and discriminative features that are transferable to object detection. ![]() These VLMs are applied to zero-shot classification using frozen model weights without the need for fine-tuning, which stands in stark contrast to the existing paradigms used for retraining or fine-tuning VLMs for open-vocabulary detection tasks. Recent vision and language models (VLMs), such as CLIP, have demonstrated improved open-vocabulary visual recognition capabilities through learning from Internet-scale image-text pairs. This is orders of magnitude smaller than the vocabulary people use to describe the visual world and leaves out many categories. However, the data collection process of manually annotating bounding boxes or instance masks is tedious and costly, which limits the modern detection vocabulary size to roughly 1,000 object classes. Posted by Weicheng Kuo and Anelia Angelova, Research Scientists, Google Researchĭetection is a fundamental vision task that aims to localize and recognize objects in an image.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |