How An AI Understands Scenes: Panoptic Scene Graph Generation.

A holy grail of computer vision is the complete understanding of visual scenes. Progress in perceptual tasks began with the task of identifying what’s in the image, for example, whether there’s a man or not.

Next, if there’s a man, we want to know where it is in the image. AI can draw a bounding box around it or label every pixel, e.g., whether the pixel belongs to a man or not. These tasks are called image classification, object detection, and image segmentation.

What else can we say about the image? Was the man talking to a friend or running with a ball? To answer this question, AI needs to understand the interactions and relationships between objects in an image. This task is called Scene Graph Generation (SGG).

Let’s discover how relationships between objects can be described. Relationships connect two objects. These relationships can be actions (jumping over), spatial (is behind), verbs (wear), prepositions (with), comparative (taller than), or prepositional phrases (drive on). Combining the objects and relationships, we create a directed graph representation (figure below). Objects can be represented by bounding boxes, from object detection, or by masks from panoptic segmentation. But panoptic segmentation has advantages, such as it covers the full scene with background and has better accuracy compared to the coarse bounding boxes. So, Panoptic Scene Graph Generation (PSG) gives us a deeper scene understanding.

Understanding scenes would enable important applications such as image search, question answering, and robotic interactions. Although It seems easy for us, there’s a big perception gap between robotic systems and humans. This technology will allow them to do more intellectual work.

Now, let’s dive into how this AI looks under the hood.

We expect a set of triplets as an output of our neural net. Every triplet consists of two objects and their relationship. Such an output raises a couple of challenges for our network architecture. For example, should we predict all triplets simultaneously or just one by one? Can we solve the task in two stages: first performing object detection, then pairwise relationship prediction between these predicted objects? Or is it better to use a one-stage end-to-end neural net?

There are multiple ways to approach this task. Let’s look at a promising one using a Transformer. The transformer has established itself as a surprisingly versatile network. It fits us because by using an attention mechanism, it can understand the relationship between sequential elements that are far from each other.

As shown in the figure above, the PSGTR model first extracts image features from a CNN backbone and then feeds the features along with trainable queries and position encoding into a transformer encoder-decoder. Here, we expect the queries to learn the representation of scene graph triplets so that for each triplet query, the object/relationship/object predictions can be extracted by three individual Feed Forward Networks (FFNs), and the segmentation task can be completed by two panoptic heads for both objects, respectively.

To train, the model authors created a dataset consisting of object/relationship/object triplets. The model predicts all triplets at once and is trained end-to-end with a set loss function that performs bipartite matching between prediction and ground truth.

This is one way an AI understands a scene. You can check the paper and code linked below for better understanding.

Reference

Yang, Jingkang and Ang, Yi Zhe and Guo, Zujin and Zhou, Kaiyang and Zhang, Wayne and Liu, Ziwei. Panoptic Scene Graph Generation.
Code: https://github.com/Jingkang50/OpenPSG
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Li Fei-Fei. Visual Genome.