#APaperADay 1. Neural Best-Buddies: Sparse Cross-Domain Correspondence
This article is high-level summary of a paper I read as part of a 30-day challenge to read papers in the field of Machine Learning. This is day 1.
The paper is a report of work done by five researchers who are interested in how computers can be taught to take two images of objects that are not the same thing, like a lion and an eagle, and to create a hybrid image of both objects.
A lot of progress has been made in teaching computers how to differentiate between any two objects without knowing any specifics about them. Initially, this was done by comparing color and pixel intensities. Most recently, this is done by a technique similar to looking at the image through different levels of magnification. These magnifications are called layers.
Our first layer has the highest level of magnification, and this goes down as we add layers. At each layer, we learn to detect different features in an image such as straight lines, curves, and edges. We also learn the importance of each feature that we detect. This approach to telling the difference details objects is called a convolutional neural network, or CNN.
Recently, two things have been in the news with respect to the detection of features in images that contain humans. The first is called human pose estimation, while the second is called landmark detection.
With human pose estimation, the computer learns to look at an image and tell you what the human is doing; for example raising one finger, or touching a finger to their nose
With landmark detection, the computer learns the location of important features on a human face, for example the location of the eyes, nose and mouth, and whether they are closed or open.
These two are important because you can compare two images of humans and ask where the eyes are in the first image and where they are in the second image. Consequently, you can manipulate the images to show how the first can become the second. All of this is correspondence. But it is correspondence in one domain!
When you try to find similar features between, say, a man and his favorite pet, that becomes a cross-domain problem. When you are only interested in a certain number of features, not all possible features, it becomes a sparse problem.
Recall our earlier CNN and it’s layers. In trying to find corresponding features, like where is an eye on this lion and where is an eye on this eagle, the authors of the paper tried to extract features at every layer of the CNN they were using.
For each layer on the two images, they asked the question “which feature here is most similar to the feature over there”. They formed these features into pairs. These are called best-buddies. Because they are found using a neural network, they are called neural best buddies.
They then formed a ranking of these features by how important the CNN said they were.
This is, of course, a non-technical simplification of what they did. I have chosen to leave out technical terms.
The whole point of this research was to find a way in which computers could be taught to take two dissimilar images and find a way of merging them, or converting one to the other.
If you are interested in reading the original paper, you can find it here.