DistillNet: a novel framework for color constancy

The project is in progress and more content is coming soon.

Description

Color constancy of human visual system (HVS) is an essential prerequisite for many vision tasks, which compensates for the effect of the illuminant color on objects’ color perception by dynamically adjusting the activities of the cone photoreceptors in the retina. Some computer vision applications, e.g., image retrieval, image classification, object recognition, and image segmentation, are designed to extract comprehensive semantic information from intrinsic colors of objects, thus requiring the input images to be device-independent and color-unbiased.

Compared with the visual recognition and understanding tasks in the HVS, the color constancy has several unique properties:

High priority - As stated at the beginning, color constancy is a prerequisite for other complex vision tasks. It is generally suggested that the retinal ganglion cells (RGCs) within the center-surround structured receptive field are the basis of color processing mechanisms of human color constancy, which respond to the activations of different cone photoreceptors in a color-opponent fashion, at the very first stage of the HVS. Therefore, in the context of computer vision, it is more resonable to perform the computational color constancy before the explorations of further semantic information.

Condensation and superficiality - The useful clues for color constancy are highly condensed: if there exists neutral or highlight/specular regions in the image, all we need to do is pool these regions in the spatial domain; if there exists shading regions, all we need to do is extract these regions in the gradient domain. Instead of learning very deep representations as many visual recognition tasks try to do, color constancy is a task that can be resolved in a much more superficial level.

Spatial structure insensitivity - Within typical CNNs trained for image claasification and recognition tasks, a large number of 3 × 3 convolution kernels along with max pooling layers are stacked to extract deep features from the images. For computational color constancy, however, it is the pixels’ intensity relationship instead of the their structural relationship that provides more prominent clues for illuminant estimation. To this end, it is resonable to replace parts of 3×3 convolution kernels by 1×1 ones, which make the network more efficient and compact.

In this paper we proposed a novel CNN-based framework for the computational color constancy. To efficiently exploit intensity relationship in different levels of feature maps, a "distillation module" are designed to distills useful information from feature maps and suppresses less useful ones through a minimum-thresholding mechanism, which is highly efficient and powerful for color-relevant applications yet parameter-economic. To reduce the redundancy of the conventional deep chained CNN architectures in intensity-oriented tasks, we remove all the fully-connected (FC) layers after conv/pool layers; instead, inspired by the neutral-pixel-statistics-based methods, we perform the spatial maximum operations on the outputs of every modules in a dichotomous network, and then concatenate these "channel-wise descriptor" into a 1-D vector, upon which the illuminant colors are estimated by the fully connected regression.

Figures

Part 1: NUS 8 Camera Dataset:

Left: raw images (dark current removed). Right: white-balanced images. The squares in the left images are the extracted subimages fed into the CNN, the scoress in which denote the corresponding prediction variances (zoom in if necessary). All images have been compressed for loading speed.




Part 2: ZJU Passport Dataset:

Left: raw images (dark current removed). Right: white-balanced images. The squares in the left images are the extracted subimages fed into the CNN, the scoress in which denote the corresponding prediction variances (zoom in if necessary). All images have been compressed for loading speed.




Part 3: Gehler-Shi Dataset:

Coming soon.

Results

All the scores in the following tables are angular errors between the predicted illuminant colors and the ground truth colors. References are coming soon.

Part 1: Gehler-Shi Dataset:

Algorithm Mean Median Trimean Worst 25%
Support Vector Regression 8.08 6.73 7.19 14.89
Grey-world 6.36 6.28 6.28 10.58
Pixels-based Gamut 4.20 2.33 2.91 10.72
Local Surface Reflectance Statistics 3.45 2.51 2.70 7.32
Exemplar-based 2.89 2.27 2.42 5.97
Bayesian 4.82 3.46 3.88 10.49
CCC 1.95 1.22 1.38 4.76
FC4 1.77 1.11 1.29 4.29
Fast Fourier 1.61 0.86 1.02 4.27
DistillNet 1.54 1.00 1.17 3.27

Part 2: NUS 8 Camera Dataset:

Algorithm Mean Median Trimean Worst 25%
Grey-world 4.59 3.46 3.81 9.85
Pixels-based Gamut 5.27 4.26 4.45 11.16
Local Surface Reflectance Statistics 3.45 2.51 2.70 7.32
Bayesian 3.50 2.36 2.57 8.02
CCC 2.38 1.48 1.69 5.85
FC4 2.12 1.53 1.67 4.78
Fast Fourier 1.99 1.31 1.43 4.75
DistillNet 1.21 0.94 1.10 3.62

Part 3: ZJU Passport Dataset:

Coming soon.


Feel free to contact jqx1991(at)gmail(dot)com with any suggestions/corrections/comments.