# Abstract

In this study, a CNN based Pixel Intensity driven iLluminant cOlor esTimation framework, PILOT, is proposed. The framework consists of a local illuminant estimation module and an illuminant uncertainty prediction module, obtained using a 3-phase training approach. The network with the well-designed microarchitecture of distillation building block and the macroarchitecture of bifurcated organization is of great representational capacity and efficacy for color-relevant vision tasks, which helps obtain a >20% relative improvement over prior algorithms and achieve state-of-the-art accuracy of illuminant estimation on benchmark datasets. The proposed framework is also computationally efficient and parameter-economic, making it suitable for applications deployed on mobile platforms. The great interpretability also makes PILOT possible to serve as a guidance for designing statistics-based models for those low-end devices with tight budgets of power consumption and computational capacity.

# Intro

Color constancy of the human visual system (HVS) is an essential prerequisite for many vision tasks, which compensates for the effect of the illuminant color on objects' color perception by adjusting the activities of the cone photoreceptors in the retina. Some computer vision applications, e.g., image retrieval, fine-grained classification, object recognition, and image segmentation, are designed to extract comprehensive information from intrinsic colors of objects, thus requiring the input images to be device-independent and color-unbiased.

Compared with visual recognition tasks in the HVS, the color constancy has several unique properties:

High priority - As stated at the beginning, color constancy is a prerequisite for other complex vision tasks. It is generally suggested that the retinal ganglion cells (RGCs) within the HVS are the basis of color processing mechanisms of human color constancy, which respond to the activations of cone photoreceptors in a color-opponent fashion, at the very first stage of the HVS. Therefore, it makes more sense to perform the computational color constancy before the explorations of further semantic information.

Condensation and superficiality - The useful clues for color constancy are highly condensed: if there exists neutral or specular regions in the image, all what the algorithm need to do is to pool these regions in the spatial domain; if there exists shading regions, it can be done by extracting these regions in the gradient domain. Instead of learning very deep representations as many visual recognition tasks try to do, color constancy is such a task that can be solved in a much more superficial level.

Spatial structure insensitivity - Within typical CNNs trained for image classification and recognition tasks, CONV layers are stacked to extract deep features from the images. For computational color constancy, however, it is the pixels' intensity relationships instead of the structural features that provide prominent clues for illuminant estimation. In this context, it is reasonable to replace parts of convolution kernels with large receptive fields by $1\times1$ ones, which make the network more efficient and compact.

In this study, a Pixel Intensity driven iLluminant cOlor esTimation framework, PILOT, is proposed, which consists of a patch-based local illuminant estimation module and an uncertainty prediction module. The former is trained to predict the local illuminant colors given a set of image patches randomly sampled from the full-resolution image, and the latter is trained to predict the uncertainties of the local estimates from the first module. In the aggregation stage, the global illuminant color is determined by calculating the median of local estimates with highest confidences (lowest uncertainties).

# Figures

## Part 1: NUS 8 Camera Dataset:

Left: raw images (dark current removed). Right: white-balanced images. The squares in the left images are the extracted subimages fed into the CNN, the scoress in which denote the corresponding prediction variances (zoom in if necessary). All images have been compressed for loading speed.

## Part 2: ZJU Passport Dataset:

Left: raw images (dark current removed). Right: white-balanced images. The squares in the left images are the extracted subimages fed into the CNN, the scoress in which denote the corresponding prediction variances (zoom in if necessary). All images have been compressed for loading speed.

# Results

All the scores in the following tables are angular errors between the predicted illuminant colors and the ground truth colors.

## Part 1: Gehler-Shi Dataset:

 Algorithm Mean Median Trimean Worst 25% Support Vector Regression 8.08 6.73 7.19 14.89 Grey-world 6.36 6.28 6.28 10.58 Pixels-based Gamut 4.20 2.33 2.91 10.72 Local Surface Reflectance Statistics 3.45 2.51 2.70 7.32 Exemplar-based 2.89 2.27 2.42 5.97 Bayesian 4.82 3.46 3.88 10.49 CCC 1.95 1.22 1.38 4.76 FC4 1.65 1.18 1.27 3.78 Fast Fourier 1.61 0.86 1.02 4.27 PILOT, Module 1 only 1.27 0.84 0.92 3.16 PILOT, Module 1+2 1.23 0.78 0.91 2.83

## Part 2: NUS 8 Camera Dataset:

 Algorithm Mean Median Trimean Worst 25% Grey-world 4.59 3.46 3.81 9.85 Pixels-based Gamut 5.27 4.26 4.45 11.16 Local Surface Reflectance Statistics 3.45 2.51 2.70 7.32 Bayesian 3.50 2.36 2.57 8.02 CCC 2.38 1.48 1.69 5.85 FC4 2.23 1.57 1.72 5.15 Fast Fourier 1.99 1.31 1.43 4.75 PILOT, Module 1 only 1.37 0.95 1.12 3.72 PILOT, Module 1+2 1.40 0.85 0.94 3.02

## Part 3: ZJU Passport Dataset:

 Algorithm Mean Median Trimean Worst 25% Pixels-based Gamut 6.18 4.73 5.31 13.58 Grey-world 5.17 4.43 4.57 10.06 White-patch 5.71 4.39 4.89 12.45 2nd-order Gray-edge 5.16 3.84 4.22 11.52 1st-order Gray-Edge 5.02 3.84 4.21 11.02 Bayesian 5.10 3.99 4.17 11.34 Local Surface Reflectance Statistics 2.85 2.07 2.45 6.28 FC4 1.95 1.47 1.59 4.21 PILOT, Module 1 only 1.38 0.92 1.05 3.70 PILOT, Module 1+2 1.38 0.84 0.98 3.25

# Reference

• A. Babenko and V. Lempitsky, "Aggregating local deep features for image retrieval," in Proc. IEEE International Conference on Computer Vision, 2015, pp. 1269–1277.
• N. Zhang, J. Donahue, R. Girshick, and T. Darrell, "Part-based r-cnns for fine-grained category detection," in European conference on computer vision, 2014, pp. 834–849.
• P. Agrawal, R. Girshick, and J. Malik, "Analyzing the performance of multilayer neural networks for object recognition," in European conference on computer vision, 2014, pp. 329–344.
• V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: A deep convolutional encoder-decoder architecture for image segmentation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, 2017.
• D. H. Brainard and B. A. Wandell, "Analysis of the retinex theory of color vision," J. Opt. Soc. Am. A, vol. 3, no. 10, pp. 1651–1661, Oct 1986.
• More...

Feel free to contact jqx1991(at)gmail(dot)com with any suggestions/corrections/comments.