A Year in Computer Vision

王佳亮

Introduction

Computer Vision typically refers to the scientific discipline of giving machines the ability of sight, or perhaps more colourfully, enabling machines to visually analyse their environments and the stimuli within them. This process typically involves the evaluation of an image, images or video. The British Machine Vision Association (BMVA) defines Computer Vision as “the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images. ”

The term understanding provides an interesting counterpoint to an otherwise mechanical definition of vision, one which serves to demonstrate both the significance and complexity of the Computer Vision field. True understanding of our environment is not achieved through visual representations alone. Rather, visual cues travel through the optic nerve to the primary visual cortex and are interpreted by the brain, in a highly stylised sense. The interpretations drawn from this sensory information encompass the near-totality of our natural programming and subjective experiences, i.e. how evolution has wired us to survive and what we learn about the world throughout our lives.

In this respect, vision only relates to the transmission of images for interpretation; while computing said images is more analogous to thought or cognition , drawing on a multitude of the brain’s faculties. Hence, many believe that Computer Vision, a true understanding of visual environments and their contexts, paves the way for future iterations of Strong Artificial Intelligence, due to its cross-domain mastery.

However, put down the pitchforks as we’re still very much in the embryonic stages of this fascinating field. This piece simply aims to shed some light on 2016’s biggest Computer Vision advancements. And hopefully ground some of these advancements in a healthy mix of expected near-term societal-interactions and, where applicable,tongue-in-cheek prognostications of the end of life as we know it.

While o ur work is always written to be as accessible as possible, sections within this particular piece may be oblique at times due to the subject matter. We do provide rudimentary definitions throughout, however, these only convey a facile understanding of key concepts. In keeping our focus on work produced in 2016, often omissions are made in the interest of brevity.

One such glaring omission relates to the functionality of Convolutional Neural Networks (hereafter CNNs or ConvNets), which are ubiquitous within the field of Computer Vision.

The success of AlexNet in 2012, a CNN architecture which blindsided ImageNet competitors, proved instigator of a de facto revolution within the field, with numerous researchers adopting neural network-based approaches as part of Computer Vision’s new period of ‘normal science’.

Over four years later and CNN variants still make up the bulk of new neural network architectures for vision tasks, with researchers reconstructing them like legos; a working testament to the power of both open source information and Deep Learning. However,an explanation of CNNs could easily span several postings and is best left to those with a deeper expertise on the subject and an affinity for making the complex understandable.

For casual readers who wish to gain a quick grounding before proceeding we recommend the first two resources below. For those who wish to go further still, we have ordered the resources below to facilitate that:

● What a Deep Neural Network thinks about your #selfie from Andrej Karpathy is one of our favourites for helping people understand the applications and functionalities behind CNNs.

● Quora: “what is a convolutional neural network?” - Has no shortage of great links and explanations. Particularly suited to those with no prior understanding.

● CS231n: Convolutional Neural Networks for Visual Recognition from

Stanford University is an excellent resource for more depth.

● Deep Learning (Goodfellow, Bengio & Courville, 2016) provides detailed explanations of CNN features and functionality in Chapter 9. The textbook has been kindly made available for free in HTML format by the authors.

For those wishing to understand more about Neural Networks and Deep Learning in general we suggest:

● Neural Networks and Deep Learning (Nielsen, 2017) is a free online textbook which provides the reader with a really intuitive understanding of the complexities of Neural Networks and Deep Learning. Even just completing chapter one should greatly illuminate the subject matter of this piece for first-timers.

As a whole this piece is disjointed and spasmodic, a reflection of the authors’excitement and the spirit in which it was intended to be utilised, section by section.Information is partitioned using our own heuristics and judgements, a necessary compromise due to the cross-domain influence of much of the work presented.

We hope that readers benefit from our aggregation of the information here to further their own knowledge, regardless of previous experience.

Part One: Classification/Localisation, Object Detection, Object Tracking

Classification/Localisation

The task of classification, when it relates to images, generally refers to assigning a label to the whole image, e.g. ‘cat’. Assuming this, Localisation may then refer to finding where the object is in said image, usually denoted by the output of some form of bounding box around the object. Current classification/localisation techniques on ImageNet9 have likely surpassed an ensemble of trained humans.10 For this reason, we place greater emphasis on subsequent sections of the blog.

However, the introduction of larger datasets with an increased number of classes11 will likely provide new metrics for progress in the near future. On that point, Fran?ois Chollet, the creator of Keras,12 has applied new techniques, including the popular architecture Xception, to an internal google dataset with over 350 million multi-label images containing 17,000 classes.

Interesting takeaways from the ImageNet LSVRC (2016):

● Scene Classification refers to the task of labelling an image with a certain scene class like ‘greenhouse’, ‘stadium’, ‘cathedral’, etc. ImageNet held a Scene Classification challenge last year with a subset of the Places215 dataset: 8 million images for training with 365 scene categories.Hikvision16 won with a 9% top-5 error with an ensemble of deep Inception-style networks, and not-so-deep residuals networks.

● Trimps-Soushen won the ImageNet Classification task with 2.99% top-5 classification error and 7.71% localisation error. The team employed an ensemble for classification (averaging the results of Inception, Inception-Resnet,ResNet and Wide Residual Networks models17) and Faster R-CNN for localisation based on the labels.18 The dataset was distributed across 1000 image classes with 1.2 million images provided as training data. The partitioned test data compiled a further 100 thousand unseen images.

● ResNeXt by Facebook came a close second in top-5 classification error with 3.03% by using a new architecture that extends the original ResNet architecture.

Object Detection

As one can imagine the process of Object Detection does exactly that, detects objects within images. The definition provided for object detection by the ILSVRC 2016 includes outputting bounding boxes and labels for individual objects.

This differs from the classification/localisation task by applying classification and localisation to many objects instead of just a single dominant object.One of 2016’s major trends in Object Detection was the shift towards a quicker, more efficient detection system. This was visible in approaches like YOLO, SSD and R-FCN as a move towards sharing computation on a whole image. Hence, differentiating themselves from the costly subnetworks associated with Fast/Faster R-CNN techniques.

This is typically referred to as ‘end-to-end training/learning’ and features throughout this piece.The rationale generally is to avoid having separate algorithms focus on their respective subproblems in isolation as this typically increases training time and can lower network accuracy. That being said this end-to-end adaptation of networks typically takes place after initial sub-network solutions and, as such, is a retrospective optimisation. However,Fast/Faster R-CNN techniques remain highly effective and are still used extensively for object detection.

● SSD: Single Shot MultiBox Detector utilises a single Neural Network which encapsulates all the necessary computation and eliminates the costly proposal generation of other methods. It achieves “ 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model ” (Liu et al. 2016).

● One of the most impressive systems we saw in 2016 was from the aptly named“ YOLO9000: Better, Faster, Stronger ”, which introduces the YOLOv2 and YOLO9000 detection systems.24 YOLOv2 vastly improves the initial YOLO model from mid-2015,25 and is able to achieve better results at very high FPS (up to 90 FPS on low resolution images using the original GTX Titan X). In addition to completion speed, the system outperforms Faster RCNN with ResNet and SSD on certain object detection datasets.

● Feature Pyramid Networks for Object Detection27 comes from FAIR28 and capitalises on the “ inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost ”,meaning that representations remain powerful without compromising speed or memory. Lin et al. (2016) achieve state-of-the-art (hereafter SOTA) single-model results on COCO29.

● R-FCN: Object Detection via Region-based Fully Convolutional Networks :This is another method that avoids applying a costly per-region subnetwork hundreds of times over an image by making the region-based detector fully convolutional and sharing computation on the whole image. “ Our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart ”(Dai et al., 2016).

Huang et al. (2016) present a paper which provides an in depth performance comparison between R-FCN, SSD and Faster R-CNN. Due to the issues around accurate comparison of Machine Learning (ML) techniques we’d like to point to the merits of producing a standardised approach here. They view these architectures as‘meta-architectures’ since they can be combined with different kinds of feature extractors such as ResNet or Inception.

Part Two: Segmentation, Super-res/Colourisation/Style Transfer, Action Recognition

Segmentation

Central to Computer Vision is the process of Segmentation, which divides whole images into pixel groupings which can then be labelled and classified. Moreover, Semantic Segmentation goes further by trying to semantically understand the role of each pixel in the image e.g. is it a cat, car or some other type of class? Instance Segmentation takes this even further by segmenting different instances of classes e.g. labelling three different dogs with three different colours. It is one of a barrage of Computer Vision applications currently employed in autonomous driving technology suites.

Super-resolution, Style Transfer & Colourisation

Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space.

Amortised MAP Inference for Image Super-resolution proposes a method for calculation of Maximum a Posteriori (MAP) inference using a Convolutional Neural Network. However, their research presents three approaches for optimisation, all of which GANs perform markedly better on real image data at present.

Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma74 and Artomatix75. Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style.76 Since then, the concept of style transfer was expanded upon by Nikulin and Novak and also applied to video,78 as is the common progression within Computer Vision.

Finally, Lizuka, Simo-Serra and Ishikawa85 demonstrate a colourisation model also based upon CNNs. The work outperformed the existing SOTA, we [the team] feel as though this work is qualitatively best also, appearing to be the most realistic. Figure 10 provides comparisons, however the image is taken from Lizuka et al.

Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN. ”

In a test to see how natural their colourisation was, users were given a random image from their models and were asked, "does this image look natural to you?"

Action Recognition

The task of action recognition refers to the both the classification of an action within a given video frame, and more recently, algorithms which can predict the likely outcomes of interactions given only a few frames before the action takes place. In this respect we see recent research attempt to imbed context into algorithmic decisions, similar to other areas of Computer Vision.

Part Three: Toward a 3D understanding of the world

“ A key goal of Computer Vision is to recover the underlying 3D structure from 2D observations of the world. ” - Rezende et al. (2016, p. 1) In Computer Vision, the classification of scenes, objects and activities, along with the output of bounding boxes and image segmentation is, as we have seen, the focus of much new research.

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction98 - creates a reconstruction of an object ‘in the form of a 3D occupancy grid using single or multiple images of object instance from arbitrary viewpoints.’ Mappings from images of objects to 3D shapes are learned using primarily synthetic data, and the network can train and test without requiring ‘any image annotations or object class labels’. The network comprises a 2D-CNN, a 3D Convolutional LSTM (an architecture newly created for purpose) and a 3D Deconvolutional Neural Network. How these different components interact and are trained together end-to-end is a perfect illustration of the layering capable with Neural Networks.

3D Shape Induction from 2D Views of Multiple Objects100 uses “ Projective Generative Adversarial Networks ” (PrGANs), which train a deep generative model allowing accurate representation of 3D shapes, with the discriminator only being shown 2D images. The projection module captures the 3D representations and converts them to 2D images before passing to the discriminator. Through iterative training cycles the generator improves projections by improving the 3D voxel shapes it generates.

Reconstruction

Fusion4D: Real-time Performance Capture of Challenging Scenes veers towards the domain of Computer Graphics, however the interplay between Computer Vision and Graphics cannot be overstated. The authors’approach uses RGB-D and Segmentation as inputs to form a real-time, multi-view reconstruction which is outputted using Voxels.

Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera won best paper at the European Convention on Computer Vision (ECCV) in 2016. The authors propose a novel algorithm capable of tracking 6D motion and various reconstructions in real-time using a single Event Camera.

Other uncategorised 3D

IM2CAD120 describes the process of transferring an ‘image to CAD model’, CAD meaning computer-assisted design, which is a prominent method used to create 3D scenes for architectural depictions,engineering, product design and many other fields.

“ Given a single photo of a room and a large database of furniture CAD models,our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database. ”

The authors present an automatic system which ‘iteratively optimizes object placements and scales’ to best match input from real images. The rendered scenes validate against the original images using metrics trained using deep CNNs.

In summation

In summation, we believe that SLAM is not likely to be completely replaced by Deep Learning. However, it is entirely likely that the two approaches may become complements to each other going forward.

Part Four: ConvNet Architectures, Datasets, Ungroupable Extras

ConvNet Architectures

● Densely Connected Convolutional Networks133 or “DenseNets” take direct inspiration from the identity/skip connections of ResNets. The approach extends this concept to ConvNets by having each layer connect to every other layer in a feed forward fashion, sharing feature maps from previous layers as inputs, thus creating DenseNets.

“DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters”.

Moving towards equivariance in ConvNets

“ To improve the statistical efficiency of machine learning methods, many have sought to learn invariant representations. In deep learning, however, intermediate layers should not be fully invariant, because the relative pose of local features must be preserved for further layers. Thus, one is led to the idea of equivariance : a network is equivariant if the representations it produces transform in a predictable linear manner under transformations of the input. In

other words, equivariant networks produce representations that are steerable.Steerability makes it possible to apply filters not just in every position (as in a standard convolution layer), but in every pose, thus allowing for increased parameter sharing.”

Datasets

In 2016, traditional datasets such as ImageNet169, Common Objects in Context (COCO),the CIFARs171 and MNIST172 were joined by a host of new entries. We also noted the rise of synthetic datasets spurred on by progress in graphics. Synthetic datasets are an interesting work-around of the large data requirements for Artificial Neural Networks (ANNs).

● CMPlaces is a cross-modal scene dataset from MIT. The task is to recognize scenes across many different modalities beyond natural images and in the process hopefully transfer that knowledge across modalities too. Some of the modalities are: Real, Clip Art, Sketches, Spatial Text (words written which correspond to spatial locations of objects) and natural language descriptions. The paper also discusses methods for how to deal with this type of problem with cross-modal convolutional neural networks.

Omissions based on forthcoming publications

There is also considerable, and increasing overlap between Computer Vision techniques and other domains in Machine Learning and Artificial Intelligence. These other domains and hybrid use cases are the subject of The M Tank’s forthcoming publications and, as with the whole of this piece, we partitioned content based on our own heuristics.

In the final section, we’ll offer some concluding remarks and a recapitulation of some of the trends we identified. We would hope that we were comprehensive enough to show a bird’s-eye view of where the Computer Vision field is loosely situated and where it is headed in the near-term. We also would like to draw particular attention to the fact that our work does not cover January-April 2017. The blistering pace of research output means that much of this work could be outdated already; we encourage readers to go and find out whether it is for themselves. But this rapid pace of growth also brings with it ucrative opportunities as the Computer Vision hardware and software markets are expected to reach $48.6 Billion by 2022.

Conclusion

In conclusion we’d like to highlight some of the trends and recurring themes that cropped up repeatedly throughout our research review process. First and foremost,we’d like to draw attention to the Machine Learning research community’s voracious pursuit of optimisation. This is most notable in the year on year changes in accuracy rates, but especially in the intra-year changes in accuracy .

We’d like to underscore this point and return to it in a moment.Error rates are not the only fanatically optimised parameter, with researchers working on improving speed, efficiency and even the algorithm’s ability to generalise to other tasks and problems in completely new ways. We are acutely aware of the research coming to the fore with approaches like one-shot learning, generative modelling,transfer learning and, as of recently, evolutionary learning, and we feel that these research principles are gradually exerting greater influence on the approaches of the best performing work.

While this last point is unequivocally meant in commendation for, rather than denigration of, this trend, one can’t help but to cast their mind toward the (very) distant spectre of Artificial General Intelligence, whether merited a thought or not. Far from being alarmist,we just wish to highlight to both experts and laypersons that this concern arises from here, from the startling progress that’s already evident in Computer Vision and other AI subfields. Properly articulated concerns from the public can only come through education about these advancements and their impacts in general. This may then in turn quell the power of media sentiment and misinformation in AI.We chose to focus on a one year timeline for two reasons. The first relates to the sheer volume of work being produced. Even for people who follow the field very closely, it is becoming increasingly difficult to remain abreast of research as the number of

publications grow exponentially. The second brings us back to our point on intra-year changes.

In taking a single year snapshot of progress, the reader can begin to comprehend the pace of research at present. We see improvement after improvement in such short time spans, but why? Researchers have cultivated a global community where building on previous approaches (architectures, meta-architectures, techniques, ideas, tips, wacky

hacks, results, etc.), and infrastructures (libraries like Keras, TensorFlow and PyTorch, GPUs, etc.), is not only encouraged but also celebrated. A predominantly open source community with few parallels, which is continuously attracting new researchers and having its techniques reappropriated by fields like economics, physics and countless others.

It’s important to understand for those who have yet to notice, that among the already frantic chorus of divergent voices proclaiming divine insight into the true nature of this technology, there is at least agreement; agreement that this technology will alter the

world in new and exciting ways. However, much disagreement still comes over the timeline on which these alterations will unravel.

Until such a time as we can accurately model the progress of these developments we will continue to provide information to the best of our abilities. With this resource we hoped to cater to the spectrum of

AI experience, from researchers playing catch-up to anyone who simply wishes to obtain a grounding in Computer Vision and Artificial Intelligence. With this our project hopes to have added some value to the open source revolution that quietly hums beneath the technology of a lifetime.

Home