Advances in Computer Vision-Based Civil Infrastructure Inspection and Monitoring

1. Introduction

Much of the critical infrastructure that serves society today, including bridges, dams, highways, lifeline systems, and buildings, was erected several decades ago and is well past its design life. For example, in the United States, according to the 2017 Infrastructure Report Card published by the American Society of Civil Engineers, there are over 56 000 structurally deficient bridges, requiring a massive 123 billion USD for rehabilitation [1]. The economic implications of repair necessitate systematic prioritization achieved through a careful understanding of the current state of infrastructure.

Civil infrastructure condition assessment is performed by leveraging the information obtained by inspection and/or monitoring processes. Traditional techniques to assess the condition of civil infrastructure typically involve visual inspection by trained inspectors, combined with relevant decision-making criteria (e.g., ATC-20 [2], national bridge inspection standards [3]). However, such inspection can be time-consuming, laborious, expensive, and/or dangerous (Fig. 1) [4]. Monitoring can be used to obtain quantitative understanding of the current state of the structure through measurement of physical quantities, such as accelerations, strains, and/or displacements; such approaches offer the ability to continuously observe structural integrity in real time, with the goal of enhanced safety and reliability, and reduced maintenance and inspection costs [5], [6], [7], [8]. While these methods have been shown to produce reliable data, they typically have limited spatial resolution or require installation of dense sensor arrays. Another issue is that once installed, access to the sensors is often limited, making regular system maintenance challenging. If only occasional monitoring is required, the installation of contact sensors is difficult and time consuming. To address some of these problems, improved inspection and monitoring approaches with less intervention from humans, lower cost, and higher spatial resolution must be developed and tested to advance and realize the full benefits of automated civil infrastructure condition assessment.

Computer vision techniques have been recognized in the civil engineering field as a key component of improved inspection and monitoring. Images and videos are two major modes of data analyzed by computer vision techniques. Images capture visual information similar to that obtained by human inspectors. Because of this similarity, computer implementation of structural inspection that is analogous to visual inspection by human inspectors is anticipated. In addition, images can encode information from the entire field of view in a non-contact manner, potentially addressing the challenges of monitoring using contact sensors. Videos are a sequence of images in which the extra dimension of time provides important information for both inspection and monitoring applications, ranging from the assimilation of context when images are collected from multiple views, to the dynamic response of the structure when high-sampling rates are used. A significant amount of research in the civil engineering community has focused on developing and adapting computer-vision techniques for inspection and monitoring tasks. Moreover, such vision-based approaches, used in conjunction with cameras and unmanned aerial vehicles (UAVs), offer the potential for rapid and automated inspection and monitoring for civil infrastructure condition assessment.

The present paper reviews recent research on vision-based condition assessment of civil infrastructure. To put the research described in this paper in its proper technical perspective, a brief history of computer vision research is first provided in Section 2. Section 3 then reviews in detail several recent efforts dealing with inspection applications of computer vision techniques for civil infrastructure assessment. Section 4 focuses on monitoring applications, and Section 5 outlines challenges for the realization of automated structural inspection and monitoring. Section 6 discusses ongoing work by the authors toward the goal of automated inspections. Finally, Section 7 provides conclusions.

2. A brief history of computer vision research

Computer vision is an interdisciplinary scientific field concerned with the automatic extraction of useful information from image data in order to understand or represent the underlying physical world, either qualitatively or quantitatively. Computer vision methods can be used to automate tasks of the human visual cortex. Initial efforts toward applying computer vision methods began in the 1960s, and sought to extract shape information about objects using edges and primitive shapes (e.g., boxes) [9]. Computer vision methods began to consider more complex perception problems with the development of different representations of image patterns. Optical character recognition (OCR) was of major interest, as the characters and digits of any fonts needed to be recognized for the purpose of increased automation in the United States Postal Service [10], license plate recognition [11], and so forth. Facial recognition, where the input image is evaluated in feature space obtained by applying hand-crafted or learned filters with the aim of detecting patterns representing human faces, has also been an active area of research [12], [13]. Other object detection problems, such as pedestrian detection and car detection, have begun to show significant improvement in recent years (e.g., Ref. [14]), motivated by increased demand for surveillance and traffic monitoring. Computer vision techniques have also been used in sports broadcasting for applications such as ball tracking and virtual replays [15].

Recent advances in computer vision techniques have largely been fueled through end-to-end learning using artificial neural networks (ANNs) and convolutional neural networks (CNNs). In ANNs and CNNs, a complex input–output relation of data is approximated by a parametrized nonlinear functiondefined using units called nodes [16]. Output of each ANN node is computed by the following:(1) $y_{n} = σ_{n} (w_{n}^{T} x_{n} + b_{n})$ where xn is a vector of input to the node n; yn is a scalar output from the node; and wn and bn are vectors of weight and bias parameters, respectively. σn is a nonlinear activation function, such as the sigmoid function and rectifier (rectified linear unit, or ReLU [17]). Similarly, for CNNs, each node applies convolution, followed by a nonlinear activation function:(2) $y_{n} = σ_{n} (W_{n} * x_{n} + b_{n})$ where * denotes convolution and Wn is the convolution kernel. The final layer of CNNs is typically a fully connected layer (FCL) which has dense connections to the output, similar to the layers of an ANN. The CNN is particularly effective for image and video data, because recognition using CNNs is robust to translation with a limited number of parameters. By increasing the number of nodes connected with each other, an arbitrary complex parametrization of the input–output relation can be realized (e.g., multilayer perceptron with many hidden layers and/or many nodes in each layer, deep convolutional neural networks (DCNNs), etc.). The parameters of the ANNs/CNNs are optimized using a collection of input and output data (training data) (e.g., Refs. [18], [19]).

These algorithms have achieved remarkable success in building perception systems for highly complex visual problems. CNNs have achieved more than 99.5% accuracy on the Modified National Institute of Standards and Technology(MNIST) handwritten digit classification problem (Fig. 2(a)) [20]. Moreover, state-of-the-art CNN architectures have achieved less than 5% top-five error (ratio of data where the true class does not mark the top five classification score) [21] on the 1000-class ImageNet classification problem (Fig. 2(b)) [22].

Use of CNNs is not limited to image classification (i.e., inferring a single label per image). A DCNN applies multiple nonlinear filters and computes a map of filter responses ( $F$ in Fig. 3, which is called a “feature map”). Instead of using all filter responses simultaneously to get a per-image class (upper flow in Fig. 3), filter responses at each location in the map can be used separately to extract information about both object categories and their locations. Using the feature map, semantic segmentation algorithms assign an appropriate label to each pixel of the image [23], [24], [25], [26]. Object detection algorithms use the feature map to detect and localize objects of interest, typically by drawing their bounding boxes [27], [28], [29], [30], [31]. Instance-level segmentation algorithms [32], [33], [34] further process the feature map to differentiate each instance of an object (e.g., assigning a separate label for each person in an image, instead of assigning the same label to all people in the input image). While dealing with video data, additional temporal information can also be used to conduct spatiotemporal analysis [35], [36], [37] for segmentation.

The Achilles heel of supervised learning techniques is the need for high-quality labeled data (i.e., images in which the objects are already identified) that are used for training purposes. While many software applications have been created to help ease the labeling process (e.g., Refs. [38], [39]), manual labeling still remains a very cumbersome task. Weakly supervised training has been proposed to perform object detection and localization tasks without pixel-level or object-level labeling of images [40], [41]; here, a CNN is trained on image-wise labels to get the object category and approximate location in the image.

Unsupervised learning techniques further reduce the need for labeled data by identifying the underlying probabilistic structure in the observed data. For example, clustering algorithms (e.g., k-means algorithm [11]) assume that the data (e.g., image patch) is generated by multiple sources (e.g., different material types), and allocate each data sample to one of the sources based on the maximum likelihood (ML) framework. For example, DeGol et al. [42] used the k-means algorithm to perform material recognition for imaged surfaces. More complex probabilistic structures can be extracted by fitting parametrized probabilistic models to the observed data (e.g., Gaussian mixture model (GMM) [16], Boltzmann machines [43], [44]). In the image processing context, CNN-based architectures for unsupervised learning have been actively investigated, such as auto-encoders [45], [46], [47] and generative adversarial networks(GANs) [48], [49], [50]. These methods can automatically learn the compact representation of the input image and/or image recovery/generation process from compact representations without manually labeled ground truth. A thorough and concise review of different supervised and unsupervised learning algorithms can be found in Ref. [51].

Another set of algorithms that have spurred significant advances in computer vision and artificial intelligence (AI) across many applications are optical flow techniques. Optical flow estimates a motion field through pixel correspondences between two image frames. Four main classes of algorithms can be used to compute optical flow, including: ① differential methods, ② region matching, ③ energy methods, and ④ phase-based techniques, for which details and references can be found in Ref. [52]. Optical flow has wide ranging applications in processing video data, from video compression [53], to video segmentation [54], motion magnification [55] and vision-based navigation of UAVs [56].

With these advances, computer vision techniques have been used to realize a wide variety of cutting-edge applications. For example, computer vision techniques are used in self-driving cars (Fig. 4) [57], [58], [59] to identify and react to potential risks encountered during driving. Accurate face recognition algorithms empower social media [60] and are also used in surveillance applications (e.g., law enforcement in airports [61]). Other successful applications include automated urban mapping [62] and enhanced medical imaging [63]. The significant improvements and successful applications of computer vision techniques in many fields provide increasing motivation for scholars to develop computer vision solutions to the civil engineering problems. Indeed, using computer vision is a natural step toward improved monitoring and inspection of civil infrastructure. With this brief history as background, the following sections describe research efforts to adapt and further develop computer vision techniques for the inspection and monitoring of civil infrastructure.

3. Inspection applications

Researchers frequently envision an automated inspection framework that consists of two main steps: ① utilizing UAVs for remote automated data acquisition; and ② data processing and inspection using computer vision techniques. Intelligent UAVs are no longer a thing of the future, and the rapid growth in the drone industry over the last few years has made UAVs a viable option for data acquisition. Indeed, UAVs are being deployed by several federal and state agencies, as well as other research organizations in the United States (e.g., Minnesota Department of Transportation [64], [65], Florida Department of Transportation [66], University of Florida [67], Michigan Department of Transportation [68], South Dakota State University [69]). These efforts have primarily focused on taking photographs and videos that are used for onsite evaluation or subsequent virtual inspections by engineers. The ability to automatically and robustly convert images or video data into actionable information is still challenging. Toward this goal, the first major subsection below reviews literature in damage detection, and the second reviews structural component recognition. The third major subsection briefly reviews a demonstration that combines both these aspects: damage detection with structure-level consistency.

3.1. Damage detection

Automated damage detection is a crucial component of any automated or semi-automated inspection system. When characterized by the ratio of pixels representing damage to those representing the undamaged portion of the structure’s surface, the presence of defects in images of a structure can be considered a relatively rare occurrence. Thus, the detection of visual defects with high precision and recall is a challenging task. This problem is further complicated by the presence of damage (DP)-like features (e.g., dark edges such as a groove can be mistaken for a crack). As described below, a great deal of research has been devoted to developing methods and techniques to reliably identify different visual defects, including concrete cracks, concrete spalling and delamination, fatigue cracks, steel corrosion, and asphalt cracks. Three different approaches for damage detection are discussed below: ① heuristic feature-extraction methods, ② deep learning-based damage detection, and ③ change detection.

3.1.1. Heuristic feature-extraction methods

Researchers have developed different heuristics methods for damage detection using image data. In principle, these methods work by applying a threshold or a machine learning classifier to the output of a hand-crafted filter for the particular damage type (DT) of interest. This section describes some of the key DTs for which heuristic feature-extraction methods have been developed.

(1) Concrete cracks. Much of the early work on vision-based damage detection focused on identifying concrete cracks based on heuristic filters (e.g., Refs. [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80]). Edge detection filters were the first type of heuristics to be used (e.g., Ref. [70]). An early survey of approaches can be found in Ref. [71]. Jahanshahi and Masri [72] used morphological features, together with classifiers (neural networks and support vector machines), to identify cracks of different thicknesses. The results from this study are presented in Fig. 5 [72], [81], where the first column shows the original images used in the study and the subsequent columns show the results from the application of the bottom-hat method, Canny method, and the algorithm from Ref. [72]. The same paper also puts forth a method for quantifying crack thickness by identifying the centerline of each crack and computing the distance to the edges. Nishikawa et al. [74] proposed multiple sequential image filtering for crack detection and property estimation. Other researchers have also developed methods to estimate the properties of concretecracks. Liu et al. [79] proposed a method for automated crack assessment using adaptive image processing, in which a median filter was used to decompose a crack into its skeleton and edges. Depth and three-dimensional (3D) information was also incorporated to conduct quantitative damage evaluation in Refs. [81], [80]. Erkal and Hajjar [82] developed and evaluated clustering process to automatically classify defects such as cracks, corrosion, ruptures, and spalling in colorized laser scan data using surface normal based damage detection. In many of the methods discussed here, binarization is a step typically employed in crack-detection pipelines. Kim et al. [83] compared different binarization methods for crack detection. These methods have been applied to a variety of civil infrastructure, including bridges (e.g., Refs. [84], [85]), tunnel linings (e.g., Ref. [76]), and post-earthquake building assessment (e.g., Ref. [86]).

(2) Concrete spalling. Methods have also been proposed to identify other defects in concrete, such as spalling. A novel orthogonal transformationapproach combined with a bridge condition index was used by Adhikari et al. [87] to quantify degradation and subsequently map to condition ratings. The authors were able to achieve a reasonable accuracy of 85% for the detection of spalling in their dataset, but were unable to address situations in which both cracks and spalling were present. Paal et al. [88] employed a combination of segmentation, template matching, and morphological preprocessing, for both spall detection and concrete column assessment.

(3) Fatigue cracks in steel. Fatigue cracks are a critical problem for steel deck bridges because they can significantly shorten the lifespan of a structure. However, research on steel fatigue crack detection in civil infrastructure has been fairly limited. Yeum and Dyke [89] manually created defects on a steel beam to give the appearance of fatigue cracks (Fig. 6). They then used a combination of region localization by object detection and filtering techniques to identify the created fatigue-crack-like defects. The authors made an interesting and useful assumption that fatigue cracks generally develop around bolt holes; however, this assumption may not be valid for other steel structures, for which critical members are usually welded—including, for example, navigational infrastructure such as miter gates [90]. Jahanshahi et al. [91]proposed a region growing method to segment microcracks on internal components of nuclear reactors.

(4) Steel corrosion. Researchers have used textural, spectral, and color information for the identification of corrosion. Ghanta et al. [92] proposed the use of wavelet features together with principal component analysis for corrosion percent estimation in images. Jahanshahi and Masri [93]parametrically evaluated the performance of wavelet-based corrosion algorithms. Methods using textural and color features have also been proposed and evaluated (e.g., Refs. [94], [95]). Automated algorithms for robotic and smartphone-based maintenance systems have also been proposed for image-based corrosion detection (e.g., Refs. [96], [97]). A survey of corrosion detection approaches using computer vision can be found in Ref. [98].

(5) Asphalt defects. Numerous techniques exist for the detection and assessment of asphalt pavement cracks and defects using heuristic feature-extraction techniques [99], [100], [101], [102], [103], [104], [105]. Hu and Zhao [101] used a local binary pattern-based (LBP) algorithm to identify pavement cracks. Salman et al. [100] proposed the use of Gabor filtering. Koch and Brilakis [99] used histogram shape-based thresholding to automatically detect potholes in pavements. In addition to RGB data (where RGB refers to 3 color channels representing red, green, and blue wavelengths of light), depth data has been used for the condition assessment of roads. For example, Chen et al. [106]reported the use of an inexpensive RGB-D sensor (Microsoft Kinect) to detect, quantify, and localize pavement defects. A detailed review of methods for asphalt defect detection can be found in Ref. [107].

For further study on identification methods for some of these defects, Koch et al. [108] provided an excellent review of computer vision defect detection techniques developed prior to 2015, classified based on the structure to which they are applied.

3.1.2. Deep learning-based damage detection

The studies and techniques described thus far can be categorized as either using machine learning techniques or relying on a combination of heuristic features together with a classifier. In essence, however, the application of such techniques in an automated structural inspection environment is limited, because these techniques do not employ the contextual information that is available in the regions around where the defect is present, such as the nature of the material or structural components. These heuristic filtering-based techniques need to be manually or semi-automatically tuned, depending on the appearance of the target structure being monitored. Real-world situations vary extensively, and hand crafting a general algorithm that can be successful in general cases is quite difficult. The recent success of deep learning for computer vision [21], [51] in a number of fields, such as general image classification [109], autonomous transportation systems [57], and medical imaging [63], has driven its application in civil infrastructure inspection and monitoring. Deep learning has greatly extended the capability and robustness of traditional vision-based damage detection for a wide variety of visual defects, ranging from cracks and spalling to corrosion. Different approaches for detection have been studied, including ① image classification, ② object detection or region-proposal methods, and ③ semantic segmentation methods. These applications are discussed below.

(1) Image classification. CNNs have been employed for the application of crack detection in steel decks [110], asphalt pavements [111], and concrete surfaces [112], with very high accuracy being achieved in all cases. Kim et al. [113]proposed a classification framework for identifying cracks in the presence of crack-like patterns using CNN and speeded-up robust features (SURF), and determined the pixel-wise location using image binarization. Architectures such as AlexNet have been fine-tuned for crack detection [114], [115] and GoogleNet has been similarly fine-tuned for spalling [116]. Atha and Jahanshahi [117]evaluated different deep learning techniques for corrosion detection, and Chen and Jahanshahi [118] proposed the use of the naïve Bayes data fusion with a CNN for crack detection. Yeum [119] utilized CNNs for the extraction of important regions of interest in highway truss structures in order to ease the inspection process.

Xu et al. [110], [120] systematically investigated the detection of steel fatigue cracks for long-span bridges using deep learning neural networks, including a restricted Boltzmann machine and a fusion CNN. The novel fusion CNN proposed was able to identify minor cracks at multiple scales, with high accuracy, and under complex backgrounds present during in-field testing. Maguire et al. [121] developed a concrete crack image dataset for machine learning applications containing 56 000 images that were classified as either having a crack or not.