Abstract
The United States Navy (USN) intends to increase the amount of uncrewed aircraft in a carrier air wing. To support this increase, carrier-based uncrewed aircraft will be required to have some level of autonomy as there will be situations where a human cannot be in/on the loop. However, there is no existing and approved method to certify autonomy within Naval Aviation. In support of generating certification evidence for autonomy, the United States Naval Academy (USNA) has created a training and evaluation system (TES) to provide quantifiable metrics for feedback performance in autonomous systems. The preliminary use case for this work focuses on autonomous aerial refueling. Prior demonstrations of autonomous aerial refueling have leveraged a deep neural network (DNN) for processing visual feedback to approximate the relative position of an aerial refueling drogue. The training and evaluation system proposed in this work simulates the relative motion between the aerial refueling drogue and feedback camera system using industrial robotics. Ground-truth measurements of the pose between the camera and drogue are measured using a commercial motion capture system. Preliminary results demonstrate calibration methods providing ground-truth measurements with millimeter precision. Leveraging this calibration, the proposed system is capable of providing large-scale datasets for DNN training and evaluation against a precise ground truth.
1 Introduction
The United States Navy (USN) has publicly stated that the air wing of the future will be 40% uncrewed [1]. As part of this effort, the USN is preparing to field the Boeing MQ-25 Stingray, an uncrewed refueling aircraft [2]. The MQ-25 will be the first large, uncrewed aircraft to operate regularly from the flight deck of a United States aircraft carrier. The current planned operations for the MQ-25 can be considered automation as there will be a human in/on the loop acting as the air vehicle operator. As the USN expands uncrewed aerial systems beyond automation, reliance on autonomy will increase. However, an approved method to certify autonomy within naval aviation does not currently exist. In response to this need, the United States Naal Academy (USNA) has created a training and evaluation system (TES) to quantify feedback performance for autonomous systems.
In coordination with the Office of Naval Research, the Naval Aviation Systems Command, and the National Airworthiness Council Artificial Intelligence Working Group (NAWCAIWG), this work focuses on an unclassified use case for certification using feedback derived from deep neural network (DNN) processing of visual imagery. In this case, the autonomous aircraft is acting as the receiver in an aerial refueling operation (i.e., approaching and coupling with a refueling drogue), and feedback must reliably approximate the relative position of the drogue from visual imagery. This will enable an uncrewed aircraft to complete an autonomous task, a first for naval aviation. Details of the use case can be found in [3]. Prior work has shown that an uncrewed aircraft in live testing and simulation can perform autonomous refueling under ideal conditions with a human closely monitoring all aspects [4,5]. However, a fleet-wide flight clearance without a human in/on the loop requires extensive performance quantification to assess risk. Further, standards or methods of compliance do not exist to certify this level of autonomy within naval aviation. The described use case offers several unique challenges, most notably a lack of a universal standard for measuring the accuracy and performance of DNN feedback. The proposed TES provides tools for data acquisition, ground-truth labeling for DNN training, and performance quantification.
The use of supervised learning to train DNNs requires an established “ground truth” defining a known input/output correspondence. In practice, creating ground-truth correspondence using nonsimulated imagery represents the most labor-intensive part of training. This effort is justified as valid ground-truth labeling of large datasets is required for accurate and predictable DNN performance [6].
While this task is relatively unskilled, lapses in accuracy in labeling data can introduce training error that reduces network accuracy. Prior research shows that a trained DNN can identify an aerial refueling drogue from a relatively small dataset [7]. However, generating a dataset of accurately labeled images is resource-intensive. As such, the TES is designed to serve both as an autolabeling tool to generate large data ground-truth correspondences and as an evaluation tool. To accomplish this, the TES incorporates industrial robots to provide automated articulation between the designated sensor (for this application, a machine vision camera), and the target (KC-130 refueling drogue). Early results showed the ability to track the relative pose across the combined manipulation workspace, and the ability to project bounding boxes defining salient target features in image space without human interaction.
This paper will highlight the process we used in establishing the TES and is structured as follows. Section 2 highlights related work, the use case, and the motivation for the TES. Section 3 details the TES design, calibration, and automation of data acquisition. Section 4 describes the preliminary evaluation of the TES and presents the results generated. Finally, Sec. 5 summarizes conclusions and discusses future work.
2 Background
The prevalence and accuracy of DNNs, specifically deep convolution neural networks, have expanded notably since the adoption of graphics processing unit computing [8]. Government and commercial entities continue to adopt this technology with applications ranging from license plate identification [9,10] to identifying humans in a distorted image [11]. Regardless of application, these algorithms require large datasets for training before they can be effective.
In 2021, Zhan et al. published a survey paper on machine-learning techniques for autolabeling across a breadth of data formats (video, audio, and text). They examined numerous papers describing methods for generating large datasets used for training machine learning algorithms [6]. However, none of the papers surveyed offered a method for automatically labeling data with quantified labeling accuracy. Given the certification and safety considerations associated with this work, these autolabeling techniques, while qualitatively effective, lack the critical information needed to support certification and safety approvals for flight clearance. Additional work on developing autolabeling methods is ongoing. Some methods include having a human manually label images while an algorithm learns from their actions. The algorithm is then able to continue the task with minimal supervision [12–15].
This paper focuses on developing a method for generating large ground-truth correspondences between images and known features for use in DNN training and evaluation. The specific application for this preliminary work is the tracking of an aerial refueling drogue in simulated flight configurations. Following Refs. [4,5], visual feedback from a camera fixed to the simulated receiving aircraft will be processed to define its distance and relative position from the drogue using salient geometric features on the drogue (e.g., the relative size of drogue features in pixels). Assuming a precise training set containing, acceptable DNN selection, and acceptable DNN parameter tuning, prior work [7] suggests that the DNN will accurately track relative position during the final approach to contact with the drogue. For the precise training set, this paper introduces the idea of using ground-truth data to automatically label the bounding boxes for DNN training vice another technique. Existing autolabeling and hand-labeling techniques do not offer a quantitative accuracy metric when generating ground-truth correspondence for the training and evaluation of the DNN. The method presented in this paper differs from existing autolabeling and hand-labeling techniques by providing quantifiable tracking metrics for labeling of training and evaluation data. For this use case, the quantified tracking accuracy generated in the training and evaluation of the DNN can provide safe operating bounds for control systems leveraging DNN feedback. As mentioned above, these measures are also critical in assessing the safety and eventual certification of flight autonomy leveraging DNN feedback.
With the advent of increased computing power (notably graphics processing unit computing), machine learning techniques such as DNNs have been widely adopted as a cost-effective method of processing large datasets across a range of sensor modalities. An extensive body of work explores the applications of machine learning algorithms to identify objects in images with applications spanning various fields. Machine learning methods have been applied to anomaly detection in medical imaging, with extensive work related to lung cancer detection [16–20], to fire detection camera footage [21–23], and as a feedback modality for self-driving cars [24–26].
The use of computer vision as a method for feedback in autonomous aerial refueling has been studied by multiple organizations. The Air Force Institute of Technology has an active program of identifying the receiver aircraft and identifying the relative pose information to a synthetic refueling drogue [27–31]. The work at Air Force Institute of Technology relies on a tanker-based vision system vice one that is hosted by the receiver aircraft. The approach simulated in this paper focuses on a receiver-based computer vision system. Various iterations of this approach and its utility for aerial refueling have appeared in the literature [32–36]. While all of these approaches show utility for demonstrating an uncrewed vehicle can complete the task, none have been vetted within naval aviation as part of the safety of flight clearance.
Fielding large uncrewed platforms within naval aviation has become the next logical next step. However, most of the functionality within these platforms is only certified safe for flight when a human is in/on the loop. As we begin to field these platforms within a carrier air wing, a need has been identified to allow them to complete their missions without human oversight. The platforms will need to exhibit some level of autonomous functionality. Before certifying this autonomous functionality, the USN needs to formally establish specifications and methods of compliance that certification officials can use to make informed risk decisions.
To enable an early dialog between industry, academia, and the United States military certification officials the NAWCAIWG sponsored two working groups to study the issue and make recommendations [37,38]. These working groups elected to use an unclassified use case: An autonomous aircraft acting as the receiver during aerial refueling through the use of a DNN. This use case was presented at Xpotential 2022 [39], detailed in the International Test and Evaluation Association Journal [3], and in the Systems Engineering Journal. The paper is another step along this research line.
This paper details the development, calibration, and application of TES to generate ground-truth correspondences for DNN training and evaluation, specifically associated with an autonomous aerial refueling task. The overarching goal of this work is to develop a framework for providing highly accurate and quantifiable ground-truth measurements applicable to the training and evaluation of autonomy with the eventual goal of providing information for certification of autonomy.
3 Test and Evaluation System Overview
This section provides an overview of the proposed TES, associated calibration, and automated data acquisition. Section 3.1 details the TES design including hardware specification and software development. Section 3.2 overviews the required calibration of the TES to provide values for static, unknown transformations. Section 3.3 describes the current approach to automated data acquisition both for calibration and autolabeling in the context of this work.
3.1 Design.
The TES proposed in this work was developed within the Department of Weapons, Robotics, & Control Engineering's Vision Integration in Polymanual Robotics (VIPER) lab at the USNA. The VIPER lab was established as a research facility to explore computer vision and sensor uncertainty leveraging multi-arm industrial robotic manipulation and commercial motion capture. The VIPER lab currently houses three, multiple degree-of-freedom (DoF) industrial manipulators: (1) Yaskawa SIA20F—a 7-DoF, 1090-mm horizontal reach, 20 kg payload manipulator; (2) Universal Robot UR10—a 6-DoF, 1300-mm horizontal reach, 10 kg payload manipulator; and (3) Universal Robot UR5—a 6-DoF, 850-mm horizontal reach, 5 kg payload manipulator. Within the VIPER lab, the Yaskawa SIA20F is anchored to the lab floor, and the UR5 and UR10 are mounted on individual mobile bases. The VIPER lab integrates a 12-camera motion capture (MoCap) constellation (OptiTrack PrimeX 41, Natural Point Inc.), providing an advertised ±0.10-mm 3D tracking accuracy.
The TES design goals are as follows: (1) simulate, at a minimum, the final 4.5 m approach for an aerial refueling task given a 1.8 m distance between the camera and the center of the drogue at an approach distance of 0 m (simulating the approximate position of the camera when the receiver makes contact as discussed in [7]); (2) simulate off-axis misalignment of at least ±1.5 m during approach; (3) integrate a KC-130 aerial refueling drogue following the North Atlantic Treaty Organization (NATO) “probe-and-drogue” standard [40] as a representative target for training and evaluation; (4) integrate an interchangeable system for mounting machine vision cameras; and (5) provide precise measurements for defining the position and orientation of drogue features relative to the camera.
Given the 14 kg weight of the KC-130 aerial refueling drogue assembly, mounting options to address design goal (3) are limited to (a) static mounting or (b) mounting to the SIA20F. Given design goal (2), the SIA20F was selected enabling ±850 mm off-axis articulation of the drogue (after accounting for pedestal collision). Mounting for the interchangeable machine vision cameras to address design goal (4) is accomplished using a rigid 80/20 interface between the UR10 end effector and a standard 1/4 in camera ball head. The UR10 is selected for the camera mount to address the remaining off-axis misalignment requirement of design goal (2) enabling a combined misalignment capability exceeding ±1.8 m. Design goal (1) is addressed by the mobility of the UR10 base relative to the rigidly fixed SIA20F with restrictions imposed by available floor space limiting the maximum approach distance to 6 m. To address design goal (5), reflective markers are rigidly fixed to components of the system to define frames that can be tracked using the MoCap. Figure 1 highlights the components of the TES.
To keep the KC-130 drogue “inflated” during imaging, the drogue is lightly modified to incorporate tension cables and turnbuckles at six locations evenly spaced about the drogue's center axis. This results in an inflated, but notably hexagonal drogue configuration. Figure 2 highlights the difference between an inflated KC-130 drogue in-flight, and an image captured by the TES. This discrepancy between the current hexagonal shape and the desired circular shape is discussed in Sec. 5 and will be addressed in future work.
The TES is interfaced using ROS and MATLAB wrappers. The SIA20F is interfaced using the ROS-Industrial “Motoman” package [41], the UR10 is interfaced using the ROS-Industrial “UR Modern Driver” package [42], and the MoCap is interfaced using the ROS “VRPN Client” package [43]. The MATLAB wrappers [44–46] utilize the MATLAB ROS Toolbox and allow users to command the robots, query robot state, and query feedback from the MoCap. Cameras are interfaced directly in MATLAB using the Image Acquisition Toolbox.
Table 1 describes the coordinate frames defined as part of the TES, and Table 2 describes the transformations required for the proposed effort.
Frame | Frame description | Information source |
---|---|---|
eu | Fixed relative to UR10 end-effector | UR10 controller |
ou | Fixed relative to UR10 base | UR10 controller |
ey | Fixed relative to SIA20F end-effector | SIA20F controller |
oy | Fixed relative to SIA20F base | SIA20F controller |
tu | Fixed relative to UR10 end-effector | MoCap |
bu | Fixed relative to UR10 base | MoCap |
ty | Fixed relative to SIA20F end-effector | MoCap |
by | Fixed relative to SIA20F base | MoCap |
w | Fixed relative to MoCap “world” | MoCap |
c | Fixed to camera focal point | Camera |
d | Drogue salient feature frame | User/CAD |
m | Fixed to upper left of digital image | Camera |
Frame | Frame description | Information source |
---|---|---|
eu | Fixed relative to UR10 end-effector | UR10 controller |
ou | Fixed relative to UR10 base | UR10 controller |
ey | Fixed relative to SIA20F end-effector | SIA20F controller |
oy | Fixed relative to SIA20F base | SIA20F controller |
tu | Fixed relative to UR10 end-effector | MoCap |
bu | Fixed relative to UR10 base | MoCap |
ty | Fixed relative to SIA20F end-effector | MoCap |
by | Fixed relative to SIA20F base | MoCap |
w | Fixed relative to MoCap “world” | MoCap |
c | Fixed to camera focal point | Camera |
d | Drogue salient feature frame | User/CAD |
m | Fixed to upper left of digital image | Camera |
Transformation | Transform description | Group | Linear units | Information source |
---|---|---|---|---|
Pose of eu relative to ou | SE(3) | mm | UR10 Controller | |
Pose of ey relative to oy | SE(3) | mm | SIA20F Controller | |
Pose of tu relative to w | SE(3) | mm | MoCap | |
Pose of bu relative to w | SE(3) | mm | MoCap | |
Pose of ty relative to w | SE(3) | mm | MoCap | |
Pose of by relative to w | SE(3) | mm | MoCap | |
Pose of eu relative to tu | SE(3) | mm | Static, Unknown | |
Pose of ou relative to ou | SE(3) | mm | Static, Unknown | |
Pose of ey relative to ty | SE(3) | mm | Static, Unknown | |
Pose of oy relative to by | SE(3) | mm | Static, Unknown | |
Pose of oy relative to by | SE(3) | mm | Static, Unknown | |
Pose of tu relative to c | SE(3) | mm | Static, Unknown | |
Pose of ty relative to d | SE(3) | mm | Static, Unknown | |
Undistorted projection of c to m | Intrinsic | Pixels | Static, Unknown |
Transformation | Transform description | Group | Linear units | Information source |
---|---|---|---|---|
Pose of eu relative to ou | SE(3) | mm | UR10 Controller | |
Pose of ey relative to oy | SE(3) | mm | SIA20F Controller | |
Pose of tu relative to w | SE(3) | mm | MoCap | |
Pose of bu relative to w | SE(3) | mm | MoCap | |
Pose of ty relative to w | SE(3) | mm | MoCap | |
Pose of by relative to w | SE(3) | mm | MoCap | |
Pose of eu relative to tu | SE(3) | mm | Static, Unknown | |
Pose of ou relative to ou | SE(3) | mm | Static, Unknown | |
Pose of ey relative to ty | SE(3) | mm | Static, Unknown | |
Pose of oy relative to by | SE(3) | mm | Static, Unknown | |
Pose of oy relative to by | SE(3) | mm | Static, Unknown | |
Pose of tu relative to c | SE(3) | mm | Static, Unknown | |
Pose of ty relative to d | SE(3) | mm | Static, Unknown | |
Undistorted projection of c to m | Intrinsic | Pixels | Static, Unknown |
Here, defines the orientation of frame d relative to frame c, defines the position of frame d relative to frame c, defines the z-distance of N salient features (P) relative to the camera frame (typically referred to as “scale” with variable s), xm and represent the pixel coordinates of salient features P within the undistorted digital image (relative to Frame m), and denotes element-wise (i.e., Hadamard) division. For distorted images, points projected using the pinhole model (Eq. (4)) must be distorted using the applicable lens/camera model (e.g., Brown-Conrady or Fisheye).
The two methods presented for calculating (Eqs. (1) and (2)) rely on information gathered from different components of the TES. Manufacturer specifications prescribe a MoCap accuracy of ±0.1 mm, SIA20F “repeatability” of ±0.1 mm, and UR10 “repeatability” of ±0.1 mm. Repeatability in this context describes the position error bounds associated with repeated movement to a fixed waypoint anywhere within the robot's workspace. Assuming a one-to-one relationship between repeatability and position information reported from the SIA20F and UR10 controllers suggests an accuracy of ±0.1 mm for both the SIA20F and UR10. Ignoring the error introduced in the recovery of the static, unknown transformations described in Table 2, a best-case error associated with Eqs. (1) and (2) can be approximated using the convolution. For Eq. (1), relying on two measured transformations with ±0.1 mm accuracy, the approximate best-case error should be within ±0.3 mm. For Eq. (2), relying on five measured transformations with ±0.1 mm accuracy, the approximate best-case error should be within ±0.9 mm. This approximation follows intuition and suggests that Eq. (1) will yield a more precise result.
3.2 Calibration.
The goal of TES calibration is to accurately establish values for the static, unknown transformations described in Table 2. To do so, two calibration fiducials are introduced to provide extrinsic information describing the fiducial pose relative to the camera frame (frame c). For convenience, these are defined as a 2D checkerboard (frame f) fixed to a unique MoCap rigid body (frame g); and a 235 mm AprilTag (frame a) rigidly fixed to the SIA20F base frame. These fiducials were selected to provide compatibility with the “Camera Calibration” and “Read AprilTag” tools available in the MATLAB Computer Vision Toolbox. Figure 3 provides images of the fiducials highlighting the MoCap rigid body markers, and Table 3 provides a summary of the transformations introduced by the fiducials.
Transformation | Transform description | Group | Linear units | Information source |
---|---|---|---|---|
Pose of f relative to c | SE(3) | mm | Camera (camera calibration) | |
Pose of a relative to c | SE(3) | mm | Camera (read AprilTag) | |
Pose of g relative to f | SE(3) | mm | Static, Unknown | |
Pose of by relative to a | SE(3) | mm | Static, Unknown |
Transformation | Transform description | Group | Linear units | Information source |
---|---|---|---|---|
Pose of f relative to c | SE(3) | mm | Camera (camera calibration) | |
Pose of a relative to c | SE(3) | mm | Camera (read AprilTag) | |
Pose of g relative to f | SE(3) | mm | Static, Unknown | |
Pose of by relative to a | SE(3) | mm | Static, Unknown |
Frames u and v are introduced generically to be replaced by the applicable frame labels in this work. Assuming n calibration samples, i and j denote discrete samples taken at unique instances of time ( and ). Thus, represents the pose of frame u in sample i relative to frame u in sample j, and represents the pose of frame v in sample i relative to frame u in sample j. Given that the unknown transformation is static, . A summary of the transformations used to recover the static, unknown transformations is provided in Table 4. Error approximations for the recovered transformations can be defined as ±0.7 mm for A and B values derived from MoCap and/or robot feedback. Values leveraging fiducial extrinsics ( and ) are dependent on camera calibration accuracy.
Though assumed static, environmental factors such as temperature will impact the transformations recovered by the TES calibration. As a result, the TES calibration must be evaluated prior to and potentially intermittently during data acquisition using the methods described in Sec. 4.2. In practice, the TES calibration accuracy is evaluated using new images taken of the checkerboard and AprilTag fiducials to provide a comparison between extrinsics calculated using static transformations (extrinsics) and MoCap measurements and transformations (extrinsics) recovered by the camera directly. If and when the mean extrinsic errors derived from this evaluation fall outside of the desired accuracy, the TES must be recalibrated before further use.
3.3 Automated Data Acquisition.
Automated data acquisition with the TES is accomplished using the hardware interfaces described in Sec. 3.1 and with a simulation environment developed as a “digital twin” for the physical TES. The simulation environment utilizes the MATLAB Robotics System Toolbox to model and visualize the UR10 and SIA20F, and to define applicable collision geometries. MoCap feedback and static transformations from calibration (Sec. 3.2) provide relative pose information for the placement of the UR10, SIA20F, fiducial frames, and drogue. A camera simulation based on camera parameters recovered during calibration and fiducial visualizations defined by known design properties are incorporated into the TES simulation using tools from [48–51].
The resultant TES simulation environment provides:
Collision detection for the UR10, SIA20F, pedestal, and mounting geometry, and components mounted to end-effectors (i.e., camera and drogue).
Forward and inverse kinematics for the UR10 and SIA20F including tool transformation offsets.
Information regarding visibility of desired features within the camera's field of view (e.g., checkerboard fiducial, AprilTag fiducial, and drogue features).
Simulated camera images for validation and debugging.
Given a desired number of samples, features of interest, and a prescribed sampling region, the initial “automated waypoint generation for data acquisition” (daqWaypoints) algorithm is defined in Algorithm 1. The functions isIkin, isCollision, and isVisible used in Algorithm 1 are built into the TES simulation environment providing tools to check if an inverse kinematic solution exists, if the system is in a collision state, and if the features of interest are within the camera field of view.
Once a set of waypoints q is defined, a collision-free path between waypoints can be defined using established methods (e.g., [52]). For this preliminary effort, the defined set of waypoints (defined in joint space) is sorted by distance from a starting configuration, each adjacent pair of sorted waypoints is connected via linear interpolation, and the TES simulation environment evaluates the interpolated points. If no collisions are found, the interpolated points are added to the path. If a collision is found, the second element of the adjacent pair is replaced by the interpolated point prior to the collision. This produces a safe, reachable, collision-free path. However, waypoints replaced in this method do not guarantee that any/all desired features are visible in the camera field of view. The result is a dataset that may be smaller than the value prescribed by the user.
1: proceduredaqWaypoints() ▷ defines desired samples, F defines features, C defines search region |
2: |
3: whiledo |
4: |
5: ifthen continue |
6: end if |
7: |
8: iforthen continue |
9: end if |
10: |
11: |
12: end while |
13: returnq ▷ q defines waypoints in joint space |
14: end procedure |
1: proceduredaqWaypoints() ▷ defines desired samples, F defines features, C defines search region |
2: |
3: whiledo |
4: |
5: ifthen continue |
6: end if |
7: |
8: iforthen continue |
9: end if |
10: |
11: |
12: end while |
13: returnq ▷ q defines waypoints in joint space |
14: end procedure |
With a collision-free path defined, the calibrated TES is capable of providing several options for ground-truth correspondence. Some examples include:
Project drogue features into image space following Eqs. (1) and (4)
Project drogue features into image space following Eqs. (1) and (4), and define bounding boxes for specific features
Define a fully or reduced parametrization describing the pose of the drogue relative to the camera frame following Eq. (1) (drogue position and yaw/pitch/roll relative to camera frame, drogue yaw/pitch relative to camera frame, drogue position only relative to camera frame, etc.)
Define a full or reduced parametrization describing the pose of the drogue relative to a user-defined frame fixed to the camera frame using an extension to following Eq. (1) (e.g., the refueling probe frame)
Beyond the options described above, combinations including image projections and full/reduced parametrization of the relative pose are possible. The advantage of the TES over existing methods is the breadth of information available. This provides flexibility in labeling modality.
4 Results
4.1 Automated Data Acquisition.
Following a coarse initial calibration of the TES, the automated data acquisition approach described in Sec. 3.3 was used to capture 2500 waypoints with the features of interest defined as the corners of the calibration checkerboard, and 2500 waypoints with the features of interest defined as the corners of the AprilTag fixed to the SIA20F base frame. The total number of 2500 waypoints for each dataset was selected to limit the data acquisition time to approximately 4 h for each dataset. The 4-h approximation assumes a conservative 6-s per image average acquisition rate which was limited by a conservative move to waypoint, stop movement, capture image, and capture pose data acquisition strategy. The 4-h per dataset time limit was chosen to restrict the test duration to a total of 8 h while supervising the system.
A collision-free path was generated for each set of waypoints using the method described in Sec. 3.3. The waypoints and collision-free paths generated for both the checkerboard and AprilTag fiducials are shown in Fig. 4 (left), and the joint space paths and waypoints missed when generating the collision-free path are shown in Fig. 4 (right). Of the 2500 waypoints generated, 2367 checkerboard waypoints were reached using the collision-free path, and 1442 AprilTag waypoints were reached using the collision-free path with unreachable waypoints replaced by the closest reachable configuration. The data acquisition time using the TES to capture checkerboard images was approximately 3 h and 40 min for the 2500 waypoints associated with the checkerboard (averaging approximately 5.3 s per image). The data acquisition time using the TES to capture AprilTag images was approximately 3 h and 30 min for the 2500 waypoints associated with the AprilTag (averaging approximately 5.0 s per image).
The resultant dataset yielded 2326 viable checkerboard image/pose correspondences and 1613 AprilTag image/pose correspondences. Note that the decrease in viable image/pose correspondences for the checkerboard dataset indicates that 41 images contained a partial or obstructed checkerboard view, and the increase in viable image/pose correspondences for the AprilTag dataset indicates that 171 “missed-waypoints” yielded an in-view AprilTag.
4.2 Calibration Results.
The image/pose correspondences described in Sec. 4.1 were separated into calibration and validation subsets. Camera calibration was performed using 1163 (odd index values) of the 2326 checkerboard images collected, and evaluated against the remaining 1163 images not used for calibration (even index values). Both the calibration and evaluation datasets yielded a mean reprojection error of 0.10 pixels. Using tools from the MATLAB Camera Calibration toolbox, the calibration processing time was approximately 2 h and 20 min for each of the 1163 image datasets. These results were achieved using a Windows 10 operating system running on a PC with a 3.80 GHz processor (Intel Xeon Processor E3-1270 v6 8M Cache), 32 GB RAM, and an NVIDIA Quadro P1000 graphics card. The MATLAB version used was R2021b. The camera calibration results are shown in Fig. 5.
Using the checkerboard and AprilTag fiducial extrinsics recovered with camera calibration parameters and the AprilTag library, the “AX = XB” solution was applied to recover , and . Using MATLAB 2021b on the same Windows 10 PC used for camera calibration, the processing time for AprilTag pose recovery and the “AX = XB” solution was approximately 40 min for 807 image calibration dataset, and the AprilTag pose recovery for the 806 image evaluation dataset was approximately 10 min.
Figure 6 shows the “mean extrinsic error” associated with the static transformations and ; and Fig. 7 shows the “mean extrinsic error” associated with the static transformations and . The term “mean extrinsic error” in this context is defined as the mean error between the 3D position of the checkerboard or AprilTag corners defined using extrinsics recovered using camera calibration or the AprilTag library, and the positions of corners using MoCap and static recovered transformations. For both the checkerboard and AprilTag static transformation recovery, the use of odd index value images for calibrations yields substantial mean extrinsic error for both calibration and evaluation (Figs. 6 and 7-top). To account both for the redundant images highlighted in Fig. 5 and possible non-Gaussian noise associated with the VRPN MoCap feedback, calibration using 6000 individual subsets of 100 randomly selected images was explored. The lowest error calibration results from both the checkerboard and AprilTag random subset is shown in Figs. 6 and 7-bottom.
4.3 Auto-Labeling Results.
Autolabeling of the drogue is performed using the recovered value of (Sec. 4.2), MoCap feedback defining and , camera intrinsics , a user-defined approximation of , and a user-defined set of salient drogue features Pd. For this work, the drogue features consist of a 609.9 mm (24 in) diameter circle offset 596.9 mm (23.5 in) from a coaxial 101.6 mm (4 in) diameter circle. These circles define the drogue's inflated cloth drag component and coupler. To show orientation discrepancies associated with this approach, circles are connected using 36 equally spaced segments approximating the 3D “spokes” connecting the cloth drag component to the coupler.
5 Conclusion and Future Work
The automated data acquisition results presented in Sec. 4.1 show that the methods presented in Sec. 3.3 provide a viable approach to acquiring large datasets safely using the TES and TES simulation environment. Of the 2500 desired waypoints collected, of checkerboard waypoints yielded viable images and of AprilTag waypoints yielded viable images. This discrepancy between the desired number of waypoints, and the number of waypoints yielding viable images suggests a need for an improved data acquisition approach to efficiency. As described in Sec. 3.3, future work will implement collision-free motion planning approaches from the literature, and alternatives to the random sampling approach will be explored.
Data acquisition times for the checkerboard and AprilTag datasets were approximately 5.3 and 5.0 s per image, respectively. The variation in acquisition times results largely from the waypoint spacing in the random sampling defined by Algorithm 1 and the replacement of waypoints during the collision-free motion planning process. The data acquisition time for the TES can be further reduced by refining the waypoint order to minimize the total distance traveled and by removing waypoints where the fiducial of interest is not in the field of view. These improvements will be addressed in future work.
The processing time for calibration is time-consuming and does not scale well as the image sets become very large. For camera calibration, processing times were approximately 2 h and 20 min for each 1163 image dataset. For AprilTag calibration, this time reduces to approximately 40 min for 807 images, and for AprilTag evaluation this time drops to approximately 10 min. While this processing time is extensive, calibration and evaluation datasets can be processed “overnight” without operator supervision to ensure hardware safety. This processing time can be reduced in the near term by migrating the calibration from a MATLAB environment to a compiled language (e.g., C++).
The camera calibration results presented in Sec. 4.2 suggest that the data acquisition capabilities of the TES provide excellent camera calibration: pixel reprojection error for both the calibration and evaluation data for 1280 × 960 pixel images. Further analysis of this calibration and evaluation data shows that of the images collected using the automated data acquisition technique are effectively redundant (i.e., providing a fiducial pose relative to the camera frame that nearly matches the pose of an existing waypoint). This further emphasizes the need to improve the automated data acquisition efficiency to ensure a wide variety of unique samples within the workspace of the TES.
Results from static transformation recovery described in Sec. 4.2 highlight an oversight in both the TES design and automated data acquisition approach described in this work. Specifically, the use of odd index value images for both the checkerboard and AprilTag fiducials (Figs. 6 and 7-top) yields unacceptably high mean calibration and evaluation extrinsic errors with values exceeding 305 mm or 12 in. while the use of 100 image subsets for calibration drastically reduces both calibration and evaluation extrinsic errors (Figs. 6 and 7-bottom).
Rerunning portions of the automated data acquisition on the TES shows intermittent, discontinuous “jumps” in the poses reported by the VRPN MoCap interface, primarily in . Viewing the same rerun automated data acquisition using the native manufacturer software interface (Motive) shows that markers associated with the MoCap camera rigid body (Frame g) are intermittently lost and the rigid body is either not being tracked or an incorrect pose is being assigned. Noting the presence of extraneous MoCap data within the data collected for both the checkerboard and AprilTag fiducials explains the improvement seen with the 100-image subset approach to calibration. Further, identifying of 214 () of checkerboard images and 171 () of AprilTag images as outliers reduces the evaluation extrinsic error to < 3.4 mm (< 0.13 in) for the checkerboard fiducial and < 4.1 mm (<0.16 in) for the AprilTag fiducial. While these final errors are > 10 times the ±0.3 mm estimated in the TES design (Sec. 3.1), they are well within the ground-truth tracking accuracy required for the application proposed in this work.
To improve static transformation recovery during calibration, three near-term methods are proposed for future work: (1) improve the camera MoCap rigid body design by increasing marker spacing, decreasing symmetry in marker placement, and defining a marker placement that can be tracked reliably in all orientations; (2) leverage MoCap manufacturer software tools (e.g., NatNet SDK) to capture individual marker visibility information during automated data acquisition; and (3) record native MoCap sessions during automated data acquisition for outlier detection in postprocessing. These improvements should improve TES data quality by providing the information necessary to automatically remove erroneous tracking information. Beyond these near-term improvements, future work will explore extensions to the automated waypoint generation for the data acquisition algorithm (Algorithm 1) to enable waypoint generation throughout the TES workspace. Instead of searching arbitrarily defined subregions to reduce computation overhead, these future methods will explore alternative constraints such as uniform endpoint spacing in task space, movement of feature points throughout the camera's field of view, etc. This proposed improvement to automated waypoint generation will provide a more complete exploration of the viable data acquisition space and may provide improvements when used to augment the TES calibration.
The autolabeling results shown in Fig. 8 qualitatively demonstrate the performance of the TES in the context of aerial refueling. Both the inflated cloth drag component and coupler of the drogue appear to align well with the user-defined model despite the poor lighting of the coupler in the collected images. The added overlay of the drogue “spokes” shows marginal misalignment along the drogue's center axis. Future work will replace the manual definition of and Pd with values defined using a MoCap digitizing wand to trace drogue features. This will eliminate the manual definition of values by the user and should improve autolabeling accuracy.
The TES presented in this work is a viable tool for generating large, autolabeled datasets of an aerial refueling drogue. Results demonstrate calibration methods providing ground-truth measurements with a mean precision of ±4.1 mm, autolabeling of drogue images appears to be qualitatively accurate. Unlike traditional methods for establishing ground truth labeling of images used to train a DNN, use of the TES (1) requires no time-consuming manual labeling or manual evaluation of ground truth and (2) provides a quantitative performance metric describing the precision of labeling in linear units. While this preliminary work was conducted under operator supervision to ensure hardware safety, supervision requirements can be relaxed or removed as the TES is further refined. Future work is proposed to address the identified limitations of and improvements to the system.