Skip to main content

Plant recognition of maize seedling stage in UAV remote sensing images based on H-RT-DETR

Abstract

The real-time monitoring and counting of maize seed germination at seedling stage is of great significance for seed quality detection, field management and yield estimation. Traditional manual monitoring and counting is very time-consuming, cumbersome and error-prone. In order to quickly and accurately identify and count maize seedlings in a complex field environment, this study proposes an end-to-end maize seedling plant detection model H-RT-DETR (Hierarchical-Real-Time DEtection TRansformer) based on hierarchical feature extraction and RT-DETR (Real-Time DEtection TRansformer). H-RT-DETR uses Hierarchical Feature Representation and Efficient Self-Attention as the backbone network for feature extraction, thereby improving the network's ability to extract features of maize seedling stage in UAV remote sensing images. Through experiments on the UAV remote sensing data set of maize seedling stage, the mean Average Precision mAP0.5–0.95, mAP0.5 and mAP0.75 of the improved H-RT-DETR model reached 51.2%, 94.7% and 48.1%, respectively, and the Average Recall (AR) reached 68.5%. In order to verify the efficiency of the proposed method, H-RT-DETR is compared with the widely used and advanced target recognition methods. The results show that the detection accuracy of H-RT-DETR is better than that of the comparison methods. In terms of detection speed, the H-RT-DETR model does not require Non-Maximum Suppression (NMS) post-processing operations, the Frames Per Second (FPS) on the test dataset reaches 84f/s, which is 19,12,11 and 21 higher than that of YOLOv5, YOLOv7, YOLOv8 and YOLOX, respectively, under the same hardware environment. This model can provide technical support for real-time detection of maize seedlings under UAV remote sensing images in terms of both detection accuracy and speed (see https://github.com/wylSUGAR/H-RT-DETR for model implementation and results).

Introduction

As one of the main grain crops in China, maize has a large planting area, accounting for 37.1% of the total grain planting area. It is mainly concentrated in the northeast, north, northwest and southwest regions, and has a great contribution to China's grain production increase [1]. Therefore, maintaining high yield and stability of maize planting in China is essential to ensure food security. The emergence rate of maize at the seedling stage is an important parameter affecting maize planting, cultivation and subsequent yield, and is a key metric for evaluating the quality of maize varieties [2]. The traditional manual monitoring and counting methods for maize seedlings are very time-consuming(Manual regular inspection, manual counting and recording of seedling growth status), cumbersome, and error-prone, especially for the field environment, where the manual method is even more undesirable. Therefore, it is crucial to accurately identify seedling maize in the field.

Efficient and accurate agriculture in the field environment requires more advanced monitoring technologies, and in recent years UAVs have been rapidly developed and widely used in agricultural observation [3]. UAV remote sensing is low-cost and easy to operate, has good anti-interference ability, can be equipped with visible light RGB, multispectral, hyperspectral and thermal infrared cameras to efficiently and non-destructively collect field crop growth status information, and it is a good choice to use UAV remote sensing to monitor the growth data of crop seedlings [4]. However, the data captured by drones generally contains a large number of small-size targets, especially for crops in the field. Due to complex terrain and environmental constraints, the target detection task faces many challenges such as small target size and many interfering objects. All these factors restrict the effective analysis and recognition of UAV remote sensing images.Therefore, for the image data collected by UAV remote sensing, how to accurately and efficiently process and identify small object crop seedling targets is crucial and challenging.

At present, researchers at home and abroad have adopted deep learning method to process crop image data, and have made some progress in crop recognition and counting. By using deep neural networks to automatically learn complex feature representations in image data, it has shown good results in traditional target detection tasks [5,6,7,8]. Most studies usually use YOLO detector as the basic model of crop seedling target recognition. Although these detection models have high detection accuracy, prior boxes usually need to be manually set before training, and appropriate Non-Maximum Suppression (NMS) needs to be carefully selected for post-processing operations [9]. Therefore, the calculation cost is high, the real-time performance is poor, and it is difficult to meet the real-time processing needs of UAVs. In this context, in 2020 Facebook proposed DETR, an end-to-end object detection algorithm based on Transformer. DETR restructures target detection as a sequence prediction problem, eliminating the need for threshold filtering and non-maximum suppression in traditional intensive post-detection processing steps. However, due to the large number of parameters in DETR, the performance of real-time target detection is limited. In order to overcome the problem of high computational cost caused by the large number of DETR parameters, Zhao et al. proposed a real-time end-to-end target detector RT-DETR [10]. Although RT-DETR can obtain higher training accuracy with fewer iterations and is outstanding in real-time detection, it still shows low efficiency in multi-scale feature recognition, and the fusion effect of features at different levels needs to be improved. Wei et al. used the scattered power decomposition method based on GD to address the pseudo-power components under complex background and reduce the background interference in the data, and proposed an improved detection model DV-DETR based on RT-DETR, which was specially optimized for small target detection in high-density scenes [11]. Liu et al. introduced a novel convolutional module PfConv based on RT-DETR to enhance its ability to detect small and medium-sized objects in low-quality images [12]. Based on RT-DETR, Yang et al. introduced EDSR super-resolution technology and extensive image augmentation methods to propose ISTD-DETR, which significantly enhanced the detection accuracy of small infrared targets [13].

The above results show that there is a certain optimization space for RT-DETR in small target detection. From the application point of view, considering the solution of high Angle of view and complex background image under UAV remote sensing, the maize target in seedling stage is small and insignificant, and the detection is difficult. From the technical point of view, the feature extraction network of RT-DETR uses the PResNet network based on CNN architecture, and the global information extraction ability is weak. When facing the interference of the field complex environment, the extraction of global information is beneficial to improve the detection effect. Although Vision Transformer (ViT) based on Transformer can be used for global modeling, it is difficult to process high-resolution images due to the global self-attention mechanism which causes the computational complexity to square with the image size. In addition, object detection tasks usually require multi-scale to extract object features of different sizes, but traditional VIT can only generate a single resolution feature map, which can not adapt to the needs of multi-scale. How to replace the feature extraction network of RT-DETR with a multi-scale global modeling network based on Transformer, while reducing the model complexity and computation to a manageable range, is an urgent technical challenge.

In view of the above situation and problems, we improved and designed a multi-level and multi-scale feature extraction target detection model based on RT-DETR: H-RT-DETR. The improved H-RT-DETR uses a layered Transformer to extract coarse features at high resolution and fine-grained features at low resolution, combined with an efficient self-attention mechanism to enhance the feature extraction capability of the model for objects of different sizes. Finally, multi-layer and multi-scale feature layers were fused into the encoder to extract the thickness and fineness features of corn seedlings more comprehensively and accurately, thus enhancing the accuracy and applicability of the model, as shown in Fig. 1. Through the above improved design, we expect that the model can improve the detection accuracy of small targets while ensuring real-time performance. Finally, the experiment verified that H-RT-DETR not only ensured the real-time recognition of corn seedlings, but also improved the detection accuracy, which provided technical support for the cultivation, planting and increase of corn production.

Fig. 1
figure 1

Remote sensing images of maize seedlings are sampled by UAV. After data preprocessing, hierarchical feature extraction is carried out, and four Transformer Block modules (Efficient Self-Attention modules and mixed feed-forward networks) are included for feature extraction of coarse and fine particle size. The last three extracted feature layers were used as input of RT-DETR to identify maize seedlings. Finally, compared with common target recognition networks, H-RT-DETR not only outperforms other networks in recognition accuracy, but also maintains a high FPS(Frames Per Second)

Materials and methods

Experimental site

The study area is located in the Jiangsu Agricultural Expo Park, Jurong City, Zhenjiang City, Jiangsu Province (32°1′24.25″N, 119°15′6.29″E), and the soil type is sandy loam, as shown in Fig. 2. The experimental area has a subtropical monsoon climate. Maize was planted on June 17, 2024, and the variety was Suyu 161.The experimental field size was 44 m × 56 m, and the planting density was 57,000 plants/hm2. The spacing of rows and plants was 30 cm.

Fig. 2
figure 2

The experimental area map

Image data acquisition and preprocessing

The image was collected on July 4, 2024, 17 days after maize sowing (DAS). Maize seedlings were in the true leaf stage of V3-V5, there were a few weeds in the field, and no obvious seedling adhesion was found. The UAV model used for image acquisition was DJI Mini 4 Pro, equipped with a 1.3-inch 48-megapixel visible light sensor. The acquisition time was 11:30 noon, in clear weather and without wind. The flight height was 10 m, the overlap rate was 50%, and a total of 661 original images of maize seedlings with a pixel resolution of 8192 × 5460 were obtained, and then the original images were cut to 640 × 640 pixels.

The maize seedlings in the image were labeled using Labelme software (https://github.com/labelmeai/labelme), and the dataset was divided into training set, evaluation set, and test set in a ratio of 7:2:1. The training and evaluation sets were used for model training, and the test dataset was used for model effect testing. In addition, in order to improve the generalization ability of the model and make up for the limited number of data sets, the training and evaluation data sets were processed by changing the contrast (coefficients of 0.3 and 1.1, respectively), brightness (coefficients of 0.4 and 1.2, respectively), color (coefficients of 0.3 and 1.3, respectively), adding motion blur and Gaussian noise. Finally, 3360 training data sets, 960 evaluation data sets and 480 test data sets were obtained. Figure 3 shows the original image of maize seedlings under UAV remote sensing, Labelme labeling and images processed by different data enhancement methods.

Fig. 3
figure 3

Images of maize seedlings under different conditions

Target detection method of maize seedlings

RT-DETR target detection architecture

Target detectors based on YOLO are widely used in various practical scenarios, but these detectors eventually produced many overlapping detection boxes, which required NMS post-processing and slowed down the speed of these detectors [14,15,16,17]. In order to remove operations such as artificial anchor points and NMS post-processing, a Transformer-based end-to-end detector was proposed [18, 19].

The RT-DETR model is an end-to-end real-time target detection model based on Transformer. It is mainly composed of backbone, efficient hybrid encoder and TransFormer decoder with auxiliary prediction head [20]. The features of the last three stages of the backbone network output are used as the input of the encoder. The efficient hybrid encoder converts multi-scale features through attention-based intra-scale feature interaction (AIFI). At the same time, CNN-based Cross-scale Feature Fusion (CCFF) is used to transform features into image feature sequences. Next, the minimum uncertainty query module selects a certain number of encoder features as the initial object query input to the decoder. The auxiliary prediction head module of the decoder continuously iterates and optimizes the object query, and finally generates the category and target box.

The RT-DETR model has an efficient hybrid encoder and eliminates NMS post-processing, which has great potential in the field of real-time target detection. Therefore, this study chooses to design and improve the RT-DETR model.

Improved H-RT-DETR target detection network

The recognition of maize seedlings is a small target detection task, especially the maize seedlings under the UAV remote sensing image as even smaller targets, which increases the difficulty of detection and recognition. In order to improve the feature extraction ability of RT-DETR model for maize seedlings, the improved H-RT-DETR model replaces the feature extraction backbone network with a multi-scale Transformer feature extraction network. Figure 4 shows the network structure of the improved H-RT-DETR model.

Fig. 4
figure 4

The network structure of H-RT-DETR model

The backbone of the H-RT-DETR model used hierarchical feature representation (HFR) to perform multi-level and multi-scale feature extraction on maize seedlings. These features provided high-resolution coarse features and low-resolution fine-grained features [21]. The input of backbone is a remote sensing maize seedling image of \(H \times W \times 3\). It first passed through the Overlap Patch Embeddings module, which passed the image through a convolutional layer, and controlled the edge overlap by defining K,S,P (K stands for kernel size, S stands for stride size, and P stands for patch size) as shown in Fig. 5a. Then after passing through four Transformer Blocks, the feature layers \(F_{i}\) of different scales were obtained, and the size of each feature layer was \(\frac{H}{{2^{i + 1} }} \times \frac{W}{{2^{i + 1} }} \times C_{i}\), where \(i \in \{ 1,2,3,4\}\).

Fig. 5
figure 5

a Overlap Patch Embeddings/Merging; bTransformer Block

Each Transformer Block contains three modules: an efficient self-attention module, a mixed feed-forward network (Mix-FFN), and an overlapped patch merging module. Each Transformer Block takes the output of the previous layer as input, passes through N efficient self-attention and mixed feed-forward network modules, and finally passes through an overlapped block merging module, as shown in Fig. 5b. The efficient self-attention module reduces the sequence length by a certain proportion in the calculation method of the traditional self-attention mechanism, which reduces the computational complexity. The equations are shown in (1) and (2).

$$Attention(Q,K,V) = Soft\max (\frac{{QK^{T} }}{{\sqrt {d{}_{head}} }})V$$
(1)
$$\begin{array}{*{20}c} {\hat{K} = {\text{Re}} shape\,\left( {\frac{N}{R},C \cdot R} \right)(K)} \\ {K = Linear\,\left( {C \cdot R,C} \right)(\hat{K})} \\ \end{array}$$
(2)

where Re shape operation means reshaping K into a sequence of \(\frac{N}{R} \times (C \cdot R)\), Linear operation means converting a tensor in \(C \cdot R\) dimension into a tensor in C dimension, R denotes reduction in proportion, the final dimension of K is \(\frac{N}{R} \times C\), the complexity of the self-attention mechanism is reduced from \(O(N^{2} )\) to \(\frac{N}{R} \times (C \cdot R)\).

The main function of the mixed feed-forward network module is to remove the position encoding in Transformer. This module uses a convolution operation to obtain the position information of each feature layer, while superimposing a nonlinear activation function and a fully-connected operation as in Eq. (3).

$${\text{x}}_{out} = MLP(GELU(Conv_{3 \times 3} (MLP(x_{in} )))) + x_{in}$$
(3)

where \(x_{in}\) represents the output of the efficient self-attention module. Each Mix-FFN network passes \(x_{in}\) through a fully connected layer, a convolution operation and a GELU activation function, and finally superimposes \(x_{in}\). This operation can not only effectively provide position information, but also solve the problem of requiring fixed resolution for position encoding in Transformer.

The overlapped patch merging module mainly changes the resolution of the feature layer to 1/2 of the original resolution through a convolution operation, while increasing the number of channels to achieve the size and channel transformation of multi-level features, as shown in Eq. (4). This module ensures the continuity between each block by controlling the stride and padding size between the two blocks, so as to realize the extraction of large-scale (coarse particles) and small-scale (fine particles) features of maize seedlings.

$$F_{i} \,\left( {\frac{H}{{2^{i + 1} }} \times \frac{W}{{2^{i + 1} }} \times C_{i} } \right) \to F_{i + 1} \,\left( {\frac{H}{{2^{i + 1 + 1} }} \times \frac{W}{{2^{i + 1 + 1} }} \times C_{i + 1} } \right),\,i \in \{ 1,2,3,4\}$$
(4)

Efficient Hybrid backbone used the output of the last three Transformer Block feature extraction modules, \(\frac{H}{{2^{i + 1} }} \times \frac{W}{{2^{i + 1} }} \times C_{i} ,\,i \in \{ 2,3,4\}\), as the input to the Efficient Hybrid Encoder, which transformed multi-scale features into a sequence of image features through Attention-based Intra-scale Feature Interaction(AIFI) and CNN-based Cross-scale Feature Fusion (CCFF). Then a certain number of image feature sequences were selected by the Uncertainty-minimal Query Selection Selects module to be converted into the input object of the decoder. Finally, the decoder continuously iterated and optimized to generate predicted category boxes [10].

Training and implementation of H-RT-DETR

The model was trained on a cloud computing server. The server system was Ubuntu 20.04.6 LTS, the computing power module was NVIDIA GeForce RTX 4090 GPU with 24G video memory, 20-core CPU with 80G memory. The deep learning framework used PyTorch 2.0.

The feature extraction of H-RT-DETR backbone input was an image of dimension \(640 \times 640 \times 3\), which first passed through the Overlap Patch Embeddings module, where K = 7, S = 4, P = 3. Then it passed through four Transformer Block modules, and the feature dimension of each Transformer Block module was \(\frac{H}{{2^{i + 1} }} \times \frac{W}{{2^{i + 1} }} \times C_{i}\), where \(H = W = 512,C_{i} \in \{ 32,64,160,256\} ,i \in \{ 1,2,3,4\}\). The number of Efficient Self-Attention and Mix_FFN of the four Transformer Block module was set to N = [2, 2, 2, 2], and the reduction ratio (R) of each Efficient Self-Attention was set to 4. In addition, in the experiment of this paper, the hyperparameters of Efficient Hybrid Encoder and Uncertainty-minimal Query Selection of H-RT-DETR are set in Table 1.

Table 1 Details of H-RT-DETR and the compared methods

The model training did not use pre-training weights, and a total of 130 epochs were trained with a batch size of 8. The AdamW optimizer was used to improve the convergence efficiency. The initial learning rate was set to 0.001 and the minimum learning rate was set to 0.00001.

Model performance evaluation

The compared methods

In order to demonstrate the high efficiency of the H-RT-DETR model, a comparative test was conducted to compare it with the following target inspection methods. The training and evaluation data sets of all networks were 3360 and 960 RGB images of 640 × 640 × 3 dimensions obtained in section “Image data acquisition and preprocessing”, respectively. The epoch of all network training was 130 and the batch size was 8.

  1. (1)

    Researches showed that YOLOv5, YOLOv7, YOLOv8 and YOLOX networks accurately and efficiently identified and counted small target crops in UAV remote sensing images [22,23,24,25]. Therefore, this paper used these networks as comparison methods. None of these networks were trained with pre-training weights, where the backbone of YOLOv5 used CSPDarknet53 [26], YOLOv7 used the tiny version, YOLOv8 used the YOLOv8-s version, and YOLOX used the YOLOX-s version. These networks all had 3 anchors, [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], and [116, 90, 156, 198, 373, 326] respectively. Other training hyperparameters are shown in Table 1.

  2. (2)

    H-RT-DETR optimizes the feature extraction network based on RT-DETR, so it is necessary to use RT-DETR as a comparison network. In this work, the feature extraction network of RT-DETR was PResNet [27] (a variant network of ResNet), in which the depth was set to 50, the stage was set to 4, and the last three output channels were used as the input of Efficient Hybrid Encoder. Other training hyperparameters are the same as H-RT-DETR.

Evaluation metrics

In order to verify the performance of the H-RT-DETR model, the metrics based on Precision, Average Recall (AR), mAP (mean average precision, mAP) and FPS (Frames Per Second) were mainly used to quantitatively evaluate all network models. The equations of Precision, Recall and mAP are as follows:

$${\text{Precision }} = \, \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(5)
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(6)
$${\text{Average Recall(AR) }} = \, \frac{{1}}{{\text{N}}}\sum\limits_{i = 1}^{N} {{\text{Recall}}_{i} }$$
(7)
$${\text{mAP}} = \frac{{\sum\nolimits_{1}^{n} {\int_{0}^{1} {{\text{Pre}}} } {\text{cision(Recall)d(Recall)}}}}{n}$$
(8)

In Eqs. (5) and (6), TP, FP and FN represent the number of true positives, false positives and false negatives, respectively. In Eq. (7), N represents the number of all samples in the network model. In Eq. (8), AP represents the area under the precision recall curve (P-R curve), and mAP represents the mean value of different categories of AP. In this experiment, there is only one category of maize seedlings, so n = 1. Among them, mAP0.5 represents the average value of mAP when the IOU threshold is 0.5, mAP0.75 represents the average value of mAP when the IOU threshold is 0.75, the mAP0.50–0.95 represents the average value of mAP under different IOU thresholds (IOU: 0.5–0.95, stride 0.05). In addition, FPS represents the number of frames per second that the target recognition network can process images, which is a performance metric for evaluating the speed of the target detection algorithm.

Results and discussion

Model training results

In order to verify the effect of the improved H-RT-DETR, this paper uses the 3360 training data sets and 960 evaluation datasets in section “The compared methods” to construct a total of 6 target recognition models, namely, YOLOv5, YOLOv7, YOLOv8, YOLOX, RT-DETR and H-RT-DETR. Figure 6 shows the curves of AR, mAP0.5, mAP0.75 and mAP0.50–0.95 for all models on the evaluation dataset. During the training process, the number of experimental training for all network models was set to 130 times, and the evaluation metrics of each model on the evaluation dataset tended to be stable, indicating that all network models finally converged on the evaluation dataset. Compared with other models, H-RT-DETR does not have an advantage in AR, mAP0.5, mAP0.75 and mAP0.50–0.95 for maize seedling detection in UAV remote sensing images at around 20 epochs. The reason may be that the feature extraction backbone of H-RT-DETR is based on a multi-level and multi-scale Transformer network, which needs to extract high-resolution coarse features and low-resolution fine-grained features, and requires more data, time, and computational cost. However, after 20 epochs, the evaluation metrics of H-RT-DETR began to outperform other models. Except for mAP0.75, which is slightly behind YOLOX after 90 epochs, other evaluation metrics are in the leading position in the end.

Fig. 6
figure 6

Comparison of evaluation indexes of different detection models: a Average Recall(AR) curves for each model; b mAP0.5 curve for each model; c mAP0.75 curve for each model; d mAP0.5–0.95 curve for each model

Results of test

In order to verify the generalization ability of the model and the recognition speed of the test model, 480 test data sets in section “Image data acquisition and preprocessing” are used on the trained model in the same environment as the training in section “Training and implementation of H-RT-DETR”, and the results are shown in Table 2. The H-RT-DETR model performs best in terms of mAP0.5–0.95, mAP0.5, mAP0.75, and AR metrics, reaching 51.2%, 94.7%, 48.1%, and 68.5%, respectively. The YOLOv7 network model performs worst in terms of mAP0.5–0.95, mAP0.5, mAP0.75, and AR metrics, reaching 33.8%, 83.7%, 17.1%, and 42.6%, respectively. YOLOX is only the same as H-RT-DETR in mAP0.75 metric, which is superior to other network models.

Table 2 Performance of the models over the test dataset

RT-DETR performs best in terms of FPS, reaching 87f/s, indicating that the RT-DETR model can complete the detection and recognition of maize seedlings in 87 test data sets in one second in this experimental environment. The FPS of H-RT-DETR is second only to RT-DETR, reaching 84f/s. The reason is analyzed as the hierarchical feature extraction backbone. In order to obtain the coarse-grained and fine-grained features of maize seedlings more accurately, it takes more time than RT-DETR.

In order to further validate the recognition ability of the proposed H-RT-DETR model for maize seedlings, 10 original images of 8192 × 5460 × 3 pixels are selected from the original remote sensing images of 480 test data sets(Random selection, and no obvious coincidence area), and crop to 5440 × 5440 × 3 pixels for the counting experiment of maize seedlings. Through manual counting, the 10 test data sets contain a total of 2613 maize seedlings. Then these 10 test data sets are input into 6 models for recognition and counting. The confidence of all models is set to 0.5, indicating that the recognition probability of more than 0.5 is considered to be maize seedlings. The experimental results are shown in Table 3. There are 2613 maize seedlings in 10 remote sensing images. The H-RT-DETR model identified 2585 images, of which 2582 are correctly identified. Precision and Recall are 99.88% and 98.81%, respectively, which are better than other comparison models. YOLOX is second only to H-RT-DETR, with Precision and Recall of 99.84% and 96.59%, respectively.

Table 3 The counting results of maize seedlings by each model

Figure 7 shows the recognition and counting effect of H-RT-DETR model and YOLOX on one of the test images. A total of 259 maize seedlings were manually counted in this test picture. Figure 7a shows the recognition and counting of the H-RT-DETR model, which recognizes a total of 256 maize seedlings (which 255 are correctly recognized, and 1 is incorrectly recognized), 4 are unrecognized. Figure 7b shows the recognition and counting of the YOLOX model, which recognizes a total of 249 maize seedlings, of which 249 are correctly recognized, and 10 are unrecognized.

Fig. 7
figure 7

Comparison of maize seedling recognition and counting with H-RT-DETR and YOLOX models: a Recognition and counting of maize seedlings by H-RT-DETR model (Red circles indicate unrecognized, and blue indicates misrecognized); b Recognition and counting of maize seedlings by YOLOX model (Red circles indicate that neither H-RT-DETR nor YOLO was recognized, and yellow circles indicates that YOLO was not recognized, but H-RT-DETR was recognized); c Recognition of maize seedlings in the original image 8192 × 5460 × 3 by H-RT-DETR model (Red circles indicate unrecognized)

Combining all the metrics of the experiment, the proposed H-RT-DETR model can not only realize the recognition and detection of maize seedlings in UAV remote sensing images more accurately, but also has a very good performance in the recognition rate of the model. At the same time, H-RT-DETR has no limit on the size of the image. As shown in. Figure 7c, for the original image of 8192 × 5460 × 3 pixels captured by UAV, the H-RT-DETR network can also perform a good recognition of corn seedlings.

Discussion

Innovation of study

The complex field environment, coupled with the small size of maize seedling in UAV remote sensing images, makes it difficult to cope with such challenges using manual counting or simple deep learning target recognition methods [28]. Researchers mostly use YOLO-based target detection models for crop detection and recognition [29]. However, the YOLO-based target detection network requires NMS post-processing operations to remove overlapped detection boxes, which will increase crop detection and recognition time. Few studies have used the Transformer-based target detection model for crop recognition in UAV remote sensing images. And based on the Transformer-based target detection model, there are fewer studies on improving the crop recognition speed and achieving real-time requirements. The feature extraction backbone network for RT-DETR uses PResNet, a variant of ResNet. Although it effectively alleviates gradient disappearance, it mainly solves the problem of local gradient, and its ability to model the global context is still weaker than Transformer. Uav remote sensing images may face complex noise. In addition, corn seedlings are small target objects, and deeper feature extraction and global modeling capabilities are needed to better acquire seedling features. Feature extraction of H-RT-DETR is a multi-layer network based on Transformer, which has better global dependence capability. In addition, combined with the efficient self-attention mechanism module, the computational complexity is reduced and the speed of feature extraction is guaranteed.

The H-RT-DETR detection model proposed in this work uses RT-DETR, a real-time target detection model based on Transformer, to creatively replace its feature extraction network with a hierarchical feature representation. It is based on the multi-layer network of Transformer, which has better global dependence and enhances the network's extraction of coarse and fine particle size features. And it uses Efficient Self-Attention mechanism to reduce the computational complexity and guarantee the speed of feature extraction. Therefore, the H-RT-DETR model further improves the detection accuracy of maize seedlings in UAV remote sensing images on the basis of ensuring the network detection speed.

Ablation experiment

In order to further verify the improvement of model detection performance of the proposed feature extraction scheme, several ablation experiments were carried out on the test set. This paper mainly tests the influence of the number of Efficient Self-Attention modules and Mix_FFN modules in Transformer Block (i.e., the value of N, as shown in Fig. 5b) on model performance. The same hyperparameters were set for each experiment and the same training strategy was used. The experimental results are shown in Table 4.

Table 4 Ablation experimental results

As can be seen from Table 4, a total of four experimental controls are included, in which Exp1 represents N = [2, 2, 2, 2] currently adopted in this paper. Represents two Efficient Self-Attention and Mix_FFN modules in each of the four Transformer blocks. The other three experiments were: N = [1, 1, 1], N = [3, 3, 3] and N = [1, 1, 3, 3]. The experimental results show that when N = [3, 3, 3, 3], mAP and AR have the best performance, with mAP0.5–0.95, mAP0.5, mAP0.75 and AR reaching 53.7%, 95.6%, 49.4% and 69.7% respectively, but the FPS is only 77f/s. When N = [1, 1, 1, 1], mAP0.5–0.95, mAP0.5, mAP0.75 and AR perform the worst, but FPS performs the best, reaching 89f/s. When N = [1, 1, 3, 3], mAP0.5–0.95, mAP0.5, mAP0.75 and AR perform slightly worse than N = [3, 3, 3], but better than N = [2, 2, 2, 2]. The reason for the analysis may be that the four Transformer blocks adopt a hierarchical feature extraction method to extract the large-scale (coarse particles) and small-scale (fine particles) features of the data from the front to the back. Corn seedlings are small targets and have higher requirements for small-scale features. Therefore, increasing the value of N in the last two Transformer blocks will significantly improve the recognition effect. However, FPS performance also suffered. The experiment in this paper comprehensively considers accuracy and real-time performance, and N is taken as [2, 2, 2, 2].

Through the Ablation Experiment, it can be found that the H-RT-DETR model proposed in this paper can be applied to different scenarios by adjusting the value of N. If you want to pay more attention to the overall situation of the data, or pay more attention to the recognition of large-scale features of the object, you can appropriately increase the value of N in the first two Transformer blocks. The values of N in the latter two Transformer blocks are more inclined to the extraction of small-scale features. Of course, the value of N also directly affects the FPS of the model.

Limitations of the study

Although the proposed H-RT-DETR model shows good performance, further improvements and extensive field testing are needed to verify its robustness under different operating conditions. For example, the sample images collected in this experiment did not cover more noise (such as more weeds or debris in the field, foggy weather, etc.) to verify the anti-interference ability of the model. In addition, in terms of flight height, this paper only studied the samples taken at a flight height of 10 m, and did not compare the effects of other flight height models in a more detailed way (such as 15 m, 30 m, etc.). Secondly, although the H-RT-DETR model has a better detection accuracy than RT-DETR, it has no improvement in real-time performance, and even the detection rate is slightly lower than that of RT-DETR.

Aimed at the limitations of the model, the future tests under various environment, including different weather conditions, a longer observation period, and more diverse data sets, and the computational complexity of further study on optimization feature extraction module will promote comprehensive validation and adapt to the practical application of the model.

Potential application

This work helps to achieve real-time and accurate monitoring of crops in UAV remote sensing images in complex farmland environments. Specifically, although UAV remote sensing can quickly capture images of field crops, the crops in the images are usually small targets, which pose a considerable challenge to the target detection model. Moreover, it is crucial to quickly or even in real time obtain the growth information of farmland crops. The H-RT-DETR model proposed in this work can not only realize the accurate recognition of maize seedling plants in UAV remote sensing images, but also take into account the recognition rate of the model. Although in the counting test of corn seedlings (Table 3), Precision and Recall showed relatively limited numerical improvement (to one or two decimal places), the reason was that the baseline of the counting test was only 2613 and the sample size was small. But in a statistical sense, H-RT-DETR correctly identified significantly more than the other models. Therefore, if the test sample size is large enough, the gap may be more obvious, especially for some applications with higher statistical requirements, which can better reflect the effectiveness of the research results of the model.

The H-RT-DETR model can not only be applied to the recognition of maize seedlings, but also study its application in other agricultural scenarios, including weed recognition, crop pests and diseases recognition, and fruit picking, etc. In addition, the recognition rate of H-RT-DETR model meets the requirement of real-time detection, and it can be deployed on edge computing devices in combination with edge computing technology to further study the application of the H-RT-DETR model in real-time target detection of UAV remote sensing and automatic picking of robots.

Conclusions

In this study, a real-time recognition model H-RT-DETR for maize seedlings in UAV remote sensing images was proposed. The H-RT-DETR model can significantly improve the recognition accuracy of maize seedlings while ensuring the recognition rate through a feature extraction network with hierarchical feature representation and efficient self-attention mechanism. The results show that the H-RT-DETR model can accurately recognize maize seedlings in UAV remote sensing images (mAP0.5–0.95 = 51.2%, mAP0.5 = 94.7%, mAP0.75 = 48.1%, AR = 68.5%, FPS = 84f/s). In addition, H-RT-DETR is compared with widely used target recognition models and methods. The results show that H-RT-DETR model has better detection performance and is an efficient and fast target recognition tool. Finally, through the counting experiment of maize seedlings on the test data set, the results show that H-RT-DETR shows a more accurate recognition effect than other models (Precision = 99.88%, Recall = 98.81%). Therefore, H-RT-DETR has great potential in realizing accurate and real-time scene recognition of crops such as maize seedlings.

Data availability

The datasets used or analysed during the current study available from the corresponding author on reasonable request.

References

  1. NBSC. Statistical bulletin of national economic and social development of the people’s republic of China in 2023. Chin Stats. 2024;3:9–26.

    Google Scholar 

  2. Doebley J, Stec A, Hubbard L. The evolution of apical dominance in maize. Nature. 1997;386(6624):485–8.

    Article  CAS  PubMed  Google Scholar 

  3. Shuaibing L, Dameng Y, et al. Estimating maize seedling number with UAV RGB images and advanced image processing methods. Precision Agri. 2022;23(5):1604–32.

    Article  Google Scholar 

  4. Jin X, Liu S, Baret F, Hemerlé M, Comar A. Estimates of plant density of wheat crops at emergence from very low altitude UAV imagery. Remote Sens Environ. 2017;198:105–14.

    Article  Google Scholar 

  5. Sun J, Jia H, Ren Z, et al. Accurate rice grain counting in natural morphology: a method based on image classification and object detection. Comput Elect Agri. 2024;227(1):109490–109490.

    Article  Google Scholar 

  6. Tang B, Zhou J, Pan Y, et al. Recognition of maize seedling under weed disturbance using improved YOLOv5 algorithm. Measurement. 2025;242:115938–115938.

    Article  Google Scholar 

  7. Yuyun P, Nengzhi Z, Lu D, et al. Identification and counting of sugarcane seedlings in the field using improved faster R-CNN. Remote Sens. 2022;14(22):5846–5846.

    Article  Google Scholar 

  8. Han Z, Cai Y, Liu A, Zhao Y, Lin C. MS-YOLOv8-based object detection method for pavement diseases. Sensors. 2024;24:4569. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/s24144569.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. Cham: Springer International Publishing; 2020.

    Book  Google Scholar 

  10. Zhao Y, Lv W, et al. Detrs beat yolos on real-time object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 16965–16974.

  11. Wei X, Yin L, Zhang L, Wu F. DV-DETR: improved UAV aerial small target detection algorithm based on RT-DETR. Sensors. 2024;24:7376. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/s24227376.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Liu Z, Sun C, Wang X. DST-DETR: image Dehazing RT-DETR for safety helmet detection in foggy weather. Sensors. 2024;24:4628. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/s24144628.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Yang H ,Wang J ,Bo Y , et al.ISTD-DETR: A deep learning algorithm based on DETR and Super-resolution for infrared small target detection.Neurocomputing,2025,.

  14. Jocher Glenn.Yolov5 release v7.0. https://github.com/ultralytics/yolov5/tree/v7.0 ,2022.2, 3,6,7

  15. Jocher Glenn.Yolov8. https://github.com/ultralytics/ultralytics/tree/main ,2023.1, 2,3,6,7

  16. Wang C Y, Bochkovskiy A, Liao H Y M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 7464–7475.

  17. Ge Z. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.

  18. Qiang Chen,Xiaokang Chen, et al. Group detr:Fast training convergence with decoupled one to-many label assignment.arXiv preprint arXiv:2207.13085, 2022.2

  19. Qiang Chen,Jian Wang, et al. Group detr v2:Strong object detector with encoder-decoder pretraining.arXiv preprint arXiv:2211.03594,2022.8

  20. Xie E, Wang W, Yu Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Proc Syst. 2024;34:12077–90. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2105.15203.

    Article  Google Scholar 

  21. Liu, Z., Lin, Y., Cao, Y., et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE/CVF Int. Conf. Comput. Vis., 10012–10022 (2021).

  22. Peng H, Li Z, Zou X, et al. Research on litchi image detection in orchard using UAV based on improved YOLOv5. Expert Syst Applic. 2025. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.eswa.2024.125828.

    Article  Google Scholar 

  23. Li Z, Zhu Y, Sui S, et al. Real-time detection and counting of wheat ears based on improved YOLOv7. Comput Elect Agri. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2024.108670.

    Article  Google Scholar 

  24. Niu S, Nie Z, Li G, et al. Multi-altitude corn tassel detection and counting based on UAV RGB imagery and deep learning. Drones. 2024;8(5):198. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/drones8050170.

    Article  Google Scholar 

  25. Chao-yu S, Fan Z, Jian-sheng L, et al. Detection of maize tassels for UAV remote sensing image with an improved YOLOX model. J Integ Agri. 2023;22(6):1671–83. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jia.2022.09.021.

    Article  CAS  Google Scholar 

  26. Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: A new backbone that can enhance learning capability of CNN. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1911.11929

  27. He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. CoRR,2015,abs/1512.03385

  28. Wang L, Wang G, Yang S, et al. Research on improved YOLOv8n based potato seedling detection in UAV remote sensing images. Front Plant Sci. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fpls.2024.1387350.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Li H, Wu J. LSOD-YOLOv8s: a lightweight small object detection model based on YOLOv8 for UAV aerial images. Eng Lett. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/tee.24195.

    Article  Google Scholar 

Download references

Funding

This work was supported by the National Key Research and Development Program of China (2023YFD1900704).

Author information

Authors and Affiliations

Authors

Contributions

Yunlong Wu: Writing– review & editing, Writing– original draft, Investigation, Visualization, Methodology, Conceptualization. Shouqi Yuan: Methodology, conceptualization. Lingdi Tang: Writing– review & editing, Writing– original draft, Visualization, Supervision, Conceptualization. All authors reviewed the manuscript.

Corresponding author

Correspondence to Lingdi Tang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Yuan, S. & Tang, L. Plant recognition of maize seedling stage in UAV remote sensing images based on H-RT-DETR. Plant Methods 21, 60 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01382-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01382-9

Keywords