Comparison of YOLO-based sorghum spike identification detection models and monitoring at the flowering stage

Zhang, Song; Yang, Yehua; Tu, Lei; Fu, Tianling; Chen, Shenxi; Cen, Fulang; Yang, Sanwei; Zhao, Quanzhi; Gao, Zhenran; He, Tengbing

doi:10.1186/s13007-025-01338-z

Research
Open access
Published: 17 February 2025

Comparison of YOLO-based sorghum spike identification detection models and monitoring at the flowering stage

Song Zhang^1,2,3,
Yehua Yang⁴,
Lei Tu^1,2,
Tianling Fu²,
Shenxi Chen¹,
Fulang Cen¹,
Sanwei Yang¹,
Quanzhi Zhao¹,
Zhenran Gao^1,2 &
…
Tengbing He¹

Plant Methods volume 21, Article number: 20 (2025) Cite this article

791 Accesses
Metrics details

Abstract

Monitoring sorghum during the flowering stage is essential for effective fertilization management and improving yield quality, with spike identification serving as the core component of this process. Factors such as varying heights and weather conditions significantly influence the accuracy of sorghum spike detection models, and few comparative studies exist on model performance under different conditions. YOLO (You Only Look Once) is a deep learning object detection algorithm. In this research, images of sorghum during the flowering stage were captured at two heights (15 m and 30 m) in 2023 via a UAV and utilized to train and evaluate variants of YOLOv5, YOLOv8, YOLOv9, and YOLOv10. This investigation aimed to assess the impact of dataset size on model accuracy and predict sorghum flowering stages. The results indicated that YOLOv5, YOLOv8, YOLOv9, and YOLOv10 achieved mAP@50 values of 0.971, 0.968, 0.967, and 0.965, respectively, with dataset sizes ranging from 200 to 350. YOLOv8m performed best on 15_sunny and 15_cloudy clouds and, overall, exhibited superior adaptability and generalizability. The predictions of the flowering stage using YOLOv8m were more accurate at heights between 12 and 15 m, with R² values ranging from 0.88 to 0.957 and rRMSE values between 0.111 and 0.396. This research addresses a significant gap in the comparative evaluation of models for sorghum spike detection, identifies YOLOv8m as the most effective model, and advances flowering stage monitoring. These findings provide theoretical and technical foundations for the application of YOLO models in sorghum spike detection and flowering stage monitoring. These findings provide a technical means for the timely and efficient management of sorghum flowering.

Introduction

Sorghum (Sorghum bicolor (L.) Moench) ranks as the fifth most widely cultivated crop globally, covering 42 million hectares, and serves as a critical food source [1, 2]. Effective monitoring of sorghum growth, along with timely management optimization and yield prediction, is essential for ensuring high and stable production levels [3]. Conventional monitoring methods are often labor intensive, inefficient, and less precise [4]. In recent years, the rapid development of computer vision [5] in crop growth monitoring [6, 7] has led to the widespread application of plant phenotyping monitoring based on unmanned aerial vehicle (UAV) imagery [8, 9, 10]. UAVs offer distinct advantages, such as cost-effectiveness, ease of operation, and high efficiency [11]. Moreover, UAVs are capable of carrying diverse types of sensing equipment, including RGB cameras [12], thermal infrared devices [13], multispectral sensors [14], hyperspectral sensors [15], and LiDAR systems [16]. These capabilities have established UAVs as versatile and widely utilized platforms for crop growth monitoring in recent years [17].

The sorghum flowering stage, defined as the stage when 50% of sorghum plants begin flowering, is a crucial physiological stage in sorghum growth [18] that directly impacts the final yield and quality. Monitoring the flowering stage facilitates reasonable flowering stage management and harvest timing, thereby maximizing economic returns. Pan et al. [19] successfully identified the initial anthesis of soybeans using UAV-acquired multispectral time series images, achieving RMSE and MAE values of 3.79 days and 3.00 days, respectively. Similarly, Guo et al. [20] developed a metric that combines spectral indices (NGBDI, GBDI) and textural features (contrast) to detect maize staminate flowers using UAV-captured maize RGB images, achieving an RMSE of 5.766 days. However, reliance on spectral data collection poses challenges because of its dependence on sunny weather conditions, which significantly limits operational flexibility.

In recent years, the use of unmanned aerial vehicle (UAV) platforms to acquire high-resolution RGB images combined with deep learning algorithms to rapidly identify crop ears [21] and monitor flowering [22]conditions has shown great potential. Compared with traditional methods, deep learning algorithms offer greater accuracy, can identify targets in complex backgrounds, and can meet the needs of large-scale, efficient, and rapid crop growth monitoring. YOLO, a deep learning model based on convolutional neural networks developed by the team of [23], employs a classification/regression framework for target detection. It rapidly predicts both object classes and bounding box coordinates [24] and has gained significant attention in plant phenotype monitoring. For example, Zhao et al. [25, 26] utilized an improved YOLOv5-based oriented and small wheat spike detection model (OSWSDet) to monitor wheat spikes, achieving an average accuracy of 90.5% at maturity. Similarly, the WheatNet detection model recorded average accuracies of 90.1% during the irrigation stage and 88.6% at maturity. Yu et al. [27] proposed an enhanced maize tassel detection model, PConv-YOLOv8, derived from YOLOv8, and conducted a comparison with seven advanced deep learning approaches. Among these, PConv-YOLOv8 × 6 demonstrated the optimal balance between detection accuracy and parameter efficiency. Gao et al. [28] developed an improved maize tassel detection model, YOLOv5-Tassel (YOLOv5-T), which achieved an average detection accuracy of 98.70%. These findings indicate that YOLO-based models exhibit significant potential for crop tassel detection.

YOLO has undergone multiple iterations since its development. YOLOv5 is widely adopted for its rapid detection speed and high accuracy [29]. YOLOv8 has garnered attention for its anchor-free detection model and decoupled head architecture, facilitating easier optimization [30]. YOLOv9 incorporates programmable gradient information (PGI), addressing the issue of information loss and enhancing model learning efficiency [31]. The latest version, YOLOv10, introduces a continuous dual allocation strategy that eliminates the need for NMS training, significantly boosting performance while reducing latency [32]. YOLO iterations have different focuses and do not necessarily lead to directly increased accuracy [24]. Furthermore, the high density of panicles in UAV-acquired images of sorghum during flowering, along with significant occlusion and overlap, presents challenges for different YOLO models. Therefore, comparing the adaptability and generalizability of different models to select the optimal model is fundamental for flowering stage monitoring and model improvement. Additionally, YOLO requires a large amount of training data to ensure accuracy, which is data that is often difficult to obtain and requires significant time for annotation [33]. However, previous studies rarely report the impact of dataset size on model accuracy or compare model performance. Therefore, evaluating the impact of dataset size on the performance of YOLO series models in sorghum panicle identification and flowering stage monitoring, as well as assessing model adaptability and generalizability, is highly important for the application of YOLO in sorghum flowering stage monitoring.

In summary, YOLO shows excellent potential for sorghum panicle identification and flowering stage monitoring. In this study, we tested 21 variants of four models to investigate model adaptability and generalizability, as well as the appropriate dataset size. The objectives of this study are as follows: (1) To test and compare the performance of the models under different dataset sizes to determine the most suitable dataset size and reduce the workload of target labeling, the cost of sorghum flowering stage monitoring should be better controlled, and the efficiency and accuracy should be improved. (2) To evaluate the adaptability and generalizability of the models, the accuracy of YOLO is tested on UAV remote sensing images acquired at different altitudes and under varying weather conditions. This can help select the most suitable model for monitoring the sorghum flowering period and provide researchers with the most appropriate model for algorithm improvement. (3) This study provides a basis for sorghum flowering period management by rapidly and accurately identifying the number of panicles needed to determine the sorghum flowering period, laying a foundation for the scientific and efficient management of sorghum.

Materials and methods

Overview of the study area and experimental design

The images for the 2023 sorghum flowering stage in this study were collected from Lianxing Village, Jichang Township, Xixiu District, Anshun City, Guizhou Province (26°5’54.15"N, 106°6’3.258"E) (see Fig. 1). The varieties included Qiangao 8 and Hongyingzi. A randomized block design was employed, featuring three replications for each variety and six nitrogen fertilizer treatments across 36 plots. The nitrogen fertilizer gradients ranged from 0 to 300 kg/hm² on a pure N basis, comprising N1 (0 kg/hm²), N2 (60 kg/hm²), N3 (120 kg/hm²), N4 (180 kg/hm²), N5 (240 kg/hm²), and N6 (300 kg/hm²). Each experimental plot measured 5 m in length and 4 m in width, covering an area of 20 m², with row spacing of 0.6 m and plant spacing of 0.3 m. Seedlings were planted on April 16, 2023, and transplanted on May 7, 2023, with all other management practices consistent with local standards. The 2024 sorghum flowering stage images were obtained from Ludi Village, Shiban Town, Huaxi District, Guiyang City, Guizhou Province (26°25’52.63"N, 106°33’59"E) (see Fig. 1). The varieties used were Qiangao 8 and Hongyingzi. This experiment also followed a randomized block design with three replications for each variety and four nitrogen fertilizer treatments across 24 plots. The nitrogen fertilizer gradient ranged from 0 to 300 kg/hm² on a pure N basis, comprising N1 (0 kg/hm²), N2 (100 kg/hm²), N3 (200 kg/hm²), and N4 (300 kg/hm²). Each plot measured 5.5 m in length and 4 m in width, with an area of 22 m², a row spacing of 0.6 m, and a spacing of 0.3 m. Seedlings were planted on April 12, 2023, and thinned on June 14, 2023, with all other management practices aligned with local agricultural standards.

Data acquisition and pre-processing

Data acquisition

In 2023, a DJI M300 RTK UAV equipped with a Zenmuse L1 sensor was used to collect RGB images at altitudes of 15 m (both cloudy and sunny), 30 m, and 45 m. The UAV operated in autoflight mode with a bypass overlap rate of 70%, a heading overlap rate of 70%, an airspeed of 1 m/s, and a resolution of 5472 × 3648 pixels. Additionally, a Zenmuse H20T infrared-visible camera was used to capture wide-angle and zoomed RGB images at 15 m altitude on sunny days, also in autoflight mode. These images were acquired with a bypass overlap rate of 80%, a heading overlap rate of 70%, and an airspeed of 1 m/s. The wide-angle images had a resolution of 4056 × 3040 pixels, while the zoom images were 5184 × 3888 pixels, and all the images were stored in JPG format. In 2024, sorghum flowering stage images were collected via a DJI MAVIC 3 M UAV at altitudes of 12 m, 15 m, 30 m, and 45 m. The UAV operated in automatic flight mode with a bypass overlap rate of 80%, a heading overlap rate of 70%, a resolution of 5280 × 3956 pixels, and stored the images in JPG format.

Data pre-processing

From Zenmuse H20T’s 2023 15 m altitude wide-angle RGB images, 152 orthoimages of field trial plots were selected, and 600 images of 640 × 640 pixels were cropped using Photoshop CS5. To improve model stability and robustness, data augmentation was performed with 90° and 180° rotations and flipping, expanding the dataset to 800 images for training. The 800 images were randomly numbered to ensure random distribution within the dataset and minimize differences between datasets. Additionally, images from the Zenmuse L1 acquisition at 15 m (cloudy and sunny) and 30 m (sunny) were used to crop 80, 80, and 30 640 × 640 images, respectively, and 80 zoom photos from Zenmuse H20T at 15 m altitude were used to construct datasets for evaluating model adaptability and generalization. The images collected during 2024 were processed using Metashape 2.0.0 to create stitched orthomosaics, which were cropped with Photoshop CS5 into 24 images containing complete plots for spike counting and flowering stage determination.

Image annotation and dataset construction

Each sorghum spike was annotated with a manually drawn bounding box using Labelimg software, as illustrated in Fig. 2. During the annotation process, occluded and overlapping spikes were annotated in their complete forms. A total of 39,888 spike targets were annotated across 800 images collected at 15 m altitude. The same annotation method was consistently applied throughout the process.

Based on the annotated images, six datasets—100_train, 200_train, 350_train, 500_train, 650_train, and 800_train—were constructed, containing 100, 200, 350, 500, 650, and 800 images, respectively. These datasets were randomly divided into training, validation, and test sets at an 8:1:1 ratio to prevent overfitting during model training. Additional datasets included 15 m sunny (15_sunny), 15 m cloudy (15_cloudy), and 15 m zoom (15_zoom), each comprising 80 images, and a 30 m sunny (30_sunny) dataset with 30 images. After labeling, the model’s adaptability to different conditions was verified, with a total of 14,324 targets labeled across the four datasets. The dataset details are presented in Table 1, with example images shown in Fig. 3.

Table 1 Dataset information

Full size table

Model training

The experiments were conducted on a desktop computer with an Intel I9-14900KF processor, 64 GB of RAM, and an Nvidia RTX 4070 SUPER graphics card with 12 GB of video memory. The YOLOv5, YOLOv8, YOLOv9, and YOLOv10 environments were configured using Anaconda 3 on a Windows 11 operating system. The input image sizes for training were standardized to 640 × 640 pixels, with training epochs set to 150 and batch sizes ranging from 8 to 16. All other hyperparameters were left at default values. A total of 126 training results were generated to evaluate the accuracy of the models across different dataset sizes. The procedure is illustrated in Fig. 4. The model with the highest accuracy was validated on the test set and tested on the 15_zoom, 15_cloudy, 15_sunny, and 30_sunny datasets to assess adaptability and generalizability across different altitudes and weather conditions. The best-performing model was ultimately selected for monitoring the flowering stage of sorghum.

Introduction of the YOLO series

The YOLO (You Only Look Once) algorithm, developed by Joseph Redmon and his team in 2015, is a target detection algorithm capable of performing real-time detection and classification of multiple objects. Its key advantage is its ability to directly output object categories and bounding boxes in a single operation. Owing to its open nature and ease of further development, YOLO has been widely adopted and improved for various target detection tasks. Since its initial release, the algorithm has undergone numerous iterations to improve accuracy and adaptability across diverse scenarios, with the latest version being YOLOv10.

YOLOv5, an enhancement of YOLOv4 developed by Glenn Jocher, is currently the most widely used YOLO model, is structured with three main components: Backbone, Neck, and Head, as illustrated in Fig. 5. During the input stage, images are pre-processed by the algorithm. The Backbone performs information compression and feature combination to improve the efficiency of feature extraction. The Neck enhances the model’s ability to represent features and expand the receptive field, thereby boosting the detection performance before the information is forwarded to the Head. The Head is responsible for detecting features, generating bounding boxes, classifying objects, and outputting the results, completing the target detection process. YOLOv5 is available in ten size variants: l, m, n, s, x, and l6, m6, n6, s6, x6. This study focuses on comparing the performance of five variants: l, m, n, s, and x.

The YOLOv8 network architecture comprises three components: Backbone, Neck, and Head. The Backbone replaces YOLOv5’s C3 module with the more gradient-efficient C2f module, which adjusts the channel counts on the basis of the model scale. The Neck uses a PAFPN module for feature fusion, facilitating the processing of feature maps from varying scales. The Head employs a Decoupled-Head structure, separating classification and detection tasks while adopting an anchor-free design. This anchor-free approach reduces the number of predicted bounding boxes, accelerates non-maximum suppression (NMS), and improves detection efficiency. YOLOv8 is available in five size variants: l, m, n, s, and x.

YOLOv9 introduces the Programmable Gradient Information (PGI) concept, which addresses the need for adaptable deep networks that can handle multiple target tasks. PGI provides comprehensive input information for calculating objective functions and reliable gradients for updating network weights. A lightweight architecture, the Generalized Efficient Layer Aggregation Network (GELAN), is implemented using traditional convolutional operators to improve parameter efficiency. This approach demonstrates the effectiveness of PGI, particularly in lightweight models. Improvements in YOLOv9 address the issue of information loss, leading to increased detection accuracy. YOLOv9 offers five size variants: c, e, m, s, and t.

YOLOv10, the most recent version of the YOLO series, is optimized for real-time, end-to-end object detection. It introduces a continuous dual assignment strategy, enabling NMS-free training that significantly enhances performance and reduces latency. Key innovations include a lightweight classification head employing depthwise separable convolutions to lower computational demands without compromising performance; spatial-channel decoupled downsampling to improve efficiency while minimizing information loss through separate spatial and channel operations; and rank-guided block design, which optimizes parameter use by tailoring block complexity to the redundancy at different stages. YOLOv10 includes six size variants: b, l, m, n, s, and x.The study employs the original model without any modifications to the code.

In this study, P (precision), R (recall rate), mAP (mean average precision), and F1 (harmonic average) were utilized to assess the performance of the YOLO model. Their definitions are provided in Eqs. (1, 2, 3, 4 and 5), where TP refers to a true positive (a correctly predicted spike), FP indicates a false positive (a spike predicted where none exists in the image), FN denotes a false negative (the number of missed sorghum spikes), N represents the total number of images, and Nt specifies the number of detected categories. AP is the area under the precision-recall curve, whereas mAP is defined as the average of the mean accuracies for all dataset categories. As the detection in this study involves only a single target class, the AP is equivalent to the mAP.

$$\:\text{P}\:=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$

(1)

$$\:\text{R}\:=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

(2)

$$\:\text{F}1\:=\frac{2\times\:\text{P}\times\:\text{R}}{\text{P}+\text{R}}$$

(3)

$$\:\text{A}\text{P}\:=\frac{\sum\:\text{P}}{\text{N}}$$

(4)

$$\:\text{m}\text{A}\text{P}=\frac{{\sum\:}_{\text{i}=1}^{\text{n}}\text{A}\text{P}\text{i}}{\text{N}\text{t}}$$

(5)

The coefficient of determination (R²), root mean square error (RMSE), and relative root mean square error (rRMSE) were employed as evaluation metrics, as defined in Eqs. (6, 7 and 8), to evaluate the model’s performance on the sorghum spike counting results.

$$\:{\text{R}}^{2}=\:1-\frac{{\sum\:}_{1}^{\text{n}}{(\text{Y}\text{i}-\text{X}\text{i})}^{2}}{{\sum\:}_{1}^{\text{n}}{(\text{Y}\text{i}-\stackrel{-}{\text{Y}}\text{i})}^{2}}$$

(6)

$$\:\text{R}\text{M}\text{S}\text{E}\:=\sqrt{\frac{{\sum\:}_{1}^{\text{n}}{(\text{Y}\text{i}-\text{X}\text{i})}^{2}}{\text{n}}}$$

(7)

$$\:\text{r}\text{R}\text{M}\text{S}\text{E}\:=\sqrt{\sum\:_{\text{i}\:=\:1}^{\text{n}}{\left({\text{Y}}_{\text{i}}-{\text{X}}_{\text{i}}\right)}^{2}\times\:\frac{1}{\text{n}}\times\:\frac{1}{\stackrel{-}{\text{Y}}}}$$

(8)

where $\:{\text{Y}}_{\text{i}}$ is the number of annotated sorghum spikes in the i-th image, $\:{\stackrel{-}{\text{Y}}}_{\text{i}}$ denotes the average number of sorghum spikes, $\:{\text{X}}_{\text{i}}$ reflects the number of sorghum spikes predicted by the model, and n is the total number of test images. The results were statistically analyzed using Excel 2021 and OriginPro 2021.

Results and analysis

Analysis of dataset size differences between models

The key evaluation parameters for deep learning model accuracy include mAP@50, mAP@50–95, P, R, and F1, with mAP@50 being the most widely used metric for accuracy assessment. This study evaluates the performance of various models trained on datasets of different sizes, selecting optimal results on the basis of mAP@50 for further analysis. The performance differences for YOLOv5, YOLOv8, YOLOv9, and YOLOv10 across dataset sizes (100_train, 200_train, 350_train, 500_train, 650_train, and 800_train) are shown in Fig. 6, while the numerical results are presented in Appendix Table 1. For YOLOv5, the mAP@50 ranges from 0.933 to 0.971, with the highest mAP@50 values for models l, m, n, s, and x observed at dataset sizes of 350, 800, 650, 350, and 350, respectively. YOLOv5n achieves the highest accuracy among these models.

In YOLOv8, the mAP@50 varies between 0.934 and 0.968, with maximum values for models l, m, n, s, and x occurring at dataset sizes of 650, 800, 800, 650, and 800, respectively, where YOLOv8n achieves the highest accuracy.

YOLOv9 yields mAP@50 values ranging from 0.861 to 0.967. Models c, e, m, s, and t reach their maximum mAP@50 at dataset sizes of 800, 650, 800, 800, and 650, respectively, with YOLOv9m and YOLOv9e achieving the highest accuracy at an mAP@50 of 0.967.

In YOLOv10, the mAP@50 ranges from 0.905 to 0.965, with the highest values for models b, l, m, n, s, and x observed at dataset sizes of 350, 800, 650, 800, 650, and 800, respectively. Among these methods, YOLOv10b achieves the highest model accuracy.

The results demonstrate that, except for YOLOv5, the accuracies of YOLOv8, YOLOv9, and YOLOv10 generally improve as the dataset size increases. Notably, for YOLOv8l, the lowest mAP@50 occurs at a dataset size of 500, whereas for all other models, the mAP@50 minima are observed at a dataset size of 100. This trend indicates that expanding the dataset size beyond 100 enhances model accuracy. Based on these observations, the top 21 training results, out of a total of 126 results, were selected for subsequent validation and testing, with selection criteria focused on achieving the highest mAP@50 values.

Validation of the accuracy of the test sets

To evaluate the model’s performance in detecting the flowering stage of spikes for unseen data, this study conducted a validation using a test set. The mAP@50 results for the test and validation sets, along with their absolute differences, are presented in Fig. 7. The mAP@50 values of the validation ranges were 0.960 to 0.967 for YOLOv5, 0.956 to 0.968 for YOLOv8, 0.954 to 0.968 for YOLOv9, and 0.952 to 0.964 for YOLOv10. All the models achieved mAP@50 values exceeding 0.95 on the test set, and the absolute differences between the validation and test sets were less than 0.015. These results indicate robust detection capabilities for unknown data across all the models and their variants.

Comparison of computational resources and inference times for different models

The YOLO series is a real-time detection model, and the size of its parameters directly affects GPU memory usage, which in turn impacts the computational workload, training time, and inference time. This is a critical factor that must be considered when selecting and deploying the model. Fig. 8 shows the model sizes, training times, and inference times for YOLOv5, YOLOv8, YOLOv9, and YOLOv10. For all the models except YOLOv9s, the training and inference times increased with model size. The overall ranking for training time, from highest to lowest, was YOLOv9 > YOLOv10 > YOLOv8 > YOLOv5, whereas for inference time, it was YOLOv9 > YOLOv8 > YOLOv10 > YOLOv5. The model size ranking was YOLOv5 > YOLOv9 > YOLOv8 > YOLOv10.

Among the models, YOLOv5n demonstrated the smallest inference time, training time, and model size. For a dataset size of 350, YOLOv5n required only 3.6 min for training, had a compact model size of 3.7 MB, and achieved an inference time of just 5.5 ms.

The training progress of the models, depicted in Fig. 9, illustrates the trends of mAP@50, mAP@50–95, precision (P), and recall (R) as the number of epochs increases. YOLOv5n, YOLOv8n, and YOLOv9t reach stable performance after 30, 25, and 60 epochs, respectively. In contrast, YOLOv10n demonstrates a rapid increase in these metrics before epoch 9, followed by a gradual increase after epoch 10. In terms of training speed, YOLOv8n > YOLOv10n > YOLOv5n > YOLOv9t, whereas YOLOv10n exhibits the greatest stability in training outcomes, followed by YOLOv8n, YOLOv5n, and YOLOv9t.

Evaluation of model adaptability and generalizability

Adaptability and generalization, reflecting performance on altered and unseen datasets, were evaluated by testing on datasets with varying weather conditions, flight heights, and camera settings, including 15_cloudy, 15_sunny, 30_sunny, and 15_zoom. Metrics such as mAP@50, mAP@50–95, P, R, and F1 scores, presented in Fig. 10, were used to assess performance.

Across all 21 model variants, the mAP@50 at 15_sunny consistently exceeds that at 30_sunny, indicating reduced accuracy with increased flight altitude. Under 15_sunny conditions, the ranking of models by mAP@50, from highest to lowest, is YOLOv8m (0.966) > YOLOv5m (0.944) > YOLOv10x > YOLOv9m (0.921). For 30_sunny conditions, the top-performing variants, in descending order of mAP@50, are YOLOv8n > YOLOv5m > YOLOv10n > YOLOv9m. This trend highlights the superior adaptability and generalizability of YOLOv8, especially under variable environmental and operational conditions.

Across the 21 variants of the four models, the mAP@50 values under 15_sunny conditions consistently exceed those under 15_cloudy, highlighting a notable decline in model performance under 15_cloudy. This decline suggests that all the models are sensitive to variations in color and brightness. Under 15_cloudy, the ranking of mAP@50 from highest to lowest is YOLOv8m > YOLOv5l > YOLOv10b > YOLOv9m.

Similarly, the mAP@50 under 15_sunny also exceeds that under 15_zoom, indicating that image compression adversely impacts the object detection accuracy. Under 15_zoom, the highest to lowest mAP@50 values are observed in YOLOv10x > YOLOv9m > YOLOv8n > YOLOv5m.

Considering the models’ performance across varying heights, weather conditions, wide-angle views, and zoom scenarios, YOLOv8m emerges as the optimal model.

Tests conducted with YOLOv8m on datasets 15_sunny, 15_cloudy, 15_zoom, and 30_sunny, as shown in Fig. 11, demonstrate strong adaptability and generalizability, further validating its suitability for diverse operational conditions.

Spike counting and flowering stage prediction

Based on the results of Sect. 2.4, sorghum flowering images captured from July 21 to July 26, 2024, were analyzed using the YOLOv8m model. The detection results, presented in Fig. 12, reveal a gradual decline in prediction accuracy with increasing flight height. The R² values across different heights were as follows: 12 m (0.915–0.957) > 15 m (0.880–0.939) > 30 m (0.790–0.935) > 45 m (0.394–0.733). Similarly, the RMSE and rRMSE increased with height, indicating a loss of precision. The rRMSE values ranged from 12 m (0.111–0.396) < 15 m (0.177–0.326) < 30 m (0.556–2.676) < 45 m (4.0–9.791). These findings suggest that 12 m is the optimal flight height for UAV monitoring of sorghum flowering, whereas heights above 30 m result in significantly reduced detection performance.

Fig. 13 shows the results of sorghum spike detection at a flight height of 12 m from July 21 to July 26, 2024, demonstrating the model’s ability to accurately identify sorghum spikes during the flowering stage. These findings indicate that YOLOv8m effectively supports precise and reliable monitoring of sorghum flowering at optimal heights.

Discussion

The performance of YOLO models on smaller datasets is generally suboptimal, as noted by [34]. While increasing the dataset size enhances accuracy, the process of data collection and annotation is time intensive. This study compared the performance of YOLOv5, YOLOv8, YOLOv9, and YOLOv10 across various dataset sizes and achieved a significant improvement in mAP@50 as the dataset size increased from 100 to 200. However, for dataset sizes between 200 and 800, the variation in mAP@50 for YOLOv8, YOLOv9, and YOLOv10 became minimal, with little incremental gain observed. For YOLOv5, the mAP@50 values sometimes decreased within this range. These findings align with Barbedo’s [35] observations, where smaller datasets resulted in limited performance for CNN-based plant disease classification, underscoring the challenge of applying deep learning to such domains. Similarly, Bailly et al. [33] and Verberg [36] reported that increasing the dataset size does not guarantee consistent improvements in model performance. Kulik and Shtanko [37] demonstrated that tiny YOLO and YOLOv3 could achieve sufficient accuracy with relatively small datasets, with diminishing returns as the dataset size increases. These results are consistent with the findings of this study, emphasizing that for sorghum spike detection, a dataset size of 200–350 offers an optimal balance between model accuracy and the effort required for dataset preparation. This balanced approach mitigates the workload while ensuring sufficient accuracy for practical applications.

When specific target categories are detected, larger and more complex models do not necessarily yield better performance. Kulik and Shtanko [37] compared tiny YOLO and YOLOv3 and reported that tiny YOLO, with fewer parameters, achieved better training results with reduced training and inference times. Similarly, in this study, smaller models exhibited comparable or superior accuracy compared with larger models while requiring significantly less time and computational resources, highlighting the importance of selecting an appropriately sized model to balance efficiency and performance. Under clear and stable weather conditions, the small size, fast detection speed, and high detection accuracy of YOLOv5n make it the optimal choice, easily enabling real-time monitoring even on mobile devices. YOLOv8n, owing to its superior performance under sunny conditions (30sunny), is better suited for large-scale monitoring because cameras at higher altitudes can capture a larger area.

For UAV-based detection of sorghum spikes, factors such as target size, image brightness, and color variation substantially influence model performance. This study evaluated the adaptability and generalizability of YOLO models on datasets such as the test set, 15_zoom, 15_sunny, 15_cloudy, and 30_sunny. The findings indicate the following: (1) All the models performed robustly on the test set and 15_sunny, with mAP@50 absolute differences between the test and validation sets ranging from 0 to 0.013, reflecting strong adaptability to unknown data. (2) Under 15_cloudy conditions, the performance of the models exhibited notable variability, with all models showing a marked decline in effectiveness, suggesting a higher rate of missed and false detections. This outcome contradicts the findings reported in other studies. For example, Mirhaji et al. [38] successfully collected images of oranges under diverse lighting conditions and utilized YOLOv4 for training and detection, achieving accurate detection of oranges regardless of lighting variations. In the study by Tian et al. [39], apples were successfully detected in images taken under both sunny and cloudy conditions. This success is attributed to the inclusion of images captured under varying lighting conditions in the training set, a strategy also employed by Tian and Mirhaji [38, 39]. In contrast, the training set used in this study consisted only of images taken on sunny days, limiting the model’s ability to generalize to cloudy conditions; a more diverse training dataset demonstrably leads to better model performance. Additionally, the strong color contrast between the target fruits (apples and oranges) and their backgrounds facilitated detection. In this study, the minimal color contrast between sorghum spikes and leaves under cloudy conditions significantly reduced the detection accuracy. (3) Under 30_sunny conditions, the performance of all the models notably decreased because of the increased distance between the camera and the target. This is consistent with findings by Hao et al. [40], who demonstrated that detection accuracy decreases as the object-to-camera distance increases. At 30_sunny, sorghum spikes occupy only about one-fourth of the pixel area compared with the training images, which severely impacts the detection precision. (4) Conversely, under 15_zoom conditions, the models exhibited a smaller performance decline. This is because the 15_zoom images were tested at a resolution of 640 × 640, resulting in the pixel representation of sorghum spikes being similar to that in the training set. This comparable pixel density mitigated the impact of zoom-induced compression, preserving detection accuracy.

During the collection of UAV remote sensing images of sorghum, selecting clear and windless weather conditions is essential. However, monitoring specific crop growth stages, such as the flowering stage, requires continuous time series data collection [41], necessitating image acquisition even under suboptimal weather conditions. In 2024, sorghum UAV images exhibited greater variability in color and brightness than did 2023 images, complicating detection. Additionally, image deformation and posture disturbances introduce distortions during stitching [42], leading to inconsistencies in detection accuracy and weakening the correlation of metrics such as R², RMSE, and rRMSE with UAV flight altitude. Future research should address weather-related interference, incorporate ground control points (GCPs), and utilize advanced algorithms and stitching software to mitigate image distortions. Furthermore, when R², RMSE, and rRMSE are calculated by fitting model counts to manual counts, misdetections and missed detections by the model are also included, introducing error. This is a limitation of this method. Therefore, future research needs to thoroughly evaluate model accuracy before prediction to minimize error.

Conclusion

This study trained the YOLOv5, YOLOv8, YOLOv9, and YOLOv10 models on six datasets of varying sizes (100–800), resulting in 126 training outcomes. For YOLOv5, increasing the dataset size improved the accuracy for datasets smaller than 200 but had minimal or even negative effects when the dataset size exceeded 350. For YOLOv8, YOLOv9, and YOLOv10, significant accuracy improvements occurred for dataset sizes below 200, with diminishing returns beyond 200. A dataset size of 200–350 is recommended as a balanced choice between model performance and workload.

Testing across 15_zoom, 15_sunny, 15_cloudy, and 30_sunny scenarios revealed that YOLOv8m demonstrated the best adaptability and generalizability, performing optimally under 15_sunny and 15_cloudy. When monitoring the sorghum flowering stage with YOLOv8m, the model achieved high accuracy (R²: 0.88–0.957; rRMSE: 0.111–0.396) at heights of 12–15 m. At greater heights (30–45 m), accuracy declined (R²: 0.394–0.935; rRMSE: 0.556–26.313), emphasizing the importance of optimal flight altitudes for reliable monitoring.

While we tested the model’s performance under different conditions, the training dataset for this study was collected within a single day and failed to capture a wider range of early sorghum panicle morphologies. Future research could continuously collect images of sorghum panicles during the flowering period to train the model and improve its detection capability for early-stage panicles. Overall, we selected the optimal model capable of accurately monitoring sorghum during the flowering period. This can guide flowering period management in production, laying a foundation for stable and increased sorghum yields, and can also serve as a reference for model algorithm improvement.

Data availability

Data is provided within the supplementary information files.

Abbreviations

UAV:: Unmanned Aerial Vehicle
YOLO:: You Only Look Once
LiDAR:: Light Detection And Ranging
NGBDI:: Normalized Green-Blue Difference Index
GBDI:: Green Blue Difference Index
NMS:: Non-Maximum Suppression

References

Ritter KB, McIntyre CL, Godwin ID, Jordan DR, Chapman SC. An assessment of the genetic relationship between sweet and grain sorghums, within Sorghum bicolor ssp. bicolor (L.) Moench, using AFLP markers. Euphytica. 2007;157:161–76.
Article CAS Google Scholar
Endalamaw C, Semahegn Z. Genetic variability and yield performance of sorghum (sorghum bicolor L.) genotypes grown in semi-arid Ethiopia. Int J Adv Biol Biomed Res. 2020;8:193–213.
Article Google Scholar
Weiss M, Jacob F, Duveiller G. Remote sensing for agricultural applications: a meta-review. Remote Sens Environ. 2020;236:111402.
Article Google Scholar
Gao Z, Lu X, Wang X, Yang Z, Wang R. Study on winter wheat leaf area index inversion employing the PSO-NN-PROSAIL model. Int J Remote Sens. 2024;45:2915–38.
Article Google Scholar
Meraj T, Sharif MI, Raza M, Alabrah A, Kadry S, Gandomi AH. Computer vision-based plants phenotyping: a comprehensive survey. Iscience. 2024; 27.
Madec S, Jin X, Lu H, De Solan B, Liu S, Duyme F, et al. Ear density estimation from high resolution RGB imagery using deep learning technique. Agr for Meteorol. 2019;264:225–34.
Article Google Scholar
Cai E, Luo Z, Baireddy S, Guo J, Yang C, Delp EJ. High-resolution uav image generation for sorghum panicle detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 1676-85.
Wang S, Zhao J, Cai Y, Li Y, Qi X, Qiu X, et al. A method for small-sized wheat seedlings detection: from annotation mode to model construction. Plant Methods. 2024;20:15.
Article CAS PubMed PubMed Central Google Scholar
Liu L, Li P. An improved YOLOv5-based algorithm for small wheat spikes detection. Signal Image Video P. 2023;17:4485–93.
Article Google Scholar
Wang B, Yang G, Yang H, Gu J, Xu S, Zhao D, et al. Multiscale maize tassel identification based on improved retinanet model and UAV images. Remote Sens. 2023;15:2530.
Article Google Scholar
Radoglou-Grammatikis P, Sarigiannidis P, Lagkas T, Moscholios I. A compilation of UAV applications for precision agriculture. Comput Netw. 2020;172:107148.
Article Google Scholar
Lu C, Nnadozie E, Camenzind MP, Hu Y, Yu K. Maize plant detection using UAV-based RGB imaging and YOLOv5. Front Plant Sci. 2024;14:1274813.
Article PubMed PubMed Central Google Scholar
Chen H, Chen H, Zhang S, Chen S, Cen F, Zhao Q, et al. Comparison of CWSI and Ts-Ta-VIs in moisture monitoring of dryland crops (sorghum and maize) based on UAV remote sensing. J Integr Agr. 2024;23:2458–75.
Article Google Scholar
Luo S, Jiang X, Jiao W, Yang K, Li Y, Fang S. Remotely sensed prediction of rice yield at different growth durations using UAV multispectral imagery. Agriculture. 2022;12:1447.
Article CAS Google Scholar
Yu F, Bai J, Jin Z, Guo Z, Yang J, Chen C. Combining the critical nitrogen concentration and machine learning algorithms to estimate nitrogen deficiency in rice from UAV hyperspectral data. J Integr Agr. 2023;22:1216–29.
Article CAS Google Scholar
Zhang X, Zhang J, Peng Y, Yu X, Lu L, Liu Y, et al. QTL mapping of maize plant height based on a population of doubled haploid lines using UAV LiDAR high-throughput phenotyping data. J Integr Agr. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jia.2024.09.004.
Article Google Scholar
Jin X, Zarco-Tejada PJ, Schmidhalter U, Reynolds MP, Hawkesford MJ, Varshney RK, et al. Synthetic aperture radar image statistical modeling: part one-single-pixel statistical models. IEEE Geosci Remote Sens Mag. 2021;9:200–31.
Article Google Scholar
Wang X, Hunt C, Cruickshank A, Mace E, Hammer G, Jordan D. The impacts of flowering time and tillering on grain yield of sorghum hybrids across diverse environments. Agronomy. 2020;10:135.
Article Google Scholar
Pan D, Li C, Yang G, Ren P, Ma Y, Chen W, et al. Identification of the initial anthesis of soybean varieties based on UAV Multispectral Time-Series images. Remote Sens (Basel). 2023;15:5413.
Article Google Scholar
Guo Y, Fu YH, Chen S, Robin Bryant C, Li X, Senthilnath J, et al. Integrating spectral and textural information for identifying the tasseling date of summer maize using UAV based RGB images. Int J Appl Earth Obs Geoinf. 2021;102:102435.
Google Scholar
Qiu S, Li Y, Gao J, Li X, Yuan X, Liu Z et al. Research and implementation of millet ear detection method based on lightweight YOLOv5. Sensors. 2023; 23:9189.
Fan Y, Tohti G, Geni M, Zhang G, Yang J. A marigold corolla detection model based on the improved YOLOv7 lightweight. Signal Image Video P. 2024;18:4703–12.
Article Google Scholar
Redmon J. You only look once: Unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit. 2016. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/ARXIV.1506.02640.
Article Google Scholar
Parambil MMA, Ali L, Swavaf M, Bouktif S, Gochoo M, Aljassmi H, et al. Navigating the YOLO landscape: a comparative study of object detection models for emotion recognition. IEEE Access. 2024;12:109427–42.
Article Google Scholar
Zhao J, Yan J, Xue T, Wang S, Qiu X, Yao X, et al. A deep learning method for oriented and small wheat spike detection (OSWSDet) in UAV images. Comput Electron Agr. 2022;198:107087. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2022.107087.
Article Google Scholar
Zhao J, Cai Y, Wang S, Yan J, Qiu X, Yao X, et al. Small and oriented wheat spike detection at the filling and maturity stages based on wheatnet. Plant Phenomics. 2023;5:0109.
Article PubMed PubMed Central Google Scholar
Yu X, Yin D, Xu H, Pinto Espinosa F, Schmidhalter U, Nie C, et al. Maize tassel number and tasseling stage monitoring based on near-ground and UAV RGB images by improved YoloV8. Precision Agric. 2024;25:1800–38.
Article Google Scholar
Gao R, Jin Y, Tian X, Ma Z, Liu S, Su Z. YOLOv5-T: a precise real-time detection method for maize tassels based on UAV low altitude remote sensing images. Comput Electron Agr. 2024;221:108991.
Article Google Scholar
Zhang C, Ding H, Shi Q, Wang Y. Grape cluster real-time detection in complex natural scenes based on YOLOv5s deep learning network. Agriculture. 2022;12:1242.
Article Google Scholar
Varghese R, Sambath M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). Chennai, India: IEEE; 2024. pp. 1–6.
Wang CY, Yeh IH, Mark Liao HY. Yolov9: Learning what you want to learn using programmable gradient information. In: Leonardis A, Ricci E, Roth S, Russakovsky, O., Sattler T, Varol G Leonardis A, Ricci E, Roth S, Russakovsky, O., Sattler T, Varol Gs. European Conference on Computer Vision. Springer, Cham; 2024. pp. 1–21.
Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. Yolov10: real-time end-to-end object detection. arXiv Preprint arXiv:240514458. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2405.14458.
Article Google Scholar
Bailly A, Blanc C, Francis É, Guillotin T, Jamal F, Wakim B, et al. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Comput Meth Prog Bio. 2022;213:106504.
Article Google Scholar
Diwan T, Anirudh G, Tembhurne JV. Object detection using YOLO: challenges, architectural successors, datasets and applications. Multimed Tools Appl. 2023;82:9243–75.
Article PubMed Google Scholar
Barbedo JGA. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Comput Electron Agr. 2018;153:46–53.
Article Google Scholar
Verberg G. The effect of dataset size on neural network performance within systematic reviewing. [Master’s thesis]. Utrecht University; 2021.
Kulik SD, Shtanko AN. Experiments with neural net object detection system YOLO on small training datasets for intelligent robotics. In: Misyurin SA, V., Avetisyan A Misyurin SA, V., Avetisyan As. Advanced Technologies in Robotics and Intelligent Systems: Proceedings of ITR 2019. Springer, Cham; 2020. pp. 157– 62.
Mirhaji H, Soleymani M, Asakereh A, Mehdizadeh SA. Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditions. Comput Electron Agr. 2021;191:106533.
Article Google Scholar
Tian Y, Yang G, Wang Z, Wang H, Li E, Liang Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput Electron Agr. 2019;157:417–26.
Article Google Scholar
Hao Y, Pei H, Lyu Y, Yuan Z, Rizzo JR, Wang Y et al. Understanding the impact of image quality and distance of objects to object detection performance. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2022. pp. 11436-42.
Yang Q, Shi L, Han J, Yu J, Huang K. A near real-time deep learning approach for detecting rice phenology based on UAV images. Agr for Meteorol. 2020;287:107938.
Article Google Scholar
Yu G, Ha C, Shi C, Gong L, Yu L. A fast and robust UAV images mosaic method. In: Wang L, Wu Y, Gong J Wang L, Wu Y, Gong Js. Proceedings of the 7th China High Resolution Earth Observation Conference (CHREOC 2020), Lecture Notes in Electrical Engineering Singapore: Springer; 2022. pp. 229– 45.

Download references

Funding

This study was funded by the Fundamentals of Guizhou University (2024–35); Science and Technology Support Program of Guizhou Provincial Science and Technology Department(Qian ke he zhi (2024) yi ban 074); Qian-Ke-He-Platform-talent(BQW[2024]001); Key Laboratory of Molecular Breeding for Grain and Oil Crops in Guizhou Province (Qiankehezhongyindi (2023) 008), and Key Laboratory of Functional Agriculture of Guizhou Provincial Higher Education Institutions (Qianjiaoji (2023) 007).

Author information

Authors and Affiliations

College of Agriculture, Guizhou University, Guiyang, 550025, China
Song Zhang, Lei Tu, Shenxi Chen, Fulang Cen, Sanwei Yang, Quanzhi Zhao, Zhenran Gao & Tengbing He
Institute of Rural Revitalization, Guizhou University, Guiyang, 550025, China
Song Zhang, Lei Tu, Tianling Fu & Zhenran Gao
Guiding Agricultural and Rural Bureau, Guiding, 551300, China
Song Zhang
Institute of Soil and Fertilizer, Guizhou Academy of Agricultural Sciences, Guiyang, 550025, China
Yehua Yang

Authors

Song Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yehua Yang
View author publications
You can also search for this author inPubMed Google Scholar
Lei Tu
View author publications
You can also search for this author inPubMed Google Scholar
Tianling Fu
View author publications
You can also search for this author inPubMed Google Scholar
Shenxi Chen
View author publications
You can also search for this author inPubMed Google Scholar
Fulang Cen
View author publications
You can also search for this author inPubMed Google Scholar
Sanwei Yang
View author publications
You can also search for this author inPubMed Google Scholar
Quanzhi Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Zhenran Gao
View author publications
You can also search for this author inPubMed Google Scholar
Tengbing He
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

ZS collected images, analyzed data, conducted experiments, and wrote the manuscript. YY performed data visualization and revised the manuscript. LT conducted experiments. TF supervised and guided the experiments. SC and FC collected images, while SY and QZ provided guidance for the experiments and offered valuable suggestions. ZG conceived the study, guided the entire research, and revised the manuscript. TH reviewed and revised the manuscript, providing valuable comments and suggestions.

Corresponding authors

Correspondence to Zhenran Gao or Tengbing He.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, S., Yang, Y., Tu, L. et al. Comparison of YOLO-based sorghum spike identification detection models and monitoring at the flowering stage. Plant Methods 21, 20 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01338-z

Download citation

Received: 25 November 2024
Accepted: 04 February 2025
Published: 17 February 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01338-z

Comparison of YOLO-based sorghum spike identification detection models and monitoring at the flowering stage

Abstract

Introduction

Materials and methods

Overview of the study area and experimental design

Data acquisition and pre-processing

Data acquisition

Data pre-processing

Image annotation and dataset construction

Model training

Introduction of the YOLO series

Results and analysis

Analysis of dataset size differences between models

Validation of the accuracy of the test sets

Comparison of computational resources and inference times for different models

Evaluation of model adaptability and generalizability

Spike counting and flowering stage prediction

Discussion

Conclusion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Plant Methods

Contact us