Skip to main content

PFLO: a high-throughput pose estimation model for field maize based on YOLO architecture

Abstract

Posture is a critical phenotypic trait that reflects crop growth and serves as an essential indicator for both agricultural production and scientific research. Accurate pose estimation enables real-time tracking of crop growth processes, but in field environments, challenges such as variable backgrounds, dense planting, occlusions, and morphological changes hinder precise posture analysis. To address these challenges, we propose PFLO (Pose Estimation Model of Field Maize Based on YOLO Architecture), an end-to-end model for maize pose estimation, coupled with a novel data processing method to generate bounding boxes and pose skeleton data from a"keypoint-line"annotated phenotypic database which could mitigate the effects of uneven manual annotations and biases. PFLO also incorporates advanced architectural enhancements to optimize feature extraction and selection, enabling robust performance in complex conditions such as dense arrangements and severe occlusions. On a fivefold validation set of 1,862 images, PFLO achieved 72.2% pose estimation mean average precision (mAP50) and 91.6% object detection mean average precision (mAP50), outperforming current state-of-the-art models. The model demonstrates improved detection of occluded, edge, and small targets, accurately reconstructing skeletal poses of maize crops. PFLO provides a powerful tool for real-time phenotypic analysis, advancing automated crop monitoring in precision agriculture.

Introduction

Maize, one of the three major staple crops worldwide, provides essential calories and nutrients to billions of people while also serving as a critical industrial raw material. Its applications span diverse sectors, including food processing, pharmaceuticals, paper-making, textiles, and bioenergy industries [1, 2]. Given its immense economic value, improving maize yield and quality has become a global agricultural research priority. Phenomics plays a pivotal role in this endeavor. By precisely acquiring and analyzing phenotypic information, researchers can improve maize yield, quality, adaptability, and economic returns through breeding superior varieties.

Recent phenomics advancements integrate remote sensing, robotics, computer vision, and artificial intelligence technologies. Among various phenotypic traits, plant posture has emerged as a key indicator of environmental stress responses and yield potential. Plant posture reflects not only growth and development mechanisms [3, 4], but also overall plant health—healthy plants typically maintain upright postures, while stressed plants show distinctive postural alterations [5, 6]. Furthermore, posture analysis aids in selecting traits for enhanced lodging resistance, contributing to improved crop stability [7].

These insights highlight the importance of plant architectural analysis and motivate the development of accurate, high-throughput methods for phenotypic characterization in maize. Extracting and analyzing phenotypic information in complex field environments—especially in densely planted settings—poses significant challenges due to labor-intensive processes and potential human errors. In recent years, deep learning [8] has demonstrated substantial promise across various phenomic tasks, including classification, detection, and segmentation. For instance, Waheed et al. [9] optimized a DenseNet-based architecture with data augmentation to achieve exceptional accuracy in maize leaf disease identification. Li et al. [10] introduced APNet, a lightweight network tailored for apricot tree disease and pest detection in complex backgrounds, attaining a high accuracy. Similarly, Chadoulis et al. [11] combined 3D Convolutional Neural Networks with Extremely Randomized Trees to detect presymptomatic viral infections in Nicotiana benthamiana via hyperspectral imaging, achieving high accuracies through an innovative patch-based approach that exploited both spectral and spatial information. In the realm of object detection, Liu and Wang [12] enhanced YOLOv3-tiny (You Only Look Once version 3-tiny) with Focal Loss to facilitate real-time detection of broken and intact maize kernels on conveyor belts, while Jing et al. [13] developed MRD-YOLO—a system for melon ripeness detection based on MobileNetV3 and Coordinate Attention—suited for resource-constrained environments. Recent advances in image segmentation have leveraged deep learning techniques for various plant phenotyping tasks. Yang et al. [14] integrated VGG16 with Mask R-CNN to achieve robust leaf segmentation and classification in complex backgrounds. Aich and Stavness [15] proposed a data-driven approach using deconvolutional and convolutional networks for accurate leaf counting. Meanwhile, Jiang et al. [16] employed YOLOv8x with spatial and spectral features to automate the segmentation of leafy potato stems, demonstrating its effectiveness in high-throughput phenotyping. For keypoint detection, He et al. [17] proposed a bottom-up model based on disentangled keypoint regression for soybean pod phenotyping, although such approaches still struggle with occlusions in dense field environments [18]. Notably, the YOLO framework has become a foundational architecture owing to its rapid detection speed, high accuracy, and flexible design.

Concurrently, recent studies have broadened the scope of maize phenotyping. Song et al. [19] employed an RGB camera to capture maize crown images and developed a precise detasseling method by integrating an Oriented R-CNN-based detector with a dual-stage key pixel extraction module. In another study, Hämmerle et al. [20] utilized a low-cost time-of-flight camera to acquire 3D point cloud data, from which maize crop height models and individual plant heights were derived via multi-view synthesis and filtering calibration—thereby obviating the need for prior terrain measurements. Meanwhile, Qi et al. [21] leveraged UAV-based multi-scale imagery to construct a specialized dataset targeting “missed tassels” during the hybrid maize seed detasseling phase. Building on this dataset, they proposed an enhanced YOLOv5 detection framework—MT-YOLO—incorporating the ECANet attention mechanism, depthwise separable convolution, and an SIoU loss function, thereby achieving efficient and accurate tassel detection under complex field conditions.

Despite these advances, existing methods predominantly rely on object detection, segmentation, or classification techniques limited by data constraints and technical complexities. These approaches often fail to directly capture plant pose skeletons and growth trajectories—which provide more intuitive representations of critical traits such as height, leaf angles, branching structures, and stalk curvature. Consequently, these phenotypic characteristics remain underexplored in current research.

Pose skeleton analysis represents an emerging frontier in precision agriculture [22], offering clearer depictions of plant growth trajectories compared to traditional bounding box or segmentation methods. However, occlusions, dense planting arrangements, and morphological variations across growth stages present significant challenges for accurate skeleton extraction. In addition, the high hardware costs associated with certain detection devices impede the widespread adoption of these techniques. In our previous work [23], we established a foundation for in-field maize keypoint detection, but fell short in achieving complete skeleton reconstruction and end-to-end architecture, particularly in high-density and occluded conditions.

To address these limitations, we propose PFLO, a high-throughput pose estimation model based on YOLOv9 [24,25,26,27] with five key enhancements: (1) a novel data preprocessing strategy converting manual"keypoint-line"annotations into uniformly distributed keypoints with associated bounding boxes; (2) integration of a Squeeze-and-Excitation module [28]; (3) a RepPose detection head based on re-parameterized convolution [29]; (4) dynamic sampling-based upsampling [30]; and (5) a Multiple-Scale Separated and Enhanced Attention Module [31].

The remainder of this paper is organized as follows: “Materials and methods” section details the dataset and methodology; “Implementation details” section describes the experimental setup; “Experimental results and discussion” section presents results and comparisons with state-of-the-art methods; and “Conclusion and future work” section concludes with contributions, limitations, and future research directions.

Materials and methods

Image acquisition and annotation

The maize dataset was collected at the Agricultural Genomics Institute at Shenzhen (CAAS) and is publicly available through the MIPDB database [32]. The images were acquired between March 2021 and September 2023 using handheld DSLR cameras (Nikon Z5) from four different perspectives—front, back, left, and right—at distances ranging from 1 to 3 m. Detailed dataset specifications are provided in Table 1.

Table 1 Image acquisition specifications and environmental settings

As shown in Fig. 1, the dataset encompasses multiple variability dimensions: (1) angular variations, (2) growth stages from vegetative (V3) to reproductive (R1) [33], (3) different lighting and weather conditions, and (4) observation challenges including self-occlusion, inter-plant occlusion, edge effects, and scale variations. The most severe occlusion scenarios occurred at V9-R1 stages, with inter-plant occlusion of 40–60% in approximately 30% of images. Leaf and stalk structures of maize plants in the first two rows were annotated using keypoints and connecting lines through Labelme software. Figure 2 illustrates these annotations at different growth stages, showing the original field images, the keypoint-based skeleton visualizations, and the corresponding ground truth. Our approach to analyzing variables across different detection scenarios in the dataset was inspired by the work of Paul et al. [34], which provided valuable insights for comprehensive dataset design.

Fig. 1
figure 1

Illustration of data collection conditions. Scenarios including a Various imaging angles capturing complete plant structures; b representation of different growth stages (from V3 to R1); c examples of diverse lighting conditions (early morning, noon, evening) and weather; d examples of observation challenges such as self-occlusion, inter-plant occlusion, edge effects, and scale variations

Fig. 2
figure 2

The images in the MIPDB database and their corresponding manual annotations at different growth stages. Each row corresponds to a different growth stage, and the columns (from left to right) display the original field images, the keypoint-based skeleton visualization results, and the ground truth, respectively

The complete dataset includes 9,800 high-quality DSLR images from the MIPDB database, evenly distributed across different environmental conditions and growth stages. We compiled detailed distribution statistics of the dataset, as shown in Table 2, displaying balanced representation across various capture conditions and growth stages.

Table 2 Statistical summary of maize growth stages in the dataset

To ensure robust model evaluation and generalization, we adopted a 7:2:1 ratio (training:validation:testing), where 7:2:1 provides optimal balance between sufficient training data and adequate validation and test samples [35]. Each partition maintained the diversity of the original dataset, including different maize growth stages and representing various field conditions, including planting density, light intensity, and target size differences. This approach strikes a balanced trade-off between comprehensive evaluation and computational efficiency.

Data preprocessing approaches

Original images (6016 × 4016 pixels) were resized to 1504 × 1004 pixels, balancing feature preservation with computational efficiency. We employed a dual-representation framework combining bounding boxes and keypoints to characterize plant morphology. Bounding boxes were generated by calculating the minimum enclosing rectangle around keypoints with a 25-pixel padding, determined to be optimal through systematic evaluation (Fig. 3a, “Hyperparameter tuning and result analysis” section).

Fig. 3
figure 3

Data process workflow. a Bounding box generation and expansion based on annotated b keypoint standardization

Manual annotations showed significant inconsistencies (Fig. 4), including variations in keypoint quantity (6.36 ± 2.90 per object) and non-uniform inter-point distances (176.65 ± 129.11 per object) (Table 3). To standardize annotations, we first sorted stalk keypoints by ascending height and leaf keypoints by distance from stalks, then interpolated ten equidistant points between original annotations (Fig. 3b). Finally, applied uniform sampling using Eq. (1):

$$n = \frac{k}{r}$$
(1)

where n represents the sampling stride, which indicates the interval between the retained keypoints. k is the total number of annotated points. r is the ratio used to determine the sampling density, controlling the number of keypoints to be retained.

Fig. 4
figure 4

Comparison of keypoint annotation data before and after preprocessing. a Keypoint annotation data before preprocessing. b Keypoint annotation data after preprocessing. The same color indicates that keypoints belong to the same stalk or leaf, and the numbers represent the order of the keypoint data

Table 3 Comparison of manual and semi-automatic annotation metrics for each leaf or stalk

To accurately depict the skeleton of the leaf while balancing computational load, we set the number of retained keypoints for each target to 17. We then standardized the definition of different parts of the leaf and stalk by defining the annotation indices (Table 4). As shown in Table 3, the processed keypoints exhibit a more uniform distribution (standard deviation reduced to 0) and more reasonable inter-point distances (49.92 ± 30.97 pixels).

Table 4 The definition of keypoints

Pose estimation model construction

We propose PFLO, a YOLO-based pose estimation model specifically designed for field maize. PFLO integrates YOLOv9 as its backbone, leveraging two key mechanisms: Programmable Gradient Information (PGI) for resolving gradient conflicts, and General Efficient Layer Aggregation Network (GELAN) for enhanced multi-scale feature fusion. PFLO's detection head builds upon YOLOv8-Pose [25] with enhancements inspired by YOLO-Pose [36]. Comparative experiments (“Comparison with state-of-the-art methods” section) confirmed YOLOv9's superior balance between accuracy and computational efficiency compared to YOLOv8, YOLOv10, and YOLO11 [25] variants. These improvements substantially improve both bounding box localization and keypoint detection while maintaining real-time processing capabilities.

Figure 5 illustrates the complete pipeline and network architecture of PFLO, providing a detailed overview of the model's design, including the sequence of operations and key components involved in the maize pose estimation process.

Fig. 5
figure 5

The pipeline and network architecture of PFLO. Upper panel: field images undergo data preprocessing (standardizing keypoints and generating bounding boxes) before entering the PFLO model. Lower panel: the architectural design featuring RepNCSPELAN4_SE blocks (yellow), dynamic upsampling modules (Dy_Sample, light blue), Multi-SEAM modules (orange) for occlusion handling, and RepConv-based detection heads (teal) that predict both bounding boxes and keypoints simultaneously

Squeeze and excitation module for enhanced feature extraction in basic block

We incorporated the Squeeze and Excitation mechanism to enhance the model's feature recalibration, improving response adjustment across channels. By embedding this mechanism into RepNCSPELAN4, which combines principles from RepVGG [29] and CSPNet [37], the model effectively highlights critical features in the information flow, enhancing feature extraction and emphasis. The SE module works as follows:

Squeeze Operation: The input feature map \(U\) with dimensions \(H\times W\times C\), where \(H\), \(W\), and \(C\) represent height, width, and channels respectively, undergoes global average pooling, producing a \(1\times 1\times C\) tensor. The computation for each channel \(C\) is shown in Eq. (2):

$$Z_{c} = \frac{1}{H \times W}\,\mathop \sum \limits_{i = 1}^{H} \mathop \sum \limits_{j = 1}^{W} u_{c} \left( {i,j} \right)$$
(2)

where \({u}_{c}(i,j)\) is the feature value at coordinates \((i,j)\) in channel \(C\) of the input \(U\).

Excitation Operation: The excitation operation uses a fully connected feedforward neural network to learn recalibration weights for each channel. This network typically consists of two layers. The first layer is a dimensionality reduction layer, designed with four fully connected layers that reduce the number of channels from \(C\) to \(\frac{C}{r}\), using reduction ratio \(r\). The second layer is a dimensionality restoration layer, which restores the number of channels from \(\frac{C}{r}\) back to \(C\). The sigmoid activation function [38] is then applied to generate the weight for each channel. This operation can be represented as Eq. (3):

$$s = \sigma \left( {W_{2} \delta \left( {W_{1} } \right)z} \right)$$
(3)

where \(\sigma\) is the activation function that generates the recalibration weight for each channel, is the ReLU activation function [39], \({W}_{1}\) and \({W}_{2}\) are the weights of the first and second fully connected layers, respectively, \(z\) is the channel tensor from the squeeze operation, \(s\) represents the excitation weight for each channel.

Finally, the recalibration weights \(s\) are applied to each channel of the original input feature map U, completing the channel recalibration. The recalibrated feature map for channel \(C\) can be expressed as Eq. (4):

$$\hat{U}_{c} = s_{c} \times U_{c}$$
(4)

In this formula, \({U}_{c}\) represents the input feature map of channel \(C\), part of the original feature map \(U\), \({s}_{c}\) is the excitation weight for channel \(C\), \({\widehat{U}}_{c}\) is the recalibrated feature map of channel \(C\), obtained by applying the weight \({s}_{c}\) to the original feature map \({U}_{c}\).

By modeling nonlinear relationships between channels, the SE module enhances critical feature emphasis such as size, position, and color, while suppressing irrelevant ones. This recalibration improves the model's representational capacity and performance, particularly in dense and multi-target pose estimation tasks, such as those encountered in high-density growth stages of maize.

The integration of the SE module into the RepNCSPELAN4 block enables the network to adaptively extract critical features, as shown in Fig. 6. The diagram illustrates the integration of Squeeze and Excitation mechanisms with RepNCSP blocks, featuring a main pathway with SELayer components and a parallel skip connection, culminating in feature concatenation and further processing. This enhancement leads to superior accuracy and robustness in complex agricultural environments.

Fig. 6
figure 6

Structure of the RepNCSPELAN4_SE module

Key improvements to the detection head: efficient detection module based on RepConv

In convolutional neural networks (CNNs), the detection head is essential for generating final detection outputs, such as bounding boxes, class labels, and keypoints. It transforms feature maps produced by the feature extraction network (commonly the backbone or feature pyramid network [40]) into the desired detection results.

To enhance the detection head's efficiency and expressiveness, we integrated the RepConv module [29], a re-parameterized convolution block that improves feature extraction during training and optimizes inference speed. By utilizing convolution kernels of varying sizes, RepConv provides diverse receptive fields, boosting detection accuracy across various conditions. During inference, these multi-branch structures are collapsed into a single convolution layer for greater efficiency.

Specifically, during the training phase, the RepConv module comprises \(1\times 1\) and \(3\times 3\) convolution branches, along with a Batch Normalization layer, as shown in Fig. 7a. These branches extract features at various scales from the input feature map, and during training, their respective parameters are updated through backpropagation.

Fig. 7
figure 7

The mechanism of the RepPose detection head. a Structure of the RepPose detection head, showing its \(1\times 1\) and \(3\times 3\) convolution branches during the training phase. b The RepPose detection head operates in different modes during training and inference, where convolutional kernels are reparameterized into a single fused kernel for inference

In the inference phase, the multiple convolution kernels (\(3\times 3\), \(1\times 1\), and the identity kernel) are fused to derive an equivalent kernel and bias. The specific fusion process is represented as:

For the \(3\times 3\) and \(1\times 1\) convolutions, the fusion of each convolution kernel with the BatchNorm layer can be expressed as Eq. (5):

$$W_{equiv} = \frac{\gamma }{{\sqrt {\sigma^{2} + \varepsilon } }}W,{ }b_{equiv} = \beta + \frac{{\gamma \left( {b - \mu } \right)}}{{\sqrt {\sigma^{2} + \varepsilon } }}$$
(5)

where \(W\) and \(b\) are the convolution kernel and bias, respectively, and \(\beta\) are the scaling and bias parameters belonging to the BatchNorm layer, and \({\sigma }^{2}\) are the mean and variance, respectively.

Additionally, an identity convolution layer is obtained by operating on an \(3\times 3\) initial identity convolution layer using the BatchNorm layer, resulting in an identity kernel as shown in Eq. (6):

$$W_{equivid} = \frac{\gamma }{{\sqrt {\sigma^{2} + \varepsilon } }}I,{ }b_{equivid} = \beta - \frac{\gamma \mu }{{\sqrt {\sigma^{2} + \varepsilon } }}$$
(6)

where I is the identity kernel (an identity matrix), which is a unit matrix in the \(3\times 3\) convolution.

Finally, they are merged through reparameterization to obtain the final equivalent convolution kernel and bias, as shown in Fig. 7b.

Thus, during the training phase, the RepConv module utilizes a more complex multi-branch convolution design to extract features. In the inference phase, the parameters learned during training are fused into a simplified equivalent convolution kernel. This reduces computational complexity and significantly improves inference speed, making it suitable for real-time applications. The re-parameterization technique reduces the number of model parameters and computational load during inference, enabling the model to operate efficiently even on resource-constrained devices.

This balance between training complexity and inference efficiency enables the RepConv-enhanced detection head to perform robustly across diverse detection tasks, particularly in scenarios requiring real-time processing.

Dynamic upsample

Upsampling is a crucial operation in image processing, designed to convert upscaling low-resolution feature maps to high-resolution versions by generating new pixel points between existing ones [41, 42]. This operation is widely used in computer vision tasks to restore details and establish connections between feature maps of different scales.

Conventional upsampling methods, such as those implemented in PyTorch, rely on fixed interpolation and sampling rules, which do not adapt to variations in scene complexity. This limitation reduces their ability to capture fine-grained features in diverse contexts. To address this, we introduced Dy_Sample, a dynamic upsampling module that adjusts sampling points flexibly based on input feature map content, as illustrated in Fig. 8.

Fig. 8
figure 8

Structure of the dynamic upsample module. The module consists of an offset generation layer and a dynamic scope component that work together to produce adaptive sampling coordinates

Specifically, the Initial Position Generator first generates an initial coordinate grid (init_pos). A predefined 2D convolution module is then applied for feature extraction from the input map and to compute the sampling point offsets (\(offset\)) as follows, also shown in Eq. (7).

$$offset = Conv_{offset} \left( {input} \right)$$
(7)

Another 2D convolution module dynamically processes the input image to obtain a scope factor, which adjusts the offset. This relationship is expressed in Eq. (8):

$$offset = Conv_{scope} \left( {input} \right)$$
(8)

During this calculation, group convolution [43] is introduced to enhance computational efficiency and reduce parameter count while ensuring high-quality feature representation. Group convolution divides input channels into multiple groups and performs convolution operations independently on each group, effectively reducing computational complexity while maintaining representational power. The scope and offset act as control mechanisms, dynamically generating the offset magnitude for each position. Using learned convolution parameters, the model fine-tunes the offset based on input map content, which is essential for capturing detailed features in challenging scenarios.

These two variables are then multiplied, followed by pixel shuffling and the application of the Sigmoid function. The result is added to the initial position (\(init\_pos\)) to obtain the final sampling point coordinates (\(\text{final}\_\text{pos}\)), as shown in Eq. (9):

$$final\_pos = pixel\_shuffle\left( {Sigmoid\left( {offset \times scope} \right)} \right) + init\_pos$$
(9)

By performing grid sampling at these points, a dynamically upsampled image is obtained. Dy_Sample's dynamic sampling mechanism adaptively adjusts the sampling position according to the input content, enabling fine-grained feature extraction in complex environments.

Multiple-scale separated and enhanced attention module for improving occlusion detection capability

The Separated and Enhanced Attention Module (SEAM) [31] combines depthwise separable convolutions [44] with a channel attention mechanism [28] to improve feature representation in occlusion scenarios. Standard convolutions often struggle to detect occluded objects due to the loss of fine-grained details, whereas SEAM enhances detection accuracy by focusing on critical features in occluded areas.

To further improve its versatility, we extended SEAM to a multi-scale version, resulting in the Multiple-Scale Separated and Enhanced Attention Module (Multi-SEAM). As illustrated in Fig. 9, Multi-SEAM employs depthwise convolution networks with varying kernel sizes (e.g., \(3\times 3\), \(5\times 5\), and \(7\times 7\)), enabling the module to capture feature information across multiple scales. This design improves the model’s robustness in complex environments containing objects of various sizes. This design strengthens the model’s resilience in handling diverse objects under complex conditions and improves intra-channel detail extraction. Through multi-scale depthwise separable convolutions, convolution operations can be performed independently on each channel, allowing the network to better capture intra-channel details. Subsequently, pointwise convolutions merge the output, strengthening the inter-channel relationships. Residual connections within the module [45] preserve the original feature information. Then, channel attention scores are computed using a two-layer fully connected structure, which are used to fuse features from different channels, enhancing the response to non-occluded objects and compensating for occluded areas. Finally, the exponential function is applied to normalize the channel weights, further improving the model's robustness to occlusion and positional errors. Specifically:

$$y_{0} = DcovN_{3 \times 3} \left( x \right)\,y_{1} = DcovN_{5 \times 5} \left( x \right)\,y_{2} = DcovN_{7 \times 7} \left( x \right)$$
(10)

where \(x\) represents the input image tensor, and \(y\) denotes the feature maps produced by depthwise separable convolutions of varying kernel sizes.

Fig. 9
figure 9

Structure of the Multi-SEAM module

The original image and these multi-scale feature maps are then processed through a global average pooling layer, generating attention scores that emphasize significant global features for each channel. These scores facilitate the recalibration of features to enhance focus on key regions, improving responses to occluded areas. The aggregated scores are computed as follows:

$$\begin{gathered} Score_{0} = AvgPool\left( {y_{0} } \right)\,Score_{1} = AvgPool\left( {y_{1} } \right)\,Score_{2} = AvgPool\left( {y_{2} } \right) \hfill \\ Score = Avg\,\left( {Score_{0} + Score_{1} + Score_{2} } \right) \hfill \\ \end{gathered}$$
(11)

Subsequently, dimensionality reduction and expansion are performed through fully connected layers, generating the attention weights for each channel as shown in Eq. (12).

$$W = Sigmoid\left( {W_{2} \left( {ReLu\left( {W_{1} x} \right)} \right)} \right)$$
(12)

where \({W}_{1}\) and \({W}_{2}\) are the weight matrices for the two fully connected layers.

The generated weights are applied to the input multi-scale feature maps, producing recalibrated feature representations expressed in Eq. (13).

$$z = W \cdot \,{\text{x}}$$
(13)

Through this mechanism, the attention module dynamically adjusts channel-specific weights, emphasizing important features while suppressing irrelevant ones. The resultant feature maps are recalibrated to highlight significant features more strongly.

Implementation details

Performance evaluation metrics

We conducted comprehensive experiments to evaluate the method's effectiveness in both pose estimation and object detection. Standard metrics including Precision, Recall, and mean Average Precision (mAP) [46] were employed for performance measurement.

For pose estimation, we used the Object Keypoint Similarity (OKS) [47] metric to quantify the spatial accuracy of predicted keypoints compared to ground truth keypoints. As shown in Eq. (14), the OKS metric is defined as:

$$OKS = \frac{{\mathop \sum \nolimits_{i} exp\left( { - \frac{{d_{i}^{2} }}{{2s^{2} \sigma_{i}^{2} }}} \right)\, \cdot \,v_{i} }}{{\mathop \sum \nolimits_{i} v_{i} + \varepsilon }}$$
(14)

In Eq. (14), \({d}_{i}\) represents the Euclidean distance between the predicted and ground truth i-th keypoint, \(s\) denotes the object scale, typically defined as the area of the bounding box size. The parameter \({\sigma }_{i}\) represents the weight assigned to the i-th keypoint, used to represent the importance of the i-th keypoint, allowing sensitivity adjustments for different keypoints. Since detection difficulty across plant parts is relatively uniform, we assigned equal weights to all keypoints of leaf and stalk. \({v}_{i}\) is the visibility flag for the i-th keypoint. A small constant \(\upvarepsilon\) is introduced to prevent division by zero.

Additionally, we incorporated the Percentage of Correct Keypoints (PCK) metric, which provides a more intuitive measure of localization accuracy. As defined in Eq. (15), PCK evaluates the percentage of predicted keypoints that fall within a specified threshold distance of their corresponding ground truth locations:

$$PCK@\alpha = \frac{{\mathop \sum \nolimits_{p} \mathop \sum \nolimits_{i} \delta \left( {d_{pi} \le \alpha \, \cdot \,d} \right)}}{{\mathop \sum \nolimits_{p} \mathop \sum \nolimits_{i} 1}}$$
(15)

In Eq. (15), \({d}_{pi}\) represents the Euclidean distance between the predicted and ground truth positions for the i-th keypoint of the p-th sample, α is the threshold coefficient, \(d\) is the reference distance (in this study’s implementation, the diagonal length of the plant's bounding box), and \(\delta (\bullet )\) is the indicator function that equals 1 when the condition is satisfied and 0 otherwise. The summations are over all samples \(p\) and all keypoints \(i.\) We primarily utilize PCK@0.2, which considers a keypoint correctly localized if its distance from ground truth is less than 20% of the reference distance. Compared to OKS, PCK offers a more intuitive binary evaluation particularly relevant for maize pose estimation.

For object detection, the Intersection over Union (IoU), which measures the overlap between two bounding boxes A and B, is calculated using Eq. (16):

$$IoU = \frac{A \cap B}{{A \cup B}}$$
(16)

Based on these metrics defined in Eqs. (14)-(16), predictions are classified into the following categories:

True Positive (TP): A detected pose skeleton with OKS value exceeding a predefined threshold (50%− 95%) or a bounding box with IoU above the same threshold range. This is reflected in mAP50 and mAP50 - 95 values.

False Positive (FP): A detection that either doesn't correspond to any ground truth or has an OKS/IoU below the threshold.

False Negative (FN): Ground truth components that the model fails to detect, especially critical in dense planting with occlusion.

From these definitions, we compute:

Precision (P) is defined in Eq. (17) as the ratio of correctly identified plant components (TP) to the total number of predictions made by the model (TP + FP):

$$P = \frac{TP}{{TP + FP}}$$
(17)

Recall (R) is calculated as shown in Eq. (18) as the ratio of correctly identified components to all actual components (TP + FN):

$$R = \frac{TP}{{TP + FN}}$$
(18)

Average Precision (AP): As defined in Eq. (19), AP is computed as the area under the Precision-Recall curve for a specific class. It provides a single value summarizing the precision-recall performance for each class (leaf or stalk), considering both precision and recall at different detection thresholds. A higher AP indicates better performance in terms of both accuracy and completeness:

$$AP = \,\int_{0}^{1} {P\left( R \right)dR}$$
(19)

mean Average Precision (mAP): The mAP is the mean of AP scores across all classes as shown in Eq. (20), providing a comprehensive metric that balances model performance across different classes:

$$mAP = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} AP_{i}$$
(20)

where \(N\) is the total number of classes, and \({AP}_{i}\) is the Average Precision for the i-th class.

The different thresholds for TP classification determine the corresponding mAP values. For instance, mAP50 - 95 is the average mAP calculated across a range of OKS or IoU thresholds, from 0.5 to 0.95, in increments of 0.05.

Implementation details

System environment and implementation details

Experiments were conducted at the Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, using a server cluster with NVIDIA Tesla V100S PCIe GPUs (32 GB memory). The server environment utilized Linux Kernel 3.10.0, PyTorch 2.2.1, and CUDA 11.1.

To enhance model generalization and robustness, we employed Mosaic Augmentation and HSV Augmentation [48,49,50], techniques validated in prior studies to improve data diversity and adaptability to scale variations. Ablation experiments regarding data augmentation strategies are detailed in “Data augmentation analysis” section.

We conducted sensitivity analyses on hyperparameters (“Hyperparameter optimization” section), identifying a learning rate of 0.01 and batch size of 4 as optimal. The model was trained for 130 epochs based on convergence analysis, which showed performance stabilization after approximately 120 epochs. A patience of 30 epochs prevented premature stopping while allowing the model to overcome temporary performance plateaus.

The specific experiment details and parameter configurations are summarized in Table 5. Throughout the training process, key metrics were continuously monitored to evaluate convergence and overall model performance. This rigorous monitoring ensured that the model achieved optimal performance under the given experimental conditions.

Table 5 System environment and training hyperparameter configuration

Statistical significance analysis

To rigorously validate this study’s results, we implemented comprehensive statistical analysis protocols. All experiments utilized fivefold cross-validation with 5 independent runs per fold to ensure robust assessment and minimize the influence of random initialization and data partitioning.

Statistical significance was primarily assessed using Welch's t-test, which accommodates potential variance heterogeneity between experimental conditions—a common occurrence in deep learning experiments due to model stochasticity. We set the significance threshold at p < 0.05, with additional annotation for stronger statistical evidence at p < 0.01 and p < 0.001.

For each key performance metric (mAP50, mAP50 - 95, precision, recall), we report 95% confidence intervals calculated using Student's t-distribution. These intervals provide a clear indication of result reliability and variability across experimental repetitions. The confidence intervals were computed using the Eq. (21):

$$CI = \overline{x} \pm t_{0.25} \, \cdot \,\frac{s}{\sqrt n }$$
(21)

where \(\overline{x }\) is the sample mean, \({t}_{0.25}\) is the critical value from Student's t-distribution for the desired confidence level (95%) and degrees of freedom (n- 1), \(s\) is the sample standard deviation, and \(n\) is the sample size.

In comparative analyses with state-of-the-art methods and ablation studies, we conducted pairwise significance tests between experimental conditions to precisely quantify performance differences. All statistical analyses used SciPy [51]'s statistical functions to ensure accuracy and reliability.

Experimental results and discussion

Hyperparameter tuning and result analysis

Hyperparameter optimization

We conducted a comprehensive analysis of critical hyperparameters to maximize model performance through systematic sensitivity analysis. Figure 10 demonstrates the effect of bounding box padding size (additional pixels added symmetrically to each side of the minimum enclosing rectangle) on detection metrics. The x-axis represents the padding size in pixels, while the y-axis shows the mAP50 performance for various detection tasks, with solid lines representing box detection performance and dashed lines representing pose detection performance for different plant components (all, stalk, and leaf).

Fig. 10
figure 10

Detection metrics with different padding sizes. The x-axis represents different padding sizes, while the y-axis corresponds to the mAP50 metric

Increasing padding from 0 to 25 pixels substantially improved performance: box detection mAP50 increased by 17% (from 0.781 to 0.916) and pose detection mAP50 by 53% (from 0.469 to 0.722). This improvement stems from enhanced contextual information enabling better feature discrimination. However, padding beyond 25 pixels exhibited diminishing returns while introducing drawbacks: increased computational overhead, compromised localization precision due to feature dilution, and heightened susceptibility to dataset-specific overfitting. Through comprehensive analysis, we established 25 pixels as the optimal padding parameter.

We evaluated three learning rates (0.001, 0.005, and 0.01) over the first 50 epochs (Fig. 11). A learning rate of 0.01 achieved the optimal trade-off between convergence efficiency and training stability. As illustrated in Fig. 11a–d, this rate significantly accelerates convergence while maintaining smooth and stable loss trajectories, effectively mitigating gradient vanishing or oscillations compared to lower rates. The performance heatmap in Fig. 11e highlights the superiority of the 0.01 learning rate, with notable improvements in pose estimation tasks critical for accurate phenotypic analysis.

Fig. 11
figure 11

Learning rate optimization analysis: a Box Loss curves with different learning rates; b Pose Loss curves with different learning rates; c Box mAP50 - 95 curves with different learning rates; d Pose mAP50 - 95 curves with different learning rates; e The effect of learning rates on model performance metrics

For batch size (Fig. 12), comparing sizes 2 and 4 showed that the larger batch size delivered more stable training with smoother loss curves while achieving slightly better performance metrics. Box mAP50 - 95 improved from 0.6534 to 0.6635 and Pose mAP50 - 95 from 0.4188 to 0.4273, suggesting batch size 4 provided better gradient estimation. Although further improvements were observed with batch size 8 in preliminary experiments, hardware memory constraints limited final experiments to batch size 4 for full-resolution images.

Fig. 12
figure 12

Batch size optimization analysis: a Box Loss curves with different batch size; b Pose Loss curves with different batch size; c Box mAP50 - 95 curves with different batch size; d Pose mAP50 - 95 curves with different batch size; e The effect of batch size on model performance metrics

The model was trained for 130 epochs based on convergence analysis showing performance stabilization after approximately 120 epochs, as evidenced by plateauing curves in Fig. 12c, d. Monitoring validation performance beyond 130 epochs revealed no significant improvements (< 0.1% change in mAP50 - 95) but early signs of overfitting, confirming this training duration ensured complete convergence while preventing overfitting.

Result analysis

We selected YOLOv9e with the YOLOv8-Pose detection head as the baseline and implemented proposed enhancements on this foundation. As shown in Figs. 13 and 14, PFLO consistently outperformed the baseline across all evaluation metrics for both pose estimation and object detection tasks.

Fig. 13
figure 13

The metrics for pose estimation during the training process: a mAP50 - 95 (%) curve changes of Baseline and PFLO; b Pose loss curve changes of Baseline and PFLO; c Precision-confidence curve; d Precision-recall curve; e Recall-confidence curve; f F1-confidence curve

Fig. 14
figure 14

The metrics for object detection during the training process: a mAP50 - 95 (%) curve changes of Baseline and PFLO; b Box loss curve changes of Baseline and PFLO; c Precision-confidence curve; d Precision-recall curve; e Recall-confidence curve; f F1-confidence curve

For pose estimation (Fig. 13), PFLO achieved a 7.3% relative improvement in mAP50 - 95 (from 0.398 to 0.427) while demonstrating faster convergence and lower final loss values. The precision-recall curves (Fig. 13d) highlight PFLO's ability to maintain higher precision across the entire recall range, with notable improvements in high-recall regions (0.7–0.9), critical for comprehensive pose analysis in dense field environments. The F1-confidence curves (Fig. 13f) confirm PFLO's enhanced detection capabilities across all confidence thresholds.

Similarly for object detection (Fig. 14), PFLO reached a 3.8% relative improvement in box mAP50 - 95 (from 0.636 to 0.660). Both precision-confidence and recall-confidence curves demonstrate consistent superiority across different operating conditions.

Table 6 provides a detailed breakdown of PFLO's performance by class and dataset split, revealing several important patterns:

Table 6 Evaluation of PFLO metrics by class and dataset split

Class-specific performance: PFLO performs better on leaf structures (test pose mAP50: 75.0%) than stalks (69.4%), suggesting that the model's enhanced feature extraction capabilities are particularly beneficial for capturing diverse leaf morphologies.

Generalization capability: PFLO demonstrates strong generalization with minimal performance degradation between validation and test sets (Box mAP50 from 92.6 to 91.5%, Pose mAP50 from 75.8 to 72.2%).

Spatial accuracy: High PCK@0.2 values (0.803–0.923) across all categories confirm PFLO's strong spatial accuracy in keypoint localization, which is essential for reliable phenotypic measurements.

These results demonstrate PFLO's robust performance across different plant structures and dataset conditions, confirming its effectiveness for high-throughput phenotyping in complex field environments.

Ablation experiments

Data augmentation analysis

To evaluate the contribution of different data augmentation techniques, we conducted a series of ablation experiments summarized in Table 7, with statistical significance assessed using Welch's t-test.

Table 7 Ablation study results for data augmentation techniques

Mosaic augmentation (Group B, probability = 1.0) led to modest but statistically significant improvements. Pose mAP50 increased from 68.6% to 69.7% (p = 0.005), while box mAP50 - 95 improved by 0.6% (p < 0.001), indicating Mosaic augmentation effectively enhances detection and localization in diverse contexts.

HSV color augmentation (Group C, H = 0.015, S = 0.7, V = 0.4) demonstrated more substantial effects. Pose mAP50 increased from 68.6% to 71.2% (p = 0.023), and box mAP50 - 95 rose by 22.6% (p = 0.030), showcasing the technique's effectiveness in enhancing model robustness to variations in color and lighting conditions common in field environments.

When both augmentations were applied together (Group D), the model exhibited the most significant improvements. Box mAP50 rose by 2.7% (p < 0.001), and box mAP50 - 95 increased by 23.4% (p < 0.001). Pose mAP50 improved by 3.6% (p = 0.028), with pose mAP50 - 95 increasing by 4.3% (p = 0.038). This synergistic effect demonstrates that combined augmentation provides complementary benefits outperforming individual techniques.

These results highlight the importance of comprehensive data augmentation in improving model robustness and accuracy, particularly for complex field environments where lighting and perspective variations are common. The substantial improvements in both object detection and pose estimation metrics validate the decision to incorporate both augmentation techniques into the final PFLO model.

Module contribution analysis

To validate each architectural module's contribution, we conducted module ablation experiments summarized in Table 8, with statistical significance assessed using Welch's t-test.

Table 8 The results of the module ablation experiments

The Squeeze and Excitation (SE) layer (Group B) recalibrates channel responses, significantly improving pose estimation performance. Pose mAP50 increased by 0.6% (p = 0.0057) and pose mAP50 - 95 by 0.7% (p = 0.02), demonstrating the effectiveness of adaptive feature recalibration.

The RepPose detection head (Group C), with its multi-branch structure, enhances feature extraction stability and precision, significantly improving pose mAP50 to 70.5% (p = 0.013) and pose mAP50 - 95 to 40.1% (p = 0.011), indicating substantial gains in keypoint detection accuracy.

The Dynamic Upsampling module (Group D) adjusts sampling points to better utilize spatial information, improving pose mAP50 - 95 by 0.6% (p = 0.021) and box mAP50 - 95 by 0.7% (p = 0.023), demonstrating its effectiveness in preserving spatial details.

The Multi-Scale Enhanced Attention Mechanism (Multi-SEAM) (Group E) delivered the most significant single-module improvement. Pose mAP50 increased by 1.3% (p = 0.002) and box mAP50 - 95 by 1.4% (p = 0.023), demonstrating its effectiveness in detecting targets at varying scales and handling occlusion—critical for field maize pose estimation.

When all enhancements were integrated (Group F), performance reached peak values across all metrics with high statistical significance. The complete model improved box mAP50 from 90.7 to 91.6% (p = 0.007) and pose mAP50 from 69.9 to 72.2% (p = 0.002), with corresponding improvements in mAP50 - 95 values. These results demonstrate the synergistic contributions of all modules, with their combined effect exceeding the sum of individual improvements.

Comparison with state-of-the-art methods

To ensure rigorous evaluation, we report PFLO's performance metrics with 95% confidence intervals, calculated using the Student's t-distribution. Statistical significance was assessed using Welch's t-test to account for potential variance heterogeneity.

Comparison of pose estimation performance

We evaluated PFLO against leading state-of-the-art pose estimation methods, spanning both top-down approaches (YOLOv5x–YOLO11x) and bottom-up approaches (HigherHRNet and DEKR). As summarized in Table 9, PFLO consistently outperforms all competing models across every evaluated metric.

Table 9 Comparison of PFLO with state-of-the-art pose estimation models

PFLO achieves 75.2% precision, surpassing the second-best model (YOLO11x) by 1.4%, and outperforming bottom-up methods by 33.6% and 34.2%, respectively. For recall, PFLO attains 70.2%, exceeding YOLOv9e by 1.3% and demonstrating a remarkable 47.5% improvement over DEKR's 22.7%.

In terms of the mAP50 metric—a critical indicator of pose estimation quality—PFLO reaches 72.2%, outperforming YOLOv9e's 69.9% by 2.3% (p < 0.001). The statistical significance confirms that this improvement represents a genuine advancement rather than random variation. Furthermore, in the more stringent mAP50 - 95 evaluation, which emphasizes precise keypoint localization across a range of IoU thresholds, PFLO achieves 42.7%, surpassing YOLOv9e's 39.8% by 2.9% (p = 0.002).

Compared to recent YOLO variants, PFLO demonstrates consistent superiority in pose estimation metrics: outperforming YOLOv5x (mAP50: p < 0.001, mAP50 - 95: p < 0.001), YOLOv6x (mAP50: p < 0.001, mAP50 - 95: p < 0.001), YOLOv8x (mAP50: p = 0.005, mAP50 - 95: p = 0.034), YOLOv10x (mAP50: p = 0.039, mAP50 - 95: p = 0.002), and YOLO11x (mAP50: p < 0.001, mAP50 - 95: p < 0.001). These statistically significant improvements validate the effectiveness of the architectural enhancements.

Comparison of object detection performance with state-of-the-art methods

We further evaluated PFLO's object detection capabilities against state-of-the-art methods, including single-stage approaches (YOLOv5x-YOLO11x) and the two-stage method Faster R-CNN. The results in Table 10 demonstrate PFLO's consistent superiority across all key metrics.

Table 10 Comparison of PFLO with state-of-the-art object detection models

PFLO achieves 86.8% precision and 86.2% recall, exceeding the second-best models by 0.3% and 1.2% respectively. For mAP50, PFLO reaches 91.6%, outperforming YOLO11x (91.2%, p = 0.069) and YOLOv10x (91.2%, p = 0.065). While these improvements approach but do not reach statistical significance at the conventional p < 0.05 threshold, they still represent meaningful advancements in agricultural contexts where even marginal improvements can translate to substantial practical benefits. In the more rigorous mAP50 - 95 evaluation, PFLO demonstrates statistically significant improvements compared to all tested methods, with p-values ranging from 0.049 (versus YOLOv9e) to < 0.001 (versus Faster R-CNN, YOLOv5x, and YOLOv6x), confirming its superior localization precision across various IoU thresholds.

Summary and statistical significance analysis

These results demonstrate that YOLOv9 as a foundational model maintains excellent object detection capabilities and inference speed while delivering superior pose estimation accuracy. YOLOv9e achieves the highest pose estimation performance (mAP50: 69.9%, mAP50 - 95: 39.8%) among all other baseline models tested, outperforming both older and newer variants while offering competitive inference speed (16.7 ms). These empirical results justify selecting YOLOv9 as the baseline for further architectural enhancements in PFLO.

While achieving state-of-the-art pose detection accuracy, PFLO also demonstrates precise bounding box localization in object detection tasks and maintaining a competitive processing speed. The fivefold cross-validation experiments and statistical significance analysis using Welch's t-test confirmed that PFLO's improvements represent substantial advancements in model design rather than random variations. The synergistic integration of multiple architectural modules enables PFLO to effectively address the challenges of field-based pose estimation. Although a few improvements approach but do not reach statistical significance at the conventional p < 0.05 threshold—likely due to the small variance and marginal gains observed in repeated experiments—they still represent meaningful advancements in agricultural contexts, where even marginal improvements can translate to substantial practical benefits.

Growth stage-specific and occlusion analysis

Growth stage and occlusion level performance analysis

To evaluate PFLO's adaptability throughout the maize phenological cycle, we conducted a systematic analysis correlating developmental stages with occlusion complexity. This approach addresses the intrinsic relationship between growth progression and occlusion severity—as maize plants develop, both morphological complexity and inter-plant occlusion increase from minimal levels in early vegetative stages (V3-V5) to severe conditions in later stages (V9-R1).

Table 11 quantifies PFLO performance across these developmental stages and occlusion conditions. The model exhibited exceptional detection capabilities at the V6-V8 stage (moderate occlusion), with pose mAP50 reaching 79.6% and PCK@0.2 achieving 0.8563. Performance metrics showed a moderate decline in later growth stages (V9-R1 with heavy occlusion), where pose mAP50 decreased to 68.3%—an expected reduction given the substantial increase in structural complexity and inter-plant occlusion (40–60%). Nevertheless, PFLO maintained robust accuracy under these challenging conditions, consistently outperforming the baseline model across all developmental phases.

Table 11 Performance evaluation of PFLO across maize growth stages and occlusion levels

Notably, PFLO demonstrated superior occlusion resilience compared to the baseline. While both models experienced performance declines as occlusion increased from minimal (0–20%) to heavy (40–60%), PFLO's relative performance reduction was considerably smaller (5.7% decrease in pose mAP50) than the baseline (7.3% decrease). This enhanced occlusion resistance represents a significant advancement for field-based phenotyping applications where plant overlap is unavoidable.

The Object Keypoint Similarity (OKS) metric validated PFLO's spatial accuracy in keypoint localization, with values ranging from 0.5040 to 0.5765 across growth stages. Similarly, the PCK@0.2 metric consistently exceeded 0.79 throughout all developmental phases, confirming high-precision localization capabilities. These metrics collectively demonstrate PFLO's versatility and robustness throughout the entire crop development cycle.

Figure 15 illustrates this developmental progression with representative samples arranged by row (V3-V4, V6, V10, and R1), with each row displaying three visualization modes: ground truth annotations (left column), PFLO predictions (middle column), and Grad-CAM (Gradient-weighted Class Activation Mapping) heatmaps (right column). The Grad-CAM visualization reveals regions of model focus through gradient-weighted class activation mapping, with warm colors (red-yellow) indicating areas of higher importance and cooler colors (blue-green) representing lower importance regions.

Fig. 15
figure 15

PFLO detection performance and attention visualization across maize growth stages. The rows represent maize images at different growth stages, while the columns, from left to right, correspond to the ground truth, PFLO-detected maize posture, and the Grad-CAM-based heatmap visualization of regions of interest during the PFLO detection process. The heatmaps illustrate the model's attention, where warm colors (red-yellow) indicate regions of higher importance, while cooler colors (blue-green) denote less significant areas

The visualization demonstrates PFLO's consistent high detection precision across phenological stages. In early development (V3-V4), the model accurately focused on smaller, developing leaves with concentrated attention hotspots. In mid-vegetative stages (V6), PFLO adapted to expanded leaf structures while maintaining precise stalk detection. During later stages (V10 and R1), despite increased plant height, leaf count, and occlusion, the heatmaps reveal how PFLO effectively distributed attention across complex plant structures while maintaining focus on key architectural elements.

Comparative visual assessment

To evaluate PFLO's detection capabilities against leading models, we conducted comparative tests against YOLOv8x, YOLOv10x, and DEKR on the most severely occluded samples from the V9-R1 growth stages. Figure 16 presents this comparison across three representative field environments (columns A-C), with six visualization rows: input image, ground truth annotation, PFLO prediction, YOLOv8x prediction, YOLOv10x prediction, and DEKR prediction. We used a consistent color scheme where red indicates stalks and other colors represent individual leaves.

Fig. 16
figure 16

Comparison of detection capabilities among different models across three typical field environments (columns A-C). The rows represent the detection results of different models, while the columns correspond to three distinct field environments

We identified nine specific cases (numbered boxes 1–9) representing three detection challenges: background row plants, severely occluded specimens, and edge-case detections. Despite ground truth annotations being limited to plants in the first two rows, PFLO successfully detected plants in background rows (boxes 1, 2) and those partially occluded by neighboring specimens (boxes 5, 6, 8, 9). Additionally, PFLO effectively detected plants at image edges (boxes 3, 7), whereas competing models either missed these detections entirely or produced significant structural errors.

It is worth noting that all models, including PFLO, face challenges when detecting plants in extremely adverse conditions—those in distant background rows, under complete occlusion, or at extreme image edges. These challenging cases represent targets for our ongoing research efforts.

Compared to other state-of-the-art methods, PFLO demonstrated better performance in occlusion handling, edge detection, and small target identification, producing more accurate pose skeletons that closely align with the actual leaf and stalk structure. This enhanced structural representation provides a more effective reflection of plant growth conditions and physiological status, offering valuable insights for phenotypic analysis and agricultural monitoring.

Conclusion and future work

This study introduces PFLO, a novel approach for maize pose extraction and analysis in complex field environments, integrating data preprocessing techniques with architectural enhancements for robust pose estimation.

In data preprocessing, we developed a keypoint-line annotation standardization approach that transforms manually annotated data into uniformly distributed keypoints. Using Euclidean distance metrics and semantic category labels, this method mitigates inconsistencies and reduces annotator bias in manual annotations.

Architecturally, PFLO incorporates multiple enhancements: feature recalibration through the SE module, enhanced feature extraction via RepConv, dynamic upsampling for finer detail preservation, and multi-scale attention mechanisms (Multi-SEAM) to handle occlusions.

Experimental results demonstrate PFLO's superior performance across different growth stages and occlusion conditions, achieving 72.2% mAP50 and 42.7% mAP50 - 95 for pose estimation, alongside 91.6% mAP50 and 66.0% mAP50 - 95 for object detection. PFLO consistently outperforms state-of-the-art approaches with statistical significance, particularly excelling in challenging field scenarios with inter-plant occlusion. The model exhibits strong adaptability throughout the entire phenological cycle, demonstrating gradual performance degradation even under severe occlusion (40–60%). Its end-to-end architecture—combining keypoint and bounding box regression in a single forward pass—ensures precise detection while maintaining high throughput.

The performance improvements demonstrated by PFLO have tangible implications for agricultural practices. The 2.3% increase in pose estimation accuracy (mAP50) translates to approximately 23 additional correctly analyzed plants per 1,000 specimens, enabling more precise phenotypic measurements across large field trials. For breeding programs, this enhanced accuracy can accelerate the identification of stress-resistant varieties by more reliably detecting subtle postural changes indicative of environmental adaptation. Furthermore, in precision agriculture applications, PFLO's superior performance under occlusion conditions (5.7% less degradation than baseline models) ensures more consistent monitoring throughout the growing season, potentially reducing the need for multiple imaging sessions and associated labor costs.

Compared to previous maize vision research, PFLO utilizes widely accessible handheld camera images to achieve precise detection of maize posture in field conditions without requiring specialized phenotyping platforms or expensive equipment (e.g., LiDAR and stereo cameras). This significantly reduces the technological barrier, enhancing practicality and scalability in resource-constrained agricultural environments. The method we proposed for processing plant pose skeleton data offers valuable insights for unbiased and standardized plant pose estimation in future studies. Through rigorous quantitative analysis and systematic validation, this study demonstrates substantial improvement in pose estimation accuracy and establishes a versatile, scalable multi-task plant pose estimation framework, providing robust technical support for crop health monitoring, stress detection, and high-throughput breeding programs.

Despite these advancements, PFLO has several limitations. First, its performance remains challenged by extreme occlusion scenarios and distant background plants, suggesting the need for additional contextual information. Second, although this study’s dataset includes images from multiple growth stages, the lack of annotated data covering diverse soil conditions, geographical regions, and maize varieties has limited validation across these variables. Further evaluation is warranted once additional annotated data become available. Finally, PFLO prioritizes precise detection over inference speed optimization, presenting an opportunity for future efficiency improvements.

Future research will focus on four main directions: (1) Integrating multi-angle and multi-modal data by combining DSLR and UAV imagery to enhance detection accuracy under heavily occluded conditions. (2) Expanding evaluations to diverse geographic regions, soil conditions, and maize varieties to thoroughly assess model robustness across various agricultural contexts. (3) Developing edge-computing optimizations for real-time field deployment on resource-constrained devices, even though PFLO was primarily designed for cluster-based inference—this remains a valuable avenue for future exploration. (4) Establishing clear connections between plant posture data and key physiological indicators (e.g., health status, nutrient levels, pest or disease pressures), thus enabling in-field physiological assessments based on posture detection. Additionally, we plan to adapt PFLO’s architecture to other major crops—including rice and wheat—to address their distinct structural attributes and phenotyping needs. Through these efforts, we aim to facilitate more precise crop management and timely interventions grounded in plant posture detection.

Availability of data and materials

The computer code and data supporting the findings of this study are available in the GitHub repository: https://github.com/Akacaesarp/PFLO. The MIPDB dataset can be accessed at: http://phenomics.agis.org.cn/#/category.

Abbreviations

CNN:

Convolutional neural network

CSPDarknet:

Cross stage partial darknet

DEKR:

Disentangled keypoint regression

DSLR:

Digital single-lens reflex camera

Dy_Sample:

Dynamic sampling-based upsampling

GELAN:

General efficient layer aggregation network

Grad-CAM:

Gradient-weighted class activation mapping

HRNet:

High-resolution network

HSV:

Hue, saturation, value

IoU:

Intersection over union

mAP:

Mean average precision

Multi-SEAM:

Multiple-scale separated and enhanced attention module

OKS:

Object keypoint similarity

PCK:

Percentage of correct keypoints

PGI:

Programmable gradient information

RepConv:

Re-parameterized convolution

SE:

Squeeze and excitation module

SGD:

Stochastic gradient descent

YOLO:

You only look once

References

  1. Runge CF, Senauer B. How biofuels could starve the poor. Foreign Aff. 2007;86:41.

    Google Scholar 

  2. Shiferaw B, Prasanna BM, Hellin J, Bänziger M. Crops that feed the world 6. Past successes and future challenges to the role played by maize in global food security. Food Secur. 2011;3:307–27.

    Article  Google Scholar 

  3. Fourcaud T, Zhang X, Stokes A, Lambers H, Körner C. Plant growth modelling and applications: the increasing importance of plant architecture in growth models. Ann Bot. 2008;101(8):1053–63.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Moulia B, Coutand C, Lenne C. Posture control and skeletal mechanical acclimation in terrestrial plants: implications for mechanical modeling of plant architecture. Am J Bot. 2006;93(10):1477–89.

    Article  PubMed  Google Scholar 

  5. Furbank RT, Tester M. Phenomics–technologies to relieve the phenotyping bottleneck. Trends Plant Sci. 2011;16(12):635–44.

    Article  CAS  PubMed  Google Scholar 

  6. Tardieu F, Simonneau T, Muller B. The physiological basis of drought tolerance in crop plants: a scenario-dependent probabilistic approach. Annu Rev Plant Biol. 2018;69(1):733–59.

    Article  CAS  PubMed  Google Scholar 

  7. Cabrera-Bosquet L, Crossa J, von Zitzewitz J, Serret MD, Luis AJ. High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge F. J Integr Plant Biol. 2012;54(5):312–20.

    Article  PubMed  Google Scholar 

  8. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.

    Google Scholar 

  9. Waheed A, Goyal M, Gupta D, Khanna A, Hassanien AE, Pandey HM. An optimized dense convolutional neural network model for disease recognition and classification in corn leaf. Comput Electron Agric. 2020;175:105456.

    Article  Google Scholar 

  10. Li M, Tao Z, Yan W, Lin S, Feng K, Zhang Z, et al. Apnet: Lightweight network for apricot tree disease and pest detection in real-world complex backgrounds. Plant Methods. 2025;21(1):4.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Chadoulis R-T, Livieratos I, Manakos I, Spanos T, Marouni Z, Kalogeropoulos C, et al. 3D-CNN detection of systemic symptoms induced by different Potexvirus infections in four Nicotiana benthamiana genotypes using leaf hyperspectral imaging. Plant Methods. 2025;21(1):15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Liu Z, Wang S. Broken corn detection based on an adjusted YOLO with focal loss. IEEE Access. 2019;7:68281–9.

    Article  Google Scholar 

  13. Jing X, Wang Y, Li D, Pan W. Melon ripeness detection by an improved object detection algorithm for resource constrained environments. Plant Methods. 2024;20(1):127.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Yang K, Zhong W, Li F. Leaf segmentation and classification with a complicated background using deep learning. Agronomy. 2020;10(11):1721.

    Article  Google Scholar 

  15. Aich S, Stavness I, editors. Leaf counting with deep convolutional and deconvolutional networks. Proceedings of the IEEE international conference on computer vision workshops; 2017.

  16. Jiang H, Gilbert Murengami B, Jiang L, Chen C, Johnson C, Auat Cheein F, et al. Automated segmentation of individual leafy potato stems after canopy consolidation using YOLOv8x with spatial and spectral features for UAV-based dense crop identification. Comput Electron Agric. 2024;219:108795.

    Article  Google Scholar 

  17. He J, Weng L, Xu X, Chen R, Peng B, Li N, et al. DEKR-SPrior: an efficient bottom-up keypoint detection model for accurate pod phenotyping in soybean. Plant Phenomics. 2024;6:0198.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV et al. Deepcut: Joint subset partition and labeling for multi person pose estimation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.

  19. Song C, Zhang F, Li J, Zhang J. Precise maize detasseling base on oriented object detection for tassels. Comput Electron Agric. 2022;202: 107382.

    Article  Google Scholar 

  20. Hämmerle M, Höfle B. Direct derivation of maize plant and crop height from low-cost time-of-flight camera measurements. Plant Methods. 2016;12(1):50.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Qi J, Ding C, Zhang R, Xie Y, Li L, Zhang W, et al. UAS-based MT-YOLO model for detecting missed tassels in hybrid maize detasseling. Plant Methods. 2025;21(1):21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Araus JL, Cairns JE. Field high-throughput phenotyping: the new crop breeding frontier. Trends Plant Sci. 2014;19(1):52–61.

    Article  CAS  PubMed  Google Scholar 

  23. Liu B, Chang J, Hou D, Pan Y, Li D, Ruan J. Recognition and localization of maize leaf and stalk trajectories in RGB images based on point-line net. Plant Phenomics. 2024;6:0199.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Cao Z, Simon T, Wei S-E, Sheikh Y, editors. Realtime multi-person 2d pose estimation using part affinity fields. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.

  25. Jocher G CA, Qiu J. Ultralytics YOLO (Version 8.0.0). Github; 2023.

  26. Newell A, Yang K, Deng J, editors. Stacked Hourglass networks for human pose estimation. Cham: Springer International Publishing; 2016.

    Google Scholar 

  27. Wang C-Y, Yeh I-H, Mark Liao H-Y, editors. Yolov9: Learning what you want to learn using programmable gradient information. European Conference on Computer Vision; 2025: Springer.

  28. Hu J, Shen L, Sun G, editors. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.

  29. Ding X, Zhang X, Ma N, Han J, Ding G, Sun J, editors. Repvgg: Making vgg-style convnets great again. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021.

  30. Liu W, Lu H, Fu H, Cao Z, editors. Learning to upsample by learning to sample. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023.

  31. Yu Z, Huang H, Chen W, Su Y, Liu Y, Wang X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recogn. 2024;155:110714.

    Article  Google Scholar 

  32. Wang P, Chang J, Deng W, Liu B, Lai H, Hou Z, et al. MIPDB: A maize image-phenotype database with multi-angle and multi-time characteristics. bioRxiv. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2024.04.26.589844.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Abendroth LJ, Elmore RW, Boyer MJ, Marlay S. Corn growth and development. 2011.

  34. Paul A, Machavaram R, Ambuj, Kumar D, Nagar H. Smart solutions for capsicum Harvesting: Unleashing the power of YOLO for Detection, Segmentation, growth stage Classification, Counting, and real-time mobile identification. Comput Electron Agric. 2024;219:108832.

    Article  Google Scholar 

  35. Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT press; 2016.

    Google Scholar 

  36. Maji D, Nagori S, Mathew M, Poddar D, editors. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022.

  37. Wang C-Y, Liao H-YM, Wu Y-H, Chen P-Y, Hsieh J-W, Yeh I-H, editors. CSPNet: A new backbone that can enhance learning capability of CNN. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops; 2020.

  38. Bishop CM. Neural networks for pattern recognition. Oxford: Oxford University Press; 1995.

    Book  Google Scholar 

  39. Nair V, Hinton GE, editors. Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10); 2010.

  40. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S, editors. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition; 2017.

  41. Dong C, Loy CC, He K, Tang X, editors. Learning a deep convolutional network for image super-resolution. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part IV 13; 2014: Springer.

  42. Long J, Shelhamer E, Darrell T, editors. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015.

  43. Zhang X, Zhou X, Lin M, Sun J, editors. Shufflenet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE conference on computer vision and pattern recognition; 2018.

  44. Howard AG. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861. 2017.

  45. He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.

  46. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. Int J Comput Vision. 2010;88:303–38.

    Article  Google Scholar 

  47. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al., editors. Microsoft coco: Common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13; 2014: Springer.

  48. Bochkovskiy A, Wang C-Y, Liao H-YM. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934. 2020.

  49. Howard AG. Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:13125402. 2013.

  50. Simard PY, Steinkraus D, Platt JC, editors. Best practices for convolutional neural networks applied to visual document analysis. Edinburgh: ICDAR; 2003.

    Google Scholar 

  51. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would like to express our sincere gratitude to Mrs. Panpan Wang for her contributions to data collection. We would also like to thank Mr. Yongyao Li from Agricultural Genomics Institute at Shenzhen (AGIS) for his valuable assistance with computing resources.

Funding

This work was supported by YNTC 2022530000241008, CNTC110202101039 (JY-16), the National Key Research and Development Program of China (No. 2022YFC3400300), and the Natural Science Foundation of China (No. 32300518).

Author information

Authors and Affiliations

Authors

Contributions

Y.P. conceptualized the study, designed and implemented the data processing methodology and deep learning models, conducted model training and validation, performed formal analysis, and created visualizations. Y.P. and H.L. wrote the original draft and led the manuscript revision process. J.C., Z.M., and B.L. contributed to investigation, data curation, and formal analysis, and provided critical feedback during manuscript review and editing. J.R., L.W., and H.L. supervised the project, secured funding, and provided administrative support and research resources.

Corresponding authors

Correspondence to Li Wang, Hailin Liu or Jue Ruan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, Y., Chang, J., Dong, Z. et al. PFLO: a high-throughput pose estimation model for field maize based on YOLO architecture. Plant Methods 21, 51 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01369-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01369-6

Keywords