DWTFormer: a frequency-spatial features fusion model for tomato leaf disease identification

Xiang, Yuyun; Gao, Shuang; Li, Xiaopeng; Li, Shuqin

doi:10.1186/s13007-025-01349-w

Research
Open access
Published: 11 March 2025

DWTFormer: a frequency-spatial features fusion model for tomato leaf disease identification

Yuyun Xiang¹,
Shuang Gao²,
Xiaopeng Li¹ &
…
Shuqin Li^1,3,4

Plant Methods volume 21, Article number: 33 (2025) Cite this article

473 Accesses
Metrics details

Abstract

Remarkable inter-class similarity and intra-class variability of tomato leaf diseases seriously affect the accuracy of identification models. A novel tomato leaf disease identification model, DWTFormer, based on frequency-spatial feature fusion, was proposed to address this issue. Firstly, a Bneck-DSM module was designed to extract shallow features, laying the groundwork for deep feature extraction. Then, a dual-branch feature mapping network (DFMM) was proposed to extract multi-scale disease features from frequency and spatial domain information. In the frequency branch, a 2D discrete wavelet transform feature decomposition module effectively captured the rich frequency information in the disease image, compensating for spatial domain information. In the spatial branch, a multi-scale convolution and PVT (Pyramid Vision Transformer)-based module was developed to extract the global and local spatial features, enabling comprehensive spatial representation. Finally, a dual-domain features fusion model based on dynamic cross-attention was proposed to fuse the frequency-spatial features. Experimental results on the tomato leaf disease dataset demonstrated that DWTFormer achieved 99.28% identification accuracy, outperforming most existing mainstream models. Furthermore, 96.18% and 99.89% identification accuracies have been obtained on the AI Challenger 2018 and PlantVillage datasets. In-field experiments demonstrated that DWTFormer achieved an identification accuracy of 97.22% and an average inference time of 0.028 seconds in real plant environments. This work has effectively reduced the impact of inter-class similarity and intra-class variability on tomato leaf disease identification. It provides a scalable model reference for fast and accurate disease identification.

Introduction

The tomato crop is regarded as a strategic agricultural resource for many countries worldwide, but the rapid spread of diverse diseases significantly threatens its yield and quality [1]. The diseases first manifest on leaves and then spread to the entire plant. Thus, it is crucial to timely and accurately identify leaf diseases for effective disease management in tomato cultivation. Traditional methods of identifying tomato leaf diseases primarily rely on plant-protection experts with specialized knowledge or experienced farmers. But, these manual approaches are time-consuming, labor-intensive, and prone to subjective biases, increasing the likelihood of misjudgments [2]. In particular, the symptoms of the same tomato leaf disease can vary considerably across different stages of infection (intra-class variability), while multiple diseases may present similar pathological characteristics (inter-class similarity). Consequently, developing automated methods for tomato leaf disease identification using computer vision technology is vital for ensuring efficient tomato production.

In recent years, the rapid development of machine learning has provided new ideas for identifying crop leaf disease [3], such as Support Vector Machine [4], Random Forests [5], and Logistic Regression [6]. While these approaches have achieved some success, their accuracy relies heavily on manually designed feature extraction algorithms [7] to extract excellent disease features. Given the variety of tomato leaf diseases and the complexity of their feature distributions, machine learning-based methods remain relatively inefficient.

To address the above issues, deep-learning methods represented by convolutional neural networks employ an end-to-end structure for disease identification [8, 9], offering advantages such as robust feature extraction capabilities and high adaptability [10]. These methods have achieved outperforming success in identifying crop leaf disease, such as citrus [11, 12], tea [13, 14], apples [15, 16], strawberries [17, 18] and tomatoes [19, 20]. Although these methods offer significant identification accuracy compared to machine-learning approaches, they primarily extract disease features from a single spatial domain information. In datasets with simple feature distributions, these methods effectively focus on disease regions. However, their performance declines in datasets with more complex feature distributions. In such cases, the models tend to prioritize low-frequency component information during the fitting process, often neglecting high-frequency texture details in the images, ultimately reducing the disease identification accuracy.

To address the limitations of relying solely on single spatial domain information, in recent years, many researchers have transformed the spatial domain information of disease images into frequency-domain information, aiming to enhance model performance [21,22,23]. For instance, Li et al. [21] used 2D Discrete Wavelet Transform (DWT) to extract frequency-domain features from tea disease spots, achieving 88% identification accuracy. Zhang et al. [22] designed a frequency-domain attention network to adaptively learn the weights of each frequency, achieving a 98.83% accuracy in citrus disease identification. Li et al. [23] used multispectral channels to convert spatial features into frequency-domain features, achieving an accuracy of 96.7% in identifying tomato leaf diseases. These studies replace spatial features with frequency-domain features as input to the identification models, providing valuable insights into extracting and learning disease features from frequency-domain. However, solely using frequency-domain features risks losing spatial features, which contain rich semantic information. Additionally, the challenges posed by intra-class variability and inter-class similarity of tomato leaf disease highlighted the critical need for comprehensive semantic information. In contrast, the fusion of frequency-spatial features is more capable of capturing semantic information and multi-scale disease features from dual domains to improve the models’ identification accuracy.

To reduce the effect of inter-class similarity and intra-class variability of tomato leaf disease, this paper proposed a frequency-spatial features fusion model, DWTFormer. The frequency-domain information captures the responses of crop leaf regions at various frequencies, providing abstract representations of disease features such as textures and edges. Spatial information describes global and local structures of disease features. By integrating these two types of information, the model can learn richer, multi-scale feature representations, thereby mitigating the effects of inter-class similarity and intra-class variability in tomato leaf diseases, ultimately improving identification accuracy. The main contributions are listed below:

(1) A frequency-spatial features fusion network for tomato leaf disease identification (DWTFormer) was proposed to reduce the impact of inter-class similarity and intra-class variability on the identification accuracy.

(2) A dual-branch feature mapping model (DFMM) was designed to extract multi-scale disease features from the frequency-spatial domain. The frequency branch captured rich frequency features, compensating for spatial features. The spatial branch achieved comprehensive spatial feature mapping.

(3) A dual-domain features fusion model based on dynamic cross-attention (MDFF-DCA) was designed to fuse frequency-spatial features. It assigns different weights to frequency-spatial features based on their attributes, enhancing the network’s focus on disease features.

Materials and methods

Datasets

Tomato leaf disease dataset

The tomato leaf disease dataset was collected from Kaggle (https://www.kaggle.com/datasets). It contains 9 diseases, totaling 39,295 images. These samples were collected from fields, greenhouses, and controlled laboratories with simple and complex backgrounds. Some samples were shown in Fig. 1. Fig. 1a shows that the range of powdery mildew distribution, the thickness of the powdery mildew layer, the clarity of concentric circles of early blight lesions, and other tomato leaf disease features show great intra-class variability depending on different stages of infection. Fig. 1b showed that both mosaic virus disease and leaf mold disease show yellow-green lesions in the early stages. Both grey spot disease and bacterial spot disease show brown lesions and yellow leaves. Both early blight and late blight show dry lesions. These disease features manifest high inter-class similarity.

The self-built dataset

Many studies have highlighted that inter-class similarity and intra-class variability are not unique to tomato diseases but are also observed in other crop leaf diseases, such as apples [24], strawberries [25], etc. As a result, this paper constructed a self-built dataset containing leaf disease of strawberries, apples, and tea to verify the generalizability of DWTFormer. It contains 7,883 samples across 15 classes collected from two sources. The apple and tea leaf disease datasets were obtained from apple and tea experimental demonstration stations in Shaanxi Province. The strawberry leaf disease dataset [26] was sourced from RoboFlow (https://universe.roboflow.com) and Kaggle platforms. Some samples were shown in Fig. 2. All images were captured in greenhouse and open-field environments characterized by complex backgrounds. The self-built dataset was augmented using common operations to address the class imbalance, resulting in an expanded dataset with 52,884 samples.

The AI Challenger 2018 and PlantVillage datasets

To further validate the DWTFormer’s effectiveness, the PlantVillage [27] and AI Challenger 2018 (www.challenger.ai) datasets were used to compare the performance of DWTFormer and some state-of-the-art models. The PlantVillage dataset includes 24 diseases across 14 crops, making the distribution of disease features highly diverse. It contains 54306 samples. In the AI Challenger dataset, 35861 samples were categorized into 61 classes based on disease severity, with significant variability observed among different samples of the same disease.

Analysis of tomato disease features

To design an effective network to reduce the impact of intra-class variability and inter-class similarity, Fig. 3 analyzed the tomato leaf disease samples using digital image processing. A represents the low-frequency component containing global structural information. H, V and D represent high-frequency components containing local information.

Fig. 3a indicates that changes in leaf color at different stages (green-yellow-brown) manifest as only slow brightness variations in the low-frequency component. The disease extent appeared as faint patches blending with the overall leaf shape, making distinguishing challenging. In contrast, during the early stages of the disease, the spot edges in the high-frequency components exhibited simple linear textures. As the disease spread, the spot edges in the H and V components became more intricate and denser, and the D component revealed staggered light and dark variations. These local features emphasized the disease details at various stages of the same disease, playing a critical role in addressing the impact of intra-class variability on disease identification accuracy. Fig. 3b demonstrates that high-frequency components exhibit highly similar edge and texture features of different diseases, providing limited discriminatory information. However, low-frequency components revealed subtle yet crucial differences in brightness distribution, spread range, and position. For example, bacterial spot disease featured slightly darker, evenly distributed disease spots with a small spread range concentrated in the middle of the leaf. In contrast, the disease spots of grey spot disease were significantly darker, more concentrated, and covered a larger area, often along the leaf edge or specific regions. These global features highlight key differences in spread patterns and distribution, making them essential for accurate disease classification.

Based on the above analysis, disease identification models should focus more on multi-scale feature extraction and fusion to reduce the impact of intra-class variability and inter-class similarity on the accuracy of tomato disease identification.

Model overview

As shown in Fig. 4, DWTFormer took MobileNetV3 [28] as its backbone and employed a multi-stage design to extract multi-scale features. Stage 1 to Stage 3 contains three repetitive units. Each unit consists of several Bneck-DSMs, DFMMs, and MDFF-DCAs. To balance the model performance and computational cost, Stage 4 and Stage 5 only consist of different numbers of Bneck-DSM without DFMMs and MDFF-DCAs. Bneck-DSM refers to embedding a Dynamic Shift Max (DSM) activation function [29] in Bneck to replace ReLU6 , improving the model’s nonlinearity. Finally, adaptive mean pooling and linear classifiers integrate the global information to output disease categories.

As shown in Fig. 4, in the first three stages, the input features were first processed by two Bneck-DSMs to extract shallow features, supporting subsequent advanced feature extraction. Then, a DFMM was designed to capture multi-scale and deep disease features from the frequency-spatial domain. Finally, multi-scale and dual-main features were fused by MDFF-DCA while strengthening the focus on disease features. Next, the specific design of DFMM and MDFF-DCA will be summarized.

The dual-domain features mapping model: DFMM

Frequency features mapping branch based on DWFD

Daubechies8 (Db8) discrete wavelet transform (DWT) provides tight support and orthogonality. Thus, the frequency-domain response in different directions is balanced during signal processing. As a result, in the image decomposition, Db8 DWT can capture the image’s global features and respond well to local features. Based on this mathematical characteristic, a frequency features mapping branch based on Db8 DWT, DWFD, was designed, shown in Fig. 5.

Specifically, the input image $x \in {\mathbb {R}}^{H \times C \times W}$ was processed by a Swish activation function to increase the features’ nonlinearity first. Then, Db8 converted it into one low-frequency component and three high-frequency components, as shown in Eqs. (1–4)

$$\begin{aligned} A[i,j]=\, \sum _k A1[k,j]\cdot h[2i - k] \end{aligned}$$

(1)

$$\begin{aligned} H[i,j]=\, \sum _k A1[k,j]\cdot g[2j - k] \end{aligned}$$

(2)

$$\begin{aligned} V[i,j]=\, \sum _k D1[k,j]\cdot h[2i - k] \end{aligned}$$

(3)

$$\begin{aligned} D[i,j]=\, \sum _k D1[k,j]\cdot g[2j - k] \end{aligned}$$

(4)

where, $A$ is low-frequency (approximation factor); $i$ and $j$ are the image’s row and column index; $A 1$ is the horizontal approximation coefficient obtained by Eq. (5); $H$ is horizontal-oriented high-frequency; $h$ is the high-pass filter coefficient; $g$ is the low-pass filter coefficient; $V$ is vertical-oriented high-frequency; $D 1$ is the horizontal detail coefficient obtained by Eq. (6); $D$ is diagonal-oriented high-frequency.

$$\begin{aligned} A1[i,j]=\, \sum _k I_c[i,k]\cdot h[2j - k] \end{aligned}$$

(5)

$$\begin{aligned} D1[i,j]=\, \sum _k I_c[i,k]\cdot g[2j - k] \end{aligned}$$

(6)

Finally, the low-frequency and high-frequency components were mapped to obtain the low-frequency and high-frequency features using a Bottleneck, defined as Eq. (7).

$$\begin{aligned} F_{low}, F_{high} = f_w(x) = \left( \psi (A), \psi (H, V, D) \right) \end{aligned}$$

(7)

where $F_{low}$, $F_{high}$ are low-frequency and high-frequency features; $f_w (\cdot )$is the feature mapping function;$\psi (\cdot )$is the Bottleneck, defined as Eq. (8).

$$\begin{aligned} \psi (x) = \delta _{1 \times 1} \left( \delta _{3 \times 3} \left( \delta _{1 \times 1}(x) \right) \right) + x \end{aligned}$$

(8)

where $\delta _{ k \times k } \left( \, \cdot \, \right)$ - is a convolution operation at the size of $k \times k$.

Spatial domain features mapping branch based on MPVT

To achieve comprehensive spatial feature mapping, the MPVT module was designed based on multi-scale convolutions and PVT (Pyramid Vision Transformer), as shown in Fig. 6. The input features were first divided into many patches by Patch Embedding to preserve spatial continuity and enhance the correlation between local features. Then, the global and local branches captured coarse and delicate features. The specific procedures were as follows:

Patch embedding. For input images $x \in {\mathbb {R}}^{H \times C \times W}$, it was divided into several patches of size $P\times P$ after PE, as shown in Eq. (9). Each patch was fixed to a vector space of dimension $D$ by a linear transformation as shown in Eq. (10):

$$\begin{aligned} P_{i,j}=\, \textrm{Reshape}\left( I[iS : iS + P, jS : jS + P, :] \right) ,\quad P_{i,j} \in {\mathbb {R}}^{P^2 \cdot c} \end{aligned}$$

(9)

$$\begin{aligned} E_{i,j}=\, P_{i,j} W_e + b_e \end{aligned}$$

(10)

where $P_{i,j}$ is the image’s patch; S is the stride; $E_{i,j}$ is the patch’s embedding vector at position i, j; $W_e\in {\mathbb {R}}^{P^2 C \times D}$is the weight matrix of the linear transformation; $b_e \in {\mathbb {R}}^D$ is the bias.

Local spatial features mapping branch. In this branch, convolution kernels of three scales (two regular convolutions with the size of 1×1 and 5×5, a dilated convolution with the size of 3×3 and a dilation rate of 2) were used to capture features under different receptive fields, respectively. The features of different scales extracted from the three branches were fused using channel splicing. A 1×1 convolution was also used to integrate further and adjust the channels. Then, the CBAM (Convolutional Block Attention Module, CBAM) [30] further strengthened the features. Finally, to ensure that the original features are not lost, the residual connection was used to fuse the input feature map with features extracted by multi-scale convolutions to obtain local features.

For the specific derivation, for the input $E_{(i,j)}$, this paper first defined the operation of multi-scale convolution features mapping as Eq. (11).

$$\begin{aligned} f_{\text {mul}}=\, F_{\text {mul}}(E_{(i,j)}) \nonumber \\=\, \delta _{1 \times 1} \left( \text {LR} \left( \text {BN} \left( \delta _{1 \times 1}(E) \right) \right) \right) \nonumber \\ \quad+ \text {LR} \left( \text {BN} \left( \delta _{3 \times 3, \text {rate}=2}(E) \right) \right) \nonumber \\ \quad + \text {LR} \left( \text {BN} \left( \delta _{5 \times 5}(E) \right) \right) \end{aligned}$$

(11)

where $f_{\text {mul}}$ is the features obtained by multi-scale convolutions; $F_{mul}$ is the operation of multi-scale convolutions; $LR\left( \cdot \right)$ is the LeakyReLU; $BN\left( \cdot \right)$ is BatchNorm; $\delta _{k \times k}(\cdot )$ is a normal convolution with the size of k×k;$\delta _{k \times k, rate=i}(\cdot )$ is a dilated convolution with the size of k×k, dilation rate of i.

Then, the operation of the local spatial feature mapping branch was defined as Eq. (12).

$$\begin{aligned} F_{local} = f_{C} + E \end{aligned}$$

(12)

where $F_{local}$ is local spatial features; $f_{C}$ is the features obtained by CBAM, whose operation was defined as Eq. (13).

$$\begin{aligned} f_{C}=F_{\text {CBAM}}(f_{mul}) = \sigma \left( \textrm{MLP}(\textrm{AvgPool}(f_{mul})) \right) \cdot f_{mul} \cdot \sigma \left( \textrm{MLP}(\textrm{MaxPool}(f_{mul})) \right) \end{aligned}$$

(13)

where $MLP\left( \cdot \right)$ is the multiple layers perception.

Global features mapping branch. For features $E \in {\mathbb {R}}^{N \times D}$, the matrices of Key(K), Query(Q), and Value(V) were first obtained by linear projection according to Eq. (14):

$$\begin{aligned} Q=EW_q, K=EW_k, V=EW_v \end{aligned}$$

(14)

where $W_q$,$W_k$,$W_v$ denote the projection matrices.

Then, attention scores were calculated by spatial reduction attention (SRA) as in Eq. (15), and the attention output was obtained after a linear transformation and residual concatenation, defined as Eq. (16):

$$\begin{aligned} \textrm{SRA}(Q, K, V)=\, \textrm{Softmax} \left( \frac{Q K^T}{\sqrt{D}} \right) V \end{aligned}$$

(15)

$$\begin{aligned} Z=\, \textrm{SRA}(Q, K, V) W_o + E \end{aligned}$$

(16)

where $W_o \in {\mathbb {R}}^{D \times D}$ denotes the output projection matrix.

Finally, global features were obtained by two ordinary convolutional layers, one fractional linear activation function, and one point-by-point convolution, as shown in Eq. (17):

$$\begin{aligned} F_{\text {global}} = \delta _{1 \times 1} \left( \sigma _{1 \times 1} \left( \textrm{GELU}(\delta _{1 \times 1}(Z)) \right) \right) + Z \end{aligned}$$

(17)

where $F_{global}$ is the global feature;$\delta _{1 \times 1}(\cdot)$ is a convolution of size 1×1;$\textrm{GELU}(\cdot )$ is a nonlinear activation function;$\sigma _{1 \times 1}(\cdot )$ is a point-by-point convolution operation of size 1×1.

The multi-scale dual-domain features fusion module: MDFF-DCA

The frequency and spatial domain features represent different attributes and detailed features with different semantic information. Therefore, the MDFF-DCA module was designed to integrate the multi-scale frequency-spatial features, as shown in Fig. 7. It contained a multi-scale features alignment (MFA) block based on three sets of asymmetric convolutions and DFF-DCA based on dynamic cross-attention.

First, the MFA module was designed to map the multi-scale frequency-spatial features to a unified scale, as shown in Fig. 7a. Specifically, three groups of asymmetric convolutions, each with different scales, were applied to process the features individually. Following this, a $1\times 1$ convolution was employed to map these processed features to unified-scale Q, KK, and V matrices, which served as the input for the next stage. As a result, high-frequency features, low-frequency features, global spatial features, and local spatial features were all mapped into four groups of matrices, each at a unified scale.

Next, this paper designed a DFF-DCA module to integrate multi-scale frequency-spatial features, as illustrated in Fig. 7b. MDFF-DCA first computed attention by querying key-value pairs of high-frequency and local spatial features, then performed feature weighting to fuse high-frequency and local spatial features. This fusion aimed to enhance the model’s ability to capture local details (such as spot textures and edges), thereby mitigating the impact of intra-class variability. Similarly, a parallel process was employed to fuse low-frequency and global spatial features, enabling the model to extract global structures (such as spot shapes and distribution patterns) in disease images. This strengthens the model’s capability to distinguish diseases with high inter-class similarity. Finally, the two fused feature representations, F1 and F2, were concatenated via channel stitching to form the final output feature.

Through MDFF-DCA, the DWTFormer can dynamically adjust attention weights based on the different attributes of frequency-spatial features. This mechanism ensures that semantically similar features receive higher attention weights while semantically distinct features are suppressed or ignored. As a result, greater emphasis is placed on semantically similar features during the features fusion, leading to improved fusion of frequency-spatial features.

For the specific derivation, for the input feature $x \in {\mathbb {R}}^{H \times C \times W}$, this paper first defined the operation of multi-scale features alignment (MFA) as Eq.(18):

$$\begin{aligned} Q, K, V = F_{mfa}(x) = \delta _{1 \times 1}(\textrm{Concat}(\delta _{1 \times 7}(x), \delta _{7 \times 1}(x), \delta _{1 \times 11}(x), \delta _{11 \times 1}(x), \delta _{1 \times 21}(x), \delta _{21 \times 1}(x))) \end{aligned}$$

(18)

where Q, K, V with the size of $H\times C\times W$ denote the matrices obtained after MFA; $F_{mfa}(\cdot )$ is the MFA function;$\delta _{1 \times 1}$ is a convolution operation of size 1×1; $Concat(\cdot )$ is the features splicing function.

According to the above definition, the $F_{global}$, $F_{local}$, $F_{high}$, and $F_{low}$ features could be mapped to 4 sets of unified scale matrices: $(Q_{high},K_{high},V_{high})$, $(Q_{low},K_{low},V_{low})$, $(Q_{global},K_{global},V_{global})$ and $(Q_{local},K_{local},V_{local})$, by Eq. (19) to Eq.(22).

$$\begin{aligned} Q_{high},K_{high},V_{high}=\, F_{mfa}(F_{high}) \end{aligned}$$

(19)

$$\begin{aligned} Q_{low},K_{low},V_{low}=\, F_{mfa}(F_{low}) \end{aligned}$$

(20)

$$\begin{aligned} Q_{global},K_{global},V_{global}=\, F_{mfa}(F_{global}) \end{aligned}$$

(21)

$$\begin{aligned} Q_{local},K_{local},V_{local}=\, F_{mfa}(F_{local}) \end{aligned}$$

(22)

Then, based on the analysis of the tomato disease features in Fig. 3, high-frequency and local spatial features were integrated by DFF-DCA to capture more detailed features, such as the texture and edges of the lesions, thereby mitigating the impact of intra-class variation. Simultaneously, low-frequency and global spatial features were fused to extract global structures, such as leaf shape and disease distribution patterns, effectively reducing the impact of inter-class similarity. Therefore, according to the matrices: $(Q_{high},K_{high},V_{high})$, $(Q_{low},K_{low},V_{low})$, $(Q_{global},K_{global},V_{global})$ and $(Q_{local},K_{local},V_{local})$, the fusion of the high-frequency features and local spatial features, low-frequency features and global spatial features were shown in Eqs. (23) and (24).

$$\begin{aligned} F_1=\, \textrm{Concat} \left( \delta _{1 \times 1} (\textrm{Atten}(Q_{high}, K_{local}, V_{local})), \delta _{1 \times 1} (\textrm{Atten}(Q_{local}, K_{high}, V_{high})) \right) \end{aligned}$$

(23)

$$\begin{aligned} F_2=\, \textrm{Concat} \left( \delta _{1 \times 1} (\textrm{Atten}(Q_{low}, K_{global}, V_{global})), \delta _{1 \times 1} (\textrm{Atten}(Q_{global}, K_{low}, V_{low})) \right) \end{aligned}$$

(24)

where $F_1$ is the fused features of high-frequency and local spatial features;$F_2$ is the fused features of low-frequency and global spatial features;Atten(Q, K, V)is the operation of dynamic cross-attention, defined as Eq. (25); T is the transpose operation; d is the scaling factor.

$$\begin{aligned} \text {Atten}(Q, K, V) = \text {Softmax}\left( \frac{QK^\top }{\sqrt{d}}\right) V \end{aligned}$$

(25)

Finally, the operation of the MDFF-DCA module was defined as Eq. (26).

$$\begin{aligned} F_{\text {out}} = f_{\text {MDFF}\_\text {DCA}}(F_s, F_f) = \text {Concat}(F_1, F_2) \end{aligned}$$

(26)

where $f_{\text {MDFF}\_\text {DCA}}(\cdot )$ is the operation of MDFF-DCA function; $F_s$ and $F_f$ are the spatial and frequency domain features.

Experimental setting and evaluation indices

The experimental software environment was set to TensorFlow 2.4 and Python 3.8. The server configuration was as follows: Intel Xeon E5-2690 CPU @ 2.90GHz; NVIDIA GeForce RTX 3080Ti; CUDA 11.4.0 and cuDNN 8.4.0. The Adam algorithm and linear decay were used to optimize the network weights, with the learning rate set at 0.0005, the weight decay set at 0.05, and the training epoch set at 100.

All datasets were divided into training, validation, and testing sets in a ratio of 8:1:1. To evaluate the DWTFormer’s performance in tomato disease identification, Precision, Recall, Accuracy, and F1-score, defined as Eq. (27) to Eq. (30), were used as evaluation metrics.

$$\begin{aligned} \text {Precision}=\, \frac{TP}{TP+FP} \end{aligned}$$

(27)

$$\begin{aligned} \text {Recall}=\, \frac{TP}{TP+FN} \end{aligned}$$

(28)

$$\begin{aligned} \text {Accuracy}=\, \frac{TP+TN}{TP+FN+FP+TN} \end{aligned}$$

(29)

$$\begin{aligned} \text {F1-score}=\, \frac{2TP}{2TP+FN+FP} \end{aligned}$$

(30)

where TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives.

Results and discussion

Identification results on the tomato leaf disease dataset

Based on DWTFormer’s structure, MobileNetV3, Swin Transformer [31], Vit [32], PVTv2 [33], and Mixer [34] were selected to be tested on the tomato leaf disease dataset. All models were retrained on the tomato leaf disease dataset without using pre-trained weight. The results are shown in Table 1.

Table 1 Comparison of different models on the tomato leaf disease dataset

Full size table

Table 1 demonstrates that DWTFormer achieved the highest identification accuracy, reaching 99.28%. Although DWTFormer had 33.69M parameters and 7.96G FLOPs, slightly higher than MobileNetV3-large. However, MobileNetV3, based on the Bneck structure and depthwise separable convolutions, is more adept at capturing local features but neglects effective global feature modeling. This limitation hampers its ability to handle diseases with high inter-class similarity, resulting in an identification accuracy 6.33% lower than DWTFormer. Although Swin Transformer-Base and ViT-Base rely on multi-head self-attention mechanisms, which excel at capturing global features, they struggle to express local features effectively. Consequently, they face constraints when dealing with diseases exhibiting significant intra-class variability, achieving identification accuracies of only 85.03% and 87.69%, respectively. Mixer performed the worst among these models, likely due to its simplistic structure, which inadequately captures global and local features.

Unlike MobileNetV3, Swin Transformer-Base, ViT-Base, and Mixer, DWTFormer effectively captures multi-scale disease features from frequency-spatial domains, including high-frequency and low-frequency components, as well as both global and local spatial features by the DFMM module. Additionally, MDFF-DCA integrates high-frequency features with local spatial features, enhancing the model’s ability to identify diseases with high intra-class variability. Similarly, MDFF-DCA combines low-frequency features with global spatial features to further enhance the model’s capacity to identify diseases with high inter-class similarity. Finally, these fused features are fused again to extract multi-scale disease features.

Although PVTv2 integrates pyramid and Transformer structures to capture multi-scale features, it extracts features from spatial domain information and lacks effective modeling of frequency-domain information, leading to an identification accuracy of only 86.56%. In contrast, DWTFormer extracts rich frequency-domain information in disease images, revealing texture details and periodic features. This facilitates the capture of high-frequency features, such as fine spot variations. Thus, a more detailed representation of disease was provided to address the challenges of intra-class variability and inter-class similarity. In addition, spatial information represents the structural information of disease features, including the shape and size of disease spots. DWTFormer effectively fuses these two types of features through MDFF-DCA, enabling more efficient extraction of multi-scale disease features and mitigating the impact of intra-class variability and inter-class similarity on identification accuracy.

In summary, DWTFormer employs a unique frequency-spatial feature fusion mechanism to extract multi-scale disease features, effectively mitigating the impact of intra-class variability and inter-class similarity on identification accuracy.

Generalizability experiments

Generalizability experiments on the self-built datasets

Experiments on the tomato leaf disease datasets indicated that DWTFormer reached a 99.28% identification accuracy, effectively decreasing the impact of inter-class similarity and intra-class variability. However, inter-class similarity and intra-class variability also exist in other leaf diseases. As a result, the generalization ability of DWTFormer plays an important role. To validate the generalizability of DWTFormer, experiments were conducted on the self-built dataset. The mainstream models mentioned above were chosen for comparison. All models were trained on the tomato leaf disease dataset and fine-tuned on the self-built dataset. Their accuracy, precision, recall, and F1-score are shown in Table 2.

Table 2 Comparison of different models on the self-built dataset

Full size table

Note: The best results are highlighted in bold.

Table 2 shows that although the self-built dataset has more complex feature distributions, such as smaller disease spots of strawberries and complex backgrounds, DWTFormer also achieved the highest accuracy of 97.63%. Its precision (98.01%), recall (97.22%), and F1-score (96.84%) were also optimal. The above findings demonstrate DWTFormer’s generalizability in identifying other crop leaf diseases with inter-class similarity and intra-class variability. This is because DFMM and MDFF-DCA enhance the model’s ability to extract multi-scale features and focus more on the disease spots, thus achieving high generalizability on the self-built dataset containing a complex distribution of disease features.

Generalizability experiments on cross-domain datasets

The results on the self-built datasets demonstrated that DWTFormer exhibited good generalizability in disease identification across multiple crops and images with complex backgrounds. However, the cross-domain generalizability of the model is also critical in application. Therefore, this paper further evaluated DWTFormer’s generalizability on three cross-domain datasets: Caltech UCSD bird (CUB 200) [35], Stanford Dogs [36], and CIFAR10 [37]. The identification accuracies of different models on these datasets are shown in Fig. 8.

Overall, all models achieved the best identification accuracy on CIFAR10, second best on the Stanford Dogs dataset, and worst performance on the CUB 200 dataset. This is due to differences in the difficulty of the datasets themselves. Specifically, the CIFAR10 dataset consists mainly of low-resolution images with significant distinguishing features between different categories and a relatively homogeneous background, allowing models to learn global and local features more efficiently. The Stanford Dogs dataset, on the other hand, has high similarity between some of the dog breeds, as well as high intra-class variability (e.g., diversity in posture, coat color, and background), which increases the difficulty of identification. In contrast, the CUB 200 has the most complex challenge. It does not only contain complex backgrounds and noise interference, but its inter-class similarity is high, with many birds having highly similar appearance characteristics with only minor differences in color, feather pattern, or body size. In addition, birds are unstructured targets with extremely variable morphology, scale, and posture, making it difficult for models to accurately focus on key features, thus reducing identification accuracy.

Analyzing the identification results across individual datasets, DWTFormer demonstrated the best performance on CIFAR10, achieving an accuracy of 96.22%, significantly surpassing all other models. MobileNetV3_large followed with 91.27%, but there remained a substantial gap compared to DWTFormer, while all other models fell below the 90.00% threshold. On the Stanford Dogs dataset, although DWTFormer achieved an accuracy of 89.91%, slightly lower than its performance on CIFAR10, it still maintained a notable advantage over other models. In particular, DWTFormer outperformed MobileNetV3_large (78.31%), PVTv2-B/3 (84.90%), and Swin Transformer-Base (76.11%) by a significant margin, highlighting its superior ability to handle fine-grained classification tasks with high intra-class variation. In the CUB 200 dataset, despite the challenges posed by complex backgrounds, noise interference, high inter-class similarity, and unstructured object variations, DWTFormer achieved the highest identification accuracy of 86.67%. Compared to MobileNetV3_large (77.56%), PVTv2-B/3 (79.10%), Swin Transformer-Base (74.86%), ViT-Base (75.20%), and Mixer-B/16 (72.77%), DWTFormer surpassed them by 9.11%, 7.57%, 11.81%, 11.47%, and 13.90%, respectively. This considerable performance improvement underscores DWTFormer’s superior feature extraction capabilities, allowing it to effectively capture fine-grained details and adapt to highly complex classification tasks.

The remarkable performance improvement of DWTFormer can be primarily attributed to its two-branch feature mapping mechanism, which effectively integrates frequency domain information from images. This approach enhances the model’s ability to capture high-frequency features, such as subtle feature variations and intricate texture details. Additionally, DWTFormer employs a dynamic cross-attention mechanism, which significantly improves classification accuracy by fusing low-frequency features with global features and high-frequency features with local features. This fusion enables the model to efficiently learn global dependencies while simultaneously strengthening its ability to extract fine-grained details. In contrast, MobileNetV3_large primarily relies on depthwise separable convolutions, focusing more on local feature extraction. However, this structure struggles to capture global dependencies and multi-scale features, leading to suboptimal performance in the CUB 200 dataset, which presents challenges such as complex backgrounds and fine-grained feature variations. Compared to MobileNetV3_large, Transformer-based models (Swin Transformer-Base and ViT-Base) leverage multi-head self-attention mechanisms to capture global features, but they lack effective fine-grained modeling of local details, which limits their performance on CUB 200. Although PVTv2 incorporates a spatial pyramid structure along with Transformer architecture, enhancing its ability to model multi-scale features compared to MobileNetV3_large, Swin Transformer-Base, and ViT-Base, it still relies solely on spatial information for feature extraction. This limitation prevents it from fully leveraging frequency domain information for more comprehensive modeling, resulting in an identification accuracy of only 79.10%, failing to surpass DWTFormer.

Comparison with other state-of-the-art models

Generalizability experiments demonstrated that DWTFormer exhibited high generalization in other crop leaf diseases and cross-domain datasets. To further validate its effectiveness, the performance of DWTFormer and some of the latest crop leaf disease identification models were tested on open-source datasets. The results are shown in Table 3.

Table 3 Comparison of other state-of-the-art models on the open-source datasets

Full size table

Table 3 reveals that DWTFormer consistently achieved the highest identification accuracy (96.18% and 99.89%) among the compared models, followed by ConvViT with a remarkable 99.84% accuracy on the PlantVillage dataset. However, it only achieved an identification accuracy of 86.83% in the AI Challenger 2018 dataset, slightly lower than that of DWTFormer. These findings indicate that although ConvViT combines convolution and Transformer structures to learn global dependencies and local details, relying solely on spatial domain features is insufficient to achieve ideal results when diseases of different stages exhibit significant intra-class variability. Compared to ConvViT, DWTFormer leverages the DWFD module to effectively extract frequency-domain information from disease images, revealing texture details and periodic patterns. This enhancement enables the model to capture high-frequency features, including subtle lesion variations and intricate texture details, which compensate for the limitations of using spatial domain information alone in expressing disease features and further enrich the representation of diseases. Furthermore, DWTFormer innovatively combines high-frequency features with local spatial features, enriching the model’s representation of local details such as lesion texture and edges, thus mitigating the effects of intra-class variability on identification accuracy. Additionally, by integrating low-frequency features with global spatial features, DWTFormer strengthens the model’s ability to capture disease distribution patterns and the leaf’s overall morphology, ultimately improving its capacity to differentiate diseases with high inter-class similarity. Finally, the global and local disease features, which include both frequency-domain and spatial-domain information, are fused again, further enhancing the model’s ability to capture multi-scale features. The above analysis further demonstrates that DWTFormer effectively reduces the impact of intra-class variability and inter-class similarity on disease identification accuracy by integrating multi-scale features from frequency-spatial domains.

Ablation experiments

Ablation experiments of each module

To verify the effectiveness of DWFD, MPVT, MFA, and DFF-DCA, ablation experiments were conducted on the tomato leaf disease dataset using the control variables method. The results are shown in Table 4. Firstly, the results of group 1 and group 4 demonstrate that introducing DFMM improves the identification accuracy by about 3.57%. This is because DFMM captures and learns multi-scale disease features from the frequency-spatial dual-domain, reducing the effects of inter-class similarity and intra-class variability on identification accuracy. The comparison of Group 4 and Group 6 shows that the introduction of MDFF-DCA improves the identification accuracy by about 2.36%. It suggests that MDFF-DCA can effectively integrate frequency-spatial features and enhance the model’s focus on the diseased spots.

Table 4 Results of ablation experiments of each module

Full size table

Then, the results of group 1 and group 2 indicate that DWFD improves accuracy by 2.34%. It was noticed that FLOPs were significantly reduced. This is because the DWT transforms disease features into sparse representations, reducing the model’s computation and storage requirements. A comparison of Group 1 and Group 3 shows that MPVT can improve accuracy by about 3.23%. These results show that frequency-spatial feature fusion can help DWTFormer obtain more comprehensive disease features, achieving higher identification accuracy. It also demonstrates the advantage of using frequency-domain features to supplement spatial features in identifying tomato leaf diseases.

Finally, the comparison of group 5 and group 6 shows that DFF-DCA improved the accuracy by 1.42%. The results of group 4 and group 5 show that MFA improved accuracy by 0.94%. This is because MFA effectively bridged the semantic gap between frequency and spatial domain features. Moreover, DFF-DCA assigns greater attention weights to intra-class features with similar semantics while ignoring inter-class features with different semantics by adjusting attention weights according to the feature attributes.

Ablation experiments of different Bneck-DSM numbers and DSM

To verify the effect of different Bneck-DSM numbers and Dynamic Shifted Max (DSM), experiments were conducted to compare the performance of DWTFormer variants with different Bneck-DSM numbers and activation functions. The dataset used was the tomato leaf disease dataset. Except for the settings of Bneck-DSM and activation function, the settings of other modules in each stage remained constant. The results are shown in Table 5.

Table 5 Results of different Bnecks and activation functions

Full size table

Table 5 shows that when the Bneck-DSM numbers of each stage were set to (2,2,3,6,2), the different activation functions had minimal impact on model performance. However, as the number of Bneck-DSM was reduced, the Dynamic Shifted Max (DSM) activation function significantly mitigated performance degradation. For example, when the Bneck-DSM number was set to (2,2,3,2,1), DWTFormer with DSM achieved an accuracy of 99.28%, only 0.34% lower than the numbers of (2,2,3,6,2). Meanwhile, the model’s parameters and FLOPs were decreased by 11.57M and 8.79G, respectively. In contrast, when the Bneck-DSM number was reduced from (2,2,3,6,2) to (2,2,3,2,1), the accuracy of DWTFormer using ReLU and LeakyReLU decreased by 2.43% and 2.87%, respectively. These findings indicate that DWTFormer, with the Bneck-DSM configuration set to (2,2,3,2,1), achieves a significant balance between computational cost and performance.

Ablation experiments of different feature fusion times

To verify the effect of feature fusion times, the performance of DWTFormer with different DWFD and MDFF-DCA numbers was compared. The feature fusion times are equal to the numbers of DWFD and MDFF-DCA. Except for the numbers of DWFD and MDFF-DCA in the model, the settings of other modules were kept constant. The results are shown in Table 6.

Table 6 shows that the model’s identification accuracy increased with fusion times. Compared with SF1, the accuracy of SF3 was improved by 2.40%. The accuracy of SF4 reached up to 99.63%. However, increasing the fusion times to improve accuracy became increasingly expensive in terms of the parameters and FLOPs. SF4 required an additional 8.63M parameters and 2.63G FLOPs compared to SF3. This is because the Transformer structure referenced by the MDFF-DCA has large parameters and FLOPs. Thus, the DWTFormer’s parameters and FLOPs gradually increase as fusion layers increase. These results show that SF3 maintains a good balance between the model performance and resource occupation, which makes it more suitable for tomato leaf disease identification.

Table 6 The performance of DWFD and MDFF-DCA numbers with different numbers

Full size table

Ablation experiments of channel ratio of frequency and spatial domain features

To verify the effect of channel ratio between frequency-domain and spatial domain features on the model’s performance, DWTFormer with varying channel ratios was compared on the tomato leaf disease dataset. The results are shown in Table 7.

Table 7 DWTFormer’s performance under different channel ratios of frequency-spatial domain features

Full size table

Table 7 shows that when the ratio was set as 1, DWTFormer achieved the highest accuracy and F1-score. The accuracy was slightly improved compared with ratio=0.5 and ratio=2.0. When the channels of the frequency-domain feature were small (ratio=0.5), insufficient mining of frequency information decreased DWTFormer’s ability to capture detailed features. On the contrary, when the channel number of the frequency-domain feature was too large (ratio=2.0), the frequency-domain information masked the importance of the spatial features, which led to the difficulty of extracting enough spatial information. The above analysis demonstrates the effectiveness of adjusting the frequency-spatial feature channel numbers to be consistent to improve the disease identification accuracy of DWTFormer.

Visualizations

Visualization of tomato leaf disease with inter-class similarity

To further validate the effectiveness of DWTFormer in reducing the impact of inter-class similarity on identification accuracy, Fig. 9 uses CAM to visualize the focus of different models on tomato leaf diseases exhibiting inter-class similarity. The red regions are considered critical regions from which the model extracted comprehensive features. It indicates that the DWTFormer’s interest regions on tomato leaves are closest to the interest regions of the human eyes. In contrast, the focus areas of the other models are more dispersed, and they contain more redundant or irrelevant features. For example, the RGB disease images show that BS and SLS have remarkably similar small brown spots, but their leaf color differs. As shown in Fig. 9, DWTFormer accurately captures this color difference, better distinguishing between the two diseases. The results suggest that DWTFormer eliminates irrelevant features and decreases the impact of inter-class similarity.

Visualization of tomato leaf disease with intra-class variability

To further verify the effectiveness of DWTFormer in reducing the effect of intra-class variability, Fig. 10 visualizes the identification results of DWTFormer for two representative diseases characterized by intra-class variability: powdery mildew and early blight. For powdery mildew, the intra-class variability mainly manifests in the density and distribution range of the powdery mildew layer, varying according to the temperature, humidity, and plant density. In the case of early blight, the intra-class variability is observed in the size of spots, color depth, and the clarity of concentric circles, which are influenced by the disease severity and environments.

Fig. 10 shows that for powdery mildew, red areas of the model mainly focused on the location of the white powder. The different CAM colors reflect that DWTFormer effectively focuses on the thickness and distribution of the white powder layer. For early blight, the red areas of the model mainly focus on the central part of the spots (especially the areas of the concentric circle) as well as the distribution of leaf spots, which further verifies that the model can accurately capture the key areas of differential characteristics within the early blight. The above results demonstrate that DWTFormer has an excellent ability to process intra-class variability and accurately locate the key characteristic areas of the disease under different conditions, thus providing adequate support for practical applications.

Visualization of stages of frequency-spatial feature fusion

To further validate the effect of frequency-domain features in enhancing the model’s ability to extract multi-scale disease features, Fig. 11 uses grey-scale mapping to visualize the output of Stage 1 to Stage 3, which are the stages of frequency-spatial feature fusion. It compares the critical features learned by the model before and after adding frequency domain features. The white areas are regions with high activation in the feature map. The more concentrated the white area is, the stronger the model extracts information from the region.

Fig. 11 demonstrates that relying solely on spatial features, the model primarily captures the basic edges of the leaf, with insufficient clarity in extracting lesion textures and details. Additionally, the overall feature map exhibits weak contrast, lacking a clear representation of the relationship between lesions and the overall leaf structure. For instance, the weak contrast in the feature map of Stage 2 suggests insufficient integration of global features. The scattered feature distribution in Stage 3 suggests a significant loss of details. In contrast, incorporating frequency-domain features significantly reduces the activation values in the background. It makes the model concentrate more on disease spots and leaf regions, with marked improvements in the edge sharpness and texture detail representation. Moreover, the feature distribution becomes more focused, and the disease distribution patterns and overall leaf morphology are expressed more clearly. These findings demonstrate that incorporating frequency-domain features enhances the model’s ability to capture high-frequency information, such as edges and textures, while complementing the focus of spatial features on low-frequency information, thus improving the model’s capacity for multi-scale feature extraction. Furthermore, frequency-domain features effectively capture specific textures and edge structures of disease spots, enhancing the saliency of target regions.

Real-time identification of tomato leaf disease in the filed

To identify tomato leaf disease in real-time in fields, this paper developed an Android-based application for tomato leaf disease identification named TomatoAPP, as illustrated in Fig. 12. First, the DWTFormer, trained in TensorFlow, was converted to a model in the “.tflite” format using TensorFlow Lite [47]. Since the weight data of CNNs were stored as 32-bit floating-point numbers, an 8-bit quantization technique was applied to reduce the bit-width, successfully compressing DWTFormer to one-quarter of its original size. Subsequently, the converted model (DWTFormer.tflite) was deployed in TomaotoAPP for the real-time identification of tomato leaf disease. TomatoAPP was developed using Android Studio. Its backend functionalities were implemented in Java, covering tasks such as image capture, uploading, identification, and pesticide recommendation. The user interface was designed using XML, including user login, image collection, and display of pesticide recommendations. Data was stored in an SQLite database. Finally, TomatoAPP.apk, built by Android Studio, was deployed on a Xiaomi 10 smartphone to identify tomato leaf disease in real-time in the plant field. Fig. 13 illustrates the user interface of TomatoAPP. Fig. 13a presents the function module for capturing tomato disease images, allowing users to upload images in real time either from their gallery or by taking photos. Fig. 13b displays the function module of identification results, including the predicted disease class, confidence levels, inference time, and pesticide recommendations.

To verify the applicability of TomatoAPP in the real field, its identification accuracy, valid response rate, and inference time were tested under varying lighting and weather conditions, respectively. In-field tests were conducted from February 3 to February 7, 2025, during three distinct periods: 6:00–8:00 (low light), 12:00–14:00 (strong light), and 17:00–18:30 (soft light), with 200 disease images collected in each period. Based on the weather forecast, the first two days of testing were sunny, and the last three days were overcast, so these two weather conditions were selected to be tested. Sunny and overcast are common and representative of typical weather conditions, offering a meaningful evaluation of DWTFormer’s performance under various light conditions. The testing results are shown in Table 8.

Table 8 Identification results of TomatoAPP under different weather and lighting conditions

Full size table

Table 8 highlights the variation in TomatoAPP’s performance under different weather and lighting conditions. Overall, DWTFormer achieved an average identification accuracy of 97.22% under overcast and 94.58% under sunny. Specifically, on sunny days, DWTFormer attained its highest identification accuracy of 97.75% at 17:00–18:30, followed by 94.25% at 6:00–8:00. However, its identification accuracy at 12:00–14:00 was dropped to 91.75%. This decline in performance is attributed to the intense lighting and high contrast during 12:00–14:00. Although disease images captured under strong lighting (12:00–14:00) are sharper, the intense light and contrast increase the risk of overexposure and detail loss in the disease image, leading to misidentification. In contrast, under overcast conditions, the model’s performance remained more consistent across all time periods. This stability is primarily due to the diffused lighting provided by overcast weather, which minimizes glare and shadow interference. The absence of extreme brightness or deep shadows allows for clearer and more balanced image capture, thereby enhancing DWTFormer’s identification accuracy in real-world conditions.

The above results show that TomatoAPP achieves an identification accuracy exceeding 94.00% for tomato leaf diseases in field conditions under both overcast and sunny weather, with an average inference time of less than 45 milliseconds. These findings confirm that DWTFormer can be effectively embedded into mobile applications for high-precision, real-time identification of tomato leaf diseases in actual agricultural environments, underscoring its applicability. However, various lighting significantly impacts DWTFormer’s identification performance, indicating that its stability under complex lighting scenarios still requires to be further optimized. Therefore, future research will incorporate image enhancement algorithms to improve the model’s adaptability to diverse lighting conditions in real environments. Additionally, due to time constraints and weather variations, the paper evaluated DWTFormer’s performance under only two representative weather conditions: overcast and sunny. Future work will expand testing to include additional weather conditions, such as cloudy and foggy days, to further assess the model’s robustness and reliability in real environments.

Discussion

This paper proposed a frequency-spatial feature fusion network, DWTFormer, for tomato leaf disease identification. Testing results showed that DWTFormer achieved fast and accurate identification of tomato leaf diseases, outperforming most existing models. Next, we will discuss the following two aspects: (1) The advantages and limitations of DWTFormer compared to other studies aimed at addressing inter-class similarity and intra-class variability. (2) The limitations of this study and directions for future work.

Comparison of other latest studies addressing inter-class similarity and intra-class variability

Zhao et al. [9] proposed a convolutional attention module combined with residual structures to obtain an average identification accuracy of 96.81% on the tomato leaf disease dataset. However, its ability to identify diseases with inter-class similarity and intra-class variability is weak. Chen et al. [48] introduced the channel attention into ResNet50 and used a two-channel filter to extract main features, achieving 89% identification accuracy in tomato leaf diseases. In comparison, the DWTFormer’s identification accuracy is improved by 10.28%. Astani et al. [49] proposed an integrated classifier for tomato disease identification, achieving a 95.98% identification accuracy slightly lower than DWTFormer’s. Li et al. [50] proposed a transformer module based on spatial convolutional self-attention, achieving a 99.10% accuracy in identifying strawberry disease in a natural scene. However, this work mainly focuses on addressing the impact of complex backgrounds on disease identification accuracy, ignoring the inter-class similarity and intra-class variability of strawberry leaf diseases. The above methods reported good experimental results. However, the performance in practice could be more satisfactory. They usually learn features from the spatial domain information of disease images, which limits the model’s ability to capture multi-scale features. Frequency-spatial feature fusion well compensates for the limitation of single spatial domain features. Thus, DWTFormer outperforms in reducing the impact of intra-class variability and inter-class similarity on the identification accuracy of tomato leaf diseases. It also has higher generalizability in identifying other diseases with intra-class variability and inter-class similarity.

This paper enhanced the disease identification task by fusing rich frequency-spatial features. The identification networks based on frequency-spatial feature fusion are expected to provide a scalable model reference for fast and accurate disease identification.

Limitations and direction for future works

1)
Enhancing the model’s adaptability in field environments. In-field experiment results show that DWTFormer achieved an identification accuracy of 97.22%, lower than the accuracy obtained on the tomato leaf disease and self-built dataset. The primary reason for this discrepancy is that disease images captured in real-time in real planting environments contain much noise, not only complex backgrounds but also disturbances such as light, leaf shading, and noise. These complex factors greatly increase the identification difficulty.

To further mitigate the interference of complex backgrounds in real environments, we propose to combine UNet and VMamba to design a U-shaped Mamba for efficient background segmentation. The effectiveness of VMamba in semantic segmentation tasks has been demonstrated. Specifically, we will apply the Visual State Space Block (VSS Block) of VMamba to the decoding side of UNet to efficiently decode the complex background in disease images.

To address the complex lighting and noise interference in real-time captured images in real environments, we will design an image enhancement algorithm that integrates Retinex with the self-adaptive BayesShrink wavelet threshold method. This algorithm will serve as a preprocessing step for disease images, aiming to improve the identification accuracy of the DWTFormer. As shown in Fig. 14, we will first convert the diseased image from the RGB color space to the HSV space, which allows for a more effective separation of lighting and color information. For the saturation component S, we will apply the segmented stretch logarithmic method to enhance the contrast in low saturation areas while compressing the contrast in high saturation areas. Next, the Retinex method will be applied to improve the V component to effectively eliminate the effects of lighting variation. For the enhanced V component, we will design a self-adaptive BayesShrink method to mitigate noise. Specifically, after performing wavelet transform, global features and contour details are predominantly concentrated in the low-frequency components, while local details and noise reside in the high-frequency components. Consequently, we will introduce global and local thresholds to separately process the low-frequency and high-frequency components, ensuring optimal preservation of image details while effectively removing noise. Finally, the enhanced disease image will be generated by wavelet reconstruction and inverse transformation from HSV back to RGB.

With the optimizations mentioned above, we hope to improve the immunity and adaptability of DWTFormer in the field environment by using image enhancement methods and background segmentation techniques.

1)
Optimizing the framework of DWTFormer. Although various designs have been adopted for DWTFormer to reduce its parameters and FLOPs effectively, its complexity is still higher than that of lightweight CNN models. Therefore, in future research, we will focus on further compressing the model structure to minimize the parameters and FLOPs as much as possible, thereby improving the model’s efficiency and practicality. Specifically, we will design a lightweight and efficient Bi-Mamba module to replace the PVT structure in the spatial feature mapping branch of DWTFormer to capture long-distance dependencies in spatial information, as illustrated in Fig. 15. Specifically, we will design a symmetric convolutional branch without SSM to complement the Mamba structure, providing parallel processing paths for spatial information. This approach aims to mitigate the risk of local information loss during image identification tasks, which may arise from the order constraints of the Mamba structure. In Bi-Mamba, the SSM branch will capture long-term dependencies, while the convolutional branch will focus on extracting local features. Furthermore, we will incorporate parameter pruning techniques to reduce the model’s parameters and computational complexity by eliminating redundant weights and connections. Through these optimizations, we hope DWTFormer will be more adaptable to mobile devices and resource-constrained environments and achieve higher efficiency while maintaining high performance.

Conclusion

This paper proposes a tomato leaf disease model based on frequency-spatial feature fusion named DWTFormer. DWTFormer fully leverages the frequency and spatial dual-domain information in disease images. To address the significant intra-class variability in tomato leaf diseases, we introduce a cross-fusion mechanism of high-frequency and local spatial features, enhancing the model’s ability to capture fine-grained features such as lesion textures. To tackle inter-class similarity, we propose the cross-fusion of low-frequency and global spatial features, improving the model’s ability to capture coarse-grained features such as lesion distribution patterns. Finally, the model’s effectiveness in tomato leaf disease identification is significantly enhanced by fully integrating coarse-grained and fine-grained features. Experimental results demonstrate that DWTFormer achieved accuracy accuracies of 99.28% and 97.63% on the tomato leaf disease and self-built datasets, respectively. In the AI Challenger 2018 and PlantVillage datasets, it achieved accuracies of 96.18% and 99.89%, respectively. In-field experiments show that DWTFormer can be embedded into a mobile app. The TomatoAPP achieved an average identification accuracy of 97.22%, with an average inference time of 0.028 seconds in real planting environments.

This paper demonstrates that the frequency-spatial feature fusion effectively mitigates the impact of intra-class variability and inter-class similarity on the identification accuracy of tomato leaf diseases. Although the frequency and spatial domain information describe different attributes and detailed features of disease images with different semantic information, in practice, these two features can reasonably be fused to improve the effectiveness of crop disease identification in natural environments. The model proposed in this paper is a more generalized form of fused frequency-spatial features, providing new ideas for crop disease identification on embedded devices. Furthermore, it provides a basis for applying frequency-spatial feature fusion to more advanced tasks, such as detection and segmentation.

Data availability

All data generated or analysed during this study are included in this published article and its supplementary information files.

Abbreviations

DWTFormer:: A frequency-spatial features fusion model for tomato leaf disease identification
DFMM:: A dual-domain features mapping model
MDFF-DCA:: A multi-scale, dual-domain features fusion model based on dynamic cross-attention
DSM:: Dynamic Shift Max activation function
DWFD:: Frequency-domain features mapping branch
Db8:: Daubechies8
DWT:: Discrete wavelet transform
PVT:: Pyramid Vision Transformer
CBAM:: Convolution block attention module
MFA:: Multi-scale and dual-domain features align.

References

Abouelmagd LM, Shams MY, Marie HS, Hassanien AE. An optimized capsule neural networks for tomato leaf disease classification. EURASIP JIVP. 2024;2024(1):2.
Google Scholar
Liu J, Wang X. Plant diseases and pests detection based on deep learning: a review. Plant Methods. 2021;17:1–18.
Article Google Scholar
Bhatia A, Chug A, Singh AP. Hybrid SVM-LR classifier for powdery mildew disease prediction in tomato plant. In: 2020 7th International conference on signal processing and integrated networks (SPIN). IEEE; 2020;218–223.
Narla VL, Suresh G. Multiple feature-based tomato plant leaf disease classification using SVM classifier. In: Machine Learning, Image Processing, Network Security and Data Sciences: Select Proceedings of 3rd International Conference on MIND 2021. Springer; 2023;443–455.
Javidan SM, Banakar A, Vakilian KA, Ampatzidis Y. Tomato leaf diseases classification using image processing and weighted ensemble learning. Agron J. 2024;116(3):1029–49.
Article CAS Google Scholar
Mohanty R, Wankhede P, Singh D, Vakhare P. Tomato plant leaves disease detection using machine learning. In: 2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE; 2022; 544–549.
Sahu SK, Pandey M. An optimal hybrid multiclass SVM for plant leaf disease detection using spatial Fuzzy C-Means model. Expert Syst Appl. 2023;214:118989.
Article Google Scholar
Nawaz M, Nazir T, Javed A, Masood M, Rashid J, Kim J, et al. A robust deep learning approach for tomato plant leaf disease localization and classification. Sci Rep. 2022;12(1):18568.
Article PubMed PubMed Central CAS Google Scholar
Zhao S, Peng Y, Liu J, Wu S. Tomato leaf disease diagnosis based on improved convolution neural network by attention module. Agriculture. 2021;11(7):651. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/agriculture11070651.
Article Google Scholar
Syed-Ab-Rahman SF, Hesamian MH, Prasad M. Citrus disease detection and classification using end-to-end anchor-based deep learning model. Appl Intell. 2022;52(1):927–38. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10489-021-02452-w.
Article Google Scholar
Yang R, Liao T, Zhao P, Zhou W, He M, Li L. Identification of citrus diseases based on AMSR and MF-RANet. Plant Methods. 2022;18(1):113.
Article PubMed PubMed Central Google Scholar
Zhang X, Xun Y, Chen Y. Automated identification of citrus diseases in orchards using deep learning. Biosyst Eng. 2022;223:249–58.
Article CAS Google Scholar
Li Z, Sun J, Shen Y, Yang Y, Wang X, Wang X, et al. Deep migration learning-based recognition of diseases and insect pests in Yunnan tea under complex environments. Plant Methods. 2024;20(1):101.
Article PubMed PubMed Central Google Scholar
Xu Y, Mao Y, Li H, Sun L, Wang S, Li X, et al. A deep learning model for rapid classification of tea coal disease. Plant Methods. 2023;19(1):98.
Article PubMed PubMed Central CAS Google Scholar
Zheng J, Li K, Wu W, Ruan H. RepDI: a light-weight CPU network for apple leaf disease identification. Comput Electron Agric. 2023;212:108122.
Article Google Scholar
Liu B, Huang X, Sun L, Wei X, Ji Z, Zhang H. MCDCNet: multi-scale constrained deformable convolution network for apple leaf disease detection. Comput Electron Agric. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2024.109028.
Article Google Scholar
Zhao S, Liu J, Wu S. Multiple disease detection method for greenhouse-cultivated strawberry based on multiscale feature fusion Faster R\_CNN. Comput Electron Agric. 2022;199:107176.
Article Google Scholar
Chen S, Liao Y, Lin F, Huang B. An improved lightweight YOLOv5 algorithm for detecting strawberry diseases. IEEE Access. 2023;11:54080–92.
Article Google Scholar
Zhang Y, Huang S, Zhou G, Hu Y, Li L. Identification of tomato leaf diseases based on multi-channel automatic orientation recurrent attention network. Comput Electron Agric. 2023;205:107605https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2022.107605.
Article Google Scholar
Sanida T, Sideris A, Sanida MV, Dasygenis M. Tomato leaf disease identification via two-stage transfer learning approach. Smart Agric Technol. 2023;5:100275.
Article Google Scholar
Li H, Shi H, Du A, Mao Y, Fan K, Wang Y, et al. Symptom recognition of disease and insect damage based on Mask R-CNN, wavelet transform, and F-RNet. Front Plant Sci. 2022;13:922797.
Article PubMed PubMed Central Google Scholar
Zhang F, Jin X, Lin G, Jiang J, Wang M, An S, et al. Hybrid attention network for citrus disease identification. Comput Electron Agric. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2024.108907.
Article Google Scholar
Li H, Huang L, Ruan C, Huang W, Wang C, Zhao J. A dual-branch neural network for crop disease recognition by integrating frequency domain and spatial domain information. Comput Electron Agric. 2024;219:108843.
Article Google Scholar
Gao Y, Cao Z, Cai W, Gong G, Zhou G, Li L. Apple leaf disease identification in complex background based on BAM-net. Agronomy. 2023;13(5):1240. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/agronomy13051240.
Article Google Scholar
Karki S, Basak JK, Tamrakar N, Deb NC, Paudel B, Kook JH, et al. Strawberry disease detection using transfer learning of deep convolutional neural networks. Sci Hortic. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.scienta.2024.113241.
Article Google Scholar
Aghamohammadesmaeilketabforoosh K, Nikan S, Antonini G, Pearce JM. Optimizing strawberry disease and quality detection with vision transformers and attention-based convolutional neural networks. Foods. 2024;13(12):1869.
Article PubMed PubMed Central Google Scholar
Hughes D, Salathé M, et al. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060. 2015;https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1511.08060.
Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, et al. Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019;p. 1314–1324.
Li Y, Chen Y, Dai X, Chen D, Liu M, Yuan L, et al. Micronet: improving image recognition with extremely low flops. In: Proceedings of the IEEE/CVF International conference on computer vision; 2021;p. 468–477.
Woo S, Park J, Lee JY, Kweon IS. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018;p. 3–19.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021;10012–10022.
Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. 2020.
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, et al. Pvt v2: improved baselines with pyramid vision transformer. Comput Visual Media. 2022;8(3):415–24. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s41095-022-0274-8.
Article CAS Google Scholar
Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, et al. Mlp-mixer: an all-mlp architecture for vision. Adv Neural Inf Process Syst 2021;34:24261–72.
Google Scholar
Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, et al. Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology; 2010.
Krause J, Stark M, Deng J, Fei-Fei L. 3d object representations for fine-grained categorization. In: 2013 IEEE international conference on computer vision workshops. IEEE; 2013;554–561.
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Elsevier; 2009.
Ma X, Chen W, Xu Y. ERCP-Net: a channel extension residual structure and adaptive channel attention mechanism for plant leaf disease classification network. Sci Rep. 2024;14(1):4221.
Article PubMed PubMed Central CAS Google Scholar
Ai Y, Sun C, Liu A, Ding F, Tie J. Identification model of crop diseases and insect pests based on convolutional neural network. In: Artificial Intelligence in China: Proceedings of the 2nd International Conference on Artificial Intelligence in China. Springer; 2021;p. 557–563.
Li X, Chen X, Yang J, Li S. Transformer helps identify kiwifruit diseases in complex natural environments. Comput Electron Agric. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2022.107258.
Article Google Scholar
Gao R, Wang R, Feng L, Li Q, Wu H. Dual-branch, efficient, channel attention-based crop disease identification. Comput Electron Agric. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2021.106410.
Article Google Scholar
Wang X, Cao W. Bit-plane and correlation spatial attention modules for plant disease classification. IEEE Access. 2023.
Wang H, Pan X, Zhu Y, Li S, Zhu R. Maize leaf disease recognition based on TC-MRSN model in sustainable agriculture. Comput Electron Agric. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2024.108915.
Article Google Scholar
Guan H, Fu C, Zhang G, Li K, Wang P, Zhu Z. A lightweight model for efficient identification of plant diseases and pests based on deep learning. Front Plant Sci. 2023;14:1227011. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fpls.2023.1227011.
Article PubMed PubMed Central Google Scholar
Xu J, Li Z, Du B, Zhang M, Liu J. Reluplex made more practical: Leaky ReLU. In: 2020 IEEE Symposium on Computers and communications (ISCC). IEEE; 2020;1–7.
Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings; 2011;315–323.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: a system for Large-Scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16); 2016;265–283.
Chen X, Zhou G, Chen A, Yi J, Zhang W, Hu Y. Identification of tomato leaf diseases based on combination of ABCK-BWTR and B-ARNet. Comput Electron Agric. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2020.105730.
Article Google Scholar
Astani M, Hasheminejad M, Vaghefi M. A diverse ensemble classifier for tomato disease recognition. Comput Electron Agric. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2022.107054.
Article Google Scholar
Li G, Jiao L, Chen P, Liu K, Wang R, Dong S, et al. Spatial convolutional self-attention-based transformer module for strawberry disease identification under complex background. Comput Electron Agric. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.compag.2023.108121.
Article Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the contributions of the participants in this study and the financial support provided by the National Key Research and Development Program of China. We are also grateful to Feifan Guan and Lvwen Huang for providing the tea leaf disease dataset.

Funding

This work was financially supported by the National Key Research and Development Program of China (2022YFD1300201).

Author information

Authors and Affiliations

College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
Yuyun Xiang, Xiaopeng Li & Shuqin Li
Center of Big Data, Data Bureau, Yichang, 44300, Hubei, China
Shuang Gao
Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, 712100, Shaanxi, China
Shuqin Li
Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture and Rural Affairs, Yangling, 712100, Shaanxi, China
Shuqin Li

Authors

Yuyun Xiang
View author publications
You can also search for this author inPubMed Google Scholar
Shuang Gao
View author publications
You can also search for this author inPubMed Google Scholar
Xiaopeng Li
View author publications
You can also search for this author inPubMed Google Scholar
Shuqin Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

YX: Methodology, Experiment, Visualization, Investigation, Validation, Writing-original draft, review and editing. SG: Data curation, Visualization, Experiment. XL: Formal analysis. SL: Review, Supervision, funding acquisition.

Corresponding author

Correspondence to Shuqin Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have consented to the publication of this manuscript.

Competing interests

The authors declare that they have no conflict of interest that could have appeared to influence to the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xiang, Y., Gao, S., Li, X. et al. DWTFormer: a frequency-spatial features fusion model for tomato leaf disease identification. Plant Methods 21, 33 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01349-w

Download citation

Received: 27 December 2024
Accepted: 14 February 2025
Published: 11 March 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13007-025-01349-w

DWTFormer: a frequency-spatial features fusion model for tomato leaf disease identification

Abstract

Introduction

Materials and methods

Datasets

Tomato leaf disease dataset

The self-built dataset

The AI Challenger 2018 and PlantVillage datasets

Analysis of tomato disease features

Model overview

The dual-domain features mapping model: DFMM

Frequency features mapping branch based on DWFD

Spatial domain features mapping branch based on MPVT

The multi-scale dual-domain features fusion module: MDFF-DCA

Experimental setting and evaluation indices

Results and discussion

Identification results on the tomato leaf disease dataset

Generalizability experiments

Generalizability experiments on the self-built datasets

Generalizability experiments on cross-domain datasets

Comparison with other state-of-the-art models

Ablation experiments

Ablation experiments of each module

Ablation experiments of different Bneck-DSM numbers and DSM

Ablation experiments of different feature fusion times

Ablation experiments of channel ratio of frequency and spatial domain features

Visualizations

Visualization of tomato leaf disease with inter-class similarity

Visualization of tomato leaf disease with intra-class variability

Visualization of stages of frequency-spatial feature fusion

Real-time identification of tomato leaf disease in the filed

Discussion

Comparison of other latest studies addressing inter-class similarity and intra-class variability

Limitations and direction for future works

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary Material 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Plant Methods

Contact us