Please wait a minute...
Journal of Arid Land  2023, Vol. 15 Issue (2): 191-204    DOI: 10.1007/s40333-023-0094-4
Research article     
Estimation of soil organic matter in the Ogan-Kuqa River Oasis, Northwest China, based on visible and near-infrared spectroscopy and machine learning
ZHOU Qian1,2,3, DING Jianli1,2,3,*(), GE Xiangyu1,2,3, LI Ke1,2,3, ZHANG Zipeng1,2,3, GU Yongsheng1,2,3
1College of Geography and Remote Sensing Science, Xinjiang University, Urumqi 830046, China
2Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi 830046, China
3Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi 830046, China
Download: HTML     PDF(796KB)
Export: BibTeX | EndNote (RIS)      


Visible and near-infrared (vis-NIR) spectroscopy technique allows for fast and efficient determination of soil organic matter (SOM). However, a prior requirement for the vis-NIR spectroscopy technique to predict SOM is the effective removal of redundant information. Therefore, this study aims to select three wavelength selection strategies for obtaining the spectral response characteristics of SOM. The SOM content and spectral information of 110 soil samples from the Ogan-Kuqa River Oasis were measured under laboratory conditions in July 2017. Pearson correlation analysis was introduced to preselect spectral wavelengths from the preprocessed spectra that passed the 0.01 level significance test. The successive projection algorithm (SPA), competitive adaptive reweighted sampling (CARS), and Boruta algorithm were used to detect the optimal variables from the preselected wavelengths. Finally, partial least squares regression (PLSR) and random forest (RF) models combined with the optimal wavelengths were applied to develop a quantitative estimation model of the SOM content. The results demonstrate that the optimal variables selected were mainly located near the range of spectral absorption features (i.e., 1400.0, 1900.0, and 2200.0 nm), and the CARS and Boruta algorithm also selected a few visible wavelengths located in the range of 480.0-510.0 nm. Both models can achieve a more satisfactory prediction of the SOM content, and the RF model had better accuracy than the PLSR model. The SOM content prediction model established by Boruta algorithm combined with the RF model performed best with 23 variables and the model achieved the coefficient of determination (R2) of 0.78 and the residual prediction deviation (RPD) of 2.38. The Boruta algorithm effectively removed redundant information and optimized the optimal wavelengths to improve the prediction accuracy of the estimated SOM content. Therefore, combining vis-NIR spectroscopy with machine learning to estimate SOM content is an important method to improve the accuracy of SOM prediction in arid land.

Key wordssoil organic matter content      vis-NIR spectroscopy      random forest      Boruta algorithm      machine learning     
Received: 12 November 2022      Published: 28 February 2023
Corresponding Authors: *DING Jianli (E-mail:
Cite this article:

ZHOU Qian, DING Jianli, GE Xiangyu, LI Ke, ZHANG Zipeng, GU Yongsheng. Estimation of soil organic matter in the Ogan-Kuqa River Oasis, Northwest China, based on visible and near-infrared spectroscopy and machine learning. Journal of Arid Land, 2023, 15(2): 191-204.

URL:     OR

Fig. 1 Overview of the Ogan-Kuqa River Oasis and spatial distribution of sampling sites. DEM, Digital Elevation Model.
Fig. 2 Plot of outliers detected through the Monte Carlo outlier detection (MCOD) method
Sample Number of samples Minimum
Standard deviation (g/kg) Coefficient of variation (%)
Full sample set 110 59.86 5.49 29.05 11.34 39.04
Calibration set 74 59.86 5.49 28.59 17.09 39.77
Validation set 36 52.91 9.94 29.99 10.97 36.57
Table 1 Statistical characteristics of the soil organic matter (SOM) content
Fig. 3 Reflectance curves of the original and preprocessed soil spectra. (a), original spectra; (b), spectra processed by Savitzky-Golay (SG) smoothing and first derivative (FD) processing. Note that the curves with different color represent the reflectance spectra of different soil samples.
Fig. 4 Correlation coefficient curves between the soil organic matter (SOM) content and preprocessed soil spectra
Fig. 5 Process of filtering variables by the competitive adaptive reweighted sampling (CARS) algorithm. (a), changing trend of the number of sampled variables with the increase of sampling runs; (b), changing trend of the root mean square error of cross-validation (RMSECV) with the increase of sampling runs; (c), trend regression coefficient paths with the increase of sampling runs. Note that the curves with different color represent the trend of the stability of each variable with the number of sampling runs, and the positions marked by vertical asterisks correspond to the optimal subset of variables that the RMSECV reached its minimum in the whole variable selection process.
Fig. 6 Process of filtering variables by the successive projections algorithm (SPA). (a), variation in the root mean square error (RMSE) with the number of variables included in the model; (b), distribution of the feature variables on the first calibration object.
Fig. 7 Importance score (Z score) of the different wavelengths identified by the Boruta algorithm
Model Selection method Variable number Calibration set (n=74) Validation set (n=36)
R2 RMSE (g/kg) R2 RMSE (g/kg) RPD
PLSR Preselected spectrum 442 0.45 5.23 0.42 5.46 1.24
CARS 31 0.69 4.38 0.67 4.46 2.12
SPA 5 0.62 4.56 0.61 4.34 1.82
Boruta algorithm 23 0.65 4.30 0.63 4.28 2.08
RF Preselected spectrum 442 0.52 4.96 0.54 4.83 1.64
CARS 31 0.73 4.24 0.72 4.26 2.36
SPA 5 0.64 4.37 0.66 4.31 1.86
Boruta algorithm 23 0.76 4.24 0.78 4.19 2.38
Table 2 Comparison of the coefficient of determination (R2), root mean square error (RMSE), and residual prediction deviation (RPD) obtained from partial least squares regression (PLSR) and random forest (RF) models based on four wavelength selection methods
Fig. 8 Distribution of feature variables selected by SPA, CARS, and Boruta algorithms. Note that the numbers on the right side of the figure represent the number of optimal variables selected by SPA, CARS, and Boruta algorithms.
[1]   Araújo M C U, Saldanha T C B, Galvão R K H, et al. 2001. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics and Intelligent Laboratory Systems, 57(2): 65-73.
doi: 10.1016/S0169-7439(01)00119-8
[2]   Araújo S R, Wetterlind J, Demattê J A M, et al. 2014. Improving the prediction performance of a large tropical vis-NIR spectroscopic soil library from Brazil by clustering into smaller subsets or use of data mining calibration techniques. European Journal of Soil Science, 65(5): 718-729.
doi: 10.1111/ejss.12165
[3]   Bao N S, Wu L X, Ye B Y, et al. 2017. Assessing soil organic matter of reclaimed soil from a large surface coal mine using a field spectroradiometer in laboratory. Geoderma, 288: 47-55.
doi: 10.1016/j.geoderma.2016.10.033
[4]   Chang W C, Laird D A, Mausbach M J, et al. 2001. Near-infrared reflectance spectroscopy-principal components regression analyses of soil properties. Soil Science Society of America Journal, 65(2): 480-490.
doi: 10.2136/sssaj2001.652480x
[5]   Chen Y, Ma L X, Yu D S, et al. 2022. Comparison of feature selection methods for mapping soil organic matter in subtropical restored forests. Ecological Indicators, 135: 108545, doi: 10.1016/j.ecolind.2022.108545.
doi: 10.1016/j.ecolind.2022.108545
[6]   Chen S C, Xu H Y, Xu D Y, et al. 2021. Evaluating validation strategies on the performance of soil property prediction from regional to continental spectral data. Geoderma, 400: 115159, doi: 10.1016/j.geoderma.2021.115159.
doi: 10.1016/j.geoderma.2021.115159
[7]   Ding J L, Yu D L. 2014. Monitoring and evaluating spatial variability of soil salinity in dry and wet seasons in the Werigan-Kuqa Oasis, China, using remote sensing and electromagnetic induction instruments. Geoderma, 235-236: 316-322.
[8]   Dharumarajan S, Lalitha M, Gomez C, et al. 2022. Prediction of soil hydraulic properties using VIS-NIR spectral data in semi- arid region of Northern Karnataka Plateau. Geoderma Regional, 28: e00475, doi: 10.1016/j.geodrs.2021.e00475.
doi: 10.1016/j.geodrs.2021.e00475
[9]   Ge X Y, Ding J L, Jin X L, et al. 2021. Estimating agricultural soil moisture content through UAV-based hyperspectral images in the arid region. Remote Sensing, 13(8): 1562, doi: 10.3390/rs13081562.
doi: 10.3390/rs13081562
[10]   Ge X Y, Ding J L, Teng D X, et al. 2022a. Exploring the capability of Gaofen-5 hyperspectral data for assessing soil salinity risks. International Journal of Applied Earth Observation and Geoinformation, 112: 102969, doi: 10.1016/j.jag.2022.102969.
doi: 10.1016/j.jag.2022.102969
[11]   Ge X Y, Ding J L, Teng D X, et al. 2022b. Updated soil salinity with fine spatial resolution and high accuracy: The synergy of Sentinel-2 MSI, environmental covariates and hybrid machine learning approaches. CATENA, 212: 106054, doi: 10.1016/j.catena.2022.106054.
doi: 10.1016/j.catena.2022.106054
[12]   Han L J, Ding J L, Wang J J, et al. 2022. Monitoring oasis cotton fields expansion in arid zones using the Google Earth Engine: A case study in the Ogan-Kucha River oasis, Xinjiang, China. Remote Sensing, 14(1): 225, doi: 10.3390/rs14010225.
doi: 10.3390/rs14010225
[13]   Hong Y S, Chen Y Y, Shen R L, et al. 2021. Diagnosis of cadmium contamination in urban and suburban soils using visible-to-near-infrared spectroscopy. Environmental Pollution, 291: 118128, doi: 10.1016/j.envpol.2021.118128.
doi: 10.1016/j.envpol.2021.118128
[14]   Jin X L, Du J, Liu H J, et al. 2016. Remote estimation of soil organic matter content in the Sanjiang Plain, Northest China: The optimal band algorithm versus the GRA-ANN model. Agricultural and Forest Meteorology, 218-219: 250-260.
[15]   Keskin H, Grunwald S, Harris W G. 2019. Digital mapping of soil carbon fractions with machine learning. Geoderma, 339: 40-58.
doi: 10.1016/j.geoderma.2018.12.037
[16]   Kursa M B, Jankowski A, Rudnicki W. 2010. Boruta-a system for feature selection. Fundamenta Informaticae, 101(4): 271-285.
doi: 10.3233/FI-2010-288
[17]   Li X H, Ding J L, Liu J, et al. 2021. Digital mapping of soil organic carbon using sentinel series data: A case study of the Ebinur Lake Watershed in Xinjiang. Remote Sensing, 13(4): 769, doi: 10.3390/rs13040769.
doi: 10.3390/rs13040769
[18]   Li Q Q, Huang Y, Song X Z, et al. 2019. Moving window smoothing on the ensemble of competitive adaptive reweighted sampling algorithm. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 214: 129-138.
doi: 10.1016/j.saa.2019.02.023
[19]   Liu J B, Dong Z Y, Xia J S, et al. 2021. Estimation of soil organic matter content based on CARS algorithm coupled with random forest. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 258: 119823, doi: 10.1016/j.saa.2021.119823.
doi: 10.1016/j.saa.2021.119823
[20]   Luo C, Wang Y A, Zhang X L, et al. 2022. Spatial prediction of soil organic matter content using multiyear synthetic images and partitioning algorithms. CATENA, 211: 106023, doi: 10.1016/j.catena.2022.106023.
doi: 10.1016/j.catena.2022.106023
[21]   Ma G L, Ding J L, Han L J, et al. 2021. Digital mapping of soil salinization based on Sentinel-1 and Sentinel-2 data combined with machine learning algorithms. Regional Sustainability, 2(2): 177-188.
doi: 10.1016/j.regsus.2021.06.001
[22]   Mcbratney A, Field D J, Koch A. 2014. The dimensions of soil security. Geoderma, 213: 203-213.
doi: 10.1016/j.geoderma.2013.08.013
[23]   Mesquita D P P, Gomes J P P, Rodrigues L R, et al. 2018. Building selective ensembles of Randomization Based Neural Networks with the successive projections algorithm. Applied Soft Computing, 70: 1135-1145.
doi: 10.1016/j.asoc.2017.08.007
[24]   Nocita M, Stevens A, Toth G, et al. 2014. Prediction of soil organic carbon content by diffuse reflectance spectroscopy using a local partial least square regression approach. Soil Biology and Biochemistry, 68: 337-347.
doi: 10.1016/j.soilbio.2013.10.022
[25]   Savitzky A, Golay M J E. 1964. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8): 1627-1639.
doi: 10.1021/ac60214a047
[26]   Schomberg J, Ziogas A, Anton-Culver H, et al. 2018. Identification of a gene expression signature predicting survival in oral cavity squamous cell carcinoma using Monte Carlo cross validation. Oral Oncology, 78: 72-79.
doi: S1368-8375(18)30021-6 pmid: 29496061
[27]   Shi T Z, Chen Y Y, Liu H Z, et al. 2014. Soil organic carbon content estimation with laboratory-based visible-near-infrared reflectance spectroscopy: Feature selection. Applied Spectroscopy, 68(8): 831-837.
doi: 10.1366/13-07294 pmid: 25061784
[28]   Shi T Z, Wang J J, Chen Y Y, et al. 2016. Improving the prediction of arsenic contents in agricultural soils by combining the reflectance spectroscopy of soils and rice plants. International Journal of Applied Earth Observation and Geoinformation, 52: 95-103.
doi: 10.1016/j.jag.2016.06.002
[29]   Song X Z, Huang Y, Tian K D, et al. 2020. Near infrared spectral variable optimization by final complexity adapted models combined with uninformative variables elimination-a validation study. Optik, 203: 164019, doi: 10.1016/j.ijleo.2019.164019.
doi: 10.1016/j.ijleo.2019.164019
[30]   Swierenga H, Wülfert F, De Noord O E, et al. 2000. Development of robust calibration models in near infra-red spectrometric applications. Analytica Chimica Acta, 411(1-2): 121-135.
doi: 10.1016/S0003-2670(00)00718-2
[31]   Tian Y C, Zhang J J, Yao X, et al. 2013. Laboratory assessment of three quantitative methods for estimating the organic matter content of soils in China based on visible/near-infrared reflectance spectra. Geoderma, 202-203: 161-170.
[32]   Viscarra Rossel R A, Walvoort D J J, Mcbratney A B, et al. 2006. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma, 131(1-2): 59-75.
doi: 10.1016/j.geoderma.2005.03.007
[33]   Vohland M, Ludwig M, Thiele-Bruhn S, et al. 2014. Determination of soil properties with visible to near- and mid-infrared spectroscopy: Effects of spectral variable selection. Geoderma, 223-225(1): 88-96.
doi: 10.1016/j.geoderma.2014.01.013
[34]   Wang J Z, Ding J L, Ma X, et al. 2019. Capability of Sentinel-2 MSI data for monitoring and mapping of soil salinity in dry and wet seasons in the Ebinur Lake region, Xinjiang, China. Geoderma, 353: 172-187.
doi: 10.1016/j.geoderma.2019.06.040
[35]   Wang X P, Zhang F, Ding J L, et al. 2018. Estimation of soil salt content (SSC) in the Ebinur Lake Wetland National Nature Reserve (ELWNNR), Northwest China, based on a Bootstrap-BP neural network model and optimal spectral indices. Science of the Total Environment, 615: 918-930.
doi: 10.1016/j.scitotenv.2017.10.025
[36]   Wang Z, Ding J L, Zhang Z P. 2022. Estimation of soil organic matter in arid zones with coupled environmental variables and spectral features. Sensors, 22(3): 1194, doi: 10.3390/s22031194.
doi: 10.3390/s22031194
[37]   Xie S G, Ding F J, Chen S G, et al. 2022. Prediction of soil organic matter content based on characteristic band selection method. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 273: 120949, doi: 10.1016/j.saa.2022.120949.
doi: 10.1016/j.saa.2022.120949
[38]   Xing Z, Du C W, Shen Y Z, et al. 2021. A method combining FTIR-ATR and Raman spectroscopy to determine soil organic matter: Improvement of prediction accuracy using competitive adaptive reweighted sampling (CARS). Computers and Electronics in Agriculture, 191: 106549, doi: 10.1016/j.compag.2021.106549.
doi: 10.1016/j.compag.2021.106549
[39]   Yin G C, Chen X L, Zhu H H, et al. 2022. A novel interpolation method to predict soil heavy metals based on a genetic algorithm and neural network model. Science of the Total Environment, 825: 153948, doi: 10.1016/j.scitotenv.2022.153948.
doi: 10.1016/j.scitotenv.2022.153948
[40]   Zhang Y, Sui B, Shen H O, et al. 2019. Mapping stocks of soil total nitrogen using remote sensing data: A comparison of random forest models with different predictors. Computers and Electronics in Agriculture, 160: 23-30.
doi: 10.1016/j.compag.2019.03.015
[41]   Zhang Z P, Ding J L, Zhu C M, et al. 2021. Bivariate empirical mode decomposition of the spatial variation in the soil organic matter content: A case study from NW China. CATENA, 206: 105572, doi: 10.1016/j.catena.2021.105572.
doi: 10.1016/j.catena.2021.105572
[1] FENG Ting, HUANG Farong, ZHU Shuzhen, BU Lingjie, QI Zhiming, LI Lanhai. Dew amount and its long-term variation in the Kunes River Valley, Northwest China[J]. Journal of Arid Land, 2022, 14(7): 753-770.
[2] Mehdi GHOLAMI ROSTAM, Seyyed Javad SADATINEJAD, Arash MALEKIAN. Precipitation forecasting by large-scale climate indices and machine learning techniques[J]. Journal of Arid Land, 2020, 12(5): 854-864.
[3] Xueting ZHANG, Xuemei LI, Lanhai LI, Shan ZHANG, Qirui QIN. Environmental factors influencing snowfall and snowfall prediction in the Tianshan Mountains, Northwest China[J]. Journal of Arid Land, 2019, 11(1): 15-28.
[4] YANG Han, XIONG Heigang, CHEN Xuegang, WANG Yaqi, ZHANG Fang. Identifying the influence of urbanization on soil organic matter content and pH from soil magnetic characteristics[J]. Journal of Arid Land, 2015, 7(6): 820-830.