The study area is located at the intersection of the Paleo-Asian tectonic domain and the Circum-Pacific tectonic domain (Fig. 1a). It is a superimposed composite tectonic zone experienced the ancient Asian Ocean evolution and the subduction of the Mesozoic Pacific Plate (Yu et al., 2012; Wu et al., 2005; Zhang et al., 2004). The regional northwest-trending Gudonghe fault runs through the whole study area and controls the spatial distribution of major geological formations (Figs. 1a, 1b). The exposed strata occupy 29.44% of the study area, mainly including the Late Archean Ji'nan Formation, the Late Permian Xindongchun and Changren Formations, the Early Cretaceous Changchai, Quanshuichun, and Dalazi Formations, the Late Cretaceous Longjing Formation, the Neocene Chuandishan basalt, and the Holocene alluvium. The widely distributed magmatic rocks are mainly composed of granite, granodiorite, diorite, gabbro, etc., forming widely exposed batholiths and stocks (Fig. 1b). Zircon U-Pb ages of diorites are 173-175 Ma (Wu et al., 2013), indicating that the magmatic rocks were formed by tectono-magmatic activities in the Yanshanian period.
Figure 1. Tectonic location map (a) and the geological map (b) of the study area (modified from Chai and Liu, 2015).
The Yanshanian tectono-magmatic activities provided continuous heat source and metallogenic materials for polymetallic mineralization (Yan et al., 2015). There are 14 mineral occurrences discovered in the study area (Fig. 2). These mineral occurrences are mainly hydrothermal deposits and a few hydrothermal skarn deposits which are closely related to multi-stage magmatic activities (Pan et al., 2016; Yan et al., 2015; Wan et al., 2010). The main metallogenic elements include Au, Cu, Co and Mo. Most of the discovered mineral occurrences are distributed in metamorphic rocks around or at the edge of the magmatic intrusions. Regional deep faults, Archean formations, and the Yanshan magmatic rocks are main controlling factors for polymetallic mineralization.
The stream sediment survey completed recently in the study area was in accordance with Geochemical Survey Criteria (No. DZ/T0011-91). As the China's geological industry standard, the documents can be downloaded at https://wenku.baidu.com/view/ 17ad51c189eb172ded63b78b.html. A total of 6 999 stream sediment samples were collected from the study area, covering 1 320 km2 with a sampling density of 1-2 samples per 0.25 km2 (Fig. 2). Inner Mongolia Mineral Experiment Institute of China analyzed concentrations of 13 elements of each stream sediment sample. Atomic absorption spectrometry (AAN) was used to analyze concentration of Au, Atomic Fluorescence Spectrophotometry (AFS) was used to analyze concentrations of Hg and As, and inductively coupled plasma mass spectrometer (ICP-MS) was used to analyze concentrations of Ag, Sb, Mo, W, Cu, Pb, Zn, Bi, Ni, and Co.
Mineral occurrence data come from the digital geological survey in the study area (Liu and Zhang, 1999). The mineral occurrence map was saved as a MapInfo Interchange File (*.mif) and used as the base map to generate a unit cell map composed of 200×137 unit cells. Each cell covers 0.228 2×0.229 6 km2, containing no more than one discovered mineral occurrence. Because the inverse distance to a power interpolation method has little influence on data smoothing, it was used to convert the concentrations of 6 999 samples per element into 200×137 grid data. Each grid point in a grid map coincides with the center of the corresponding cell in the unit cell map.
A grid point was defined as a true positive point if the corresponding unit cell contains a mineral occurrence. A total of 14 true positive points was defined in the study area. Except for these 14 grid points, all other grid points were defined as true negative points. It must be noted that the defined true negative points most likely contain some true positive points, which are incorrectly defined as true negative points. However, a few incorrectly defined true negative points cannot significantly affect the ROC curve analysis results. Because the ROC curve is one of the effective tools for processing non-equilibrium data, the non-equilibrium of data does not affect the ROC curve analysis results.
The grid data of each element were used to predict mineral potential of the study area. The higher the element concentration of a grid point, the more likely the corresponding cell contains a mineral occurrence. Based on the true positive points and true negative points defined in Section 2.2 and the grid data of each element, the AUC value of the element was calculated using Eq.12 to evaluate the effectiveness of the element concentration for the prediction of mineral potentials.
Based on the AUC value of each element, the corresponding standard normal distribution statistic, ZAUC, can be calculated using Eq.13 to test whether the AUC value of the element is significantly different from the value 0.5. The AUC value is significantly different from the value 0.5 if only the value of ZAUC is greater than the threshold value 1.96 at the significant level α=0.05 (Chen and Wu, 2019a, b, 2017c, 2016).
Table 1 lists the estimated AUCs and ZAUCs for 13 elements. Table 1 shows that the concentrations of Au, Co, Cu, Mo, Ni and W can effectively predict mineral potential of the study area, asthe ZAUCs of these elements are greater than the threshold value 1.96. Among these six elements, the first five elements are metallogenic elements, and the last one is an ore-forming associated element. Therefore, these statistical results are strongly consistent with the geological characteristics of the study area.
Element AUC ZAUC Element AUC ZAUC Element AUC ZAUC Ag 0.515 9 0.204 6 Cu 0.712 3 2.720 8 Sb 0.580 2 1.003 0 As 0.610 1 1.371 4 Hg 0.540 9 0.517 7 W 0.661 5 2.023 6 Au 0.681 6 2.290 2 Mo 0.718 1 2.805 6 Zn 0.638 2 1.723 5 Bi 0.630 7 1.629 2 Ni 0.763 1 3.519 2 Co 0.687 5 2.370 2 Pb 0.393 5 -1.534 4
Table 1. AUCs and ZAUCs for 13 elements
2.1. Geological Characteristics
2.2. Geochemical Data and Data Preprocessing
2.3. Indicator Element Selection
The anomaly score and Mahalanobis distance data generated in Section 3 were used to draw ROC curves and calculate AUCs. Figure 4 shows the four ROC curves of the corresponding bat-optimized models and default-parameter models. The ROC curves of the two bat-optimized models dominate those of the default-parameter counterparts in the ROC space. Therefore, the bat algorithm can improve the performance of the two anomaly detection models.
Table 2 lists the performance evaluation statistics including AUCs, ZAUCs, maximum Youden index (MYI), optimal threshold (OT), percentage of anomaly areas (PAA), Benefit, and data modeling time (DMT). The ZAUCs of the anomaly detection models in Table 2 are much higher than the threshold value 1.96. Therefore, the anomaly detection models can effectively predict the mineral occurrences in the study area. The AUCs of the bat-optimized models are greater than the AUCs of the default-parameter counterparts. Therefore, the bat-optimized models perform better than their default-parameter counterparts in geochemical anomaly detection. However, the bat-optimized models take much more time than their default-parameter counterparts.
Statistics AUC ZAUC MYI OT PAA Benefit DMT (s) Default IF 0.690 8 2.415 7 0.368 3 0.397 6 0.315 6 0.79 25.02 Optimized IF 0.754 3 3.370 9 0.401 5 0.424 7 0.220 2 0.79 4 873.16 Default EE 0.751 9 3.330 2 0.121 1 423.96 0.019 9 0.43 23.90 Optimized EE 0.760 2 3.469 3 0.323 7 338.33 0.030 7 0.43 8 686.93 IF. Isolation forest, EE. elliptic envelope, MYI. maximum Youden index, OT. optimal threshold, PAA. percentage of anomaly areas, and DMT-data modeling time.
Table 2. Performance evaluation statistics of the two bat-optimized models and their default-parameter counterparts
The PAAs and Benefits in Table 2 show that the geochemical anomalies detected by the bat-optimized isolation forest and by default-parameter isolation forest occupy, respectively, 22.02% and 31.56% of the study area, and both contain 79% of the mineral occurrences discovered in the study area. The geochemical anomalies detected by the bat-optimized elliptic envelope and by default-parameter elliptic envelope occupy, respectively, 3.07% and 1.99% of the study area, and both contain 43% of the mineral occurrences discovered in the study area. Therefore, the geochemical anomalies detected by the elliptic envelope models occupy relatively small percentage of the study area (3.07% and 1.99%) and have relatively small benefit values (43%).
Figures 1-3 reveal: (a) the geochemical anomalies are mainly distributed on the southwestern Gudonghe fault zone; and (b) the most mineral occurrences are located inside the geochemical anomalies detected by the isolation forest models but outside those detected by the elliptic envelope models, indicating that the complex geological settings and the complex geological evolution history of the study area lead to the complex geochemical background, and the elliptic envelope model cannot effectively extract the geochemical anomalies from the study area.
The isolation forest and elliptic envelope models are established in different ways. To establish an isolation forest of size T, a total of T training sets is needed, and each training set composed of S training samples is randomly selected from the sample population to build the isolation tree. Then, each sample in the sample population is traversed through the isolation forest to calculate its anomaly score. An elliptic envelope model is established on the whole data set using the MCD algorithm (Rousseuw, 1984). Therefore, neither isolation forest nor elliptic envelope requires a test set except for the training set.
The bat algorithm has more parameters than those of isolation forest and elliptic envelope. However, the default parameter values of the bat algorithm can guarantee that the optimization procedure can found the 'best' parameter values for the isolation forest and elliptic envelope models. Even if the 'best' values found in the optimization process are not the real best ones, they still can improve the performance of the two anomaly detection models. In order to test the influence of parameter values of bat algorithm on the performance of isolation forest model, the following experiments were completed.
To assess the impact of L on the data modeling results, Im=20 and the remaining six parameter values used in Section 3.1 keep unchanged, and let L starts from 10 and increases by 10 at a time until 50. Figures 5a-5d shows the curves of AUC, PAA, LYI and DMT changing with L. To evaluate the impact of Im on data modeling results, the parameter Im is processed in the same way as L. Figure 5e-5h shows the curves of AUC, PAA, LYI and DMT changing with Im. Figure 5 reveals that the change of L and Im values has no significant impact on data modeling performance.
Figure 5. Curves of performance evaluation indices varying with L and Im given fmin=0, fmax=1, Amin=0, α= γ=0.9: (a) AUC. L curve; (b) PAA. L curve; (c) LYI. L curve; (d) DMT. L curve; (e) AUC. Im curve; (f) PAA. Im curve, (g) LYI. Im curve, and (h) DMT. Im curve.
In order to test the influence of α on the data modeling performance, let α starts from 0.1 and increases by 0.2 at a time until 0.9, while the remaining 7 parameter values used in Section 3.1 keep unchanged. Figures 6a-6d shows the curves of AUC, PAA, LYI and DMT changing with α. The parameter γ is also processed in the same way as αin order to test the influence of γ on the data modeling performance. Figure 6e-6h shows the curves of AUC, PAA, LYI and DMT changing with γ. Figure 6 indicates that changes in γ value between 0.1 and 0.9 have no significant impact on data modeling performance, while increasing α value can faster the convergence speed of the bat algorithm.
Figure 6. Curves of performance evaluation indices varying with α and γ given L=50, Im=20, fmin=0, fmax=1, Amin=0: (a) AUC. α curve, (b) PAA. α curve, (c) LYI. α curve, (d) DMT. α curve, (e) AUC. γ curve, (f) PAA. γ curve, (g) LYI. γ curve, and (h) DMT. γ curve.
The parameters T and S of isolation forest are the results of rounding spatial coordinates using Eqs. 10 and 11. The rounding process may affect the performance of the isolation forest model. Therefore, the same data modeling process was repeated five times. Table 3 lists the performance evaluation statistics of the bat-optimized isolation forest in the 5 repetitions. Table 3 reveals that the data modeling results of the 5 repetitions are different. Therefore, in applications, it is better to repeat the same data modeling process many times and identify geochemical anomalies with the data modeling result with the maximum convergent AUC value.
Statistics Repetition AUC ZAUC MYI OT PAA Benefit DMT (s) 1 0.738 3 3.111 8 0.465 8 0.420 6 0.226 8 0.79 2 998.20 2 0.733 3 3.034 6 0.401 1 0.394 7 0.285 7 0.79 3 017.35 3 0.754 3 3.370 9 0.401 5 0.424 7 0.220 2 0.79 4 873.16 4 0.742 7 3.182 2 0.413 1 0.377 4 0.339 9 0.79 1 702.99 5 0.727 3 2.943 1 0.381 8 0.368 8 0.368 4 0.79 2 388.53 MYI. Maximum Youden index; OT. optimal threshold; PAA. percentage of anomaly areas, and DMT-data modeling time.
Table 3. Performance evaluation statistics of the bat-optimized isolation forest in the 5 repetitions of data modeling
In terms of ROC curve and AUC, the bat-optimized elliptic envelope model performs better than the default-parameter elliptic envelope in geochemical anomaly detection. However, these two elliptic envelope models do not perform as well as the isolation forest model, because the geochemical anomalies detected by the elliptic envelope models have relatively small benefit values. This may be because the elliptic envelope cannot express the "complex background shape". Therefore, the assumption that the geochemical background population conforms to multivariate Gaussian distribution is unreasonable.
Zheng (2019) used four machine learning methods to extract geochemical anomalies from the same geochemical data in the study area. The results show that the isolation forest, OCSVM and CRBM models perform well, while the local outlier factor (LOF) (Breunig et al., 2000) model performs poorly in geochemical anomaly detection, indicating that the background population of the study area is complex and does not obey multivariate Gaussian distribution, so it is not suitable for the LOF model.