Advanced Search

Indexed by SCI、CA、РЖ、PA、CSA、ZR、etc .

Volume 32 Issue 2
Apr.  2021
Turn off MathJax
Article Contents

Yongliang Chen, Shicheng Wang, Qingying Zhao, Guosheng Sun. Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models. Journal of Earth Science, 2021, 32(2): 415-426. doi: 10.1007/s12583-021-1402-6
Citation: Yongliang Chen, Shicheng Wang, Qingying Zhao, Guosheng Sun. Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models. Journal of Earth Science, 2021, 32(2): 415-426. doi: 10.1007/s12583-021-1402-6

Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models

doi: 10.1007/s12583-021-1402-6
More Information
  • Isolation forest and elliptic envelope are used to detect geochemical anomalies, and the bat algorithm was adopted to optimize the parameters of the two models. The two bat-optimized models and their default-parameter counterparts were used to detect multivariate geochemical anomalies from the stream sediment survey data of 1:50 000 scale collected from the Helong district, Jilin Province, China. Based on the data modeling results, the receiver operating characteristic (ROC) curve analysis was performed to evaluate the performance of the two bat-optimized models and their default-parameter counterparts. The results show that the bat algorithm can improve the performance of the two models by optimizing their parameters in geochemical anomaly detection. The optimal threshold determined by the Youden index was used to identify geochemical anomalies from the geochemical data points. Compared with the anomalies detected by the elliptic envelope models, the anomalies detected by the isolation forest models have higher spatial relationship with the mineral occurrences discovered in the study area. According to the results of this study and previous work, it can be inferred that the background population of the study area is complex, which is not suitable for the establishment of elliptic envelope model.
  • 加载中
  • Breunig, M. M., Kriegel, H. P., Ng, R. T., et al., 2000. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Conference 2000, Dallas
    Chai, S. L., Liu, Z. H., 2015. Experimental Demonstration on 1: 50000 Scale Mineral Geology Survey of Four Geological Maps in the Helong Area, Jilin Province. Mineral Geology Survey Report (Internal Communication), Jilin University, Changchun. 205(in Chinese)
    Chen, Y. L., Lu, L. J., Li, X. B., 2014a. Kernel Mahalanobis Distance for Multivariate Geochemical Anomaly Recognition. Journal of Jilin University (Earth Science Edition), 44(1): 396-408(in Chinese) http://en.cnki.com.cn/Article_en/CJFDTOTAL-CCDZ201401040.htm
    Chen, Y. L., Lu, L. J., Li, X. B., 2014b. Application of Continuous Restricted Boltzmann Machine to Identify Multivariate Geochemical Anomaly. Journal of Geochemical Exploration, 140: 56-63. https://doi.org/10.1016/j.gexplo.2014.02.013 doi:  10.1016/j.gexplo.2014.02.013
    Chen, Y. L., 2015. Mineral Potential Mapping with a Restricted Boltzmann Machine. Ore Geology Reviews, 71: 749-760. https://doi.org/10.1016/j.oregeorev.2014.08.012 doi:  10.1016/j.oregeorev.2014.08.012
    Chen, Y. L., Wu, W., 2016. A Prospecting Cost-Benefit Strategy for Mineral Potential Mapping Based on ROC Curve Analysis. Ore Geology Reviews, 74: 26-38. https://doi.org/10.1016/j.oregeorev.2015.11.011 doi:  10.1016/j.oregeorev.2015.11.011
    Chen, Y., Wu, W., 2017a. Mapping Mineral Prospectivity by Using One-Class Support Vector Machine to Identify Multivariate Geological Anomalies from Digital Geological Survey Data. Australian Journal of Earth Sciences, 64(5): 639-651. https://doi.org/10.1080/08120099.2017.1328705 doi:  10.1080/08120099.2017.1328705
    Chen, Y. L., Wu, W., 2017b. Mapping Mineral Prospectivity Using an Extreme Learning Machine Regression. Ore Geology Reviews, 80: 200-213. https://doi.org/10.1016/j.oregeorev.2016.06.033 doi:  10.1016/j.oregeorev.2016.06.033
    Chen, Y. L., Wu, W., 2017c. Application of One-Class Support Vector Machine to Quickly Identify Multivariate Anomalies from Geochemical Exploration Data. Geochemistry: Exploration, Environment, Analysis, 17(3): 231-238. https://doi.org/10.1144/geochem2016-024 doi:  10.1144/geochem2016-024
    Chen, Y. L., Wu, W., 2019a. Isolation Forest as an Alternative Data-Driven Mineral Prospectivity Mapping Method with a Higher Data-Processing Efficiency. Natural Resources Research, 28(1): 31-46. https://doi.org/10.1007/s11053-018-9375-6 doi:  10.1007/s11053-018-9375-6
    Chen, Y. L., Wu, W., 2019b. Separation of Geochemical Anomalies from the Sample Data of Unknown Distribution Population Using Gaussian Mixture Model. Computers & Geosciences, 125: 9-18. https://doi.org/10.1016/j.cageo.2019.01.010 doi:  10.1016/j.cageo.2019.01.010
    Chen, Y. L., Wu, W., Zhao, Q. Y., 2019a. A Bat Algorithm-Based Data-Driven Model for Mineral Prospectivity Mapping. Natural Resources Research, 29(1): 247-265. https://doi.org/10.1007/s11053-019-09589-z doi:  10.1007/s11053-019-09589-z
    Chen, Y. L., Wu, W., Zhao, Q. Y., 2019b. A Bat-Optimized One-Class Support Vector Machine for Mineral Prospectivity Mapping. Minerals, 9(5): 317. https://doi.org/10.3390/min9050317 doi:  10.3390/min9050317
    Gałuszka, A., 2007. A Review of Geochemical Background Concepts and an Example Using Data from Poland. Environmental Geology, 52(5): 861-870. https://doi.org/10.1007/s00254-006-0528-2 doi:  10.1007/s00254-006-0528-2
    Goyal, S., Patterh, M. S., 2013. Wireless Sensor Network Localization Based on Bat Algorithm. International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS), 4(5): 507-512
    Liu, F. S., Zhang, M. L., 1999. Complete Quality Management of the New-Round Land Resources Survey. Chinese Geology, 267(8): 20-21(in Chinese)
    Liu, F. T., Ting, K. M., Zhou, Z. H., 2008. Isolation Forest. Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM), 413-422
    Pan, Y. D., Xu, B. J., Sun, Y., et al., 2016. Geological Features of the Jinchengdong Gold Deposit in Helong City, Jilin Province, China. Jilin Geology, 35(1): 30-35(in Chinese) http://en.cnki.com.cn/Article_en/CJFDTOTAL-JLDZ201601007.htm
    Rousseeuw, P. J., 1984. Least Median of Squares Regression. Journal of the American Statistical Association, 79(388): 871-880. https://doi.org/10.1080/01621459.1984.10477105 doi:  10.1080/01621459.1984.10477105
    Rousseeuw, P. J., van Driessen, K. V., 1999. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics, 41(3): 212-223. https://doi.org/10.1080/00401706.1999.10485670 doi:  10.1080/00401706.1999.10485670
    Sharawi, M., Emary, E., Saroit, I. A., et al., 2012. Bat Swarm Algorithm for Wireless Sensor Networks Lifetime Optimization. International Journal of Science and Research (IJSR), 3(5): 655-664 http://www.researchgate.net/publication/270241192_Bat_Swarm_Algorithm_for_Wireless_Sensor_Networks_Lifetime_Optimization
    Wan, W. Z., Wang, J. B., Feng, X. Y., et al., 2010. Geological Features and Prospecting Directions of the Heanhe Gold Deposit in the Helong Area, Jilin Province, China. Jilin Geology, 29(1): 71-75(in Chinese) http://en.cnki.com.cn/Article_en/CJFDTotal-JLDZ201004015.htm
    Wu, F., Lin, J., Wilde, S., et al., 2005. Nature and Significance of the Early Cretaceous Giant Igneous Event in Eastern China. Earth and Planetary Science Letters, 233(1/2): 103-119. https://doi.org/10.1016/j.epsl.2005.02.019 doi:  10.1016/j.epsl.2005.02.019
    Wu, P. F., Sun, D. Y., Wang, T. H., et al., 2013. Chronology, Geochemical Characteristic and Petrogenesis Analysis of Diorite in Helong of Yanbian Area, Northeastern China. Geological Journal of China Universities, 19(4): 600-610(in Chinese) http://en.cnki.com.cn/Article_en/CJFDTOTAL-GXDX201304006.htm
    Wu, W., Chen, Y. L., 2018. Application of Isolation Forest to extract Multivariate Anomalies from Geochemical Exploration Data. Global Geology, 21(1): 36-47. https://doi.org/10.3969/j.issn.1673-9736.2018.01.04 doi:  10.3969/j.issn.1673-9736.2018.01.04
    Xiong, Y. H., Zuo, R. G., 2016. Recognition of Geochemical Anomalies Using a Deep Autoencoder Network. Computers & Geosciences, 86: 75-82. https://doi.org/10.1016/j.cageo.2015.10.006 doi:  10.1016/j.cageo.2015.10.006
    Yan, D., Li, N., Xu, M., et al., 2015. Mineralization Characteristics and Genesis of the Bailiping Silver Deposit in Helong City, Jilin Province. Jilin Geology, 34(3): 36-41(in Chinese) http://www.zhangqiaokeyan.com/academic-journal-cn_jilin-geology_thesis/0201253935009.html
    Yang, X. S., Gandomi, A. H., 2012. Bat Algorithm: A Novel Approach for Global Engineering Optimization. Engineering Computations, 29(5): 464-483. https://doi.org/10.1108/02644401211235834 doi:  10.1108/02644401211235834
    Yang, X. S., 2010. A new Metaheuristic Bat-Inspired Algorithm. In: Juan, R. G., David, A. P., Carlos, C., et al., eds., Nature Inspired Cooperative Strategies for Optimization. Springer-Verlag, Berlin. 65-74
    Yu, J. J., Wang, F., Xu, W. L., et al., 2012. Early Jurassic Mafic Magmatism in the Lesser Xing'an-Zhangguangcai Range, NE China, and Its Tectonic Implications: Constraints from Zircon U-Pb Chronology and Geochemistry. Lithos, 142/143:256-266. https://doi.org/10.1016/j.lithos.2012.03.016 doi:  10.1016/j.lithos.2012.03.016
    Zhang, Y. B., Wu, F. Y., Wilde, S. A., et al., 2004. Zircon U-Pb Ages and Tectonic Implications of 'Early Paleozoic' Granitoids at Yanbian, Jilin Province, Northeast China. The Island Arc, 13(4): 484-505. https://doi.org/10.1111/j.1440-1738.2004.00442.x doi:  10.1111/j.1440-1738.2004.00442.x
    Zheng, Z. Y., 2019. A Comparison between Several Machine Learning Methods for Multivariate Geochemical Anomaly Identification in the Helong Area, Jilin Province: [Dissertation]. Jilin University, Changchun. 40-50(in Chinese with English Abstract)
  • 加载中
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures(6)  / Tables(3)

Article Metrics

Article views(6) PDF downloads(3) Cited by()

Related
Proportional views

Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models

doi: 10.1007/s12583-021-1402-6

Abstract: Isolation forest and elliptic envelope are used to detect geochemical anomalies, and the bat algorithm was adopted to optimize the parameters of the two models. The two bat-optimized models and their default-parameter counterparts were used to detect multivariate geochemical anomalies from the stream sediment survey data of 1:50 000 scale collected from the Helong district, Jilin Province, China. Based on the data modeling results, the receiver operating characteristic (ROC) curve analysis was performed to evaluate the performance of the two bat-optimized models and their default-parameter counterparts. The results show that the bat algorithm can improve the performance of the two models by optimizing their parameters in geochemical anomaly detection. The optimal threshold determined by the Youden index was used to identify geochemical anomalies from the geochemical data points. Compared with the anomalies detected by the elliptic envelope models, the anomalies detected by the isolation forest models have higher spatial relationship with the mineral occurrences discovered in the study area. According to the results of this study and previous work, it can be inferred that the background population of the study area is complex, which is not suitable for the establishment of elliptic envelope model.

Yongliang Chen, Shicheng Wang, Qingying Zhao, Guosheng Sun. Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models. Journal of Earth Science, 2021, 32(2): 415-426. doi: 10.1007/s12583-021-1402-6
Citation: Yongliang Chen, Shicheng Wang, Qingying Zhao, Guosheng Sun. Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models. Journal of Earth Science, 2021, 32(2): 415-426. doi: 10.1007/s12583-021-1402-6
  • Machine learning methods have been successfully applied to detect geochemical anomalies from a complex geochemical background. These methods include the kernel Mahalanobis distance (Chen et al., 2014a), continuous restricted Boltzmann machine or CRBM (Chen and Wu, 2017a; Chen et al., 2014b), deep autoencoder network (Xiong and Zuo, 2016), ant colony algorithm (Chen and An, 2016), one-class support vector machine or OCSVM (Chen and Wu, 2017b, c), isolation forest (Chen and Wu, 2019a; Wu and Chen, 2018), Gaussian mixture model (Chen and Wu, 2019b), and etc. However, these methods require to define a set of initialization parameters in geochemical data modelling. Improper initialization parameter values may degrade the performance of these machine learning models. To handle this problem, the bat algorithm (Yang and Gandomi, 2012; Yang, 2010) was adopted to optimize the initialization parameters of machine learning models in geochemical anomaly detection.

    The bat algorithm is a swarm intelligence method for solving global engineering optimization problems (Yang and Gandomi, 2012; Yang, 2010). This algorithm has been successfully used in machine learning field to solve large scale optimization problems (Goyal and Patterh, 2013; Sharawi et al., 2012). Chen et al. (2019b) used the bat algorithm to optimize the initialization parameters of the OCSVM model to improve the performance of the model in mineral prospectivity mapping, and established a bat-optimized OCSVM model for mapping mineral prospectivity; and Chen et al. (2019a) established a data-driven mineral prospectivity mapping model based on the bat algorithm, and illustrated its usefulness through a case study.

    Isolation forest can be used to detect geochemical anomalies from a complex geochemical background; However, the elliptic envelope (Rousseuw, 1984) can only detect geochemical anomalies from the geochemical background that can be fitted by the elliptic envelope. By comparing the performance of these two methods in geochemical anomaly detection in a study area, it is possible to determine whether the background population of the study area is complex. In this paper, the two methods were used to detect multivariate anomalies from the stream sediment survey data of 1 : 50 000 scale collected from the Helong district, Jilin Province, China; and the bat algorithm was used to optimize the initialization parameters of these two models. The receiver operating characteristic (ROC) curve analysis (Chen et al., 2019a, b; Chen and Wu, 2019a, b; Wu and Chen, 2018; Chen and Wu, 2017a, b, c, 2016; Chen, 2015) was used to evaluate the performance of the two models initialized with default parameters and initialized with optimized parameters. The ROC curve and the area under the ROC curve (AUC) of each model were used to evaluate the performance of the model. The Youden index (Chen et al., 2019a, b; Chen and Wu, 2019a, b; Wu and Chen, 2018; Chen and Wu, 2017a, b, c, 2016; Chen, 2015) was used to determine the optimal threshold for separation of geochemical anomalies from the background. The main contribution of this paper is to use the bat algorithm to optimize the two machine learning models in geochemical anomaly detection and illustrate that the bat algorithm-optimized models perform better than those whose parameters are not optimized in geochemical anomaly detection.

  • Isolation forest (Chen and Wu, 2019a; Wu and Chen, 2018; Liu et al., 2008) is an efficient outlier detection method in machine learning, which consists of a modeling stage and a testing stage. In the modeling stage, a set of binary decision trees is constructed, and each tree represents the iterative random separation process of the training samples randomly selected from the sample population. In the testing stage, the anomaly score of each sample is estimated according to the path length of all the decision trees traversed by the sample. A binary decision tree called isolation tree, and a set of isolation trees constitute an isolation forest.

    An isolation tree consists of a root node, a set of internal and external nodes, and boundaries that connect adjacent nodes. An internal node connected to its two child nodes by two edges represents a random separation; and an external node represents a sample isolated from the training samples. The number of internal nodes from the root node to the external node is equal to the number of random separations required to isolate the sample from the training samples. It is also equal to the number of edges connecting each pair of parent and child nodes from the root node to the external node. Therefore, the path length of a sample is defined as the number of edges that a sample traverses in the isolation tree (Liu et al., 2008). Outliers have some distinctive characteristics that allow them to be isolated from the training samples by a smaller number of random separations. So, its path length is shorter than the normal samples.

    In multivariate geochemical anomaly detection, it is assumed that there are n samples in the study area, and m element concentrations are observed for each sample. Matrix X=(xij)n×m represents the observed data. Item xij represents the concentration of element j at sample i. Notation U represents the percentage of outliers in the sample population. After defining the number of decision trees (T) and the number of training samples (S), the isolation forest model can be constructed by modeling the data matrix X. According to Liu et al. (2008), T=100 and S=256 can be used as the default parameters in most cases in anomaly detection.

    Given the values of the parameters T and S, and assuming that the minimum and maximum concentrations of each element in the sample population have been identified. The isolation forest model of the data matrix X can be established through the following four steps

    Step 1 S training samples are randomly selected from the sample population;

    Step 2 One element is randomly selected from the m elements, and a random value between the minimum and maximum concentration values of the selected element is chosen as the threshold to classify the S training samples into anomalous and normal (background) samples according to their element concentrations;

    Step 3 Repeating Step 2 until all the S training samples are classified as anomalous samples;

    Step 4 Steps 1 to 3 is repeated for T times, and in each repetition, one binary decision tree is generated.

    After the isolation forest model is established, the testing stage can begin and anomaly score of each sample can be calculated. At this stage, each sample traverses all the isolation trees in the isolation forest to obtain its path length. The average path length of the sample is then calculated and used to compute the anomaly score to represent the anomaly intensity of the sample. In order to normalize the average path length of each sample, the following formula is used to estimate the expectation of the average path length (Liu et al., 2008),

    where S is the training sample size, and the real number 1.154 431 329 8 is twice of Euler's constant (0.577 215 664 9).

    According to the average path length of each sample i, (i=1, 2, ..., n), and the expectation of the average path length, Liu et al. (2008) defined the anomaly score of each sample i (i=1, 2, ..., n) as

    where S is the training sample size, E(h(i)) is the average path length of sample i, and c(S) is the expectation of the average path length. The value range of anomaly score f(i, S) is (0, 1]. When f(i, S) is close to 1, sample i is an anomaly sample, while when f(i, S) is smaller than 0.5, the sample is most probably a normal (background) sample.

  • The elliptic envelope method is an extension of the iterative mean ±2 σ statistical method (Gałuszka, 2007) in multi-dimensional space. If the background data come from a multivariate Gaussian distribution, the covariance matrix of the multivariate geochemical data can be then represented by an elliptic envelope in the observation space, and those samples that are sufficiently far away from the elliptic envelope are identified as the anomalous samples.

    For Gaussian-distributed multivariate geochemical data, the distance of each sample xi (i=1, 2, ..., n) to the mode of the distribution can be calculated by

    where μ and Σ are the location and the covariance of the Gaussian distribution.

    The above parameters μ and Σ can be estimated using geochemical data. The minimum covariance determinant estimator (MCD) (Rousseuw, 1984)) can be used to estimate the covariance matrix of highly contaminated datasets, up to (n-m-1)/2 anomalous samples. Here, the integers n and m are the number of samples and the number of variables, respectively. Rousseeuw and van Driessen (1999) proposed a fast algorithm of the MCD estimator. In geochemical anomaly detection, the fast algorithm can be used to calculate standard estimates of location and covariance of multivariate geochemical data. The Mahalanobis distances obtained from this estimate is used to measure the anomaly intensity of each sample.

  • The bat algorithm is a simulated iterative process in which virtual bats search for prey and avoid obstacles. By dynamically controlling the transformation between global search and local search, the algorithm avoids falling into the local optimum and has good performance in solving the optimization problem of objective function (Yang, 2010). The simulation process is controlled by a set of predefined parameters, including bat population size (L), maximum iteration (Im), convergence speed parameters (α and γ), loudness parameter (A), emission rate parameter (r), and detection range parameters (λ and f). In application, the default values can be used for all parameters except for L and Im. According to Yang (2010), parameters α and γ can be simply defined as the value 0.9, and parameters A, r and f can be defined as values between 0 and 1 (Yang and Gandomi, 2012). Amin=0 represents that a bat stops making sounds temporarily when it finds prey, and rmin=0 and rmax=1 represent no pulse and maximum pulse rate, respectively. λmax represents the detectable range, which can be adjusted by changing f because λ×f is constant (Yang, 2010).

    The bat algorithm can be used to optimize the objective function related to a group of independent variables. Assume that there are m independent variables related to the objective function, and each independent variable has a certain value range. According to the value range of each independent variable, a m-dimensional search space can be defined for the bat population. In the search space, a position z is represented by a m-dimensional vector, which can be used to calculate the value of the objective function. The L virtual bats randomly occupy L positions in the search space at beginning, and then start their iterative searching process. In each iteration, the bat population alternately completes the global and local search processes.

    During the global search process of iteration t (0≤t < Im), the L vectors z1, z2, ..., zL are used to calculate the value of the objective function, and then the vector z* corresponding to the maximum (or minimum) value of the objective function is identified as the current best position. Then the frequency fl flying speed vlt and position zlt of each bat l (1 ≤lL) are updated as follows (Yang, 2010)

    where β is a uniformly distributed random number within the range [0, 1], z* is the current best position, and t is the number of iterations.

    During the local search process of iteration t (0 ≤t < Im), a local search is conducted around the current best location, and the following stochastic equation is used to generate a new location, and then whether the new position is the best position among all the current best positions is tested (Yang, 2010)

    where ε is a m-dimensional random vector with each component between -1 and 1; and < At> is the average loudness of bat population in iteration t.

    After each iteration t, Al and rl of each bat l need to be updated as follows

    where α and γ are constants, usually set to 0.9 (Yang and Gandomi, 2012).

  • As an overall performance measure of a machine learning model in geochemical anomaly detection, the AUC value of the machine learning model can be used as the objective function value of the bat algorithm, which is dependent on the initialization parameter values of the model. The bat algorithm can be then used to automatically optimize the parameters by maximizing the AUC value of the model. In this paper, the bat algorithm was used to optimize the initialization parameters of isolation forest and elliptic envelope in geochemical anomaly detection.

    In order to establish an isolation forest model for geochemical anomaly detection, three key parameters, namely the size of isolation forest, the number of training samples and anomaly-fraction, must be defined correctly. According to the value range of these three parameters, a three-dimensional search space is defined for the bat population. Then the bat algorithm is used to search for the optimal value of these three parameters iteratively to maximize the AUC value of the isolation forest model. In each iteration, the parameter values determined by the spatial position occupied by each bat l (1≤lL) are used to initialize the isolation forest model. The isolation forest model is then used to detect geochemical anomalies and its AUC value is calculated based on the anomaly scores generated by the model. The spatial location corresponding to the largest AUC value is identified as the current best location. Finally, a stochastic equation is used to generate a new location from the current best location and the verification is carried out. As the number of iterations increases, the AUC value of the isolation forest model becomes larger and larger, and finally converges to the global maximum.

    Among the above three parameters, the size of isolation forest and the number of training samples must be initialized with integer values greater than zero. However, the search space of the bat population is a multi-dimensional continuous space, and each coordinate of a spatial position is a real number. Therefore, the first two coordinates of a spatial position must be transformed into integers as follows

    where cor1 and cor2 are the first two coordinates of a spatial position; and int() is a function that converts a real number to the nearest integer. The T in Eq.10 and S in Eq.11 can be used to initialize the isolation forest model.

    A similar strategy can be used to automatically optimize the parameters of the elliptic envelope model in geochemical anomaly detection. According to the value range of support-fraction and contamination-fraction, a two-dimensional search space can be defined for the bat population, and then the bat algorithm can be used to automatically optimize the parameter values of the elliptic envelope model in geochemical anomaly detection.

  • ROC curve analysis is an effective tool for evaluating the performance of a geochemical anomaly detection model (Chen and Wu, 2019a, 2017a, b, 2016; Chen, 2015). Based on the grid geochemical data and the mineral deposits found in the study area, true positive points and true negative points can be defined and the ROC curve analysis can be performed (Chen and Wu, 2019a). The ROC curve is the curve that the benefit changes with the cost under different threshold settings (Chen and Wu, 2017). The better the anomaly detection model performs in geochemical anomaly detection; the closer the ROC curve is to the upper left corner of the ROC space.

    The AUC value is a quantitative expression of the spatial relationship between the continuous explanatory variable and the binary target variable. Its value is in range of 0.5 to 1, which correspond respectively to the random and deterministic spatial relationships between the explanatory and target variables. Assume that there are tp true positive points and tn true negative points in the study area. The AUC value of an anomaly detection model can be estimated by

    with $ \varphi \left({f\left({{\boldsymbol{x}_i}} \right), f\left({{\boldsymbol{y}_j}} \right)} \right) = \left\{ \begin{array}{l} 1, \;\;\;\;\;\;f\left({{\boldsymbol{x}_i}} \right) > f\left({{\boldsymbol{y}_j}} \right)\\ 0.5, \;\;\;\;\;f\left({{\boldsymbol{x}_i}} \right) = f\left({{\boldsymbol{y}_j}} \right)\\ 0, \;\;\;\;\;f\left({{\boldsymbol{x}_i}} \right) < f\left({{\boldsymbol{y}_j}} \right) \end{array} \right. $

    where tp and tn are respectively the number of true positive points and the number of true negative points, f(xi)(i=1, 2, ..., tp) represents the anomaly score at the ith true positive point, and f(yi)(j=1, 2, ..., tn) represents the anomaly score at the jth true negative point.

    Based on the AUC value in Eq.12, the corresponding ZAUC value can be calculated as follows

    where SAUC denotes the standard deviation of AUC, which can be expressed as

    ZAUC is a random variable that conforms the standard normal distribution, because it is a function of the random variable AUC (Chen and Wu, 2019a; Chen, 2015).

    The Youden index is a measure of the spatial relationship between a binary explanatory variable and a binary target variable. Its value is between -1 and +1, respectively representing the deterministic negative and deterministic positive spatial relationships. When the Youden index is zero, it means that there is no spatial relationship between the binary explanatory variable and the binary target variable. In geochemical anomaly detection, the Youden index can be used to determine the optimal threshold for separating geochemical anomalies from the background (Chen and Wu, 2019b, 2017c; Chen et al., 2014b).

  • The study area is located at the intersection of the Paleo-Asian tectonic domain and the Circum-Pacific tectonic domain (Fig. 1a). It is a superimposed composite tectonic zone experienced the ancient Asian Ocean evolution and the subduction of the Mesozoic Pacific Plate (Yu et al., 2012; Wu et al., 2005; Zhang et al., 2004). The regional northwest-trending Gudonghe fault runs through the whole study area and controls the spatial distribution of major geological formations (Figs. 1a, 1b). The exposed strata occupy 29.44% of the study area, mainly including the Late Archean Ji'nan Formation, the Late Permian Xindongchun and Changren Formations, the Early Cretaceous Changchai, Quanshuichun, and Dalazi Formations, the Late Cretaceous Longjing Formation, the Neocene Chuandishan basalt, and the Holocene alluvium. The widely distributed magmatic rocks are mainly composed of granite, granodiorite, diorite, gabbro, etc., forming widely exposed batholiths and stocks (Fig. 1b). Zircon U-Pb ages of diorites are 173-175 Ma (Wu et al., 2013), indicating that the magmatic rocks were formed by tectono-magmatic activities in the Yanshanian period.

    Figure 1.  Tectonic location map (a) and the geological map (b) of the study area (modified from Chai and Liu, 2015).

    The Yanshanian tectono-magmatic activities provided continuous heat source and metallogenic materials for polymetallic mineralization (Yan et al., 2015). There are 14 mineral occurrences discovered in the study area (Fig. 2). These mineral occurrences are mainly hydrothermal deposits and a few hydrothermal skarn deposits which are closely related to multi-stage magmatic activities (Pan et al., 2016; Yan et al., 2015; Wan et al., 2010). The main metallogenic elements include Au, Cu, Co and Mo. Most of the discovered mineral occurrences are distributed in metamorphic rocks around or at the edge of the magmatic intrusions. Regional deep faults, Archean formations, and the Yanshan magmatic rocks are main controlling factors for polymetallic mineralization.

    Figure 2.  Stream sediment sampling locations and known mineral occurrences in the study area.

  • The stream sediment survey completed recently in the study area was in accordance with Geochemical Survey Criteria (No. DZ/T0011-91). As the China's geological industry standard, the documents can be downloaded at https://wenku.baidu.com/view/ 17ad51c189eb172ded63b78b.html. A total of 6 999 stream sediment samples were collected from the study area, covering 1 320 km2 with a sampling density of 1-2 samples per 0.25 km2 (Fig. 2). Inner Mongolia Mineral Experiment Institute of China analyzed concentrations of 13 elements of each stream sediment sample. Atomic absorption spectrometry (AAN) was used to analyze concentration of Au, Atomic Fluorescence Spectrophotometry (AFS) was used to analyze concentrations of Hg and As, and inductively coupled plasma mass spectrometer (ICP-MS) was used to analyze concentrations of Ag, Sb, Mo, W, Cu, Pb, Zn, Bi, Ni, and Co.

    Mineral occurrence data come from the digital geological survey in the study area (Liu and Zhang, 1999). The mineral occurrence map was saved as a MapInfo Interchange File (*.mif) and used as the base map to generate a unit cell map composed of 200×137 unit cells. Each cell covers 0.228 2×0.229 6 km2, containing no more than one discovered mineral occurrence. Because the inverse distance to a power interpolation method has little influence on data smoothing, it was used to convert the concentrations of 6 999 samples per element into 200×137 grid data. Each grid point in a grid map coincides with the center of the corresponding cell in the unit cell map.

    A grid point was defined as a true positive point if the corresponding unit cell contains a mineral occurrence. A total of 14 true positive points was defined in the study area. Except for these 14 grid points, all other grid points were defined as true negative points. It must be noted that the defined true negative points most likely contain some true positive points, which are incorrectly defined as true negative points. However, a few incorrectly defined true negative points cannot significantly affect the ROC curve analysis results. Because the ROC curve is one of the effective tools for processing non-equilibrium data, the non-equilibrium of data does not affect the ROC curve analysis results.

  • The grid data of each element were used to predict mineral potential of the study area. The higher the element concentration of a grid point, the more likely the corresponding cell contains a mineral occurrence. Based on the true positive points and true negative points defined in Section 2.2 and the grid data of each element, the AUC value of the element was calculated using Eq.12 to evaluate the effectiveness of the element concentration for the prediction of mineral potentials.

    Based on the AUC value of each element, the corresponding standard normal distribution statistic, ZAUC, can be calculated using Eq.13 to test whether the AUC value of the element is significantly different from the value 0.5. The AUC value is significantly different from the value 0.5 if only the value of ZAUC is greater than the threshold value 1.96 at the significant level α=0.05 (Chen and Wu, 2019a, b, 2017c, 2016).

    Table 1 lists the estimated AUCs and ZAUCs for 13 elements. Table 1 shows that the concentrations of Au, Co, Cu, Mo, Ni and W can effectively predict mineral potential of the study area, asthe ZAUCs of these elements are greater than the threshold value 1.96. Among these six elements, the first five elements are metallogenic elements, and the last one is an ore-forming associated element. Therefore, these statistical results are strongly consistent with the geological characteristics of the study area.

    Element AUC ZAUC Element AUC ZAUC Element AUC ZAUC
    Ag 0.515 9 0.204 6 Cu 0.712 3 2.720 8 Sb 0.580 2 1.003 0
    As 0.610 1 1.371 4 Hg 0.540 9 0.517 7 W 0.661 5 2.023 6
    Au 0.681 6 2.290 2 Mo 0.718 1 2.805 6 Zn 0.638 2 1.723 5
    Bi 0.630 7 1.629 2 Ni 0.763 1 3.519 2
    Co 0.687 5 2.370 2 Pb 0.393 5 -1.534 4

    Table 1.  AUCs and ZAUCs for 13 elements

  • Liu et al. (2008) recently shown that when S=256 and T=100, the isolation forest provided enough details to perform anomaly detection across a wide range of data; Wu and Chen (2018) illustrated that when S=512 and T=100, the isolation forest had the best performance in geochemical anomaly detection; and Chen and Wu (2019a) shown that when S=256 and T=150, the isolation forest performed best in mineral prospectivity mapping. Referring to these research results, we selected S=256 and T=100 as the default values of training sample size and isolation forest size, respectively. According to the geological characteristics of the study area, U=0.4 was selected as the default value of anomaly-fraction. Using the method discussed in Section 1.1, an isolation forest model was established based on the geochemical data prepared in Section 2.2. The anomaly score of each grid was calculated using Eq.3.

    The bat algorithm was used to automatically determine the optimal values of S, T, and U to improve the performance of the isolation forest model in geochemical anomaly detection. The bat algorithm needs to predefine eight parameters. These parameters include L (bat population size), Im (the number of iterations), fmin, fmax, Amin, Amax, αand γ. According to Yang and Gandomi (2012), the parameters fmin, fmax, Amin and Amax can be defined as fmin=0, fmax=1, Amin=0, and Amax=1. But the other four parameters need to be defined in accordance with the optimization problem solved by the bat algorithm.

    In this study, the eight parameters were defined respectively as L=50, Im=20, fmin=0, fmax=1, Amin=0, Amax=1, and α=γ=0.9. A three-dimensional search space was empirically defined by 1.0 ≤S≤1 000.0, 5.0≤T≤1 000.0, and 0.000 01≤U≤0.499 99. Fifty bats started their iterative search process from 50 random locations in the search space. The "best" position found after the 20th iterations was used as the optimal parameter values of the isolation forest model. In order to reduce the possibility that the iterative search converges to a local maximum, the iterative search process was repeated for 5 times, and the largest AUC value 0.754 3 was used as the global "maximum" value. The corresponding optimal parameter values were defined as S=6, T=66, U=0.083 7. The isolation forest model initialized with these parameter values was established and used to calculate the anomaly score of each grid point.

  • According to Rousseeuw and van Driessen (1999), the default parameter values of SF and CF can be expressed as (n-m-1)/(2×n) and (n+m+1)/(2×n), respectively. The parameters n and m are the number of samples and the number of variables, respectively. In this study, n=26 800, m=6.

    Accordingly, the parameters SF=0.500 1 and CF=0.499 9 were defined as the two default parameter values, respectively; and then according to the grid data obtained in Section 2.2, the fast algorithm of the MCD estimator was used to compute the robust estimates of location μ and covariance Σ. The Mahalanobis distance of each grid to the multivariate Gaussian distribution was computed using Eq.4.

    The bat algorithm was used to automatically determine the optimal values of SF and CF to improve the performance of the fast algorithm of MCD estimator in geochemical anomaly detection. The size of bat population was defined as L=20, and the other seven parameters were predefined in the same way as the bat-optimized isolation forest model in Section 3.1. In this section, a two-dimensional search space was empirically defined by 0.1≤SF≤0.9 and 0.000 01≤CF≤0.499 99. Twenty bats started their iterative search process from 20 random locations in the search space. The "best" position found after the 20th iterations was used as the optimal parameter values of the fast algorithm of MCD estimator. The corresponding optimal parameter values were SF=0.167 35 and CF=0.499 99. The fast algorithm initialized with these parameter values was used to calculate robust estimates of location μ and covariance Σ. The Mahalanobis distance of each sample to the multivariate Gaussian distribution was computed using Eq.4.

  • The anomaly scores generated by the isolation forest model and the Mahalanobis distances generated by the elliptic envelope model will be used to extract anomalies from the geochemical data by a threshold. The key to this process is to determine the optimal threshold, with which the geochemical anomalies extracted from the geochemical data have the highest spatial correlation with the mineral occurrences found in the study area. The Youden index corresponding to a threshold can be used to measure the spatial relationship between the anomalies extracted by the threshold and the mineral occurrences found in the study area (Chen and Wu, 2019a, b; Wu and Chen, 2018; Chen and Wu, 2017a, b, c, 2016; Chen, 2015). Therefore, the Youden index can be calculated for each possible threshold between the minimum and maximum of the anomaly score (or Mahalanobis distance), and then the threshold corresponding to the maximum Youden index can be identified as the optimal threshold.

    In this study, the optimal threshold for extraction of geochemical anomalies was selected from the 1 000 uniformly distributed thresholds between the minimum and maximum of anomaly scores (or Mahalanobis distances). The optimal thresholds for the bat-optimized isolation forest model and the default-parameter isolation forest model are 0.424 7 and 0.397 6, respectively. These two thresholds were used to extract geochemical anomalies from the corresponding anomaly score data. Figures 3a and 3b shows the geochemical anomalies optimally extracted from the anomaly score data generated by the bat-optimized isolation forest model and by the default parameter isolation forest model, respectively. The optimal thresholds for the bat-optimized elliptic envelope model and default-parameter elliptic envelope model are 338.33 and 423.96, respectively. These two thresholds were used to extract geochemical anomalies from the Mahalanobis distance data. Figure 3c and d shows the geochemical anomalies optimally extracted from the Mahalanobis distance data generated by the bat-optimized elliptic envelope model and by the default parameter elliptic envelope model, respectively.

    Figure 3.  Geochemical anomalies detected by (a) the bat-optimized isolation forest, (b) the default-parameter isolation forest, (c) the bat-optimized elliptic envelope and (d) the default-parameter elliptic envelope.

  • The anomaly score and Mahalanobis distance data generated in Section 3 were used to draw ROC curves and calculate AUCs. Figure 4 shows the four ROC curves of the corresponding bat-optimized models and default-parameter models. The ROC curves of the two bat-optimized models dominate those of the default-parameter counterparts in the ROC space. Therefore, the bat algorithm can improve the performance of the two anomaly detection models.

    Figure 4.  ROC curves of the bat-optimized and default-parameter machine learning models.

    Table 2 lists the performance evaluation statistics including AUCs, ZAUCs, maximum Youden index (MYI), optimal threshold (OT), percentage of anomaly areas (PAA), Benefit, and data modeling time (DMT). The ZAUCs of the anomaly detection models in Table 2 are much higher than the threshold value 1.96. Therefore, the anomaly detection models can effectively predict the mineral occurrences in the study area. The AUCs of the bat-optimized models are greater than the AUCs of the default-parameter counterparts. Therefore, the bat-optimized models perform better than their default-parameter counterparts in geochemical anomaly detection. However, the bat-optimized models take much more time than their default-parameter counterparts.

    Statistics AUC ZAUC MYI OT PAA Benefit DMT (s)
    Default IF 0.690 8 2.415 7 0.368 3 0.397 6 0.315 6 0.79 25.02
    Optimized IF 0.754 3 3.370 9 0.401 5 0.424 7 0.220 2 0.79 4 873.16
    Default EE 0.751 9 3.330 2 0.121 1 423.96 0.019 9 0.43 23.90
    Optimized EE 0.760 2 3.469 3 0.323 7 338.33 0.030 7 0.43 8 686.93
    IF. Isolation forest, EE. elliptic envelope, MYI. maximum Youden index, OT. optimal threshold, PAA. percentage of anomaly areas, and DMT-data modeling time.

    Table 2.  Performance evaluation statistics of the two bat-optimized models and their default-parameter counterparts

    The PAAs and Benefits in Table 2 show that the geochemical anomalies detected by the bat-optimized isolation forest and by default-parameter isolation forest occupy, respectively, 22.02% and 31.56% of the study area, and both contain 79% of the mineral occurrences discovered in the study area. The geochemical anomalies detected by the bat-optimized elliptic envelope and by default-parameter elliptic envelope occupy, respectively, 3.07% and 1.99% of the study area, and both contain 43% of the mineral occurrences discovered in the study area. Therefore, the geochemical anomalies detected by the elliptic envelope models occupy relatively small percentage of the study area (3.07% and 1.99%) and have relatively small benefit values (43%).

    Figures 1-3 reveal: (a) the geochemical anomalies are mainly distributed on the southwestern Gudonghe fault zone; and (b) the most mineral occurrences are located inside the geochemical anomalies detected by the isolation forest models but outside those detected by the elliptic envelope models, indicating that the complex geological settings and the complex geological evolution history of the study area lead to the complex geochemical background, and the elliptic envelope model cannot effectively extract the geochemical anomalies from the study area.

  • The isolation forest and elliptic envelope models are established in different ways. To establish an isolation forest of size T, a total of T training sets is needed, and each training set composed of S training samples is randomly selected from the sample population to build the isolation tree. Then, each sample in the sample population is traversed through the isolation forest to calculate its anomaly score. An elliptic envelope model is established on the whole data set using the MCD algorithm (Rousseuw, 1984). Therefore, neither isolation forest nor elliptic envelope requires a test set except for the training set.

    The bat algorithm has more parameters than those of isolation forest and elliptic envelope. However, the default parameter values of the bat algorithm can guarantee that the optimization procedure can found the 'best' parameter values for the isolation forest and elliptic envelope models. Even if the 'best' values found in the optimization process are not the real best ones, they still can improve the performance of the two anomaly detection models. In order to test the influence of parameter values of bat algorithm on the performance of isolation forest model, the following experiments were completed.

    To assess the impact of L on the data modeling results, Im=20 and the remaining six parameter values used in Section 3.1 keep unchanged, and let L starts from 10 and increases by 10 at a time until 50. Figures 5a-5d shows the curves of AUC, PAA, LYI and DMT changing with L. To evaluate the impact of Im on data modeling results, the parameter Im is processed in the same way as L. Figure 5e-5h shows the curves of AUC, PAA, LYI and DMT changing with Im. Figure 5 reveals that the change of L and Im values has no significant impact on data modeling performance.

    Figure 5.  Curves of performance evaluation indices varying with L and Im given fmin=0, fmax=1, Amin=0, α= γ=0.9: (a) AUC. L curve; (b) PAA. L curve; (c) LYI. L curve; (d) DMT. L curve; (e) AUC. Im curve; (f) PAA. Im curve, (g) LYI. Im curve, and (h) DMT. Im curve.

    In order to test the influence of α on the data modeling performance, let α starts from 0.1 and increases by 0.2 at a time until 0.9, while the remaining 7 parameter values used in Section 3.1 keep unchanged. Figures 6a-6d shows the curves of AUC, PAA, LYI and DMT changing with α. The parameter γ is also processed in the same way as αin order to test the influence of γ on the data modeling performance. Figure 6e-6h shows the curves of AUC, PAA, LYI and DMT changing with γ. Figure 6 indicates that changes in γ value between 0.1 and 0.9 have no significant impact on data modeling performance, while increasing α value can faster the convergence speed of the bat algorithm.

    Figure 6.  Curves of performance evaluation indices varying with α and γ given L=50, Im=20, fmin=0, fmax=1, Amin=0: (a) AUC. α curve, (b) PAA. α curve, (c) LYI. α curve, (d) DMT. α curve, (e) AUC. γ curve, (f) PAA. γ curve, (g) LYI. γ curve, and (h) DMT. γ curve.

    The parameters T and S of isolation forest are the results of rounding spatial coordinates using Eqs. 10 and 11. The rounding process may affect the performance of the isolation forest model. Therefore, the same data modeling process was repeated five times. Table 3 lists the performance evaluation statistics of the bat-optimized isolation forest in the 5 repetitions. Table 3 reveals that the data modeling results of the 5 repetitions are different. Therefore, in applications, it is better to repeat the same data modeling process many times and identify geochemical anomalies with the data modeling result with the maximum convergent AUC value.

    Statistics Repetition AUC ZAUC MYI OT PAA Benefit DMT (s)
    1 0.738 3 3.111 8 0.465 8 0.420 6 0.226 8 0.79 2 998.20
    2 0.733 3 3.034 6 0.401 1 0.394 7 0.285 7 0.79 3 017.35
    3 0.754 3 3.370 9 0.401 5 0.424 7 0.220 2 0.79 4 873.16
    4 0.742 7 3.182 2 0.413 1 0.377 4 0.339 9 0.79 1 702.99
    5 0.727 3 2.943 1 0.381 8 0.368 8 0.368 4 0.79 2 388.53
    MYI. Maximum Youden index; OT. optimal threshold; PAA. percentage of anomaly areas, and DMT-data modeling time.

    Table 3.  Performance evaluation statistics of the bat-optimized isolation forest in the 5 repetitions of data modeling

    In terms of ROC curve and AUC, the bat-optimized elliptic envelope model performs better than the default-parameter elliptic envelope in geochemical anomaly detection. However, these two elliptic envelope models do not perform as well as the isolation forest model, because the geochemical anomalies detected by the elliptic envelope models have relatively small benefit values. This may be because the elliptic envelope cannot express the "complex background shape". Therefore, the assumption that the geochemical background population conforms to multivariate Gaussian distribution is unreasonable.

    Zheng (2019) used four machine learning methods to extract geochemical anomalies from the same geochemical data in the study area. The results show that the isolation forest, OCSVM and CRBM models perform well, while the local outlier factor (LOF) (Breunig et al., 2000) model performs poorly in geochemical anomaly detection, indicating that the background population of the study area is complex and does not obey multivariate Gaussian distribution, so it is not suitable for the LOF model.

  • Combining the isolation forest and elliptic envelope with the bat algorithm, the bat-optimized anomaly detection models were established, and used to detect multivariate anomalies from the stream sediment survey data collected from the Helong district, Jilin Province, China. The results show that the bat algorithm can improve the performance of the two models in geochemical anomaly detection by optimizing the model parameters.

    The geochemical anomalies detected by the bat-optimized isolation forest and by the default parameter isolation forest occupy a small percentage of the study area but contain the most mineral occurrences discovered in the study area, indicating that the results of geochemical anomaly detection are strongly consistent with the mineralization characteristics of the study area.

    The geochemical anomalies detected by the bat-optimized elliptic envelope and by the default parameter elliptic envelope have relatively small benefit values, which indicates that the elliptic envelop models cannot express the 'complex background shape' of the study area, and indicates that it is unreasonable to assume that the geochemical background population in the study area obeys the multivariate Gaussian distribution.

  • This work was supported by the National Natural Science Foundation of China (Nos. 41672322, 41872244). The authors are grateful to Professor Yongqing Chen for inviting us to write a paper in honor of Professor Pengda Zhao's 90th birthday. The authors are grateful to Professor Renguang Zuo for encouraging us to finish this paper. The final publication is available at Springer via https://doi.org/10.1007/s12583-021-1402-6.

Reference (32)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return