Citation: | Hang He, Chao Ma, Shan Ye, Wenqiang Tang, Yuxuan Zhou, Zhen Yu, Jiaxin Yi, Li Hou, Mingcai Hou. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. Journal of Earth Science, 2024, 35(3): 1035-1043. doi: 10.1007/s12583-023-1944-8 |
Geological reports are a significant accomplishment for geologists involved in geological investigations and scientific research as they contain rich data and textual information. With the rapid development of science and technology, a large number of textual reports have accumulated in the field of geology. However, many non-hot topics and non-English speaking regions are neglected in mainstream geoscience databases for geological information mining, making it more challenging for some researchers to extract necessary information from these texts. Natural Language Processing (NLP) has obvious advantages in processing large amounts of textual data. The objective of this paper is to identify geological named entities from Chinese geological texts using NLP techniques. We propose the RoBERTa-Prompt-Tuning-NER method, which leverages the concept of Prompt Learning and requires only a small amount of annotated data to train superior models for recognizing geological named entities in low-resource dataset configurations. The RoBERTa layer captures context-based information and longer-distance dependencies through dynamic word vectors. Finally, we conducted experiments on the constructed Geological Named Entity Recognition (GNER) dataset. Our experimental results show that the proposed model achieves the highest F1 score of 80.64% among the four baseline algorithms, demonstrating the reliability and robustness of using the model for Named Entity Recognition of geological texts.
Allahyari, M., Pouriyeh, S., Assefi, M., et al., 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv: 1707.02919. |
Bowring, J. F., McLean, N. M., Walker, J. D., et al., 2015. Advanced Cyberinfrastructure for Geochronology as a Collaborative Endeavor: A Decade of Progress, A Decade of Plans. American Geophysical Union, Fall Meeting 2015. IN23E-03 |
Chan, M. A., Peters, S. E., Tikoff, B., 2016. The Future of Field Geology, Open Data Sharing and CyberTechnology in Earth Science. The Sedimentary Record, 14(1): 4–10. https://doi.org/10.2110/sedred.2016.1.4 |
Chu, D. P., Wan, B., Li, H., et al., 2021. Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model. Earth Science, 46(8): 3039–3048. https://doi.org/10.3799/dqkx.2020.309 (in Chinese with English Abstract) |
Consoli, B., Santos, J., Gomes, D., et al., 2020. Embeddings for Named Entity Recognition in Geoscience Portuguese Literature. Proceedings of The 12th Language Resources and Evaluation Conference. Euro-pean Language Resources Association, Marseille, France. 4625–4630 |
Cutcher-Gershenfeld, J., Baker, K. S., Berente, N., et al., 2016. Build It, but will They Come? A Geoscience Cyberinfrastructure Baseline Analysis. Data Science Journal, 15: 8. https://doi.org/10.5334/dsj-2016-008 |
Devlin, J., Chang, M. W., Lee, K., et al., 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805. |
Enkhsaikhan, M., Holden, E. J., Duuring, P., et al., 2021. Understanding Ore-Forming Conditions Using Machine Reading of Text. Ore Geology Reviews, 135: 104200. https://doi.org/10.1016/j.oregeorev.2021.104200 |
Fan, R. Y., Wang, L. Z., Yan, J. N., et al., 2019. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. ISPRS International Journal of Geo-Information, 9(1): 15. https://doi.org/10.3390/ijgi9010015 |
Guo, C., Xu, Q., Dong, X. J., et al., 2021. Geohazard Recognition and Inventory Mapping Using Airborne LiDAR Data in Complex Mountainous Areas. Journal of Earth Science, 32(5): 1079–1091. https://doi.org/10.1007/s12583-021-1467-2 |
He, Y. X., Luo, C. W., Hu, B. Y., 2015. Geographic Entity Recognition Method Based on Crf Model and Rules Combination. Computer Appli-cations and Software, 32(1): 179–185, 202. https://doi.org/10.3969/j.issn.1000-386x.2015.01.046 (in Chinese with English Abstract) |
Holden, E. J., Liu, W., Horrocks, T., et al., 2019. GeoDocA—Fast Analysis of Geological Content in Mineral Exploration Reports: A Text Mining Approach. Ore Geology Reviews, 111: 102919. https://doi.org/10.1016/j.oregeorev.2019.05.005 |
Huang, G. H., Zhong, J., Wang, C., et al., 2022. Prompt-Based Self-Training Framework for Few-Shot Named Entity Recognition. Knowledge Science, Engineering and Management. Proceedings of 15th International Conference, KSEM 2022. August 6–8, 2022, Singapore. 91–103. |
Kitchin, R., 2014. Big Data, New Epistemologies and Paradigm Shifts. Big Data & Society, 1(1): 205395171452848. https://doi.org/10.1177/2053951714528481 |
Lehnert, K., Su, Y., Langmuir, C. H., et al., 2000. A Global Geochemical Database Structure for Rocks. Geochemistry, Geophysics, Geosystems, 1(1): 1012. https://doi.org/10.1029/1999gc000026 |
Li, D. F., Hu, B. T., Chen, Q. C., 2022. Prompt-Based Text Entailment for Low-Resource Named Entity Recognition. arXiv: 2211.03039. |
Liu, P. F., Yuan, W. Z., Fu, J. L., et al., 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 55(9): 195. https://doi.org/10.1145/3560815 |
Lü, X., Xie, Z., Xu, D. X., et al., 2022. Chinese Named Entity Recognition in the Geoscience Domain Based on BERT. Earth and Space Science, 9(3): e02166. https://doi.org/10.1029/2021ea002166 |
Ma, K., Tian, M., Tan, Y. J., et al., 2022. Named Entity Recognition Dataset for Four Regional Geological Survey Reports by Data Mining Methodology. Journal of Global Change Data & Discovery, 6(1): 78–84. https://doi.org/10.3974/geodp.2022.01.11 |
McKay, N. P., Emile-Geay, J., 2016. Technical Note: The Linked Paleo Data Framework—A Common Tongue for Paleoclimatology. Climate of the Past, 12(4): 1093–1100. https://doi.org/10.5194/cp-12-1093-2016 |
Peters, S. E., Husson, J. M., 2018. We need a Global Comprehensive Stratigraphic Database: Here's a Start. The Sedimentary Record, 16(1): 4–9. https://doi.org/10.2110/sedred.2018.1.4 |
Peters, S. E., Husson, J. M., Czaplewski, J., 2018. Macrostrat: A Platform for Geological Data Integration and Deep-Time Earth Crust Research. Geochemistry, Geophysics, Geosystems, 19(4): 1393–1409. https://doi.org/10.1029/2018gc007467 |
Peters, S. E., McClennen, M., 2016. The Paleobiology Database Application Programming Interface. Paleobiology, 42(1): 1–7. https://doi.org/10.1017/pab.2015.39 |
Piskorski, J., Yangarber, R., 2013. Information Extraction: Past, Present and Future. Multi-source, Multilingual Information Extraction and Summarization. Springer, Berlin, Heidelberg. 23–49. |
Qiu, Q. J., Xie, Z., Wu, L., et al., 2019. GNER: A Generative Model for Geological Named Entity Recognition without Labeled Data Using Deep Learning. Earth and Space Science, 6(6): 931–946. https://doi.org/10.1029/2019ea000610 |
Qiu, Q. J., Tian, M., Xie, Z., et al., 2023. Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach. Journal of Earth Science, 34(5): 1406–1417. https://doi.org/10.1007/s12583-022-1789-8 |
Quinn, D., Linzmeier, B., Sundell, K., et al., 2021. Implementing the Sparrow Laboratory Data System in Multiple Subdomains of Geochro-nology and Geochemistry. EGU General Assembly Conference Abstracts. EGU21-13832. |
Raja, N. B., Dunne, E. M., Matiwane, A., et al., 2022. Colonial History and Global Economics Distort our Understanding of Deep-Time Biodiver-sity. Nature Ecology & Evolution, 6(2): 145–154. https://doi.org/10.1038/s41559-021-01608-8 |
Sang, E. F., De Meulder, F., 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Edmonton, Canada. Association for Computational Linguistics, Morristown, NJ, USA. |
Shin, T., Razeghi, Y., Logan IV, R. L., et al., 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv: 2010.15980. |
Shipley, T. F., Tikoff, B., 2019. Collaboration, Cyberinfrastructure, and Cognitive Science: The Role of Databases and Dataguides in 21st Century Structural Geology. Journal of Structural Geology, 125: 48–54. https://doi.org/10.1016/j.jsg.2018.05.007 |
Singer, D. A., 2021. How Deep Learning Networks could be Designed to Locate Mineral Deposits. Journal of Earth Science, 32(2): 288–292. https://doi.org/10.1007/s12583-020-1399-2 |
Vieira, D. A., Mookerjee, M., Matsa, S., 2014. Incorporating Geoscience, Field Data Collection Workflows into Software Developed for Mobile Devices. AGU Fall Meeting Abstracts. IN41A-3641 |
Walker, J. D., Tikoff, B., Newman, J., et al., 2019. StraboSpot Data System for Structural Geology. Geosphere, 15(2): 533–547. https://doi.org/10.1130/ges02039.1 |
Walker, J., Lehnert, K., Hofmann, A., et al., 2005. EarthChem: International Collaboration for Solid Earth Geochemistry in Geoinformatics. AGU Fall Meeting Abstracts. IN44A-03 |
Wang, B., Ma, K., Wu, L., et al., 2022. Visual Analytics and Information Extraction of Geological Content for Text-Based Mineral Exploration Reports. Ore Geology Reviews, 144: 104818. https://doi.org/10.1016/j.oregeorev.2022.104818 |
Wang, Q. Y., Li, Z. H., Tu, Z. P., et al., 2023. Geotechnical Named Entity Recognition Based on BERT-BiGRU-CRF Model. Earth Science, 48(8): 3137–3150. https://doi.org/10.3799/dqkx.2022.462 (in Chinese with English Abstract) |
Williams, J. W., Grimm, E. C., Blois, J. L., et al., 2018. The Neotoma Paleoecology Database, a Multiproxy, International, Community-Curated Data Resource. Quaternary Research, 89(1): 156–177. https://doi.org/10.1017/qua.2017.105 |
Yan, H., Yang, N., Peng, Y., et al., 2020. Data Mining in the Construction Industry: Present Status, Opportunities, and Future Trends. Automation in Construction, 119: 103331. https://doi.org/10.1016/j.autcon.2020.103331 |
Yao, Y., Zhang, A., Zhang, Z. Y., et al., 2021. CPT: Colorful Prompt Tuning for Pre-Trained Vision-Language Models. arXiv: 2109.11797. |
Ye, S., 2022. A Quantitative Investigation of Large Geoscientific Datasets: How Records of Geochronology and Macroevolution are Distorted by Paleoclimate, Paleoenvironment, and Sediment Preservation: [Disser-tation]. The University of Wisconsin-Madison, Madison |
Ye, S., Cuzzone, J. K., Marcott, S. A., et al., 2023. A Quantitative Assessment of Snow Shielding Effects on Surface Exposure Dating from a Western North American 10Be Data Compilation. Quaternary Geochronology, 76: 101440. https://doi.org/10.1016/j.quageo.2023.101440 |
Ye, S., Peters, S. E., 2023. Bedrock Geological Map Predictions for Phanerozoic Fossil Occurrences. Paleobiology, 49(3): 394–413. https://doi.org/10.1017/pab.2022.46 |
Zhu, Y. Q., Sun, K., Hu, X. M., et al., 2023. Research and Practice on the Framework for the Construction, Sharing, and Application of Large-Scale Geoscience Knowledge Graphs. Journal of Geo-information Science, 25(6): 1215–1227. https://doi.org/10.12082/dqxxkx.2023.210696 (in Chinese with English Abstract) |