Advanced Search

Indexed by SCI、CA、РЖ、PA、CSA、ZR、etc .

Volume 34 Issue 5
Oct 2023
Turn off MathJax
Article Contents
Xiaobo Zhang, Hao Li, Qiang Liu, Zhenhua Li, Claire E. Reymond, Min Zhang, Yuangeng Huang, Hongfei Chen, Zhong-Qiang Chen. A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time. Journal of Earth Science, 2023, 34(5): 1358-1373. doi: 10.1007/s12583-022-1801-3
Citation: Xiaobo Zhang, Hao Li, Qiang Liu, Zhenhua Li, Claire E. Reymond, Min Zhang, Yuangeng Huang, Hongfei Chen, Zhong-Qiang Chen. A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time. Journal of Earth Science, 2023, 34(5): 1358-1373. doi: 10.1007/s12583-022-1801-3

A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time

doi: 10.1007/s12583-022-1801-3
More Information
  • Corresponding author: Zhenhua Li, zhli@cug.edu.cn
  • Received Date: 30 Jun 2022
  • Accepted Date: 02 Dec 2022
  • Available Online: 14 Oct 2023
  • Issue Publish Date: 30 Oct 2023
  • Within any scientific disciplines, a large amount of data are buried within various literature depositories and archives, making it difficult to manually extract useful information from the datum swamps. The machine-learning extraction of data therefore is necessary for the big-data-based studies. Here, we develop a new text-mining technique to reconstruct the global database of the Precambrian to Recent stromatolites, providing better understanding of secular changes of stromatolites though geological time. The step-by-step data extraction process is described as below. First, the PDF documents of stromatolite-containing literatures were collected, and converted into text formation. Second, a glossary and tag-labeling system using NLP (Natural Language Processing) software was employed to search for all possible candidate pairs from each sentence within the papers collected here. Third, each candidate pair and features were represented as a factor graph model using a series of heuristic procedures to score the weights of each pair feature. Occurrence data of stromatolites versus stratigraphical units (abbreviated as Strata), facies types, locations, and age worldwide were extracted from literatures, respectively, and their extraction accuracies are 92%/464, 87%/778, 92%/846, and 93%/405 from 3 750 scientific abstracts, respectively, and are 90%/1 734, 86%/2 869, 90%/2 055 and 91%/857 from 11 932 papers, respectively. A total of 10 072 unique datum items were identified. The newly obtained stromatolite dataset demonstrates that their stratigraphical occurrences reached a pronounced peak during the Proterozoic (2 500–541 Ma), followed by a distinct fall during the Early Phanerozoic, and overall fluctuations through the Phanerozoic (541–0 Ma). Globally, seven stromatolite hotspots were identified from the new dataset, including western United States, eastern United States, western Europe, India, South Africa, northern China, and southern China. The proportional occurrences of inland aquatic stromatolites remain rather low (~20%) in comparison to marine stromatolites from the Precambrian to Jurassic, and then display a significant increase (30%–70%) from the Cretaceous to the present.

     

  • Conflict of Interest
    The authors declare that they have no conflict of interest.
  • loading
  • Al-Badrashiny, M., Bolton, J., Chaganty, A. T., et al., 2017. Tinkerbell: Cross-Lingual Cold-Start Knowledge Base Construction. The 2017 Text Analysis Conference, TAC 2017, November 13–14, Gaithersburg
    Allwood, A. C., Rosing, M. T., Flannery, D. T., et al., 2018. Reassessing Evidence of Life in 3 700-Million-Year-Old Rocks of Greenland. Nature, 563(7730): 241–244. https://doi.org/10.1038/s41586-018-0610-4
    Angeli, G., Gupta, S., Jose, M., et al., 2014. Stanford's 2014 Slot Filling Systems. TAC KBP, 695
    Banon, S., 2021. Elasticsearch. https://www.elastic.co/
    Carlson, A., Betteridge, J., Kisiel, B., et al., 2010. Toward an Architecture for Never-Ending Language Learning. The Twenty-Fourth AAAI Conference on Artificial Intelligence. July 11–15, 2010, Atlanta, Georgia. ACM, New York, 1306–1313. https://doi.org/10.5555/2898607.2898816
    Cermeño, P., Falkowski, P. G., Romero, O. E., et al., 2015. Continental Erosion and the Cenozoic Rise of Marine Diatoms. Proceedings of the National Academy of Sciences of the United States of America, 112(14): 4239–4244. https://doi.org/10.1073/pnas.1412883112
    Chen, Z. Q., Tong, J. N., Liao, Z. T., et al., 2010. Structural Changes of Marine Communities over the Permian–Triassic Transition: Ecologically Assessing the End-Permian Mass Extinction and Its Aftermath. Global and Planetary Change, 73(1/2): 123–140. https://doi.org/10.1016/j.gloplacha.2010.03.011
    Chen, Z. Q., Tu, C. Y., Pei, Y., et al., 2019. Biosedimentological Features of Major Microbe-Metazoan Transitions (MMTS) from Precambrian to Cenozoic. Earth-Science Reviews, 189: 21–50. https://doi.org/10.1016/j.earscirev.2019.01.015
    Cheng, Q. M., 2021. IUGS' Initiative on Data-Driven Geoscience Discovery. Journal of Earth Science, 32(2): 468–470. https://doi.org/10.1007/s12583-021-1455-6
    Cohen, K. M., Finney, S. C., Gibbard, P. L., et al., 2013. The Ics International Chronostratigraphic Chart. Episodes 36, 199–204.
    Community, A. P., 2021. Apache PDFBox A Java PDF Library. https://pdfbox.apache.org/
    Deleger, L., Molnar, K., Savova, G., et al., 2013. Large-Scale Evaluation of Automated Clinical Note De-Identification and Its Impact on Information Extraction. Journal of the American Medical Informatics Association, 20(1): 84–94. https://doi.org/10.1136/amiajnl-2012-001012
    Fan, R., Wang, L., Yan, J., et al., 2019. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. International Journal of GeoInformation, 9: 15 (in Chinese with English Abstract)
    Feng, X. Q., Chen, Z. Q., Benton, M. J., et al., 2022. Resilience of Infaunal Ecosystems during the Early Triassic Greenhouse Earth. Science Advances, 8(26): eabo0597. https://doi.org/10.1126/sciadv.abo0597
    Foundation, A. S., 2021. Apache Tika-Apache Tika. https://tika.apache.org
    Getoor, L., Taskar, B., 2007. Introduction to Statistical Relational Learning. In: Getoor, L, Taskar, B., eds., Adaptive Computation and Machine Learning Series, The MIT Press, Boston. https://doi.org/10.7551/mitpress/7432.001.0001
    Glyph, C., 2021. Xpdf. http://www.xpdfreader.com/
    Gradstein, F. M., Ogg, J. G., Schmitz, M. D., et al., 2020. Geologic Time Scale 2020. Elsevier, New York
    Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term Memory. Neural Computation, 9(8): 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
    Hoffmann, R., Zhang, C., Weld, D. S., 2010. Learning 5 000 Relational Extractors. The 48th Annual Meeting of the Association for Computational Linguistics, July 13, 2010, Uppsala
    Hong, Z., Ward, L., Chard, K., et al., 2021. Challenges and Advances in Information Extraction from Scientific Literature: A Review. JOM, 73(11): 3383–3400. https://doi.org/10.1007/s11837-021-04902-9
    Huang, Y. G., Chen, Z. Q., Roopnarine, P. D., et al., 2021. Ecological Dynamics of Terrestrial and Freshwater Ecosystems across Three Mid-Phanerozoic Mass Extinctions from Northwest China. Proceedings Biological Sciences, 288(1947): 20210148. https://doi.org/10.1098/rspb.2021.0148
    Karttunen, L., Chanod, J. -P., Grefenstette, G., et al., 1996. Regular Expressions for Language Engineering. Natural Language Engineering, 2(4): 305–328. https://doi.org/10.1017/s135132499700 1563 doi: 10.1017/s1351324997001563
    Knoll, A. H., Follows, M. J., 2016. A Bottom-up Perspective on Ecosystem Change in Mesozoic Oceans. Proceedings Biological Sciences, 283(1841): 20161755. https://doi.org/10.1098/rspb.2016.1755
    Kruiper, R., Vincent, J. F. V., Chen-Burger, J., et al., 2020. In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts. arXiv: 2005.07751. https://arxiv.org/abs/2005.07751
    Ku, J. P., Hicks, J. L., Hastie, T., et al., 2015. The Mobilize Center: An NIH Big Data to Knowledge Center to Advance Human Movement Research and Improve Mobility. Journal of the American Medical Informatics Association, 22(6): 1120–1125. https://doi.org/10.1093/jamia/ocv071
    Liu, J., Wright, S. J., Ré, C., et al., 2014. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. The 31st International Conference on International Conference on Machine Learning-Volume 32. June 21–26, 2014, Beijing. https://doi.org/10.5555/3044805.3044945
    Lowagie, B., 2021. The Leading Pdf Library for Developers| Itext. https://itextpdf.com/en
    Ma, X. G., Wu, C. L., Carranza, E. J. M., et al., 2010. Development of a Controlled Vocabulary for Semantic Interoperability of Mineral Exploration Geodata for Mining Projects. Computers & Geosciences, 36(12): 1512–1522. https://doi.org/10.1016/j.cageo.2010.05.014
    Mallory, E. K., Zhang, C., Ré, C., et al., 2016. Large-Scale Extraction of Gene Interactions from Full-Text Literature Using DeepDive. Bioinformatics, 32(1): 106–113. https://doi.org/10.1093/bioinformatics/btv476
    Microsoft, 2021. Microsoft Academic Graph. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/
    Mintz, M., Bills, S., Snow, R., et al., 2009. Distant Supervision for Relation Extraction without Labeled Data. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. August 2–7, 2009, Singapor
    Mooney, R. J., Bunescu, R., 2005. Mining Knowledge from Text Using Information Extraction. ACM SIGKDD Explorations Newsletter, 7(1): 3–10. https://doi.org/10.1145/1089815.1089817
    Nasar, Z., Jaffry, S. W., Malik, M. K., 2018. Information Extraction from Scientific Articles: A Survey. Scientometrics, 117(3): 1931–1990. https://doi.org/10.1007/s11192-018-2921-5
    Niu, F., Recht, B., Re, C., et al., 2011. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. The 24th International Conference on Neural Information Processing Systems. December 12–15, 2011, Granada. https://doi.org/10.5555/2986459.2986537
    Niu, F., Zhang, C., Ré, C., et al., 2012. Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference. International Journal on Semantic Web & Information Systems, 8(3): 42–73. https://doi.org/10.4018/jswis.2012070103
    Nutman, A. P., Bennett, V. C., Friend, C. R. L., et al., 2016. Rapid Emergence of Life Shown by Discovery of 3, 700-Million-Year-Old Microbial Structures. Nature, 537(7621): 535–538. https://doi.org/10.1038/nature19355
    Paleobiodb, 2021. The Paleobiology Database. https://paleobiodb.org/
    Peters, S. E., Zhang, C., Livny, M., et al., 2014. A Machine-Compiled Macroevolutionary History of Phanerozoic Life. arXiv: 1406.2963. https://arxiv.org/abs/1406.2963
    Qiu, Q. J., Xie, Z., Wu, L., et al., 2019. Geoscience Keyphrase Extraction Algorithm Using Enhanced Word Embedding. Expert Systems with Applications, 125: 157–169. https://doi.org/10.1016/j.eswa.2019.0 2.001 doi: 10.1016/j.eswa.2019.02.001
    Qiu, Q. J., Xie, Z., Wu, L., et al., 2020. Automatic Spatiotemporal and Semantic Information Extraction from Unstructured Geoscience Reports Using Text Mining Techniques. Earth Science Informatics, 13(4): 1393–1410. https://doi.org/10.1007/s12145-020-00527-9
    Rabosky, D. L., Sorhannus, U., 2009. Diversity Dynamics of Marine Planktonic Diatoms across the Cenozoic. Nature, 457(7226): 183–186. https://doi.org/10.1038/nature07435
    Raup, D., 1981. Extinction: Bad Genes or Bad Luck? W. W. Norton & Company, New York
    Richardson, M., Domingos, P., 2006. Markov Logic Networks. Machine Learning, 62(1): 107–136. https://doi.org/10.1007/s10994-006-5833-1
    Riloff, E., Jones, R., 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, July 18–22, 1999, Orlando, 474–479
    Shinyama, Y., 2021. Pdfminer. https://www.unixuser.org/~euske/python/pdfminer/
    Small, S. G., Medsker, L., 2014. Review of Information Extraction Technologies and Applications. Neural Computing and Applications, 25(3): 533–548. https://doi.org/10.1007/s00521-013-1516-6
    Soudry, D., Weissbrod, T., 1995. Morphogenesis and Facies Relationships of Thrombolites and Siliciclastic Stromatolites in a Cambrian Tidal Sequence (Elat Area, Southern Israel). Palaeogeography, Palaeoclimatology, Palaeoecology, 114(2/3/4): 339–355. https://doi.org/10.1016/0031-0182(94)00087-o
    Translated, 2021. Mymemory. https://mymemory.translated.net/
    Vaswani, A., Shazeer, N., Parmar, N., et al., 2017. Attention is All You Need, In: Guyon, I., Luxburg, U. V., Bengio, S., eds., Advances in Neural Information Processing Systems, Curran Associates, Inc. https://doi.org/10.48550/arxiv.1706.03762
    Wainwright, M. J., Jordan, M. I., 2007. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2): 1–305. https://doi.org/10.1561/2200000001
    Wang, C. B., Ma, X. G., Chen, J. G., et al., 2018. Information Extraction and Knowledge Graph Construction from Geoscience Literature. Computers & Geosciences, 112: 112–120. https://doi.org/10.1016/j.cageo.2017.12.007
    Webber, B., 2009. Discourse―Early Problems, Current Successes, Future Challenges. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. August 2–7, 2009, Singapore
    Wondershare Technology Group Co. Limited, 2021. Wondershare PDFelement. https://pdf.wondershare.com/
    Wu, S., Hsiao, L., Cheng, X., et al., 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. Proceedings ACM-SIGMOD International Conference on Management of Data, 2018: 1301–1316. https://doi.org/10.1145/3183713.3183729
    Yi, H., Lin, J., Zhou, K., et al., 2008. The Origin of Miocene Lacustrine Stromatolites in the Hoh Xil Area and Its Paleoclimatic Implications. Journal of Mineralogy and Petrology, 28: 106–113 (in Chinese with English Abstract)
    Zhang, C., 2015. DeepDive: A Data Management System for Automatic Knowledge Base Construction: [Dissertation]. University of Wisconsin-Madison, Madison
    Zhang, C., Govindaraju, V., Borchardt, J., et al., 2013. GeoDeepDive: Statistical Inference Using Familiar Data-Processing Languages. The 2013 ACM SIGMOD International Conference on Management of Data. June 22–27, 2013, New York. https://doi.org/10.1145/2463676.2463680
    Zhang, C., Ré, C., 2013. Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. The 2013 ACM SIGMOD International Conference on Management of Data. June 22–27, 2013, New York. https://doi.org/10.1145/2463676.2463702
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(12)  / Tables(6)

    Article Metrics

    Article views(283) PDF downloads(41) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return