Artificial Intelligence Applied to Regulatory Standard Processing in Mining: Development of a Decision Support Tool

Matheus Mendes Damasceno, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
Luan Alysson de Souza, Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
Matheus Mauricio Chaves, Vale S.A., Belo Horizonte, Minas Gerais, Brazil
Darym Junior Ferrari de Campos, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
Giovanna Monique Alelvan, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
Vinicius Resende Domingues, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
Luan Carlos de Sena Monteiro Ozelim, Universidade de Brasília, Brasília, Distrito Federal, Brazil

Abstract

The advancement of Artificial Intelligence (AI) and Natural Language Processing (NLP) has driven innovation in the mining sector, particularly in supporting regulatory and standardization activities. Large Language Models (LLMs), based on the Transformer architecture, demonstrate robustness in text interpretation and generation, but still face challenges in adapting to specific technical domains, such as mining and geotechnics. To mitigate these limitations, strategies such as Retrieval-Augmented Generation (RAG) or other machine learning advancements have been applied to specialize these models, ensuring better adherence to the sector’s technical terminology and regulatory requirements. In this context, this study proposes the development of an AI and NLP-based tool capable of standardizing and optimizing access to regulations and standards within the integrated mining chain, promoting the sector’s digital transformation. The project utilizes advanced techniques for collecting, preprocessing, and tokenizing regulatory texts, which will be integrated via RAG to open-source LLMs, creating a model tailored to the sector’s specifics. This approach enables the tool to understand complex queries, identify patterns, and provide accurate responses to users, optimizing compliance with current legislation and improving decision-making agility.

The research will result in the implementation of an intelligent chatbot integrated with normative and technical databases from the mining industry, facilitating automated inquiries and promoting greater transparency and accessibility to regulations. The performance evaluation of the adjusted model will be
conducted through quantitative metrics, measuring the accuracy, coherence, and relevance of the generated responses by comparing them to a specialized corpus. The tool will serve as essential decision support, enabling operators, researchers, and regulators to quickly access critical information, thereby improving operational efficiency, safety, and ensuring regulatory compliance in the mining sector, which is strategic for the global economy.

Introduction

Mining operations are inherently complex, encompassing multiple phases, from prospecting and exploration to processing, transportation, and eventual mine closure. Each of these stages is governed by an extensive and intricate set of environmental, safety, labour, and technical regulations. The economic
importance of the sector is evident in recent data: in 2024, the Brazilian mineral sector’s revenue reached R$270.8 billion, a 9.1% increase from 2023. The collection of Financial Compensation for the Exploration of Mineral Resources (CFEM) also increased, reaching R$7.4 billion in 2024, a 8.6% rise compared to the previous year. Additionally, the investment forecast for the 2025–2029 quadrennium is R$68.4 billion (IBRAM, 2025). These substantial figures reinforce the critical need for operational efficiency and rigorous regulatory compliance.

Compliance with these norms and legislations is fundamental not only for the legality of operations but also for ensuring operational safety, minimizing environmental impacts, protecting workers’ health, and guaranteeing sustainability during operation. A factor that accentuates the need for efficiency in managing regulatory information is the increased time spam for the development of mining projects. Studies indicate that the time from discovery to production of a mine has increased considerably, from approximately 12 years for projects initiated 15 years ago to about 18 years currently (Manalo, 2024). This temporal dilation introduces a high degree of uncertainty for managers, who need to make strategic decisions based on a regulatory and market scenario that can change substantially by the time the mine becomes operational.

This regulatory “moving target,” combined with long project cycles, creates an urgent demand for tools that can facilitate navigation and continuous adaptation to regulations. Recent advances in Artificial Intelligence (AI) and Natural Language Processing (NLP) are catalysts for innovation in the mining industry. These technologies offer novel approaches to managing large volumes of textual information, particularly for supporting regulatory and standardization activities. AI already demonstrates value in enhancing safety and operational efficiency through applications like predictive maintenance, risk identification, and automation. This trend is corroborated by sectoral analyses that identify AI as fundamental to improving efficiency, data management, and supporting Environmental, Social, and Governance (ESG) reporting in mining.

Despite available technology, mining stakeholders face significant obstacles in managing a growing volume of complex regulatory information. This information is often dispersed across various sources, presented in inconsistent formats, and written in dense, specialized language. The digital transformation needed to mitigate these issues is hindered by organizational resistance to change, concerns over implementation cost and system interoperability challenges. Consequently, these factors compromise efficiency and create significant compliance risks, which can lead to project delays, financial penalties, and reputational damage.

The convergence between increasing regulatory complexity and long project cycles in mining establishes a critical demand for innovative solutions. Concurrently, the maturation of technologies such as Large Language Models (LLMs) and techniques like Retrieval-Augmented Generation (RAG) offer a
unique opportunity to address this demand. The main objective of this paper is to develop and evaluate a functional prototype of an intelligent chatbot. This system should be capable of processing queries formulated in natural language and providing accurate, contextually relevant responses based on documents about specific norms and regulations of the mining industry. The success of such a tool would not only optimize regulatory navigation in mining but could also serve as a model for other highly regulated sectors in Brazil, driving broader digital transformation.

Theoretical Foundation

Large Language Models (LLMs) and Transformer Architecture

Large Language Models (LLMs) represent a milestone in the evolution of Artificial Intelligence, particularly in the field of Natural Language Processing. Their development was significantly driven by the introduction of the Transformer architecture (Vaswani et al., 2017). This innovative architecture moved the
focus from recurrence and convolutions, traditionally employed in sequence models, to relying primarily on attention mechanisms.

The central component of the Transformer architecture is the self-attention mechanism, which allows the model to weigh the importance of different words (tokens) within an input sequence when calculating the representation of each word. Essentially, self-attention assesses how each token relates to all other
tokens in the sequence, efficiently capturing long-range dependencies. An important extension is multihead attention, which allows the model to focus on different positions and aspects of the input sequence simultaneously, learning different types of contextual relationships in parallel. The Transformer
architecture is typically composed of a stack of encoders and decoders. The encoder processes the input sequence and generates a contextualized representation for each token, while the decoder uses this representation to generate the output sequence, token by token (Amatriain et al., 2023).

The impact of the Transformer architecture on NLP has been profound, serving as the basis for the development of a vast range of prominent LLMs, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and their numerous variants. Among these
recent advancements, Mistral 7B has gained significant prominence, an open-source language model also based on the Transformer architecture. Its suitability as a core model for Retrieval-Augmented Generation (RAG) systems is particularly noteworthy. This is justified by its high efficiency, delivering competitive performance despite its relatively small size (7 billion parameters), and its specialized “Instruct” variants, which are fine-tuned to accurately follow instructions and synthesize answers based on provided context. Furthermore, its architecture efficiently handles the long contexts required by RAG tasks, making it a powerful and practical choice for building systems that ground their responses in external knowledge.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an advanced AI technique aimed at enhancing the performance and reliability of LLMs, especially in tasks requiring specific and up-to-date knowledge. RAG combines the generative capabilities of LLMs with an information retrieval system that accesses external and relevant data sources. Instead of relying solely on the knowledge internalized during pre-training (which can be vast but static and prone to obsolescence), a RAG system first retrieves pertinent information from a knowledge base—such as a set of documents, databases, or web pages—and then uses this retrieved information to inform and contextualize the LLM’s response generation. This approach is crucial for overcoming the inherent knowledge limitations of LLMs and their adaptation to technical domains, a previously mentioned challenge.

Recent developments in RAG architectures have demonstrated significant applicability in regulationintensive domains. Sun et al. (2025) proposed a compliance-checking RAG framework enhanced with semantic graph integration, achieving state-of-the-art performance on bilingual regulatory datasets. Their work highlights the importance of combining retrieval mechanisms with structured semantic representations to improve accuracy in regulatory compliance tasks. Similarly, Kim and Min (2024) introduced QA-RAG, a RAG-based chatbot tailored for the pharmaceutical sector, which outperforms
conventional generative models in tasks requiring high factual accuracy and regulatory consistency. In addition to these frameworks, Malali (2025) demonstrates in the financial domain that RAG-based architectures substantially improve compliance automation, high accuracy, and reduced manual effort.
The RAG process is initiated by the user’s query. Subsequently, the query encoding stage transforms this textual input into a numerical vector representation, or embedding, through a language model. The information retrieval phase uses the query embedding to probe an external, previously indexed knowledge
base. The most relevant information fragments are extracted and, after ranking and filtering, are combined with the user’s original query, forming an enriched context that is provided as input to the LLM. This, in turn, synthesizes a response in natural language that is more informative and well-founded. Optionally, a
post-processing stage can refine the response.

The “knowledge-intensive” nature of regulatory documents, characterized by informational density and the need for absolute precision, makes RAG a particularly suitable approach. The strength of RAG lies in its ability to generate responses grounded in information explicitly retrieved from authoritative sources, mitigating the risk of “hallucinations” or outdated information, which is essential in a context with severe consequences for misinterpretations.

Design and Methodology of an AI-Powered Regulatory Chatbot for the Mining Sector

General Architecture

The architecture of the proposed solution for the mining regulatory chatbot comprises the following main components: a User Interface (Chatbot) for interaction; a Query Processing Module for preprocessing and converting the query into a vector embedding; a RAG Engine, the core of the tool, composed of a Retrieval Component (Retriever) that accesses the knowledge base and a Generation Component (Generator) that uses an open-source LLM (Mistral 7B) to formulate the response based on the query and retrieved snippets; and a Knowledge Base of Regulatory Documents, a centralized and indexed repository of legal and normative texts. The flow begins with the user’s query, which is processed and sent to the RAG engine. The engine retrieves relevant documents and generates a contextualized response, which is returned to the user via the interface.

Data Acquisition and Preprocessing

Data Sources

Regulatory and normative texts from the mining chain will be obtained from official and publicly accessible sources. This includes portals of government bodies such as the National Mining Agency (Agência Nacional de Mineração—ANM), the Ministry of Mines and Energy (Ministério de Minas e Energia—MME), and the Ministry of Labor and Social Security for Regulatory Norms (Normas Regulamentadoras—NRs), including the Mining Regulatory Norms (Normas Regulamentadoras de Mineração—NRM). The breadth of the knowledge base can be exemplified by the diversity of NRM categories available on the ANMlegis portal, such as “NRM-05: Beneficiation,” “NRM-10: Operations with Explosives and Accessories,” “NRM15: Ventilation,” “NRM-17: Support Systems and Treatments,” and “NRM-20: Dust Prevention,” among many others, indicating the granularity and scope of the documents to be incorporated.

Text Segmentation Strategies (Chunking)

Segmenting extensive documents into smaller pieces (chunks) is essential for the effective functioning of RAG systems, as it facilitates precise information retrieval and adherence to the context limits of LLMs. Simplistic fixed-size segmentation strategies can fragment context, resulting in loss of meaning. For legal and regulatory texts, where the integrity of clauses and specifications is crucial, more sophisticated approaches are necessary:

Recursive Character Text Splitting: This technique uses a hierarchy of separators (e.g., paragraphs, then sentences, then words) to divide the text, attempting to keep semantically cohesive units together. It is particularly useful for documents with a clear structure, such as legal texts.
Semantic Chunking: Utilizes embedding models to identify semantic breaks in the text, grouping sentences or paragraphs that are semantically related. The goal is to create chunks that are internally coherent in terms of meaning. Research indicates that semantic chunking can be highly
effective in ensuring information coherence within chunks. The choice of segmentation strategy, or a hybrid combination, will be guided by the need to preserve the complete meaning of articles, sections, and clauses within regulatory documents. Chunk overlap will also be considered to
maintain contextual continuity between adjacent segments. Although the Mistral 7B LLM has a long context window, intelligent segmentation remains fundamental to optimize the relevance of the context provided to the model.

LLM Integration (Mistral 7B)

The Mistral 7B model was selected as the open-source LLM due to its notable efficiency, its high performance in reasoning and language processing tasks, and, crucially, for its ability to process long contexts. This latter characteristic is particularly advantageous for analyzing extensive and complex
regulatory documents. Prompt engineering will be fundamental to properly format the user’s original query and the retrieved context chunks, guiding the Mistral 7B model to generate precise and well-founded responses.

The overall system design reflects a multi-layered approach to mitigate the inherent weaknesses of LLMs. The Mistral 7B model does not operate in isolation; it is augmented by a knowledge base (through data acquisition and analysis), made accessible via embeddings and vector databases, and focused by the
RAG architecture. This multi-pronged defensive strategy against irrelevance and hallucinations is vital for building trust in an application intended for a high-risk domain, such as the legal/regulatory field.

Performance Evaluation Structure

The primary objective of this evaluation is to rigorously assess the chatbot’s performance in delivering accurate, relevant, and well-founded responses based on its specialized corpus of mining regulations. To achieve this, a curated dataset of questions and answers will be developed, encompassing both common
and complex queries pertinent to Brazilian mining regulations (NRM). This dataset will serve as a “gold standard” against which the chatbot’s generated responses are benchmarked.

The evaluation will be structured around specific metrics designed for Retrieval-Augmented Generation (RAG) systems, using frameworks like RAGs to provide a comprehensive analysis. These metrics dissect the pipeline’s performance into its core components: the quality of the generated answer
and the effectiveness of the retrieval mechanism. The key metrics are defined as follows:

Faithfulness assesses whether the chatbot’s answer is strictly grounded in the retrieved context. It measures the degree of factual consistency between the generated response and the source material provided by the retriever. A high faithfulness score indicates that the model is not “hallucinating” or fabricating information, ensuring that every claim in the answer can be traced back to the regulatory text.
Answer Relevancy measures how directly the generated answer addresses the user’s question. While an answer might be factually correct (faithful), it could still be irrelevant if the retrieved context, though accurate on its own, does not pertain to the specific query. This metric ensures that the chatbot is not only correct but also on-topic and helpful.
Context Precision evaluates the signal-to-noise ratio of the retrieval step. It measures the proportion of retrieved documents (or “chunks”) that were genuinely relevant to answering the question. High precision means the retriever is efficient and accurate, successfully identifying useful context
without including irrelevant information that could confuse the generator.
Context Recall measures the retriever’s ability to find all the necessary information to answer the question. It assesses whether all relevant documents from the knowledge base were successfully retrieved. High recall is crucial for complex questions that require synthesizing information from multiple sources, as it ensures no critical context is missed.

A key enhancement to the retrieval process is the implementation of the similarity function as a “Confidence Level”. This metric indicates the relationship between the user’s query and the retrieved RAG material, with results above 0.5 considered satisfactory, thus providing a transparent measure of relevance.

Ultimately, the demonstration of “satisfactory results” transcends the mere ability to provide an answer. It depends on this multifaceted evaluation of quality, ensuring that the chatbot is reliable, its reasoning is transparent, and its underlying retrieval mechanism is both precise and comprehensive. A welldocumented evaluation framework not only validates the technical viability of this project but also contributes valuable insights to the broader challenge of creating and assessing dependable RAG systems in specialized domains.

Preliminary Results and Technical Discussion

The current development stage of the prototype has yielded technically consistent advancements, which validate the proposed methodology for regulatory intelligence. Through the integration of advanced text segmentation techniques, such as semantic chunking and recursive splitting, alongside the Mistral 7B
language model and a Retrieval-Augmented Generation (RAG) pipeline, initial tests demonstrate promising performance. The architecture, connected to a structured regulatory knowledge base, presents preliminary results that attest to its viability and potential for optimizing compliance processes.

The strategic decision to adopt a RAG architecture, rather than commercial language models accessed via API (such as ChatGPT), is based on criteria essential for regulated corporate environments. A primary factor is the assurance of data sovereignty and privacy, as both query information and the proprietary knowledge base remain within the company’s security perimeter. Such an approach mitigates the risks associated with transferring sensitive data to third-party servers and ensures compliance with legislation such as the Brazilian General Data Protection Law (LGPD).

Additionally, the RAG architecture offers a level of governance that is unattainable with commercial models. It allows for the dynamic, real-time updating of the knowledge base, an indispensable requirement for keeping pace with the constant evolution of regulations. In that regard, commercial models typically
impose limits on the total number of documents that can be uploaded or charge premium rates for document processing based on token usage. Another crucial benefit is the ability to implement granular access control, which filters the retrieved information based on each user’s permissions. This functionality ensures that the responses generated by the system are always appropriate for the individual’s role and access level, reinforcing information security and governance.

Designed to maximize reliability and traceability, the RAG architecture grounds each response in specific, verifiable documents, significantly reducing the risk of “hallucinations”—the generation of factually incorrect information. This capability to cite sources creates a transparent audit trail, which is essential for regulatory compliance. From an economic and scalability standpoint, the RAG approach proves superior, as it avoids the prohibitive costs and high latency associated with queries that require the analysis of extensive document volumes, thereby overcoming the practical limitations of context windows.

The system’s performance, evaluated using rigorous metrics, shows positive results, placing it on par with commercial alternatives in terms of performance, while also offering important additional benefits. Metrics such as Faithfulness and Answer Relevancy have shown strong performance, ensuring the generation of accurate and contextually relevant answers. Concurrently, Context Precision and Context Recall are undergoing continuous optimization. These indicators, intrinsically linked to the quality of the knowledge base and retrieval configurations, demonstrate the technical feasibility and scalability of the solution.

Despite technical progress, the success of the implementation will depend on user adoption and organizational integration. Current efforts are focused on refining the knowledge base, improving retrieval mechanisms, and conducting systematic benchmark evaluations to ensure the system’s robustness and
reliability. A full operational demonstration of the tool, detailing its functionality in regulatory queries within the mining sector, will be conducted during the presentation of the corresponding scientific paper, with a preliminary version already available for consultation (Souza, 2025).

Conclusion

This study detailed the design of an Artificial Intelligence tool to optimize access to and interpretation of regulations and standards in the Brazilian mining industry. The core of the problem lies in the complexity and volume of the normative framework, which imposes significant challenges to operational efficiency and compliance. The proposed solution is an intelligent chatbot, based on the Retrieval-Augmented Generation (RAG) architecture, which integrates an open-source Large Language Model (LLM) with a specialized knowledge base built from regulatory documents of the sector. The use of text embeddings specialized for the legal domain in Portuguese and advanced text segmentation strategies are key components to ensure the relevance and accuracy of the retrieved information.

The expectation of “satisfactory results” is supported by a combination of factors: judicious methodological choices aimed at maximizing system performance in a demanding technical domain; the proposal of a rigorous evaluation structure based on established metrics for RAG systems, which will allow
a quantifiable demonstration of the tool’s effectiveness; and the anticipated positive impact for the mining sector, including efficiency gains, improved compliance, and safety support. The very journey of identifying a critical problem, designing a sophisticated solution with cutting-edge techniques, and
proposing a robust method to evaluate its effectiveness constitutes a significant contribution, whose process and expected results can be considered satisfactory.

However, limitations are acknowledged. The current work describes the development of a prototype. The knowledge base, although comprehensive, will be limited to publicly accessible documents and may not capture all nuances or internal policies of specific companies. The evaluation, although rigorous, will
be based on a curated dataset, and performance in real-world scenarios may vary. The inherent limitations of LLMs, even with RAG, persist regarding true understanding and complex reasoning.

For future work, several branches of development and research are suggested:

Multimodal RAG: Expansion of the system’s capability to incorporate and process information contained in diagrams, schematics, or images present in regulatory documents.
Advanced Reasoning: Evolution of the chatbot beyond questions and answers, enabling it to assist in more complex tasks, such as compliance verification for a given operational scenario.Real-Time Updates: Development of mechanisms for the automatic ingestion and processing of
new or updated regulations, ensuring the knowledge base remains current.
User Feedback Integration: Implementation of a system that allows users to evaluate the quality of responses and provide corrections, enabling a continuous improvement cycle.
Impact Studies and Broad Deployment: Expansion of the tool’s use and the conduct of studies to measure its real impact on operational efficiency and compliance rates in mining companies.
Enhanced Explainability: Improving the system’s ability to explain why it provided a certain answer, more effectively highlighting specific clauses in source documents.
Integration with Other Systems: Connecting the chatbot to internal compliance management systems used by mining companies.

These future directions point towards the evolution of the tool from a passive information retrieval system to a more active and integrated partner in decision support and compliance management in the mining sector.

References

Agência Nacional de Mineração. (n.d.). Normas Reguladoras de Mineração (NRM). ANMlegis. https://anmlegis.datalegis.net/action/ActionDatalegis.php?acao=recuperarTematicasCollapse&cod_modulo=351&cod_menu=6710

Amatriain, X., Sankar, A., Bing, J., Bodigutla, P. K., Hazen, T. J., & Kazi, M. (2023). Transformer models: An introduction and catalog [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2302.07730

Instituto Brasileiro de Mineração. (2025). Relatório anual de atividades. https://ibram.org.br/wpcontent/uploads/2025/02/IBRAM_Relatorio-Anual-2024_completo-2.pdf

Kim, J., & Min, M. (2024). From RAG to QA-RAG: Integrating generative AI for pharmaceutical regulatory compliance. arXiv. https://doi.org/10.48550/arXiv.2402.01717

Malali, N. (2025). The role of retrieval-augmented generation (RAG) in financial document processing: Automating compliance and reporting. International Journal of Management Technology, 12(3), 26–46. https://doi.org/10.37745/ijmt.2013/vol12n32646

Manalo, P. (2024, April 10). Average lead time almost 18 years for mines started in 2020–23. S&P Global Market Intelligence. https://www.spglobal.com/marketintelligence/en/news-insights/research/average-lead-timealmost-18-years-for-mines-started-in-2020-23

Souza, L. (2025). LegisMinerRAGAPI [Web application]. Hugging Face. https://huggingface.co/spaces/luansouza4444/LegisMinerRAGAPI

Sun, J., Luo, Z., & Li, Y. (2025). A compliance checking framework based on retrieval-augmented generation. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 2603–2615). https://aclanthology.org/2025.coling-main.178.pdf

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017).

Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems, 30 (pp. 6000–6010). Curran Associates, Inc. https://papers.nips.cc/paper/7181-attention-is-all-you-need