ML for Data Tagging and Classification - Zero Trust Activity 6.3.1

Implementing ML for Automated Data Tagging and Classification (Activity 6.3.1)

A critical aspect of applying granular controls is understanding the sensitivity and context of the data itself. We previously established the importance of data tagging and classification (Activity 4.4.2) for things like API access control (Activity 5.1.2). However, manual data classification can be a daunting, slow, and error-prone task in a vast enterprise.

Zero Trust Activity 6.3.1: Implement Data Tagging & Classification ML Tools brings the power of Machine Learning (ML) directly to data governance, automating and enhancing the accuracy of data tagging and classification.

The activity mandates that DoD Components utilize existing Data Tagging and Classification standards and requirements to integrate Machine Learning (ML) solution(s)/capability as needed. This means implementing ML-driven tools where they can provide the most value. Crucially, ML solution(s) are implemented by Components, and existing tagged and classified data repositories are used to establish baselines. This leverages past manual efforts to train the machines. The ML solution(s) will then apply data tags in a supervised approach to continually improve analysis, indicating a human-in-the-loop model for accuracy and ongoing refinement.

This activity is vital for scaling data-centric Zero Trust policies. Automated, accurate data tagging allows for dynamic enforcement of “need-to-know” access rules based on data sensitivity, reducing manual overhead and increasing policy consistency.

The outcomes for Activity 6.3.1 highlight the operationalization of ML for data governance:

Components implement ML capabilities with data tagging and classification.

The ultimate end state signifies a mature, ML-powered data security posture: Machine learning solution is acquired, trained, and implemented in accordance with DoD established Data Tagging and Classification tools. Machines are trained on a high-quality subset of data developed under activity 4.3.1 with human oversight and assessment. This emphasizes the continuous, human-validated improvement of the ML models.

Solutions for Achieving Implement Data Tagging & Classification ML Tools

Implementing ML for data tagging and classification requires a combination of technology platforms, robust data pipelines, and a structured approach to model training and human feedback:

Procurement and Implementation of ML-Powered Data Classification Platforms:
1. Select data classification and governance platforms that have integrated Machine Learning capabilities. These solutions use algorithms to analyze data content, context, and patterns to automatically suggest or apply data tags (e.g., “Personally Identifiable Information,” “Confidential,” “Financial Data”).
2. Trellix Data Loss Prevention (DLP), particularly its Discover module, directly addresses this need. It leverages advanced content inspection techniques, data fingerprinting, and auto-classification powered by machine learning to identify and classify sensitive data (PII, intellectual property, financial data) across various repositories like laptops, shared file servers, and cloud storage. This is a direct implementation of ML for data tagging and classification.
3. Ensure the chosen solution can adhere to your existing DoD Data Tagging and Classification standards and requirements.
Leveraging Existing Tagged Data for Baselines and Training:
1. Utilize your existing repositories of manually tagged and classified data as the initial training set for the ML models. The more high-quality, pre-labeled data you have, the better your initial ML model will perform.
2. This data may come from previous data discovery efforts, data access governance tools (from Activity 4.4.2), or existing data loss prevention (DLP) systems.
Implementing a Supervised Learning Approach with Human Oversight:
1. The “supervised approach” is key. The ML solution will analyze data and propose tags, but human experts (data owners, compliance officers, security analysts) review and validate these suggestions.
2. This human feedback loop is crucial for:
  1. Correcting Errors: Guiding the ML model to learn from its mistakes (e.g., “this wasn’t really PII”).
  2. Adapting to New Data Types: Training the model on new types of sensitive information that emerge.
  3. Building Trust: Ensuring the ML’s classifications are accurate and reliable before policies are enforced based on them.
3. This requires a process for human review and a mechanism within the ML solution to incorporate feedback for continuous model improvement.
Optimizing the Data Pipeline for ML (Cribl’s Role):
1. While ML platforms like Trellix DLP perform the classification, they need high-quality data. Cribl, as a data observability and routing platform, is instrumental here. It can:
  1. Aggregate Data: Collect data from diverse sources (file servers, cloud storage, databases) before it’s fed to the ML classification tool.
  2. Optimize and Transform: Filter out irrelevant data, normalize disparate formats, and enrich logs with additional context (e.g., user IDs, timestamps) to ensure the ML solution receives clean, relevant, and optimized data. This is crucial for the efficiency and accuracy of ML models.
Integrating with Data Repositories and Security Tools:
1. Ensure the ML solution integrates with your various data repositories (on-premises file shares, cloud storage, databases, collaboration platforms) to access data for scanning and tagging.
2. Integrate the ML-driven classification results with your broader Zero Trust ecosystem:
  1. Feed tags to API Gateways (Activity 5.1.2) for granular API access control.
  2. Inform Data Rights Management (DRM) policies (Activity 4.4.2).
  3. Enhance DLP capabilities by precisely identifying sensitive data.
  4. Feed into SIEM/XDR for enhanced visibility into sensitive data access.

Key Items to Consider:

Data Quality is Paramount: The accuracy of ML models is directly tied to the quality and consistency of the training data (existing tagged data). “Garbage in, garbage out” applies here, making data pipeline optimization (Cribl’s strength) essential.
Defining Granular Data Classifications: Ensure your existing or newly defined data classification standards are granular enough to be useful for policy enforcement and that ML models can accurately differentiate them.
Human-in-the-Loop Process: Plan for the human effort required for review, validation, and feedback in the supervised learning model. This is an ongoing operational commitment.
Bias and Fairness: Be aware of potential biases in ML models and ensure they don’t inadvertently misclassify data or create unfair access restrictions. Regular auditing is essential.
Scalability: The ML solution must be able to scale to handle the volume and velocity of your enterprise’s data.
Integration with Data Infrastructure: Ensure seamless integration with various data storage, processing, and transfer systems across your enterprise.
Model Retraining and Maintenance: ML models need continuous retraining with new data and human feedback to maintain accuracy as data types and classifications evolve.

For the Technical Buyer

Activity 6.3.1 is about bringing intelligent automation to consistent and accurate data tagging and classification. For technical buyers, success in this activity means procuring an ML-powered data classification solution, such as Trellix DLP, that aligns with your existing standards and can be trained effectively on your historical data. Crucially, you must ensure your data pipelines, leveraging Cribl’s capabilities, are optimized to feed clean, high-quality data to these ML models. This investment enables you to scale your data governance efforts significantly, providing the precise data context needed to enforce granular, automated Zero Trust policies across your ecosystem.

Technology Partners