Required Data Maturity · AI Security Landscape

In data science, a familiar saying claims that 80% of the work is data preparation. The 2016 CrowdFlower Data Science Report put rough numbers behind it: 60% of data scientists' time goes into cleaning and organising data, another 19% into collecting it, and only 9% into mining for patterns. AI in security is no different. Two use cases with similar business value can have very different total cost of ownership, simply because one needs months of data plumbing before it produces anything useful, and the other works on data you already have.

The Y axis captures that effort. From bottom to top, each band needs progressively more data maturity: not just volume, but also quality, labelling, and standardisation.

LLMs sit in the centre because their training combines all three paradigms: self-supervised pre-training, supervised fine-tuning, and RLHF. Semi-Supervised Learning is the named intersection of supervised and unsupervised approaches.

The five bands

Reinforcement Learning (bottom, data-lean)

Learns by interacting with an environment, not by being trained on a labelled corpus. Fits security problems where the environment itself is the data source: a self-optimising firewall exploring which policies block attack patterns without breaking legitimate traffic, a WAF bypass tester that mutates payloads until something slips through, or a crisis-response simulator that learns escalation strategies through repeated scenarios. The cost is not labelled data, it is a safe and realistic enough simulation of the network or system the agent gets to explore.

Unsupervised Learning

Finds structure in raw data without any labels: anomaly detection, clustering, noise reduction. Excellent when telemetry is plentiful but ground-truth examples of bad behaviour are not: CloudTrail and Azure Activity Logs, Lambda invocation metrics, network traffic flows, user action streams for insider-risk detection, duplicate alerts in a SIEM. Most security teams already sit on top of the data; what they lack is a labelled set of incidents that would make supervised approaches possible.

Semi-Supervised Learning

A small set of labels combined with a large unlabelled dataset. This matches the reality of security work: a few hundred confirmed incidents or hand-classified policy decisions sit on top of millions of raw events. Practical applications include cloud misconfiguration detection (a handful of well-classified cases plus a large infrastructure to scan), proactive threat hunting (analysts label a few campaigns, the model extends the pattern across the estate), critical attack-path analysis, and compliance drift detection seeded with known control violations.

Supervised Learning

Trained on labelled examples. Classic ML in security: incident classification with next-best-action recommendation, AI-based vendor risk scoring, phishing-susceptibility prediction, automated DLP labelling. Demands a clean, labelled dataset that mirrors what you will see in production. The cost is not the model, it is the months of curation, the agreement on label definitions across teams, and the continuous re-labelling as the threat landscape shifts.

Large Language Models (top, data-rich)

Pre-trained on huge corpora, so the model itself is data-rich by default. The question is how well-organised your domain context is. Three architectural patterns dominate in security, each with a different data-maturity profile.

Pure LLM (unstructured text and code)

GPT-class models reasoning over text and code without external retrieval. Strong for text processing, summarising, and enrichment. In security: STRIDE threat modelling on architecture documents, SAST-finding explanation, log narrative generation, drafting incident reports. Source documents must exist in machine-readable form, but no separate knowledge base is required.

RAG (internal docs and CVEs)

LLM plus a vector database holding internal documentation, CVEs, and prior triage outcomes. Reduces hallucination by grounding answers in retrieved evidence. In security: vulnerability triage hubs, audit and compliance document intelligence, virtual security assistants for SOC playbook automation. Data maturity required is high; the knowledge base itself is the critical asset and needs continuous curation, access control and drift handling.

Fine-Tuning (proprietary logs and APIs)

Adjusting the base model itself to a specific domain or stack. Used when the language of the problem (proprietary log formats, internal API code, niche compliance terminology) is not adequately covered by pre-training. In security this is rare; mostly relevant for very specialised stacks or large organisations with stable, proprietary tooling. Data maturity required is very high, with significant volumes of hand-curated training examples.

Why this matters: the higher up the Y axis, the more time you'll spend on data infrastructure (pipelines, embeddings, governance) before you see results. Pick approaches that match what you already have, then climb deliberately.