CyberSOCEval: A Cybersecurity Analysis Test Bench for Large Language Models

Among the ranks of those claiming a presence across the entire AI stack, CrowdStrike is in the spotlight.

The American software maker leaned hard on this angle at its annual conference, showcasing a broad array of integrations, some newer, others more longstanding. These include the provisioning of an MCP server inside Amazon Bedrock AgentCore (to access Falcon telemetry), the linkage between Falcon Shield SSPM and Salesforce’s Security Center, and a gateway tying together the Charlotte AI AgentWorks designer and NVIDIA’s Nemotron models.

Also on the list is CyberSOCEval. This benchmark is intended to help evaluate the capabilities of LLMs in two domains: malware analysis and the exploitation of threat intelligence reports. It sits within the CyberSecEval suite. The latter, launched by Meta, aims to test both the vulnerabilities of LLMs and their defensive capabilities.

A multiple-choice approach with a touch of multimodality

Meta and CrowdStrike collaborated on CyberSOCEval. Both acknowledge that this benchmark is not the first of its kind. But they emphasize, on one hand, that it is open. Unlike, for example, the benchmark Sophos unveiled last year (tasks covered: transforming natural-language questions into SIEM queries, summarizing incidents from raw SOC data, and assessing incident severity levels). And on the other hand, that it touches on multimodality. In the threat intelligence portion, one of the test configurations indeed involved converting reports into images (one PNG file per page).

To promote reproducibility and interpretability of results, they chose to rely on MCQs generated by Llama models, then verified by humans and adjusted as needed. This approach is not perfectly representative of real SOC operations, acknowledge CrowdStrike and Meta. They note, however, having mitigated guessing probability by allowing multiple correct answers and by increasing the number of options.

Malware analysis: Claude Sonnet comes out on top

For malware analysis, the tests rely on Falcon sandbox execution logs. They cover five threat types:

RAT (remote access Trojans)
Ransomware
Info-stealers
Avoidance by antivirus/EDR tools
Malware that manipulates hooks monitoring process execution

The dataset comprises 609 questions, with up to 10 possible answers. They were generated by Llama 3.2 90B from publicly available malware logs targeting Windows environments. Examples:

Which of these Windows API calls best indicates the process-injection technique?
To verify whether an IP address belongs to a C2 infrastructure, which method should you use first?
Given the libraries loaded, which logging strategy most effectively detects reflective DLL injection?

The models tested were all used with their original parameters, without fine-tuning. With, however, a few system-prompt adjustments on a per-case basis to encourage adherence to the desired response format.
Regardless of difficulty, Claude 3.7 Sonnet performed the best.

Although the tests focus on execution logs, Meta and CrowdStrike want to believe in a generalization to fileless attacks. Their main argument: most system events can be triggered by either method and will leave the same footprint in the logs. They point out, in particular:

Direct command execution in a terminal, which can be carried out programmatically via executables
The possibility of replacing a double-click on an executable with a running command
GET requests, interchangeable with manual downloads

Threat intelligence : no clear edge for reasoning models

Tests on the threat intelligence portion were built from 45 reports from four sources (CrowdStrike, CISA, NSA, IC3). Their content was extracted both as images and text. The questions were generated by Llama 4 Maverick and Llama 3.2 90B, with up to six possible answers. Some were created from fixed categories (e.g., kill chain phases). Others, through entity extraction by LLMs (e.g., how does malware Y using vulnerability X affect sector Z?).

Among the 588 questions included in the dataset:

If an attack is attributed to APT29, which indicator is the least reliable?
Which defensive control most effectively mitigates the misuse of Base64-encoded PowerShell commands?
To detect abnormal SSL certificates in the observed infrastructure, what is the best approach?

Across all models tested, results improve when the text modality is included, and even more so when the image modality is excluded. When using reasoning models, scaling inference does not significantly boost performance, particularly relative to the gains this technique is reputed to bring in code and mathematics.

Linux, incident response, entity extraction… The expansion avenues for CyberSOCEval

Several expansion avenues for CyberSOCEval are under study. In particular, there is talk of extending coverage to Linux and mobile environments, adding telemetry for fileless attacks, and broadening the scope of malicious activities (notably exfiltration of sensitive data).

Meta and CrowdStrike also plan to add questions that place more emphasis on exploiting non-textual information. In the threat intelligence dimension, they hope to go beyond MCQs by extracting entities and mapping their relationships. The best-performing models could then potentially be used to automate the ingestion of structured data. However, the ontology issue must be addressed: at present, there is no universal framework to describe and relate security concepts beyond the MITRE ATT&CK matrix.

On the roadmap is also a benchmark for incident response. The challenge here lies in the subjective nature of threat prioritization by SOC analysts. It will also be necessary to provide adequate examples of business context.

A multiple-choice approach with a touch of multimodality

Malware analysis: Claude Sonnet comes out on top

Threat intelligence : no clear edge for reasoning models

Linux, incident response, entity extraction… The expansion avenues for CyberSOCEval

Cyberattack on Kering: data from millions of customers compromised

Expert Column – Cybersecurity: why sovereignty matters

Project Zero changes its vulnerability disclosure policy

Salesloft flaw: support tickets exposed… and more

NIST standardizes “light” cryptography for IoT