Generating Feature-Rich Emails for Benchmarking LLMs

[Submitted on 26 Nov 2025 (v1), last revised 20 Mar 2026 (this version, v5)]

View a PDF of the paper titled The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs, by Rebeka Toth and 2 other authors

View PDF
HTML (experimental)

Abstract:In this paper, we introduce a metadata-enriched generation framework (PhishFuzzer) that seeds real emails into Large Language Models (LLMs) to produce 23,100 diverse, structurally consistent email variants across controlled entity and length dimensions. Unlike prior corpora, our dataset features strict three-class labels (Phishing, Spam, Valid), provides full URL and attachment metadata, and annotates each email with attacker intent. Using this dataset, we benchmark two state-of-the-art LLMs (Qwen-2.5-72B and Gemini-3.1-Pro) under both Basic (body, subject) and Full (+URL, sender, attachment) settings. By applying formal confidence metrics (Task Success Rate and Confidence Index), we analyze model reliability, robustness against linguistic fuzzing, and the impact of structural metadata on detection accuracy. Our fully open-source framework and dataset provide a rigorous foundation for evaluating next-generation email security systems. To support open science, we make the PhishFuzzer Dataset, the generation scripts and prompts available on GitHub: this https URL

Submission history

From: Rebeka Toth [view email]
[v1]
Wed, 26 Nov 2025 14:40:06 UTC (588 KB)
[v2]
Sat, 3 Jan 2026 10:37:31 UTC (588 KB)
[v3]
Mon, 26 Jan 2026 11:12:45 UTC (1 KB) (withdrawn)
[v4]
Wed, 11 Feb 2026 15:59:56 UTC (1 KB) (withdrawn)
[v5]
Fri, 20 Mar 2026 14:23:00 UTC (413 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Generating Feature-Rich Emails for Benchmarking LLMs

Causal Inference Is Eating Machine Learning

Hierarchical Reinforcement Learning for Large-Scale Adaptive Traffic Signal Control

[2603.19461] Hyperagents

[2603.04803] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

Building a Navier-Stokes Solver in Python from Scratch: Simulating Airflow

Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

What Are Ramps: How to Shop for and Cook the Spring Allium

EU signals imminent decision on Google DMA probe

Generating Feature-Rich Emails for Benchmarking LLMs

What I Shared At SEJ Live

Chickpea Tachin With Herb Salad Recipe

Google ads are showing identical website stats across multiple advertisers

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Generating Feature-Rich Emails for Benchmarking LLMs

Submission history

Related Posts

Subscribe to Updates