Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment
    AI Tools

    How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment

    AwaisBy AwaisMarch 18, 2026No Comments19 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment
    Share
    Facebook Twitter LinkedIn Pinterest Email

    systems inject rules written by humans. But what if a neural network could discover those rules itself?

    In this experiment, I extend a hybrid neural network with a differentiable rule-learning module that automatically extracts IF-THEN fraud rules during training. On the Kaggle Credit Card Fraud dataset (0.17% fraud rate), the model learned interpretable rules such as:

    IF V14 < −1.5σ AND V4 > +0.5σ → Fraud

    where σ denotes the feature standard deviation after normalization.

    The rule learner achieved ROC-AUC 0.933 ± 0.029, while maintaining 99.3% fidelity to the neural network’s predictions.

    Most interestingly, the model independently rediscovered V14 — a feature long known by analysts to correlate strongly with fraud — without being told to look for it.

    This article presents a reproducible neuro-symbolic AI experiment showing how a neural network can discover interpretable fraud rules directly from data.

    Full code: github.com/Emmimal/neuro-symbolic-ai-fraud-pytorch

    What the Model Discovered

    Before the architecture, the loss function, or any training details — here is what came out the other end.

    After up to 80 epochs of training (with early stopping, most seeds converged between epochs 56–78), the rule learner produced these in the two seeds where rules emerged clearly:

    Seed 42 — cleanest rule (5 conditions, conf=0.95)


    Learned Fraud Rule — Seed 42 · Rules were never hand-coded

    IF   V14 < −1.5σ
    AND V4  > +0.5σ
    AND V12 < −0.9σ
    AND V11 > +0.5σ
    AND V10 < −0.8σ
    
    THEN FRAUD

    Seed 7 — complementary rule (8 conditions, conf=0.74)

    Learned Fraud Rule — Seed 7 · Rules were never hand-coded

    IF   V14 < −1.6σ
    AND V12 < −1.3σ
    AND V4  > +0.3σ
    AND V11 > +0.5σ
    AND V10 < −1.0σ
    AND V3  < −0.8σ
    AND V17 < −1.5σ
    AND V16 < −1.0σ
    
    THEN FRAUD

    In both cases, low values of V14 sit at the heart of the logic — a striking convergence given zero prior guidance.

    The model was never told which feature mattered.

    Yet it independently rediscovered the same feature human analysts have identified for years.

    A neural network discovering its own fraud rules is exactly the promise of neuro-symbolic AI: combining statistical learning with human-readable logic. The rest of this article explains how — and why the gradient kept finding V14 even when told nothing about it.

    From Injected Rules to Learned Rules — Why It Matters

    Every fraud model has a decision boundary. Fraud teams, however, operate using rules. The gap between them, between what the model learned and what analysts can read, audit, and defend to a regulator — is where compliance teams live and die.

    In my previous article in this series, I encoded two analyst rules directly into the loss function: if the transaction amount is unusually high and if the PCA signature is anomalous, treat the sample as suspicious. That approach worked. The hybrid model matched the pure neural net’s detection performance while remaining interpretable.

    But there was an obvious limitation I left unaddressed. I wrote those rules. I chose those two features because they made intuitive sense to me. Hand-coded rules encode what you already know, they are a good solution when fraud patterns are stable and domain knowledge is deep. They are a poor solution when fraud patterns are shifting, when the most important features are anonymized (as they are in this dataset), or when you want the model to surface signals you haven’t thought to look for.

    The natural next question: what features would the gradient choose, if given the freedom to choose?

    This pattern extends beyond fraud. Medical diagnosis systems need rules that doctors can verify before acting. Cybersecurity models need rules that engineers can audit. Anti-money laundering systems operate under regulatory frameworks requiring explainable decisions. In any domain combining rare events, domain expertise, and compliance requirements, the ability to extract auditable IF-THEN rules from a trained neural network is directly valuable.

    Architecturally, the change is surprisingly simple. You are not replacing the MLP, you are adding a second path that learns to express the MLP’s decisions as human-readable symbolic rules. The MLP trains normally. The rule module learns to agree with it, in symbolic form. That is the subject of this article: differentiable rule induction in ~250 lines of PyTorch, with no prior knowledge of which features matter.

    “You are not replacing the neural network. You are teaching it to explain itself.”

    The Architecture: Three Learnable Pieces

    The architecture keeps a standard neural network intact, but adds a second path that learns symbolic rules explaining the network’s decisions. The two paths run in parallel from the same input and their outputs are combined by a learnable weight α:

    Two parallel paths from a single input: the top path feeds into a three-layer MLP with batch normalization to produce mlp_prob; the bottom path feeds into a Learnable Discretizer then a Rule Learner to produce rule_prob. Both outputs merge into a weighted combination α·mlp + (1-α)·rule to give the final fraud probability.
    The Hybrid Rule Learner runs two paths in parallel from the same 30-feature input. The MLP path handles detection; the rule path learns to explain it. α is a trainable scalar — not a hyperparameter. Image by Author.

    The MLP path is identical to the previous article: three fully connected layers with batch normalization. The rule path is new. Alpha is a learnable scalar that the model uses to weight the two paths, it starts at 0.5 and is trained by gradient descent like any other parameter. After training, α converged to approximately 0.88 on average across seeds (range: 0.80–0.94). The model learned to weight the neural path at roughly 88% and the rule path at 12% on average. The rules are not replacing the MLP, they are a structured symbolic summary of what the MLP learned.

    1. Learnable Discretizer

    Rules need binary inputs — is V14 below a threshold? yes or no. Neural networks need continuous, differentiable operations. The soft sigmoid threshold bridges both.

    For each feature f and each learnable threshold t:

    bf,t=σ ⁣(xf−θf,tτ)b_{f,t} = \sigma\!\left(\frac{x_f – \theta_{f,t}}{\tau}\right)

    Where:

    • xfx_f​ is the value of feature *f* for this transaction
    • θf,t\theta_{f,t}t​ is a learnable threshold, initialized randomly, trained by backpropagation
    • τ\tau is temperature — high early in training (exploratory), low later (crisp)
    • bf,tb_{f,t} is the soft binary output: “is feature *f* above threshold *t*?”

    The model learns three thresholds per feature, giving it three “cuts” per dimension. Each threshold is independent — the model can spread them across the feature’s range or concentrate them around the most discriminative cutpoint.

    Three side-by-side subplots showing sigmoid curves for three learned thresholds at θ=−1.5, θ=0.0, and θ=1.5. Each subplot shows two lines: a nearly flat blue line (τ=5.0, soft) and a sharp orange step function (τ=0.1, crisp). The dashed vertical line marks the threshold position.
    The same sigmoid at τ=5.0 (blue) and τ=0.1 (orange), across three learned threshold positions. At high temperature, every feature value produces a gradient. At low temperature, the function is nearly a binary step — readable as a human condition. Image by Author.

    At τ=5.0 (epoch 0): the sigmoid is almost flat. Every feature value produces a gradient. The model explores freely. At τ=0.1 (epoch 79): the sigmoid is nearly a step function. Thresholds have committed. The boundaries are readable as human conditions.

    class LearnableDiscretizer(nn.Module):
        def __init__(self, n_features, n_thresholds=3):
            super().__init__()
            # One learnable threshold per (feature × bin)
            self.thresholds = nn.Parameter(
                torch.randn(n_features, n_thresholds) * 0.5
            )
            self.n_thresholds = n_thresholds
    
        def forward(self, x, temperature=1.0):
            # x: [B, F] → output: [B, F * n_thresholds] soft binary features
            x_exp = x.unsqueeze(-1)               # [B, F, 1]
            t_exp = self.thresholds.unsqueeze(0)  # [1, F, T]
            soft_bits = torch.sigmoid(
                (x_exp - t_exp) / temperature
            )
            return soft_bits.view(x.size(0), -1)  # [B, F*T]

    2. Rule Learner Layer

    Each rule is a weighted combination of binarized features, passed through a sigmoid:ruler(x)=σ ⁣(∑iwr,i⋅biτ)\text{rule}_r(x) = \sigma\!\left(\frac{\sum_i w_{r,i} \cdot b_i}{\tau}\right)

    The sign of each weight has a direct interpretation after tanh squashing:

    • w>+0.5w > +0.5 → feature must be HIGH for this rule to fire
    • w<−0.5w < -0.5 → feature must be LOW for this rule to fire
    • ∣w∣<0.5|w| < 0.5 → feature is irrelevant to this rule

    Rule extraction follows directly: threshold the absolute weight values after training to identify which features each rule uses. This is how IF-THEN statements emerge from continuous parameters — by reading the weight matrix.

    class RuleLearner(nn.Module):
        def __init__(self, n_bits, n_rules=4):
            super().__init__()
            # w_{r,i}: which binarized features matter for each rule
            self.rule_weights = nn.Parameter(
                torch.randn(n_rules, n_bits) * 0.1
            )
            # confidence: relative importance of each rule
            self.rule_confidence = nn.Parameter(torch.ones(n_rules))
    
        def forward(self, bits, temperature=1.0):
            w = torch.tanh(self.rule_weights)        # bounded in (-1, 1)
            logits = bits @ w.T                       # [B, R]
            rule_acts = torch.sigmoid(logits / temperature)  # [B, R]
            conf = torch.softmax(self.rule_confidence, dim=0)
            fraud_prob = (rule_acts * conf.unsqueeze(0)).sum(dim=1, keepdim=True)
            return fraud_prob, rule_acts

    3. Temperature Annealing

    The temperature follows an exponential decay schedule:τ(t)=τstart⋅(τendτstart)t/T\tau(t) = \tau_{\text{start}} \cdot \left(\frac{\tau_{\text{end}}}{\tau_{\text{start}}}\right)^{t/T}

    With τ_start=5.0, τ_end=0.1, T=80 epochs:

    EpochτState
    05.00Rules fully soft — gradient flows everywhere
    400.69Rules tightening — thresholds committing
    790.10Rules near-crisp — readable as IF-THEN
    A line chart showing temperature τ on the y-axis decreasing from 5.0 at epoch 0 to near 0.1 by epoch 79. The curve is exponential. Three annotations mark the key stages: Fully soft at epoch 0, Tightening at epoch 40, and Near-crisp at epoch 79.
    Temperature τ decays exponentially across 80 epochs, from exploratory softness (τ=5.0) to near-binary crispness (τ=0.1). The shaded area shows the region where gradients are still informative. Image by Author.
    def get_temperature(epoch, total_epochs, tau_start=5.0, tau_end=0.1):
        progress = epoch / max(total_epochs - 1, 1)
        return tau_start * (tau_end / tau_start) ** progress

    Without annealing, the model stays soft and rules never crystallize into anything a fraud analyst can read or a compliance team can sign off on. Annealing is what converts a continuous optimization into a symbolic output.

    Before the loss function — a quick note on where this idea comes from, and what makes this implementation different from prior work.

    Standing on the Shoulders of ∂ILP, NeuRules, and FINRule

    It is worth situating this work in the existing literature not as a full survey, but to clarify what ideas are borrowed and what is new.

    Differentiable Inductive Logic Programming introduced the core idea that inductive logic programming traditionally a combinatorial search problem — can be reformulated as a differentiable program trained with gradient descent. The key insight used here is the use of soft logical operators that allow gradients to flow through rule-like structures. However, ∂ILP requires predefined rule templates and background knowledge declarations, which makes it harder to integrate into standard deep learning pipelines.

    Recent work applying differentiable rules to fraud detection such as FINRule — shows that rule-learning approaches can perform well even on highly imbalanced financial datasets. These studies demonstrate that learned rules can match hand-crafted detection logic while adapting more easily to new fraud patterns.

    Other systems such as RIFF and Neuro-Symbolic Rule Lists introduce decision-tree-style differentiable rules and emphasize sparsity to maintain interpretability. The L1 regularization used in this implementation follows the same principle: encouraging rules to rely on only a few conditions rather than all available features.

    The implementation in this article combines these ideas differentiable discretization plus conjunction learning — but reduces them to roughly 250 lines of dependency-free PyTorch. No template language. No background knowledge declarations. The goal is a minimal rule-learning module that can be dropped into a standard training loop.

    Three-Part Loss: Detection + Consistency + Sparsity

    The full training objective:

    Ltotal=LBCE+λc⋅Lconsistency+λs⋅Lsparsity+λconf⋅Lconfidence\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{BCE}} + \lambda_c \cdot \mathcal{L}_{\text{consistency}} + \lambda_s \cdot \mathcal{L}_{\text{sparsity}} + \lambda_{\text{conf}} \cdot \mathcal{L}_{\text{confidence}}

    L_BCE — Weighted Binary Cross-Entropy

    Identical to the previous article. pos_weight = count(y=0) / count(y=1) ≈ 578. One labeled fraud sample generates 578× the gradient of a non-fraud sample. This term is unchanged the rule path adds no complexity to the core detection objective.

    L_consistency — The New Term

    Rules should agree with the MLP where the MLP is confident. Operationally: MSE between rule_prob and mlp_prob, masked to predictions where the MLP is either clearly fraud (>0.7) or clearly non-fraud (<0.3):

    confident_mask = (mlp_prob > 0.7) | (mlp_prob < 0.3)
    if confident_mask.sum() > 0:
        consist_loss = F.mse_loss(
            rule_prob.squeeze()[confident_mask],
            mlp_prob.squeeze()[confident_mask].detach()  # ← critical
        )

    The .detach() is critical: we are teaching the rules to follow the MLP, not the other way around. The MLP remains the primary learner. The uncertain region (0.3–0.7) is deliberately excluded that is where rules might catch something the MLP misses.

    L_sparsity — Keep Rules Simple

    L1 penalty on the raw (pre-tanh) rule weights: mean(|W_rules|). Without this, rules absorb all 30 features and become unreadable. With λ_s=0.25, the optimizer pushes irrelevant features toward zero while leaving genuinely useful features — V14, V4, V12 — at |w| ≈ 0.5–0.8 after tanh squashing.

    L_confidence — Kill Noise Rules

    A small L1 penalty on the confidence logits (λ_conf=0.01) drives low-confidence rules toward zero weight in the output combination, effectively eliminating them. Without this, multiple technically active but meaningless rules appear with confidence 0.02–0.04 that obscure the real signal.

    Final hyperparameters: λ_c=0.3, λ_s=0.25, n_rules=4, λ_conf=0.01.

    With the machinery in place here is what it produced.

    Results: Does Rule Learning Work — and What Did It Find?

    Experimental Setup

    • Dataset: Kaggle Credit Card Fraud, 284,807 transactions, 0.173% fraud rate
    • Split: 70/15/15 stratified by class label, 5 random seeds [42, 0, 7, 123, 2024]
    • Threshold: F1-maximizing on validation set, applied symmetrically to test set
    • Same evaluation protocol as Article 1

    Detection Performance

    Two side-by-side bar charts showing F1 Score and PR-AUC for Pure Neural (Article 1) in blue and Rule Learner in orange, across 5 seeds. Error bars show standard deviation. Pure Neural: F1=0.804±0.020, PR-AUC=0.770±0.024. Rule Learner: F1=0.789±0.032, PR-AUC=0.721±0.058.
    Detection performance across 5 random seeds (mean ± std). The Rule Learner sits approximately 1.5 F1 points below the pure neural baseline — a real but modest cost for a model that now produces auditable IF-THEN rules. Image by Author.
    ModelF1 (mean ± std)PR-AUC (mean ± std)ROC-AUC (mean ± std)
    Isolation Forest0.1210.1720.941
    Pure Neural (Article 1)0.804 ± 0.0200.770 ± 0.0240.946 ± 0.019
    Rule Learner (this article)0.789 ± 0.0320.721 ± 0.0580.933 ± 0.029

    Note: Isolation Forest numbers from Article 1 for reference. All other models evaluated with identical splits, thresholds, and seeds.

    The rule learner sits slightly below the pure neural baseline on all three detection metrics, approximately 1.5 F1 points on average. The tradeoff is explainability. The per-seed breakdown shows the full picture:

    SeedNN F1RL F1NN ROCRL ROCFidelityCoverage
    420.8180.8240.96070.96810.99210.8243
    00.8250.8320.97270.95720.99250.8514
    70.7790.7760.92720.90010.99550.7568
    1230.8170.7550.94830.89740.99220.8108
    20240.7790.7590.92230.94160.99460.8108

    In seeds 42 and 0, the rule learner exceeds the pure neural baseline on F1. In seed 2024, it exceeds on ROC-AUC. The performance variance across seeds is the honest picture of what gradient-based rule induction produces on a 0.17% imbalanced dataset.

    Rule Quality — The New Contribution

    Three metrics, Each answers a different question a compliance officer would ask.

    Rule Fidelity — can I trust this rule set to represent the model’s actual decisions?

    def rule_fidelity(mlp_probs, rule_probs, threshold=0.5):
        mlp_preds  = (mlp_probs  > threshold).astype(int)
        rule_preds = (rule_probs > threshold).astype(int)
        return (mlp_preds == rule_preds).mean()

    Rule Coverage — what fraction of actual fraud does at least one rule catch?

    def rule_coverage(rule_acts, y_true, threshold=0.5):
        any_rule_fired = (rule_acts > threshold).any(axis=1)
        return any_rule_fired[y_true == 1].mean()

    Rule Simplicity — how many unique feature conditions per rule, after deduplication?

    def rule_simplicity(rule_weights_numpy, weight_threshold=0.50):
        # Divide by n_thresholds (=3) to get unique features,
        # the meaningful readability metric. Target: < 8.
        active = (np.abs(rule_weights_numpy) > weight_threshold).sum(axis=1)
        unique_features = np.ceil(active / 3.0)
        unique_features = unique_features[unique_features > 0]
        return float(unique_features.mean()) if len(unique_features) > 0 else 0.0
    Metricmean ± stdTargetStatus
    Fidelity0.993 ± 0.001> 0.85Excellent
    Coverage0.811 ± 0.031> 0.70Good
    Simplicity (unique features/rule)1.7 ± 2.1< 8The mean is dominated by three seeds where the rule path collapsed entirely (simplicity=0); in the two active seeds, rules used 5 and 8 conditions — comfortably readable.
    α (final)0.880 ± 0.045—MLP dominant

    This highlights a real tension in differentiable rule learning: strong sparsity regularization produces clean rules when they appear, but can cause the symbolic path to go dark in some initializations. Reporting mean ± std across seeds rather than cherry-picking the best seed is essential precisely because of this variance.

    Fidelity at 0.993 means that in seeds where rules are active, they agree with the MLP on 99.3% of binary decisions — the consistency loss working exactly as designed.

    Two subplots. Left: Val PR-AUC per epoch for all five seeds (42, 0, 7, 123, 2024) shown as overlapping blue lines of varying shades, ranging from 0.6 to 0.8 across up to 80 epochs. Right: Temperature annealing schedule showing τ dropping from 5.0 to near 0 over approximately 57 epochs.
    Left: validation PR-AUC across all five seeds throughout training. Right: the temperature schedule as actually executed — note that early stopping fired between epochs 56 and 78 depending on seed. Image by Author.

    The Extracted Rules — What the Gradient Found

    A dark terminal-style visualization showing one extracted fraud rule labeled Rule 1 with confidence 0.95. The rule reads: IF V4 greater than 0.471 (+0.5σ) AND V10 less than −0.774 (−0.8σ) AND V11 greater than 0.458 (+0.5σ) AND V12 less than −0.861 (−0.9σ) AND V14 less than −1.462 (−1.5σ) THEN FRAUD. Footer text notes the model was never told which features to use and rules emerged from gradient descent alone.
    The complete rule extracted from seed 42 — five conditions, confidence 0.95. Every threshold was learned by backpropagation. None were written by hand. Image by Author.

    Both rules are shown in full at the top of this article. The short version: seed 42 produced a tight 5-condition rule (conf=0.95), seed 7 a broader 8-condition rule (conf=0.74). In both, V14 < −1.5σ (or −1.6σ) appears as the leading condition.

    The cross-seed feature analysis confirms the pattern across all five seeds:

    FeatureAppears inMean weighted score
    V142/5 seeds0.630
    V112/5 seeds0.556
    V122/5 seeds0.553
    V102/5 seeds0.511
    V41/5 seeds0.616
    V171/5 seeds0.485

    Even with only two seeds producing visible rules, V14 ranked first or second in both — a statistically striking convergence given zero prior feature guidance. The model did not need to be told what to look for.

    “The model received 30 anonymized features and a gradient signal. It found V14 anyway.”

    What the Model Found — and Why It Makes Sense

    V14 is one of 28 PCA components extracted from anonymized credit card transaction data. Exactly what it represents is not public knowledge — that is the point of the anonymization. What multiple independent analyses have established is that V14 has the highest absolute correlation with the fraud label of any feature in the dataset.

    Why did the rule learner find it? The mechanism is the consistency loss. By training rules to agree with the MLP’s confident predictions, the rule learner is reading the MLP’s internal representations and translating them into symbolic form. The MLP had already learned from the labels that V14 was important. The consistency loss transferred that signal into the rule weight matrix. Temperature annealing then hardened that weight into a crisp threshold condition.

    This is the fundamental difference between Rule Injection (Article 1) and Rule Learning (this article). Rule injection encodes what you already know. Rule learning discovers what you don’t. In this experiment, the discovery was V14 — a signal the gradient found independently, without being told to look for it.

    Across five seeds, readable rules emerged in two — consistently highlighting V14. That is a powerful demonstration that gradient descent can rediscover domain-critical signals without being told to look for them.

    A histogram showing predicted fraud probability on the x-axis from 0 to 1. Two overlapping bars: blue for non-fraud and orange for fraud. Non-fraud concentrates sharply near 0 with density around 60. Fraud concentrates sharply near 1.0 with density around 25. A small orange bar appears near 0 and a small blue bar near 0.85, indicating some overlap.
    Predicted fraud probability distributions for seed 42. The model learned to push non-fraud toward 0 and fraud toward 1 with very little overlap — the bimodal separation that good calibration on imbalanced data looks like. Image by Author.

    A compliance team can now read Rule 1, verify that V14 < −1.5σ makes domain sense, and sign off on it — without opening a single weight matrix. That is what neuro-symbolic rule learning is for.

    Four Things to Watch Before Deploying This

    • Annealing speed is your most sensitive hyperparameter Too fast: rules crystallize before the MLP has learned anything — you get crisp nonsense. Too slow: τ never falls low enough and rules stay soft. Treat τ_end as the first parameter to tune on a new dataset.
    • n_rules sets your interpretability budget Above 8–10 rules, you have a lookup table, not an auditable rule set. Below 4, you may miss tail fraud patterns. The sweet spot for compliance use is 4–8 rules.
    • The consistency threshold assumes a calibrated MLP If your base MLP is poorly calibrated — common on severely imbalanced data — the mask fires too rarely. Run a calibration plot on validation outputs. Consider Platt scaling if calibration is poor.
    • Learned rules need auditing after every retrain Unlike frozen hand-coded rules, learned rules update whenever the model retrains. The compliance team cannot sign off once and walk away — the sign-off must happen every retrain cycle.

    Rule Injection vs. Rule Learning — When to Use Which

    SituationUse
    Strong domain knowledge, stable fraud patternsRule Injection (Article 1)
    Unknown or shifting fraud patternsRule Learning (this article)
    Compliance requires auditable, readable rulesRule Learning
    Fast experiment, minimal engineering overheadRule Injection
    End-to-end interpretability pipelineRule Learning
    Small dataset (<10k samples)Rule Injection — consistency loss needs signal

    The rule learner adds approximately 200 lines of code and a hyperparameter sweep. It is not free. On very small datasets, the consistency loss may not accumulate enough signal to learn meaningful rules — validate fidelity before treating extracted rules as authoritative. The approach is a tool, not a solution.

    One honest observation from the five-seed experiment: in 3 of 5 seeds, strong sparsity pressure drove all rule weights below the extraction threshold. The model converged to the right detection answer but expressed it purely through the MLP path. This variance is real. Single-seed results would give a misleadingly clean picture — which is why multi-seed evaluation is non-negotiable for any paper or article making claims about learned rule behavior.

    The next question in this series is whether these extracted rules can flag concept drift — detecting when fraud patterns have shifted enough that the rules need updating before model performance degrades. When V14’s importance drops in the rule weights while detection metrics hold steady, the fraud distribution may be changing. That early warning signal is the subject of the next article.

    Disclosure

    This article is based on independent experiments using publicly available data (Kaggle Credit Card Fraud dataset, CC-0 Public Domain) and open-source tools (PyTorch, scikit-learn). No proprietary datasets, company resources, or confidential information were used. The results and code are fully reproducible as described, and the GitHub repository contains the complete implementation. The views and conclusions expressed here are my own and do not represent any employer or organization.

    References

    [1] Evans, R., & Grefenstette, E. (2018). Learning Explanatory Rules from Noisy Data. JAIR, 61, 1–64. https://arxiv.org/abs/1711.04574

    [2] Wolfson, B., & Acar, E. (2024). Differentiable Inductive Logic Programming for Fraud Detection. arXiv preprint arXiv:2410.21928. https://arxiv.org/abs/2410.21928

    [3] Martins, J. L., Bravo, J., Gomes, A. S., Soares, C., & Bizarro, P. (2024). RIFF: Inducing Rules for Fraud Detection from Decision Trees. RuleML+RR 2024. arXiv:2408.12989. https://arxiv.org/abs/2408.12989

    [4] Xu, S., Walter, N. P., & Vreeken, J. (2024). Neuro-Symbolic Rule Lists. arXiv preprint arXiv:2411.06428. https://arxiv.org/abs/2411.06428

    [5] Kusters, R., Kim, Y., Collery, M., de Sainte Marie, C., & Gupta, S. (2022). Differentiable Rule Induction with Learned Relational Features. arXiv preprint arXiv:2201.06515. https://arxiv.org/abs/2201.06515

    [6] Dal Pozzolo, A. et al. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. IEEE SSCI. Dataset: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (CC-0)

    [7] Alexander, E. P. (2026). Hybrid Neuro-Symbolic Fraud Detection. Towards Data Science. https://towardsdatascience.com/hybrid-neuro-symbolic-fraud-detection-guiding-neural-networks-with-domain-rules/

    [8] Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. In 2008 Eighth IEEE International Conference on Data Mining (ICDM), pp. 413–422. IEEE. https://doi.org/10.1109/ICDM.2008.17

    [9] Paszke, A. et al. (2019). PyTorch. NeurIPS 32. https://pytorch.org

    [10] Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR, 12, 2825–2830. https://scikit-learn.org

    Code: github.com/Emmimal/neuro-symbolic-ai-fraud-pytorch

    Previous article: Hybrid Neuro-Symbolic Fraud Detection: Guiding Neural Networks with Domain Rules

    Experiment fraud Learned Network Neural NeuroSymbolic Rules
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

    March 18, 2026

    Bridging Modality Gap with Temporal Evolution Semantic Space

    March 18, 2026

    How to Effectively Review Claude Code Output

    March 18, 2026

    Everything You Need to Know About Recursive Language Models

    March 17, 2026

    [2601.15871] Why Inference in Large Models Becomes Decomposable After Training

    March 17, 2026

    Self-Hosting Your First LLM | Towards Data Science

    March 17, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

    March 18, 2026

    arXiv:2603.16739v1 Announce Type: cross Abstract: Decoding the orchestration of neural activity in electroencephalography (EEG) signals…

    How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment

    March 18, 2026

    Google says AI Mode stays ad-free for Personal Intelligence users

    March 18, 2026

    Search Referral Traffic Down 60% For Small Publishers, Data Shows

    March 18, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Everything You Need to Know About Recursive Language Models

    March 17, 2026

    Google Removes ‘What People Suggest,’ Expands Health AI Tools

    March 17, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.