The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation

For SaaS (software as a service) companies, monitoring and managing their product data is crucial. For those who fail to understand this, by the time they notice an incident. Damage is already done. For struggling companies, this can be fatal.
To prevent this, I built an n8n workflow linked to their database that will analyze the data daily, spot if there is any incident. In this case, a log and notification system will start to investigate as soon as possible. I also built a dashboard so the team could see the results in real time.
Image by Yassin Zehar

Context

A B2B SaaS platform specializing in data visualization and automated reporting serves roughly 4500 customers, dispersed in three segments:

Small business
Mid-market
Enterprise

The weekly product usage exceeds 30000 active accounts with strong dependencies on real-time data (pipelines, APIs, dashboards, background jobs).

The product team works closely with:

Growth (acquisition, activation, onboarding)
Revenue (pricing, ARPU, churn)
SRE/Infrastructure (reliability, availability)
Data Engineering (pipelines, data freshness)
Support and customer success

Last year, the company observed a rising number of incidents. Between October and December, the total of incidents increased from 250 to 450, an 80% increase. With this increase, there were more than 45 High and critical incidents that affected thousands of users. The most affected metrics were:

api_error_rate
checkout_success_rate
net_mrr_delta
data_freshness_lag_minutes
churn_rate

When incidents occur, a company is judged by its customers based on how it handles and reacts. While the product team is esteemed for how they managed it and ensured it won’t happen again.

Having an incident once can happen, but having the same incident twice is a fault.

Business impact

More volatility in net recurring revenue
A noticeable decline in active accounts over several consecutive weeks
Multiple enterprise customers reporting and complaining about the outdated dashboard (up to 45+minutes late)

In total, between 30000 and 60000 users were impacted. Customer confidence in product reliability also suffered. Among non-renewals, 45% pointed out that’s their main reason.

Why is this issue critical?

As a data platform, the company cannot afford to have:

slow or stale data
Api error
Pipeline failures
Missed or delayed synchronizations
Inaccurate dashboard
Churns (downgrades, cancellations)

Internally, the incidents were spread across several systems:

Notions for product tracking
Slack for alerts
PostgreSQL for storage
Even on Google Sheets for customer support

There was not a single source of truth. The product team has to manually cross-reference and double-check all data, searching for trends and piecing together. It was an investigation and resolving a puzzle, making them lose so many hours weekly.

Solution: Automating an incident system alert with N8N and building a data dashboard. So, incidents are detected, tracked, resolved and understood.

Why n8n?

Currently, there are several automation platforms and solutions. But not all are matching the needs and requirements. Selecting the right one following the need is essential.

The specific requirements were to have access to a database without an API needed (n8n supports Api), to have visual workflows and nodes for a non-technical person to understand, custom-coded nodes, self-hosted options and cost-effective at scale. So, amongst the platforms existing like Zapier, Make or n8n, the choice was for the last.

Designing the Product Health Score

First, the key metrics need to be determined and calculated.

Impact score: simple function of severity + delta + scale of users

impact_score = (
            severity_weights[severity] * 10
            + abs(delta_pct) * 0.8
            + np.log1p(affected_users)
        )
        impact_score = round(float(impact_score), 2)

Priority: derived from severity + impact

if severity == "critical" or impact_score > 60:
            priority = "P1"
        elif severity == "high" or impact_score > 40:
            priority = "P2"
        elif severity == "medium":
            priority = "P3"
        else:
            priority = "P4"

Product health score

def compute_product_health_score(incidents, metrics):
    """
    Score = 100 - sum(penalties)
    Production version handles 15+ factors
    """
    # Key insight: penalties have different max weights
    penalties = {
        'volume': min(40, incident_rate * 13),      # 40% max
        'severity': calculate_severity_sum(incidents), # 25% max  
        'users': min(15, log(users) / log(50000) * 15), # 15% max
        'trends': calculate_business_trends(metrics)    # 20% max
    }
    
    score = 100 - sum(penalties.values())
    
    if score >= 80: return score, "🟢 Stable"
    elif score >= 60: return score, "🟡 Under watch"
    else: return score, "🔴 At risk"

Designing the Automated Detection System with n8n

This system is composed of 4 streams:

Stream 1: retrieves recent revenue metrics, identifies unusual spikes in churn MRR, and creates incidents when needed.

const rows = items.map(item => item.json);

if (rows.length < 8) {
  return [];
}

rows.sort((a, b) => new Date(a.date) - new Date(b.date));

const values = rows.map(r => parseFloat(r.churn_mrr || 0));

const lastIndex = rows.length - 1;
const lastRow = rows[lastIndex];
const lastValue = values[lastIndex];

const window = 7;
const baselineValues = values.slice(lastIndex - window, lastIndex);

const mean = baselineValues.reduce((s, v) => s + v, 0) / baselineValues.length;
const variance = baselineValues
  .map(v => Math.pow(v - mean, 2))
  .reduce((s, v) => s + v, 0) / baselineValues.length;
const std = Math.sqrt(variance);

if (std === 0) {
  return [];
}

const z = (lastValue - mean) / std;
const deltaPct = mean === 0 ? null : ((lastValue - mean) / mean) * 100;

if (z > 2) {
  const anomaly = {
    date: lastRow.date,
    metric_name: 'churn_mrr',
    baseline_value: mean,
    actual_value: lastValue,
    z_score: z,
    delta_pct: deltaPct,
    severity:
      deltaPct !== null && deltaPct > 50 ? 'high'
      : deltaPct !== null && deltaPct > 25 ? 'medium'
      : 'low',
  };

  return [{ json: anomaly }];
}

return [];

Stream 2: Monitors feature usage metrics to detect sudden drops in adoption or engagement.
Incidents are logged with severity, context, and alerts to the product team.

Stream 3: For every open incident, collects additional context from the database (e.g., churn by country or plan), uses AI to generate a clear root cause hypothesis and suggested next steps, sends a summarized report to Slack and email and updates the incident

Stream 4: Every morning, the workflow compiles all incidents from the previous day, creates a Notion page for documentation and sends a report to the leadership team

We deployed similar detection nodes for 8 different metrics, adjusting the z-score direction based on whether increases or decreases were problematic.

The AI agent receives additional context through SQL queries (churn by country, by plan, by segment) to generate more accurate root cause hypotheses. And all of this data is gathered and sent in a daily email.

The workflow generates daily summary reports aggregating all incidents by metric and severity, distributed via email and Slack to stakeholders.

The dashboard

The dashboard is consolidating all signals into one place. An automatic product health score with a 0-100 base is calculated with:

incident volume
severity weighting
open vs resolved status
number of users impacted
business trends (MRR)
usage trends (active accounts)

A segment breakdown to identify which customer groups are the most affected:

A weekly heatmap and time series trend charts to identify recurring patterns:

And a detailed incident view composed by:

Business context
Dimension & segment
Root cause hypothesis
Incident type
An AI summary to accelerate communication and diagnoses coming from the n8n workflow

Diagnosis:

The product health score noted the actual product 24/100 with the status “at risk” with:

45 High & Critical incidents
36 incidents during the last 7 days
33,385 estimated affected usersNegative trend in churn and DAU
Several spikes in api_error_rate and drops in checkout_success_rate

Biggest impact per segments:

Enterprise → critical data freshness issues
Mid-Market → recurring incidents on feature adoption
SMB → fluctuations in onboarding & activation

Impact

The goal of this dashboard is not only to analyze incidents and identify patterns but to enable the organization to react faster with a detailed overview.

We noticed a 35% reduction in critical incidents after 2 months. SRE & DATA teams identified the recurring root cause of some major issues, thanks to the unified data, and were able to fix it and monitor the maintenance. Incident response time improved dramatically due to the AI summaries and all the metrics, allowing them to know where to investigate.

An AI-Powered Root Cause Analysis

Using AI can save a lot of time. Especially when an investigation is needed in different databases, and you don’t know where to start. Adding an AI agent in the loop can save you a considerable amount of time thanks to its speed of processing data. To obtain this, a detailed prompt is necessary because the agent will replace a human. So, to have the most accurate results, even the AI needs to understand the context and receive some guidance. Otherwise, it could investigate and draw irrelevant conclusions. Don’t forget to make sure you have a full understanding of the cause of the issue.

You are a Product Data & Revenue Analyst.

We detected an incident:
{{ $json.incident }}

Here is churn MRR by country (top offenders first):
{{ $json.churn_by_country }}

Here is churn MRR by plan:
{{ $json.churn_by_plan }}

1. Summarize what happened in simple business language.
2. Identify the most impacted segments (country, plan).
3. Propose 3-5 plausible hypotheses (product issues, price changes, bugs, market events).
4. Propose 3 concrete next steps for the Product team.

It is essential to note that once the results are obtained, a final check is necessary to ensure the analysis was correctly done. AI is a tool, but it can also go wrong, so do not only on it; it’s a helpful tool. For this system, the AI will suggest the top 3 likely root causes for each incident.

A better alignment with the leadership team and reporting based on the data. Everything became more data-driven with deeper analyses, not intuition or reports by segmentation. This also led to an improved process.

Conclusion & takeaways

In conclusion, building a product health dashboard has several benefits:

Detect negative trends (MRR, DAU, engagement) earlier
Reduce critical incidents by identifying root-cause patterns
Understand real business impact (users affected, churn risk)
Prioritize the product roadmap based on risk and impact
Align Product, Data, SRE, and Revenue around a single source of truth

That’s exactly what many companies lack: a unified data approach.

Using the n8n workflow helped in two ways: being able to resolve the issues as soon as possible and gather the data in one place. The automation tool helped reduce the time spent on this task as the business was still running.

Lessons for Product teams

Start simple: building an automation system and a dashboard needs to be clearly defined. You are not building a product for the customers, you are building a product for your collaborators. It is essential that you understand each team’s needs since they are your core users. With that in mind, have the product that will be your MVP and answer to all your needs first. Then you can improve it by adding features or metrics.
Unified metrics matter more than perfect detection: we have to keep in mind that it will be thanks to them that the time will be saved, along with understanding. Having good detection is essential, but if the metrics are inaccurate, the time saved will be wasted by the teams looking for the metrics scattered across different environments
Automation saves 10 hours per week of manual investigation: by automating some manual and recurring tasks, you will save hours investigating, as with the incident alert workflow, we know directly where to investigate first and the hypothesis of the cause and even some action to take.
Document everything: a proper and detailed documentation is a must and will allow all the parties involved to have a clear understanding and views about what is happening. Documentation is also a piece of data.

Who am I ?

I’m Yassin, a Project Manager who expanded into Data Science to bridge the gap between business decisions and technical systems. Learning Python, SQL, and analytics has enabled me to design product insights and automation workflows that connect what teams need with how data behaves. Let’s connect on Linkedin

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation

Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

Generalizing Real-World Robot Manipulation via Generative Visual Transfer

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Follow the AI Footpaths | Towards Data Science

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Hallucinations in LLMs Are Not a Bug in the Data

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Post, Story, and Reels Dimensions

How nonprofits can build a digital presence that actually drives impact

How Google Profits From Demand You Already Own

Extra-Creamy Deviled Eggs Recipe | Epicurious

Vibe Coding Plugins? Validate With Official WordPress Plugin Checker

Generalizing Real-World Robot Manipulation via Generative Visual Transfer

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation

Context

Business impact

Why n8n?

Designing the Product Health Score

Designing the Automated Detection System with n8n

The dashboard

Impact

An AI-Powered Root Cause Analysis

Conclusion & takeaways

Lessons for Product teams

Who am I ?

Related Posts

Subscribe to Updates