Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach
    AI Tools

    Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

    AwaisBy AwaisDecember 17, 2025No Comments13 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach
    Share
    Facebook Twitter LinkedIn Pinterest Email

    grow more complex, traditional logging and monitoring fall short. What teams actually need is observability: the ability to trace agent decisions, evaluate response quality automatically, and detect drift over time—without writing and maintaining large amounts of custom evaluation and telemetry code.

    Therefore, teams need to adopt the right platform for observability while they focus on the core task of building and improving the agents’ orchestration. And integrate their application to the observability platform with minimal overhead to their functional code. In this article, I will demonstrate how you can set up an open-source AI observability platform to perform the following using a minimal-code approach:

    • LLM-as-a-Judge: Configure pre-built evaluators to score responses for Correctness, Relevance, Hallucination and more. Display scores across runs with detailed logs and analytics.
    • Testing at scale: Set up datasets to store regression test cases for measuring accuracy against expected ground truth responses. Proactively detect LLM and agent drift.
    • MELT data: Track metrics (latency, token usage, model drift), events (API calls, LLM calls, tool usage), logs (user interaction, tool execution, agent decision making) with detailed traces – all without detailed telemetry and instrumentation code.

    We will be using Langfuse for observability. It is open-source and framework-agnostic and can work with popular orchestration frameworks and LLM providers.   

    Multi-agent application

    For this demonstration, I have attached the LangGraph code of a Customer Service application. The application accepts tickets from the user, classifies into Technical, Billing or Both using a Triage agent, then routes it to the Technical Support agent, Billing Support agent or to both of them. Then a finalizer agent synthesizes the response from both agents into a coherent, more readable format. The flowchart is as follows:

    Customer Service agentic application
    The code is attached here
    # --------------------------------------------------
    # 0. Load .env
    # --------------------------------------------------
    from dotenv import load_dotenv
    load_dotenv(override=True)
    
    # --------------------------------------------------
    # 1. Imports
    # --------------------------------------------------
    import os
    from typing import TypedDict
    
    from langgraph.graph import StateGraph, END
    from langchain_openai import AzureChatOpenAI
    
    from langfuse import Langfuse
    from langfuse.langchain import CallbackHandler
    
    # --------------------------------------------------
    # 2. Langfuse Client (WORKING CONFIG)
    # --------------------------------------------------
    langfuse = Langfuse(
        host="https://cloud.langfuse.com",
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
        secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
    )
    langfuse_callback = CallbackHandler()
    os.environ["LANGGRAPH_TRACING"] = "false"
    
    
    # --------------------------------------------------
    # 3. Azure OpenAI Setup
    # --------------------------------------------------
    llm = AzureChatOpenAI(
        azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
        temperature=0.2,
        callbacks=[langfuse_callback],  # 🔑 enables token usage
    )
    
    # --------------------------------------------------
    # 4. Shared State
    # --------------------------------------------------
    class AgentState(TypedDict, total=False):
        ticket: str
        category: str
        technical_response: str
        billing_response: str
        final_response: str
    
    # --------------------------------------------------
    # 5. Agent Definitions
    # --------------------------------------------------
    
    def triage_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            name="triage_agent",
            input={"ticket": state["ticket"]},
        ) as span:
            span.update_trace(name="Customer Service Query - LangGraph Demo") 
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "Classify the query as one of: "
                        "Technical, Billing, Both. "
                        "Respond with only the label."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            raw = response.content.strip().lower()
    
            if "both" in raw:
                category = "Both"
            elif "technical" in raw:
                category = "Technical"
            elif "billing" in raw:
                category = "Billing"
            else:
                category = "Technical"  # ✅ safe fallback
    
            span.update(output={"raw": raw, "category": category})
    
            return {"category": category}
    
    
    
    def technical_support_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            name="technical_support_agent",
            input={
                "ticket": state["ticket"],
                "category": state.get("category"),
            },
        ) as span:
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "You are a technical support specialist. "
                        "Provide a clear, step-by-step solution."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            answer = response.content
    
            span.update(output={"technical_response": answer})
    
            return {"technical_response": answer}
    
    
    def billing_support_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            name="billing_support_agent",
            input={
                "ticket": state["ticket"],
                "category": state.get("category"),
            },
        ) as span:
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "You are a billing support specialist. "
                        "Answer clearly about payments, invoices, or accounts."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            answer = response.content
    
            span.update(output={"billing_response": answer})
    
            return {"billing_response": answer}
    
    def finalizer_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            name="finalizer_agent",
            input={
                "ticket": state["ticket"],
                "technical": state.get("technical_response"),
                "billing": state.get("billing_response"),
            },
        ) as span:
    
            parts = [
                f"Technical:\n{state['technical_response']}"
                for k in ["technical_response"]
                if state.get(k)
            ] + [
                f"Billing:\n{state['billing_response']}"
                for k in ["billing_response"]
                if state.get(k)
            ]
    
            if not parts:
                final = "Error: No agent responses available."
            else:
                response = llm.invoke([
                    {
                        "role": "system",
                        "content": (
                            "Combine the following agent responses into ONE clear, professional, "
                            "customer-facing answer. Do not mention agents or internal labels. "
                            f"Answer the user's query: '{state['ticket']}'."
                        ),
                    },
                    {"role": "user", "content": "\n\n".join(parts)},
                ])
                final = response.content
    
            span.update(output={"final_response": final})
            return {"final_response": final}
    
    
    # --------------------------------------------------
    # 6. LangGraph Construction 
    # --------------------------------------------------
    builder = StateGraph(AgentState)
    
    builder.add_node("triage", triage_agent)
    builder.add_node("technical", technical_support_agent)
    builder.add_node("billing", billing_support_agent)
    builder.add_node("finalizer", finalizer_agent)
    
    builder.set_entry_point("triage")
    
    # Conditional routing
    builder.add_conditional_edges(
        "triage",
        lambda state: state["category"],
        {
            "Technical": "technical",
            "Billing": "billing",
            "Both": "technical",
            "__default__": "technical",  # ✅ never dead-end
        },
    )
    
    # Sequential resolution
    builder.add_conditional_edges(
        "technical",
        lambda state: state["category"],
        {
            "Both": "billing",         # Proceed to billing if Both
            "__default__": "finalizer",
        },
    )
    builder.add_edge("billing", "finalizer")
    builder.add_edge("finalizer", END)
    
    graph = builder.compile()
    
    
    # --------------------------------------------------
    # 9. Main
    # --------------------------------------------------
    if __name__ == "__main__":
    
        print("===============================================")
        print(" Conditional Multi-Agent Support System (Ready)")
        print("===============================================")
        print("Enter 'exit' or 'quit' to stop the program.\n")
        
        while True:
            # Get user input for the ticket
            ticket = input("Enter your support query (ticket): ")
    
            # Check for exit command
            if ticket.lower() in ["exit", "quit"]:
                print("\nExiting the support system. Goodbye!")
                break
    
            if not ticket.strip():
                print("Please enter a non-empty query.")
                continue
                
            try:                
                    # --- Run the graph with the user's ticket ---
                 result = graph.invoke(
                    {"ticket": ticket},
                    config={"callbacks": [langfuse_callback]},
                )
            
                # --- Print Results ---
                category = result.get('category', 'N/A')
                print(f"\n✅ Triage Classification: **{category}**")
                
                # Check which agents were executed based on the presence of a response
                executed_agents = []
                if result.get("technical_response"):
                    executed_agents.append("Technical")
                if result.get("billing_response"):
                    executed_agents.append("Billing")
                
                
                print(f"🛠️ Agents Executed: {', '.join(executed_agents) if executed_agents else 'None (Triage Failed)'}")
    
                print("\n================ FINAL RESPONSE ================\n")
                print(result["final_response"])
                print("\n" + "="*60 + "\n")
    
            except Exception as e:
                # This is important for debugging: print the exception type and message
                print(f"\nAn error occurred during processing ({type(e).__name__}): {e}")
                print("\nPlease try another query.")
                print("\n" + "="*60 + "\n")
    

    Observability Configuration

    To set up Langfuse, go to https://cloud.langfuse.com/, and set up an account with a Billing tier (hobby tier with generous limits available), then set up a Project. In the project settings, you can generate the public and secret keys which need to be provided at the beginning of the code. You also need to add the LLM connection, which will be used for the LLM-as-a-Judge evaluation.

    Langfuse project set up

    LLM-as-a-Judge setup

    This is the core of the performance evaluation setup for agents. Here you can configure various pre-built Evaluators from the Evaluator Library which will score the responses on various criteria such as Conciseness, Correctness, Hallucination, Answer Critic etc. These should suffice for most use cases, else Custom Evaluators can be set up also. Here is a view of the Evaluator library:

    Evaluator library

    Select the evaluator, say Relevance, that you wish to use. You can choose to run it for new or existing traces or for Dataset runs. In addition, review the evaluation prompt to ensure it satisfies your evaluation objective. Most importantly, the query, generation and other variables should be correctly mapped to the source (usually, to the Input and Output from the application trace). For our case, these will be the ticket data entered by the user and the response generated by the finalizer agent respectively. In addition, for Dataset runs, you can compare the generated responses to the Ground Truth responses stored as expected outputs (explained in the next sections).

    Here is the configuration for the ‘GT Accuracy’ evaluation I set up for new Dataset runs, along with the Variable mapping. The evaluation prompt preview is also depicted. Most of the evaluators score within a range of 0 to 1:

    Evaluator setup
    Evaluator prompt

    For the customer service demo, I have configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs only.

    Active evaluators

    Datasets setup

    Create a dataset to use as a test case repository. Here, you can store test cases with the input query and the ideal expected response. To create the dataset, there are 3 choices: create one record at a time, upload a CSV of queries and expected responses, or, quite conveniently, add inputs and outputs directly from the application traces whose responses are adjudged to be of good quality by human experts.

    Here is the dataset I have created for the demo. These are a mix of technical, billing, or ‘Both’ queries, and I have created all the records from application traces:

    Dataset view

    That’s it! The configuration is done and we are ready to run observability.

    Observability Results

    The Langfuse Home page is a dashboard of several useful charts. It shows the count of execution traces, scores and averages at a glance, traces by time, model usage and cost etc.

    Observability overview dashboard

    MELT data

    The most useful observability data is available in the ‘Tracing’ option, which displays summarized and detailed views of all executions. Here is a view of the dashboard depicting the time, name, input, output and the crucial latency and token usage metrics. Note that for every agent execution of our application, there are 2 evaluation traces generated for the Conciseness and Relevance evaluators we set up.

    Tracing overview
    Conciseness and Relevance evaluation runs for each application execution

    Let’s look at the details of one of the executions of the Customer Service application. On the left panel, the agent flow is depicted both as a tree as well as a flowchart. It shows the LangGraph nodes (agents) and the LLM calls along with the token usage. If our agents had tool calls or human-in-the-loop steps, they would have been depicted here as well. Note that the evaluation scores for Conciseness and Relevance are also depicted on top, which are 0.40 and 1 respectively for this run. Clicking on them shows the reason for the score and a link to take us to the evaluator trace.

    On the right, for each agent, LLM and tool call, we can see the Input and generated output. For instance, here we see that the query was categorized as ‘Both’, and therefore in the left chart, it shows both the technical and billing support agents were called, which confirms our flow is working as expected.

    Multi-agent trace

    On top of the right hand panel, there is the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel like the one depicted below, where you can add the input and output of that step directly to a test dataset created in the previous section. This is a useful feature for human experts to add frequently occurring user queries and good responses to the dataset during normal agent operations, thereby building a Regression test repository with minimal effort. In future, when there is a major upgrade or release to the application, the Regression dataset can be run and the generated outputs can be scored against the Expected outputs (ground truth) recorded here using the ‘GT Accuracy’ evaluator we created during the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

    Add to Dataset

    Here is one of the evaluation traces (Conciseness) for this application trace. The evaluator provides the reasoning for the score of 0.4 it adjudged this response to be.

    Evaluator reasoning

    Scores

    The Scores option in Langfuse show a list of all the evaluation runs from the various active evaluators along with their scores. More pertinent is the Analytics dashboard, where two scores can be selected and metrics such as mean and standard deviation along with trend lines can be viewed.

    Scores dashboard
    Score analytics

    Regression testing

    With Datasets, we are ready to run regression testing using the test case repository of queries and expected outputs. We have stored 4 queries in our Regression dataset, with a mix of technical, billing and ‘Both’ queries.

    For this, we can run the attached code which gets the relevant dataset and runs the experiment. All the test runs are logged along with the average scores. We can view the result of a selected test with Conciseness, GT Accuracy and Relevance scores for each test case in one dashboard. And as needed, the detailed trace can be accessed to see the reasoning for the score.

    You can view the code here.
    from langfuse import get_client
    from langfuse.openai import OpenAI
    from langchain_openai import AzureChatOpenAI
    from langfuse import Langfuse
    import os
    # Initialize client
    from dotenv import load_dotenv
    load_dotenv(override=True)
    
    langfuse = Langfuse(
        host="https://cloud.langfuse.com",
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
        secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
    )
    
    llm = AzureChatOpenAI(
        azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
        temperature=0.2,
    )
    
    # Define your task function
    def my_task(*, item, **kwargs):
        question = item.input['ticket'] 
        response = llm.invoke([{"role": "user", "content": question}])
    
        raw = response.content.strip().lower()
     
        return raw  
     
    # Get dataset from Langfuse
    dataset = langfuse.get_dataset("Regression")
     
    # Run experiment directly on the dataset
    result = dataset.run_experiment(
        name="Production Model Test",
        description="Monthly evaluation of our production model",
        task=my_task # see above for the task definition
    )
     
    # Use format method to display results
    print(result.format())
    Test runs
    Scores for a test run

    Key Takeaways

    • AI observability does not need to be code-heavy.
      Most evaluation, tracing, and regression testing capabilities for LLM agents can be enabled through configuration rather than custom code, significantly reducing development and maintenance effort.
    • Rich evaluation workflows can be defined declaratively.
      Capabilities such as LLM-as-a-Judge scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and evaluation prompts are configured directly in the observability platform—without writing bespoke evaluation logic.
    • Datasets and regression testing are configuration-first features.
      Test case repositories, dataset runs, and ground-truth comparisons can be set up and reused through the UI or simple configuration, allowing teams to run regression tests across agent versions with minimal additional code.
    • Full MELT observability comes “out of the box.”
      Metrics (latency, token usage, cost), events (LLM and tool calls), logs, and traces are automatically captured and correlated, avoiding the need for manual instrumentation across agent workflows.
    • Minimal instrumentation, maximum visibility.
      With lightweight SDK integration, teams gain deep visibility into multi-agent execution paths, evaluation results, and performance trends—freeing developers to focus on agent logic rather than observability plumbing.

    Conclusion

    As LLM agents become more complex, observability is no longer optional. Without it, multi-agent systems quickly turn into black boxes that are difficult to evaluate, debug, and improve.

    An AI observability platform shifts this burden away from developers and application code. Using a minimal-code, configuration-first approach, teams can enable LLM-as-a-Judge evaluation, regression testing, and full MELT observability without building and maintaining custom pipelines. This not only reduces engineering effort but also accelerates the path from prototype to production.

    By adopting an open-source, framework-agnostic platform like Langfuse, teams gain a single source of truth for agent performance—making AI systems easier to trust, evolve, and operate at scale.

    Want to know more? The Customer Service agentic application presented here follows a manager-worker architecture pattern, which does not work in CrewAI. Read about how observability helped me to fix this well known issue with the manager-worker hierarchical process of CrewAI, by tracing agent responses at each step and refining them to get the orchestration to work as it should. Full analysis here: Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It

    Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

    All images and data used in this article are synthetically generated. Figures and code created by me

    agents Approach ConfigurationFirst MinimalCode Observability ProductionGrade
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    Escaping the SQL Jungle | Towards Data Science

    March 21, 2026

    A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

    March 21, 2026

    Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

    March 21, 2026

    Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

    March 21, 2026

    How to Measure AI Value

    March 20, 2026

    What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

    March 20, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    What Is Buttermilk? How It’s Made and Used

    March 22, 2026

    The thickness of buttermilk varies widely from carton to carton. You may need to adjsut…

    Why your law firm’s best leads don’t convert after research

    March 22, 2026

    For Demi Lovato, Learning to Cook Meant Starting to Heal

    March 21, 2026

    Adobe to shut down Marketo Engage SEO tool

    March 21, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    23 Radish Recipes for Salads, Pickles, and More

    March 21, 2026

    Bots could overtake human web usage by 2027

    March 21, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.