LLM Observability with Langfuse and Helicone: Production Setup Guide

Home › Blog › LLM Observability with Langfuse and Helicone: Production Setup Guide

LLM observability Langfuse Helicone: choosing the right stack

If you are running a non-trivial LLM workload in production without observability, you are flying blind through a thunderstorm. LLM observability Langfuse Helicone is the comparison I get asked about every week, and the honest answer is that they solve overlapping but distinct problems. Therefore, this guide unpacks what each tool does well, where they hurt, and how to design a stack that survives a 10x traffic spike.

I have run both at scale, including a 40M-call-per-month support agent and a research workload chewing through 8 billion tokens monthly. Consequently, my recommendations are grounded in real bills, real incidents, and real on-call pages. The goal is not to declare a winner but to help you pick deliberately.

What LLM observability actually means

LLM observability extends classical observability with three new dimensions. First, prompt and completion capture, since the inputs and outputs are themselves the most important signal. Second, evaluation scores, because correctness is not boolean. Third, cost attribution, since token usage is now a first-class business metric.

Specifically, a production-grade setup must answer five questions on demand: what did the model see, what did it return, how long did it take, how much did it cost, and was the answer good. Furthermore, all five answers must be filterable by user, feature, model version, and prompt version. Anything less and you cannot debug regressions or optimize spend.

LLM observability tracing dashboard with prompt and completion data — A trace tree showing parent-child spans for a multi-step agentic workflow.

Langfuse: open source, trace-first, evaluation-rich

Langfuse is an open-source observability platform that you can self-host or run on their cloud. Its core abstraction is the trace, a tree of nested spans that mirrors the OpenTelemetry model. Moreover, Langfuse natively understands LLM-specific concepts like prompt versions, generation costs, and human-graded scores.

The integration story is excellent. The Python and TypeScript SDKs hook into LangChain, LlamaIndex, and the OpenAI SDK with a single decorator. Additionally, the dataset feature lets you snapshot production traces, label them, and replay them against new prompt versions. As a result, you can answer “did my prompt change degrade quality” with hard data instead of vibes.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from anthropic import Anthropic

langfuse = Langfuse()
client = Anthropic()

@observe(name="contract_qa_pipeline")
def answer_contract_question(contract_id: str, question: str, user_id: str):
    langfuse_context.update_current_trace(
        user_id=user_id,
        session_id=f"contract-${contract_id}",
        tags=["contract-qa", "production"],
        metadata={"contract_id": contract_id}
    )

    contract = load_contract(contract_id)
    prompt = langfuse.get_prompt("contract-qa", label="production")
    compiled = prompt.compile(contract=contract.text, question=question)

    with langfuse_context.start_as_current_observation(
        name="claude-generation",
        as_type="generation",
        model="claude-opus-4-7",
        input=compiled,
        metadata={"prompt_version": prompt.version}
    ) as gen:
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            messages=[{"role": "user", "content": compiled}]
        )
        gen.update(
            output=response.content[0].text,
            usage_details={
                "input": response.usage.input_tokens,
                "output": response.usage.output_tokens,
                "cache_read": response.usage.cache_read_input_tokens
            }
        )

    score = run_eval(question, response.content[0].text, contract.text)
    langfuse_context.score_current_observation(
        name="grounding",
        value=score.grounding,
        comment=score.rationale
    )
    return response.content[0].text

Helicone: proxy-first, zero-touch, cost-obsessed

Helicone takes a fundamentally different approach. Instead of an SDK that wraps your code, Helicone is a proxy that sits between your application and the LLM provider. Therefore, integration is a one-line base URL change, with no instrumentation work required. For teams shipping fast, this is genuinely magical.

The trade-off is depth. Helicone captures the request and response automatically, including latency, cost, and cache status. However, it has weaker support for nested spans, custom evaluations, and prompt versioning compared to Langfuse. In contrast, its strengths shine in cost analytics, user-level rate limiting, and request caching, where the proxy architecture is a natural fit.

For a team running a high-volume chat product where 90% of the value is “show me cost by user and catch latency regressions,” Helicone reaches that 90% in under an hour. Meanwhile, Langfuse will take a day or two but unlock substantially deeper analysis. Consult the official Helicone documentation and Langfuse documentation for current SDK details.

OpenTelemetry GenAI: the convergence layer

The OpenTelemetry GenAI semantic conventions, finalized in late 2025, define standard span attributes for LLM calls: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and so on. Subsequently, both Langfuse and Helicone now ingest OTLP traces that follow this spec, as do Langsmith, Datadog, and Honeycomb.

The practical implication is that you should instrument with OTel first and pick a backend second. As a result, switching vendors becomes a config change rather than a rewrite. Additionally, OpenLLMetry, an open-source library from Traceloop, provides drop-in OTel instrumentation for the major LLM SDKs and emits spans that all three platforms understand.

OpenTelemetry GenAI semantic conventions trace example — OpenTelemetry GenAI conventions standardize the span attributes across observability vendors.

Designing the trace hierarchy

A common mistake is recording each LLM call as a flat event. However, real applications are chains: retrieve, rerank, generate, validate, possibly retry. Therefore, your trace must mirror that structure with nested spans, each carrying its own latency, cost, and metadata.

The hierarchy I recommend has four levels. First, a session span covering the user’s conversation. Second, a request span per user message. Third, chain spans for retrieval, generation, and validation. Fourth, leaf spans per provider call. Furthermore, attach the prompt version ID and model version to every leaf, so regressions are pinpointable in a single query.

Evaluation pipelines that catch regressions

Logging is necessary but insufficient. You also need automated evaluations running against curated datasets every time a prompt changes. Specifically, I run three eval categories on every prompt PR: deterministic checks like JSON validity, model-graded checks like grounding and helpfulness, and golden-set regression checks against 200 historical traces.

Langfuse handles this natively with its dataset and experiments features. Conversely, with Helicone you typically pair it with a separate eval framework like Promptfoo or Inspect. For background on the broader system context, see my Spring AI RAG production guide and structured output patterns.

Cost attribution and budget enforcement

Both tools track cost, but the depth differs. Helicone’s strength is real-time per-user cost dashboards and built-in rate limiters that throttle abusive accounts before they blow the budget. Meanwhile, Langfuse excels at cost-per-feature breakdowns when you tag traces with feature flags, which is invaluable for product-led growth analysis.

In practice, I run Helicone as the proxy for cost guardrails and rate limiting, and Langfuse for deep tracing and evaluation. Consequently, the two systems coexist: Helicone catches the runaway loop in real time, and Langfuse explains why it happened the next morning. Notably, both tools support the OpenTelemetry GenAI conventions, so duplicate ingestion is a config flag away.

In conclusion, getting LLM observability Langfuse Helicone right is less about picking a tool and more about deciding what questions you must answer in production. Start with OpenTelemetry instrumentation, choose Helicone for fast cost visibility, layer Langfuse for deep tracing and evaluation, and never ship a prompt change without a regression eval. As a result, your LLM platform becomes debuggable, predictable, and economically sustainable.