Monitoring

Monitoring and Logging: What to Track, How to Alert, and Tools That Work

Practical monitoring and logging advice from three decades of production operations. What metrics matter, how to build alerts that work, and tools I trust.

Sep 15, 2025

DevOps

Prometheus at Scale: VictoriaMetrics, Thanos, Grafana Mimir, and What to Choose When a Single Node Isn't Enough

Single-node Prometheus breaks down at scale. Here's how VictoriaMetrics, Thanos, and Grafana Mimir solve long-term storage, high availability, and multi-cluster metrics at petabyte scale.

May 24, 2025

DevOps

Observability Pipelines: OpenTelemetry Collector, Cribl, and Taming the Telemetry Data Flood Before It Bankrupts You

How to build observability pipelines with the OpenTelemetry Collector, Cribl, and Vector to cut telemetry costs 60-80% without losing diagnostic visibility.

May 20, 2025

Cloud Architecture

LLM Observability in Production: Tracing, Evaluation, and How to Know When Your AI Is Broken

A practical guide to instrumenting LLM applications with OpenTelemetry GenAI semantic conventions, choosing between Langfuse, LangSmith, and Arize Phoenix, tracking token costs, and running evaluation in production.

May 4, 2025

DevOps

The Cloud-Native Observability Stack: Prometheus, Loki, Grafana, and Tempo in Production

A deep-dive into building a production-grade observability stack with Prometheus, Loki, Grafana, and Tempo. Learn the architecture, scaling trade-offs, cardinality traps, and when the open-source stack beats a $40k/month SaaS bill.

Apr 15, 2025

DevOps

OpenTelemetry and Distributed Tracing: Making Sense of Microservice Chaos

How OpenTelemetry works, why distributed tracing is different from logging and metrics, and how to instrument your services without drowning in overhead and noise.

Mar 28, 2025

Monitoring

Monitoring and Logging: What to Track, How to Alert, and Tools That Work

Prometheus at Scale: VictoriaMetrics, Thanos, Grafana Mimir, and What to Choose When a Single Node Isn't Enough

Observability Pipelines: OpenTelemetry Collector, Cribl, and Taming the Telemetry Data Flood Before It Bankrupts You

LLM Observability in Production: Tracing, Evaluation, and How to Know When Your AI Is Broken

The Cloud-Native Observability Stack: Prometheus, Loki, Grafana, and Tempo in Production

OpenTelemetry and Distributed Tracing: Making Sense of Microservice Chaos

Get Cloud Architecture Insights