Running a 24/7 AI Agent: Lessons from 90 Days of Autonomous Operations

What we learned running an OpenClaw agent autonomously for three months — what worked, what broke, what surprised us, and the operational patterns that emerged.

By Maya

Running a 24/7 AI Agent: Lessons from 90 Days of Autonomous Operations

We deployed an OpenClaw agent in January and let it run. Not as an experiment — as a production system handling real work. Content production, email management, system monitoring, scheduling, and research.

Three months later, here's what we've learned. Some of it confirms what the marketing material promises. Some of it doesn't.

The Setup

  • Server: Hetzner CX31 (4 vCPU, 8GB RAM), Ubuntu 22.04
  • Agent: OpenClaw with Claude Sonnet as the primary model, Haiku for simple tasks, Opus for complex analysis
  • Channels: Telegram (primary), with cron jobs for automated tasks
  • Skills installed: web search, Google Workspace, weather, healthcheck, humanizer
  • Cron jobs: 12 recurring tasks, from morning briefings to weekly security audits

Month 1: Finding the Edges

The first two weeks were mostly about discovering what the agent interpreted differently than we expected.

Example: We told the agent to "check email and flag anything important." It flagged everything from clients as important, which was technically correct but not useful. We had to define "important" with specifics — unread emails from a named list of key contacts, emails containing words like "invoice," "deadline," or "issue," and any email marked high priority by the sender.

Specificity made all the difference. Vague instructions produced correct-but-useless behavior. Precise instructions produced genuinely helpful output.

What broke: The Gmail OAuth token expired after 7 days and nobody noticed for 3 days. The agent was silently skipping email checks because authentication failed. We fixed this by adding a health check that specifically tests Gmail connectivity and alerts on failure.

Cost in month 1: $112 (mostly from over-using Opus during setup and testing).

Month 2: Building Routines

By week 5, the operational rhythm settled. The agent had 12 cron jobs running on schedule, a well-defined workstate queue, and clear autonomy zones. Human intervention dropped to maybe 15 minutes per day — reviewing flagged emails, approving content, and handling the occasional edge case.

What surprised us: The agent started catching things we missed. A server disk filling up at 3 AM. A client email that arrived at 11 PM mentioning a deadline we'd forgotten. A broken link in our published content that the weekly audit found. The 24/7 coverage isn't just a convenience — it catches time-sensitive issues that fall through cracks with human-only monitoring.

The content pipeline stabilized. Posts went through research → drafting → humanization → review → publication with minimal intervention. Quality was consistent. Not every post was brilliant, but none were embarrassing. About 1 in 8 posts needed manual editing beyond what the review agent caught — usually a code example that was slightly off or a claim that needed more nuance.

What broke: A cron job that checked HackerNews for relevant articles got stuck in a loop when the API was temporarily down. It kept retrying every 30 seconds for 4 hours, burning through API tokens on failed requests. We added timeout limits and exponential backoff after that.

Cost in month 2: $87 (after optimizing model selection per task).

Month 3: Optimization

With the basics running smoothly, we focused on efficiency.

Model optimization was the biggest cost reduction. We mapped every task to the cheapest model that could handle it:

| Task | Before | After | Cost Impact | |------|--------|-------|------------| | Email categorization | Sonnet | Haiku | -75% per task | | Morning briefing | Sonnet | Sonnet | No change (needs quality) | | Blog draft writing | Sonnet | Sonnet | No change | | Security audit | Opus | Sonnet | -80% per task | | File organization | Sonnet | Haiku | -75% per task |

Security audits on Sonnet instead of Opus? It works fine. The audit is mostly checking configuration values against known good practices — it doesn't need deep reasoning. The one task we kept on Opus: reviewing final content before publication, where the model catches subtle quality issues Sonnet misses.

Prompt optimization cut token usage by about 30%. We trimmed the SOUL.md from 3,000 tokens to 1,200 tokens by removing redundant instructions and moving reference material into separate files that load only when needed. Since the system prompt is included in every API call, shorter = cheaper.

What broke: A sub-agent spawned by a content pipeline cron job produced a blog post that confidently cited a study that didn't exist. The review agent missed it because the citation format looked correct. We added a verification step that specifically checks whether cited URLs are real and accessible.

Cost in month 3: $74.

Operational Patterns That Emerged

The Documentation Compound Effect

The more we documented in the agent's workspace files, the better it performed. At day 1, it was a general-purpose assistant. By day 90, it had accumulated so much context — preferred formatting, common mistakes to avoid, key contacts, project details, scheduling preferences — that it operated more like a trained employee than a tool.

This is the non-obvious advantage of file-based memory. Each note, each correction, each preference compounds into an increasingly capable agent.

Context Loss Is the Main Failure Mode

Every time the agent session resets (which happens after a certain amount of conversation), it needs to reload context from files. If the files don't capture the current state well, the agent fumbles.

Our solution: a workstate.md file that the agent updates constantly — what's in progress, what's done, what's blocked, what's next. On session restart, reading this file brings the agent back to speed in seconds.

Human Oversight Scales Down, Not Away

Week 1: reviewed everything. Week 4: reviewed flagged items only. Week 12: reviewed weekly summaries plus anything the agent escalated.

You don't reach zero oversight. You reach efficient oversight. The agent handles volume; you handle judgment calls.

Cron Jobs Are the Backbone

Autonomous operation isn't about an agent sitting there thinking all day. It's about scheduled tasks that fire reliably. Our 12 cron jobs handle about 85% of the agent's daily work. The remaining 15% is ad-hoc requests through Telegram.

The cron schedule is the operational backbone. When a cron job breaks, that capability goes silent until someone notices. Monitoring cron health is as important as monitoring server health.

Metrics After 90 Days

| Metric | Value | |--------|-------| | Total blog posts published | 197 | | Emails categorized | ~5,400 | | Email drafts generated | ~480 | | Security audits completed | 12 | | Cron job executions | ~3,200 | | Average daily API cost | $2.80 | | Unplanned downtime | ~6 hours total (3 incidents) | | Posts requiring manual editing | 24 (~12%) | | False urgent alerts | 7 | | Missed genuine issues | 2 (both caught within 24h) |

What We'd Do Differently

Set up monitoring from day 1. We added health checks and alerting in week 3 after the Gmail token expired silently. Should have been there from the start.

Write the AGENTS.md more carefully upfront. We rewrote it four times in the first month as we discovered edge cases. Spending more time on initial definition would have saved rework.

Start with fewer cron jobs. We deployed 8 cron jobs in the first week. Three of them needed significant adjustment. Better to start with 3-4 and add incrementally.

Budget for hallucination checking. AI agents will occasionally invent facts, cite non-existent sources, or misinterpret data. Build verification into the pipeline, not as an afterthought.

Bottom Line

Running an AI agent in production for 90 days cost about $275 in total API fees plus $45 in hosting. It handled work that would have taken a human 150-200 hours. The hourly equivalent: about $1.60/hour.

The quality isn't human-equivalent across all tasks. But for operational throughput — the boring, repetitive, time-sensitive stuff that eats your day — it's hard to beat the economics.

The key insight: autonomous AI operations work when you invest in documentation, monitoring, and gradual trust-building. Skip any of those three, and you get an expensive chatbot that occasionally breaks.

Ready to start? Our VPS setup guide gets you deployed in 25 minutes. Then set up your first cron jobs and let the compound effect begin.