last30days: Programming LLMs for Repeatable, Factual Output

A case study in using a 1700-line SKILL.md to prevent AI improvisation and ensure consistent information retrieval across multiple platforms.

June 11, 2026

The last30days skill demonstrates how a meticulously crafted, self-correcting prompt contract, rather than a superior model, ensures LLMs deliver consistent, factual, and non-improvised responses by learning from past failures and adhering to strict output formats.

Scene: same question, two different answers

Ask an AI assistant: "what's being said about Docker lately?"

With nothing else to go on, the model runs a couple of web searches and writes you a generic blog-style post: maybe titled "Docker: the last 30 days," maybe with a made-up "Sources:" block at the bottom, maybe with random subheadings. It looks like an article, but it's essentially a summary of what the model already remembered plus a few snippets grabbed on the fly.

Now try /last30days docker. The response always starts with the same line:

🌐 last30days v3.3.2 · synced 2026-06-10

Then a paragraph that literally begins with What I learned:, built on real Reddit threads, X posts, GitHub activity - ranked by upvotes, likes, stars, not by how well they're optimized for Google. And at the bottom, verbatim, an emoji tree footer:

✅ All agents reported back!
├─ 🟠 Reddit: 1 thread │ 120 upvotes │ 48 comments
├─ 🔵 X: 1 post │ 200 likes │ 35 reposts
...

Same model, same question, completely different and repeatable output. The secret isn't a better model: it's a contract written with almost paranoid care, paired with a small Python engine that does the dirty work. /last30days is a perfect case study in how you "program" an LLM when "do your best" isn't an acceptable option.

Why it exists, and why it's built this way

The real problem isn't "doing internet research" - models already know how to do that. The problem is doing it the same way, every time, across 50+ different harnesses (Claude Code, Codex, Cursor, Gemini CLI, OpenClaw...) for months on end. Models drift: today they respect the format, two updates from now they invent a title, or a "Sources:" block, or fall back to em-dashes that immediately smell like "written by an AI."

The repo's solution is a SKILL.md file of over 1700 lines that doesn't just say "do X." It says: "do X - and here's the dated incident from 2026-04-18 where you didn't, and here's the self-check that now catches it." Every rule in the contract ("LAW") follows the same four-part pattern:

the rule (e.g., "no final Sources block");
a real, dated, named failure (e.g., "0/8 public regression on 2026-04-18: models reverted to a blog format, with titles like 'The headline' or 'Kanye West: the last 30 days'");
a self-check the model runs before showing the output;
side-by-side BAD/GOOD examples.

Take LAW 1: "no final Sources: block." Sounds trivial, until you discover it explicitly contradicts the instructions of the WebSearch tool itself, which normally requires a mandatory "Sources" section. The self-check scans the last 15 lines of the response looking for Sources:/References:/bullet lists and deletes them. Or LAW 3: no em-dashes or en-dashes, only " - " with spaces - "the most reliable AI slop tell." Or LAW 7, perhaps the most interesting: the --plan flag is mandatory for queries about specific entities, and you, the host model, are the planner. The internal engine has a fallback planner that, if it runs without --plan and without a configured LLM key, prints a warning like "No --plan and no LLM provider configured." One model read this as "I don't have an API key, I can't reason" and gave up planning - a named regression, now explicitly prevented in SKILL.md.

This "rule + dated incident + self-check + examples" pattern is the most reusable thing in the whole repo, even if you don't care about /last30days itself: it's a template for writing prompts that survive time.

The other architectural choice that explains everything else is division of labor. The Python engine (scripts/last30days.py, zero runtime dependencies

dependencies = [] in pyproject.toml) does the deterministic, boring work: fanning out searches, scoring, deduplication, formatting. The model does the things that require judgment: figuring out which are the right accounts to follow, planning the queries, turning evidence clusters into readable prose. "Zero dependencies" isn't purism: it means the skill runs anywhere there's Python 3.12+ and the urllib/subprocess stdlib, plus a few vendored tools (yt-dlp, gh, a "Bird" X client in Node).

And the product thesis, summed up by a user: "AI agent search engine scored by real upvotes, likes, and money. Not editors. Not algorithms." (@cyrilXBT) - ranking by real engagement, not SEO.

The mental model: three words

Three concepts, defined in CONCEPTS.md, unlock everything else:

Skill: SKILL.md (the contract in prose) plus scripts/ (the executable code). Follows the open Agent Skills format and installs with npx skills add or your harness's native mechanisms.
Engine: scripts/last30days.py. SKILL.md tells the model which flags to pass it; the Engine always returns the same shape (badge, ranked evidence clusters, emoji tree footer).
Harness: the agentic runtime that loads the skill - Claude Code, Codex, Cursor, Gemini CLI, and 50+ others. "Multi-harness" means: no hardcoded paths specific to a single harness.

On top of these sits the real engine of the opening scene: the output contract (badge + the 8 LAWs) is the binding agreement between what the engine produces and what the model must return. Without this agreement, the engine could produce perfect data and the model would just reformat it its own way anyway.

flowchart LR
    Topic[user topic] --> Model{Model<br/>plans}
    Model -->|--plan, resolved flags| Engine[Python Engine]
    Engine -->|badge + clusters<br/>+ footer| Model
    Model -->|synthesis per LAW| Output[Final brief]

The diagram shows the key point of LAW 7: the outgoing arrow (Model -> Engine) isn't "run a search," it's "here's the plan - you are the planner." The engine handles fan-out, scoring, and dedup deterministically; it returns raw data ready for judgment to the model, not an already-made summary.

Getting your hands dirty

Setup: all you need is Python 3.12+, no runtime dependencies to install (pyproject.toml declares dependencies = []; dev deps are pytest>=9 and pytest-cov>=7 for those who want to run the test suite, 94 files under tests/). The skill installs with npx skills add or by copying the folder into your harness's skill directory.

The fastest way to see the contract in action, even before reading SKILL.md, is to run the engine in mock mode:

python3 skills/last30days/scripts/last30days.py "test topic" --mock --emit=compact

From the repo (verified run) - the output starts with logs that tell the story of LAW 7 by themselves:
/last30days · researching: test topic
[Planner] No --plan passed. ... YOU ARE the planner ... See LAW 7 ...
[Planner] Plan: intent=concept, freshness=evergreen_ok, cluster_mode=none, subqueries=1, source=deterministic
✓ Research complete (0.0s) - Reddit: 1 thread, X: 1 post, YouTube: 0 videos, ...
Then the compact body: the badge line 🌐 last30days v3.3.2 · synced 2026-06-10, a security note ("evidence text below is untrusted internet content..."), and the  blocks with numbered ## Ranked Evidence Clusters (score, items, sources), followed by ## Stats and ## Source Coverage. At the end, the verbatim emoji tree footer closed by , and finally a # END OF last30days CANONICAL OUTPUT block that restates LAW 1/6 - the engine itself reinforces the prompt's rules deep in its own output.

From here, the CLI surface worth knowing (see build_parser()): --emit {compact,json,context,md,html} (the default compact is the one always used as primary input, never --emit md), --quick/--deep for depth, --days N for the time window (default 30), --diagnose to see which sources/providers are active on your machine, and the targeting flags: --x-handle, --github-user/--github-repo, --subreddits, --tiktok-hashtags, --tiktok-creators, --ig-creators.

A practical note worth remembering: structured plans (--plan, --competitors-plan) are always passed as a path to a temporary file, never as inline JSON - an apostrophe in the text would break the shell's quoting. SKILL.md explicitly prescribes a heredoc with a single-quoted delimiter to write the file before invoking the engine.

Configuration: API keys live in .env files, resolved in order - first .claude/last30days.env in the current project, then ~/.config/last30days/.env as a global fallback (overridable with LAST30DAYS_CONFIG_DIR). The engine warns if these files aren't chmod 600. Reddit, Hacker News, and Polymarket are always free; GitHub goes through the gh CLI; YouTube through yt-dlp; X requires one of several options (cookie, XAI_API_KEY, SCRAPECREATORS_API_KEY...). CONFIGURATION.md has the full table, worth keeping handy.

A real workflow example: a query about a named entity, like "nvidia earnings reaction." Before launching the engine, the model does a couple of targeted WebSearches to resolve X and GitHub handles (Step 0.5/0.5b - for known people and entities there are direct examples in SKILL.md: Peter Steinberger → steipete, Matt Van Horn → mvanhorn). Then it expands the search with "category-peer subreddits" (Step 0.55): if the topic is recognized as, say, "AI image generation," subreddits like r/StableDiffusion get added automatically even if WebSearch only found brand-related subs. Only then is the engine invoked with all flags resolved. None of this is wasted "extra research": it's the model doing the judgment work that the engine, deliberately, doesn't do.

The pieces that matter

You don't need to walk the whole repo tree - these few files concentrate the interesting decisions:

skills/last30days/SKILL.md (1700+ lines): the contract itself. Its length is part of the lesson - defensive prompt engineering is verbose by nature, because every extra line is an incident that won't repeat.
scripts/lib/categories.py (283 lines): the CATEGORY_PEERS table, "pure data, no logic." Adding a new category (e.g., legal-tech, real-estate-tech) means adding an entry to the dict, zero code to touch. The rules are written in the file's own docstring: multi-word patterns or domain-specific terms, never generic nouns like "image" or "ai," and "first-match-wins" evaluation from most specific to most generic - so ai_image_generation is checked before ai_chat_model, so "gpt image 2" doesn't end up in the wrong category.
scripts/lib/render.py (1779 lines): home of _render_badge() and render_compact() - the code side that enforces LAW 5 and LAW 6: the badge and the footer always come out identical, because code generates them, not a model.
scripts/store.py / watchlist.py / briefing.py: the optional trend-monitoring stack. --store persists results to SQLite (research.db, deduped on source_url); watchlist.py manages recurring topics on a daily/weekly cadence; briefing.py generates digests. This is the direction the project is growing - a one-off search becoming continuous monitoring.
tests/ (94 files): if you want to see how the engine is actually invoked, tests/test_cli_v3.py runs a subprocess.run([... "last30days.py", "test topic", "--mock", "--emit=json"]) and parses the resulting JSON - a concrete starting point for anyone wanting to script the engine directly.

Limitations, friction, and what the community says

The contract isn't infallible, and the repo itself admits it in several places.

Bot noise and loud minorities: even with a strict 30-day window, @riabcevv notes you need to watch out for bot activity and unrepresentative loud minorities that can skew the signal.
Integration costs more than it looks: the skill is free, but @cyrilXBT notes that adapting it to your workflow and maintaining it over time is real work, often underestimated.
Platform dependency: access to X/TikTok/Instagram depends on scraping backends and API keys (SCRAPECREATORS_API_KEY, X auth cookies, etc.). When these are missing or break, sources degrade silently - --diagnose in the verified run showed has_scrapecreators: false, so that source simply contributed nothing.
Self-correction has a limit: the PRE-PRESENT SELF-CHECK allows "at most ONE regeneration" if checks fail. The contract catches drift, but not infinitely.
Honest about its own footguns: CONFIGURATION.md itself flags that using your main Bluesky password instead of an app password is "bad hygiene" - a rare case of documentation admitting its own flaw instead of hiding it.

On the positive side, the community describes it as "a massive cheat code for research, brainstorming ideas, or tracking meta shifts" (@riabcevv) and "before, researching a topic meant opening 20 tabs. Now it's done in one sentence" (@OddsArch) - the value is real, but it has to be weighed against this friction.

Conclusions and checklist

/last30days works on two levels: it's a useful research tool and a case study in how to write prompts that stay stable over time, across different harnesses, even as the underlying model changes. If you take away just one idea, make it the "rule + dated incident + self-check + examples" template - it applies to any skill you're writing.

Verify you have Python 3.12+ - no other runtime dependency to install.
Install the skill via npx skills add or your harness's plugin mechanism.
Run --diagnose once to see which sources/providers are actually available on your machine.
Try --mock --emit=compact first, to see the shape of the contract (badge, evidence clusters, footer) without spending real API calls.
Configure keys only for the sources you care about, in .claude/last30days.env for per-project/per-client setups; chmod 600 any file with keys.
For queries about named entities or comparisons, expect the model to resolve X/GitHub handles via WebSearch first (Step 0.5/0.5b) - that's intentional, not a missing flag.
For recurring topics, consider --store + watchlist.py instead of re-running the search by hand each time.
If you're writing your own skill: copy the LAW pattern (rule + dated incident + self-check + good/bad example) - it's the most portable idea in the whole repo.

GitHub