Executive TL;DR

We built a code review AI Agent that reads every merge request, parses the diff, and posts concise, actionable inline comments, so reviewers get high-signal feedback before they touch a single line of code. For CEOs, we see this as a force-multiplier: the agent enforces basic standards, surfaces security and performance issues, and keeps human reviewers focused on design and edge cases instead of busywork.
This isn’t about replacing engineers; it’s about removing repetitive load, shortening time-to-merge, and creating an auditable trail of why a comment was made. We piloted the workflow on several repos and measured fewer trivial review cycles and faster merges.
Benefits

  • Faster reviews: reduces reviewer time per MR and speeds up delivery.
  • Consistent comments: standardises feedback across teams and reps.
  • Audit trail: every automated comment is timestamped and traceable.
  • Avoid noise: filters low-value suggestions so humans review high-impact issues.

Looking to build something similar? Build AI Agent with us for the best results and support.

Why we built a Code Review AI Agent

We built a code review agent because our teams were drowning in routine merge request work that added no strategic value. Every day we saw small, repetitive issues- style nits, missing null checks, trivial security smells come up again and again, while reviewers spent their time on low-signal feedback instead of design and architecture. The result: merge requests piling up, reviewers burning out, and meaningful releases delayed by avoidable cycles.

The problem is threefold. First, review bottlenecks slow delivery: a single stalled MR can block downstream work. Second, feedback is inconsistent across reviewers and teams, creating rework and confusion. Third, reviewer fatigue makes it harder to catch real risks consistently. That’s why we built an automated, LLM-powered MR workflow that parses diffs, runs targeted checks, and posts concise inline suggestions, freeing humans to focus on higher-value judgment calls.

The business impact is straightforward: faster time-to-merge, fewer reviewer hours spent on routine checks, and more consistent onboarding for new engineers because the agent enforces baseline standards. That consistency also reduces risk, security and performance weaknesses are flagged earlier, not after production incidents.

Use cases we prioritized:

  1. Small bug-fix MR: catch regression risks and style issues automatically so reviewers can approve faster.
  2. Security-sensitive MR: run focused security heuristics to surface SQLi/XSS patterns before human review.
  3. Documentation-only MR: auto-classify and suppress noise so reviewers aren’t interrupted by non-code changes.

In short, our code review agent targets the drudgery so humans can do the unique, high-impact work only people can do.

Architecture & flow

Key data objects & mapping

Object Purpose Where Used
mergeRequestLink Parsed MR URL components including domain, project path, MR ID, and MR URL. Webhook parsing, project resolution, email summaries.
projectId Canonical numeric GitLab project ID, resolved dynamically if missing. All GitLab API calls including diff fetch and discussions.
getMRDataChanges Complete MR payload from GitLab containing metadata and file diffs. Validation, diff parsing, commit SHA extraction.
changes[] Per-file diff objects with paths, diff hunks, and file state flags. Split logic, skip rules, AI diff processing.
originalCode / newCode Sanitized code blocks extracted from diff hunks without +/- markers. LLM review prompt payload.
position[...] Inline comment anchor including path, SHA references, and line numbers. GitLab inline discussion posting.
notificationKey Workflow outcome identifier (Reviewed, Blocked, Conflict, etc.). Email templates, logs, webhook response.
gitlabMRReviewPrompt Static or external system prompt defining AI review rules. AI agent invocation.
canReview Boolean toggle to skip review when disabled. Pre-review gating logic.
has_conflicts GitLab conflict indicator blocking automated review. Validation gate.
merge_status Mergeability status indicating if MR can be merged. Validation before processing and commenting.

Diff parsing

Diffs are the agent’s map, we turn each @@ -a,b +c,d @@ hunk into a tiny, reviewable story that the code review agent can understand. The hunk header means: a is the old-file start line, b is how many lines that old hunk covers; c is the new-file start line, and d is the new count. From there we walk each hunk line-by-line: lines beginning with a space are context (advance both old and new counters), – lines advance only the old counter (they belong in originalCode), and + lines advance only the new counter (they belong in newCode). This produces sanitized originalCode and newCode blocks (no leading +/- or @@ markers) and an exact mapping from hunk offsets to old_line / new_line integers used for anchor placement.

Edge cases matter. Skip diffs where deleted_file == true (nothing to anchor), skip renamed_file entries that show no content change, and skip diffs that don’t start with @@ (binary or generated files). Those rules keep the agent from posting unusable comments.

Anchoring depends on SHA correctness: GitLab requires the right base_sha, start_sha, and head_sha together with the chosen old_line or new_line. If the MR advances between fetch and post, anchors fail (422). Re-fetch SHAs just before posting or skip posting if they’ve changed. This is why precise line mapping is essential and why SHA revalidation is baked into the workflow.

  • Single-hunk: simple @@ -3,4 +3,5 @@ with one + and one – – assert originalCode, newCode, and mapping of lines.
  • Multi-hunk: two hunks in one file — assert separate hunkMeta entries and independent line mappings.
  • Rename-only: renamed_file == true + no diff – assert the file is skipped.
  • Deleted & binary: deleted file and diff without @@ – assert skip.

These tests catch the common failures that break inline comments and guard the agent from noisy mistakes.

Prompt design & AI agent behavior

Our code review ai agent succeeds or fails based on the prompt and how we shape its behavior, this section explains the goals, the payload we send, token strategy, and how we clean the agent’s output so it’s safe to post to a merge request.

Prompt design goals

We design the system prompt for determinism: the agent must return either Done or a short (1–3 line) actionable comment. We bias the agent toward safety and precision: language heuristics (file extension / path hints to choose idiomatic rules), focused security checks (SQLi, XSS patterns, unsafe eval usage), code-idiom suggestions (e.g., === vs ==), and performance anti-pattern detection (inefficient loops, unbounded allocations). Keep the model settings conservative (low temperature, small max_tokens) so output stays terse and parseable.

Payload composition

Each LLM call should include two parts: the system prompt + a user message containing (a) file path and language hint (from extension), (b) hunk metadata (hunk offsets, start lines), and (c) the sanitized originalCode and newCode blocks. This gives the agent the exact context to decide whether the change introduces a problem, and whether to anchor feedback to old_line or new_line. Keep the payload structured so parsing the agent output is trivial.

Token strategy & multi-hunk handling

Send hunks (plus a handful of context lines) rather than full files. For most repos we recommend ~400–800 tokens per call and tight max_tokens for replies so the agent stays concise. For multi-hunk files you can either: (a) call the agent per-hunk (simpler anchors), or (b) consolidate hunks into a single prompt when cross-hunk reasoning is needed, but then ensure you include clear hunk boundaries and map responses to the correct hunk. Cache by file SHA to avoid repeated costs.

Post-processing & safety nets

After the agent returns, normalize done → Done (case-insensitive), trim whitespace, and truncate multi-line answers to the first actionable sentence. If the output cannot be parsed or the agent returns verbose/ambiguous text, mark that file as agent_error and route it to a human-review notification (email/log). Finally, always re-validate MR SHAs before posting to avoid 422 anchor failures.


This combination of tight prompts, conservative token strategy, and robust post-processing lets us automate routine checks safely while keeping humans in the loop for ambiguous or high-risk findings.

Error handling & retry patterns

Failures happen, we designed the workflow to fail loudly and safely. We classify errors into four buckets and handle each with clear retries, escalation, and audit logging.

  1. HTTP (GitLab/API): transient 5xx: retry up to 3 times with exponential backoff (1s, 3s, 10s). For 429 rate limits, honor Retry-After when present; otherwise back off starting at 10s. If a POST returns a 422 (SHA/anchor mismatch), re-fetch the MR and skip posting if SHAs changed.
  2. LLM / Agent errors: on timeouts or non-parseable output, retry once with the same prompt (to avoid burning quota). If it still fails or returns verbose/ambiguous text, mark the file agent_error, record the failure in the execution stamp, and route the MR to human review via email. Don’t flood retries, protect cost and quota.
  3. Credential failures: fail fast, set notificationKey to a platform-alert state, and notify ops/admins. Avoid blind retries for auth errors.
  4. Logging / persistence failures: attempt secondary storage (local file or alternate DB), continue the review run, and flag the run as partially audited in the execution row.

Always surface errors in the execution stamp (executionId, errorMessage, status) so operators can triage quickly.

Roadmap & planned improvements

  1. Duplicate-comment detection- Before posting, search existing discussions for identical or near-identical bodies. Why: prevents spammy repeats and reduces reviewer annoyance. Impact: fewer noise complaints and cleaner threads.
  2. SHA revalidation (immediately before post)- Re-fetch MR SHAs just before posting any inline discussion. Why: avoids 422 anchor failures when new commits arrive. Impact: higher post success rate and fewer skipped comments.
  3. Caching by file SHA-– Store review results keyed by file SHA to skip identical content. Why: saves LLM calls and cost. Impact: substantial cost and latency reduction on busy repos.
  4. Language detector / file-extension heuristics- Let the agent return a severity label (info/warning/critical). Why: allows filtering auto-comments to only critical items. Impact: reduces noise; focuses human attention.
  5. Dashboard & metrics- Expose runs, comment counts, token costs, failures. Why: makes ops accountable and optimizations measurable. Impact: faster iteration and ROI visibility.

Always surface errors in the execution stamp (executionId, errorMessage, status) so operators can triage quickly.

Conclusion

We built the code review agent to deliver measurable outcomes: faster time-to-merge, fewer reviewer hours spent on trivial checks, more consistent feedback, and an auditable trail of why comments were made, all while reducing the chance that obvious security or performance issues reach production. Our risk posture is conservative: the agent flags routine problems and defers ambiguous or high-risk cases to humans; it never blocks merges.

Next step: pilot this on one repository or team, measure reviewer time and comments-per-MR, then iterate. Contact our AI Agent experts if you’d like us to run a two-week pilot and deliver a clear delta in reviewer time and comment quality.

You might also like

Stay ahead in tech with Sunflower Lab’s curated blogs, sorted by technology type. From AI to Digital Products, explore cutting-edge developments in our insightful, categorized collection. Dive in and stay informed about the ever-evolving digital landscape with Sunflower Lab.

Call Icon

Privacy Preference Center