Description
With the widespread adoption of AI-assisted coding and advanced refactoring tools, the volume of code generation has outpaced human review capacity. As Pull Requests (PRs) grow in size and complexity, reviewers face increased pressure, making it easy to miss critical defects.
Goals
- [ ] Probe AI Integration: Investigate the viability of Google Gemini within existing GitHub workflows.
- [ ] Stress Test: Validate efficacy by running models against large-scale, real-world Pull Requests.
- [ ] Synthesize: Determine if current AI models offer a tangible reduction in review latency or an increase in defect detection, vs. merely adding noise.
Resources
Result
I do tests using three specific implementations of the Gemini model: Gemini CLI, Gemini in VS Code (Non-Agentic), and the Gemini GitHub App.
1. Gemini CLI
Tested via local environment without sandbox (Sandbox execution failed).
Verdict: Useful for generation, poor for analysis.
Pros:
- Privacy: Feedback is local; no noise on the public PR thread.
- Generation: Excellent at scaffolding PR descriptions from scratch.
Cons:
- Context Blind: Struggles to parse existing comments or PR metadata. Attempting to fix grammar in a description resulted in hallucinations.
- Token Limits: Fails on large PRs. It correctly identifies the limit and refuses to process (rather than hallucinating), but this limits utility.
- Low Signal: Failed to identify significant issues in the test cases.
Tips:
- Ensure you prompted gemini to run git without a pager to avoid manual scrolling
2. Gemini VS Code (Non-Agentic)
My primary daily driver for code generation.
Verdict: Better for preparing pull requests than reviewing it.
Pros:
- Privacy: Local feedback loop.
- Patching: Can suggest concrete code coverage additions or specific patches.
Cons:
- No Git Awareness: Cannot natively read the git index or branch diffs. Requires manual piping of diffs into the chat context.
- Context Window: Useless on medium-to-large PRs due to token window constraints.
- Workflow: Optimized for writing code, not reviewing external diffs.
3. Gemini GitHub Code Assist Bot
Tested using the free tier on a personal repository (Enterprise Cloud activation failed).
Verdict: The most powerful, but suffers from high noise and non-determinism.
Pros:
- Deep Context: Successfully processed massive PRs (e.g.,
+32,355 / −41,499lines). - Hit Rate: Found the highest number of legitimate issues among all tested methods.
- UX: Native integration via GitHub comments allows for conversational debugging.
Cons:
- Thread Pollution: "Messes up" the PR timeline. It frequently adds multiple high-priority labels to identical suggestions.
- Low Signal-to-Noise: Excessive "appraisals" (praise). As a maintainer, I need diffs and bugs, not emotional validation.
- Non-Deterministic: Running the bot multiple times on the identical codebase yields completely different review points.
- Opinionated Linting: Validates TODOs/FIXMEs as "correct" behavior to be fixed later, which is redundant.
- Adoption Friction: Requires repo-level installation, forcing the tool on all pull requests regardless of reviewer preference.
Field Data & Examples
The following PRs illustrate the behavior of the GitHub App across different scales:
- jreidinger/agama/pull/1 Massive Proved the bot can handle massive context windows without crashing.
- jreidinger/agama/pull/2 Small PoC Code was imperfect/buggy, yet the bot missed obvious logic flaws.
- jreidinger/agama/pull/3 Massive Tested configuration tweaks. Highlighted non-determinism: multiple reviews of the same code produced different comments.
- jreidinger/agama/pull/5 TypeScript Redundancy Bug: The bot flagged the same issue multiple times across the file with identical long-form explanations, cluttering the view.
Other models
I intended to test Copilot and Claude, but Copilot requires a PRO license (unavailable), and Claude currently lacks a workflow optimized specifically for code review integration.
Looking for hackers with the skills:
Nothing? Add some keywords!
This project is part of:
Hack Week 25
Activity
Comments
-
about 17 hours ago by mvidner | Reply
Just repeating the Examples section but with clickable links
- very big pull request: jreidinger/agama#1
- smaller POC that is far from perfect and has some issues: jreidinger/agama#2
- big one, but this time with try to tweak config and also with multiple reviews of same code ( and always different comments ) jreidinger/agama#3
Similar Projects
This project is one of its kind!