AI Evaluation / Benchmarking

AI Code Evaluator

Automated evaluation for AI-generated code scoring across correctness, edge cases, performance, and code quality. Inspired by Labelbox and Scale AI evaluation pipelines.

📅 2025–2026 🔌 Stable 📈 4 Weighted Dimensions 🔗 GitHub

01 The Problem

Large language models can now generate code that looks correct — but appearances are deceiving. A generated function might pass basic tests yet fail catastrophically on edge cases, exhibit quadratic complexity where linear suffices, or lack the documentation and type safety needed for production use. Manual review of AI-generated code doesn't scale, and simple pass/fail testing misses the nuance that separates a toy solution from a shippable one.

The fundamental challenge is multi-dimensional evaluation. A code submission can be: correct but slow, fast but undocumented, well-structured but fragile on edge inputs. Existing tools like Labelbox and Scale AI pioneered automated evaluation for data-labeling quality — but the code domain demanded its own framework. The AI Code Evaluator fills this gap: a systematic, weighted scoring engine that measures AI-generated code across four orthogonal axes and produces a single interpretable quality score.

02 Architecture

The framework is designed as a modular pipeline: test cases are defined in YAML, the scoring engine runs each submission through all four dimensions, and results are aggregated into a 0–100 score. Each dimension is independently configurable, allowing different problem domains to re-weight criteria without modifying core logic.

Evaluator Engine
├── Correctness (40%) — Standard test cases
├── Edge Cases    (25%) — Boundaries, empty, large inputs
├── Performance   (20%) — Execution speed, algorithmic efficiency
└── Code Quality  (15%) — Docstrings, type hints, naming, structure
    │
    ▼
Benchmark Runner → Test Cases + Scoring Engine
    │
    ▼
Report Generator → JSON + Markdown
    
CI: GitHub Actions (automated runs)
⚡ Design Principle

Evaluation must be deterministic and reproducible. Every test case is serialised in YAML, every run logs the exact environment and seed, and the scoring engine is pure Python with no randomness. This means two runs of the same submission always produce the same score — critical for benchmarking model iterations or tracking quality over time.

03 Tech Stack

The stack prioritises portability, automation, and clear output. YAML keeps test-case definitions human-readable and version-controllable; GitHub Actions makes evaluation a push-button CI process; and dual-format reporting (JSON + Markdown) serves both automated consumers and human reviewers.

Technology Role Why
Python 3.11+ Core runtime Rich typing, static analysis hooks, broad ML/AI ecosystem
PyYAML Test-case configuration Human-readable, Git-friendly, easy to extend per problem domain
GitHub Actions CI Automated evaluation runs Trigger on push/PR, integrate badge into README, no infra to manage
JSON + Markdown reports Dual-format output JSON for machine parsing / dashboards; Markdown for PR comments and wikis
Static analysis integration Code quality signals mypy, pylint, flake8 metrics feed into the Code Quality dimension

04 Key Challenges

🎲 1. AI Code Is Fundamentally Unpredictable

Human-written code follows consistent style patterns; AI-generated code varies wildly between models, prompts, and even runs with identical inputs. A framework that assumes standard indentation, conventional naming, or idiomatic Python will penalise valid-but-unusual solutions. The evaluator must be model-agnostic: it scores the functional properties of the code (does it run? is it fast? is it safe?) without assuming how it was produced. This meant designing the Correctness and Edge Case dimensions to reason about runtime behaviour rather than syntax, and weighting style heuristics conservatively.

🏬 2. Automatic Edge-Case Generation

A robust evaluation must test boundary conditions: empty inputs, negative numbers, Unicode strings, maximum recursion depth, concurrent calls. Hand-writing these for every problem doesn't scale. The framework includes an edge-case generator that analyses the function signature and type hints to produce a combinatorial suite of boundary inputs per problem domain. For example, a sort function receives an empty list, a single-element list, a reversed list, a list of duplicates, a list of mixed types (if untyped), and a list at maximum recursion depth. The generator is parameterised so domain experts can extend it with domain-specific cases.

🚀 Implementation Detail

Edge-case generation uses a type-based strategy table. For int parameters it generates 0, -1, 231-1, and typical values. For str it generates empty string, single character, very long string (~10K chars), and Unicode mixed-script text. The combinatorics are pruned by a configurable limit to keep total test time bounded.

📊 3. Performance Isolation

Measuring algorithm speed in Python is noisy. Garbage collection, OS scheduling, CPU throttling, and even the phase of the moon can skew timing results by 10–30%. The framework addresses this with statistical significance: each performance test runs multiple iterations (default 10), discards the fastest and slowest, takes the median, and compares against a pre-recorded baseline. A submission is only penalised if the 95% confidence interval of its execution time lies strictly above the baseline. This approach eliminates most noise while catching genuine inefficiency — for example, a solution using list.insert(0, ...) repeatedly instead of a collections.deque.

📜 4. Automated Code-Quality Heuristics

Some aspects of code quality are straightforward to measure: presence of docstrings, type hints, meaningful identifiers, cyclomatic complexity. But design patterns — proper use of abstractions, separation of concerns, composability — resist automated measurement. The framework uses a heuristic scoring layer that checks for AST-level patterns (function length < 40 lines, branching factor < 8, consistent return types) and awards partial credit based on a weighted rubric. These heuristics are transparent and configurable per problem, and the scoring rationale is included in the Markdown report so a human reviewer can override if needed.

Code Highlight

The evaluation engine runs candidate code through four scoring dimensions with configurable weights, producing a structured report from each execution.

def evaluate_solution(
    problem_name: str,
    code: str,
    func_name: str,
    test_cases: list[dict],
    edge_cases: list[dict] | None = None,
    time_limit: float = 2.0,
) -> EvaluationResult:
    result = EvaluationResult(problem_name)

    # Dimension 1: Code Quality (static analysis)
    result.code_quality = analyze_code_quality(code)

    # Execute and test
    try:
        exec_globals = {}
        exec(compile(code, '<string>', 'exec'), exec_globals)
        fn = exec_globals[func_name]

        # Dimension 2: Correctness — standard test cases
        result.correctness = run_tests(fn, test_cases, time_limit)
        # Dimension 3: Edge cases — boundary conditions
        result.edge_cases = run_tests(fn, edge_cases or [], time_limit)
        # Dimension 4: Performance — timing analysis
        result.performance = measure_performance(fn, test_cases)
    except Exception as e:
        result.error = str(e)

    # Weighted total: 40% + 25% + 20% + 15%
    result.score = (
        result.correctness * 0.40 +
        result.edge_cases  * 0.25 +
        result.performance * 0.20 +
        result.code_quality * 0.15
    )
    return result

05 Results

4 Scoring Dimensions
Weighted Configurable Weights
2 Report Formats (JSON + MD)
CI GitHub Actions Automated

The AI Code Evaluator produces a single 0–100 score from four weighted dimensions, enabling side-by-side comparison of good versus flawed solutions. The dual-format output means the same evaluation run feeds both a CI badge in the project README and a detailed report attached to a PR. The architecture directly mirrors production evaluation pipelines at Labelbox and Scale AI, adapted for code rather than data-labelling quality.

06 What I Learned

Building an automated code evaluator revealed just how much tacit knowledge goes into human code review — and how systematically it can be encoded. Three takeaways shaped my approach to AI evaluation infrastructure:

📚 Lesson 1

Evaluation frameworks are products. The consumer of a score is not a machine — it's a developer, a team lead, or a CI pipeline. Every output must be interpretable: not just what the score is, but why. The Markdown report that breaks down each dimension with the specific failing test cases turned out to be more valuable than the aggregate score itself. Transparent, auditable evaluation builds trust in automated scoring.

📚 Lesson 2

Weighting is a policy decision, not a technical one. Should performance be worth 20% or 30%? Should code quality ever outweigh correctness for a bootcamp submission? There is no universal answer. The framework's key architectural insight is making weights external configuration in YAML rather than hard-coded constants. Different organisations, courses, or problem domains can define their own weighting profiles without modifying a single line of Python.

📚 Lesson 3

AI evaluation is a different problem from human evaluation. When a human reviews code, they can infer intent — "this variable name is confusing but I see what you mean." An automated evaluator has no such luxury. Every heuristic is a leaky abstraction. The most robust parts of the framework are the functional tests (correctness, edge cases) because they operate on behaviour rather than form. The code quality dimension is useful as a signal, but it should always be presented with the caveat that it measures convention compliance, not design quality.