Build a Python Tool That Automates an SEO Audit (Student Project)
projectsSEOpython

Build a Python Tool That Automates an SEO Audit (Student Project)

UUnknown
2026-03-06
9 min read
Advertisement

Project-based tutorial: build a Python CLI SEO audit tool that runs checks, scores issues, and outputs a prioritized remediation list.

Build a Python Tool That Automates an SEO Audit (Student Project)

Hook: If you feel overwhelmed by fragmented SEO checks, slow manual audits, and unclear priorities—this project gives you a practical, code-first answer: build a Python command-line tool that runs common SEO checks, scores findings, and outputs a prioritized remediation list you can act on today.

Why this project matters in 2026

SEO audits in 2026 must handle dynamic sites, Core Web Vitals, entity-based signals, and rising accessibility and privacy expectations. Automated tooling saves time and gives students a real deliverable to show employers: a reproducible audit that can run in CI, on-demand, or as part of a PR check.

What you’ll build

  • A CLI Python tool that accepts one or more URLs and runs a set of checks.
  • Checks include: HTTP status, redirects, robots.txt, sitemap presence, title/meta checks, canonical/hreflang, image alt, structured data detection, and Core Web Vitals via Lighthouse or PageSpeed API.
  • An algorithm that scores and prioritizes remediation with recommended fixes and estimated effort/impact.
  • Outputs: terminal summary, JSON export, and a simple HTML report.

Prerequisites

  • Python 3.11+ (2026 standard for many classrooms)
  • Node (optional) to run Lighthouse CLI for full Core Web Vitals
  • pip packages: requests, beautifulsoup4, playwright (for JS-rendered sites), and rich (for pretty CLI output)
  • Basic familiarity with HTTP, HTML, and command-line tools

Install the basics

python -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4 playwright rich
# If you want Lighthouse-based CWV checks:
# npm i -g lighthouse
# and install Playwright browsers:
playwright install

Project architecture

Keep the tool modular so students can extend it. A simple structure:

seo-audit/
├─ cli.py
├─ audit.py
├─ checks/
│  ├─ http_checks.py
│  ├─ meta_checks.py
│  └─ cwv.py
├─ reporter.py
└─ tests/

Data model (issues & findings)

Each finding is a small dict with fields like:

  • id: unique key (eg. missing-title)
  • severity: low/medium/high
  • impact: numeric impact estimate (1-10)
  • effort: numeric effort estimate (1-10)
  • message: human-readable explanation
  • url: affected page
  • fix: recommended remediation

Step 1 — Basic HTTP & metadata checks

Start small: fetching the page and checking responses and core on-page signals.

Example: fetch and check status

import requests
from bs4 import BeautifulSoup

def fetch(url, timeout=10):
    resp = requests.get(url, timeout=timeout, allow_redirects=True)
    return resp

def check_status(url):
    resp = fetch(url)
    if resp.status_code >= 400:
        return {
            'id': 'http-4xx-5xx',
            'severity': 'high',
            'impact': 9,
            'effort': 2,
            'message': f'HTTP {resp.status_code} returned for {url}',
            'url': url,
            'fix': 'Fix server response or redirects.'
        }
    return None

Meta tag checks (title & description)

def check_meta(html, url):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.title.string.strip() if soup.title else ''
    desc = ''
    tag = soup.find('meta', attrs={'name':'description'})
    if tag:
        desc = tag.get('content','').strip()

    findings = []
    if not title:
        findings.append({
            'id': 'missing-title', 'severity': 'high', 'impact': 8, 'effort': 2,
            'message': 'Missing ', 'url': url, 'fix': 'Add an informative title (50-60 chars).'
        })
    elif len(title) > 70:
        findings.append({
            'id': 'long-title', 'severity': 'medium', 'impact': 4, 'effort': 1,
            'message': 'Title is long (>70 chars).', 'url': url, 'fix': 'Shorten to ~50-60 chars.'
        })

    if not desc:
        findings.append({
            'id': 'missing-description', 'severity': 'medium', 'impact': 6, 'effort': 2,
            'message': 'Missing meta description.', 'url': url, 'fix': 'Add a unique descriptive meta description.'
        })

    return findings
</code></pre>

  <h2 id="step-2-handling-dynamic-sites-with-playwright">Step 2 — Handling dynamic sites with Playwright</h2>
  <p>Many modern sites render content client-side. Use Playwright to get the rendered HTML before running checks.</p>
  <pre><code>from playwright.sync_api import sync_playwright

def fetch_rendered(url, timeout=30):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until='networkidle', timeout=timeout*1000)
        html = page.content()
        browser.close()
    return html
</code></pre>

  <p><strong>Tip:</strong> Only use Playwright when you detect JavaScript-driven content; otherwise prefer fast requests-based fetches.</p>

  <h2 id="step-3-core-web-vitals-cwv-data">Step 3 — Core Web Vitals (CWV) data</h2>
  <p>To evaluate performance and page experience, call Lighthouse CLI (Node) or Google PageSpeed Insights (PSI) API and parse JSON output. Lighthouse remains the most complete way to capture lab CWV metrics; PSI provides field and lab data via API.</p>

  <h3 id="call-lighthouse-via-subprocess-basic">Call Lighthouse via subprocess (basic)</h3>
  <pre><code>import subprocess, json

def run_lighthouse(url, out_json='lhr.json'):
    cmd = [
        'lighthouse', url,
        '--quiet',
        f'--output=json',
        f'--output-path={out_json}',
        '--chrome-flags="--headless"'
    ]
    subprocess.run(' '.join(cmd), shell=True, check=True)
    with open(out_json) as f:
        return json.load(f)

def extract_cwv(lhr):
    audits = lhr.get('audits', {})
    return {
        'LCP': audits.get('largest-contentful-paint', {}).get('numericValue'),
        'CLS': audits.get('cumulative-layout-shift', {}).get('numericValue'),
        'FID': audits.get('interactive', {}).get('numericValue')
    }
</code></pre>

  <p><strong>Note:</strong> Lighthouse and PSI evolve often. In late 2025 and into 2026 Google refined lab/field distinctions and added more nuanced metrics; treat CWV as one input to prioritization, not the only signal.</p>

  <h2 id="step-4-other-meaningful-checks">Step 4 — Other meaningful checks</h2>
  <ul>
    <li><strong>robots.txt</strong> — ensure it isn't blocking pages you expect indexed.</li>
    <li><strong>sitemap.xml</strong> — detect presence and check URLs.</li>
    <li><strong>canonical</strong> — duplicate content prevention.</li>
    <li><strong>hreflang</strong> — international sites should expose correct annotations.</li>
    <li><strong>image alt</strong> — missing alt attributes (accessibility & SEO).</li>
    <li><strong>structured data</strong> — detect JSON-LD, schema.org types; missing or invalid schema reduces rich result opportunities.</li>
    <li><strong>broken links</strong> — crawl internal links and check 4xx/5xx responses.</li>
  </ul>

  <h2 id="step-5-prioritization-algorithm">Step 5 — Prioritization algorithm</h2>
  <p>A key learning outcome: turn raw findings into a prioritized remediation list. Use a simple scoring function that balances impact and effort. Students should be able to tweak weights and thresholds.</p>

  <pre><code>def compute_priority(finding):
    # Normalize impact/effort (1-10 scale expected)
    impact = finding.get('impact', 5)
    effort = finding.get('effort', 5)
    severity_weight = {'low': 1, 'medium': 1.5, 'high': 2}[finding.get('severity','medium')]

    # Higher score => higher priority
    # priority_score = (impact * severity_weight) / effort
    score = (impact * severity_weight) / max(1, effort)
    finding['priority_score'] = round(score, 2)
    return finding

# Example: sort findings
findings = [compute_priority(f) for f in findings]
findings.sort(key=lambda x: x['priority_score'], reverse=True)
</code></pre>

  <p>This produces an ordered list where high-impact, low-effort fixes bubble to the top—perfect for sprint planning.</p>

  <h3 id="enrich-findings-with-suggested-fixes-and-estimated-effort">Enrich findings with suggested fixes and estimated effort</h3>
  <p>Where possible, attach a short remediation recipe and an estimated time-to-fix (T-shirt sizing). Example mapping:</p>
  <ul>
    <li>Missing title — fix in CMS, effort: 15–30 minutes</li>
    <li>Large LCP image — compress or lazy-load image, effort: 1–4 hours</li>
    <li>Broken internal links — update links, effort: 30–90 minutes</li>
  </ul>

  <h2 id="step-6-reporting">Step 6 — Reporting</h2>
  <p>Students should produce machine-readable output and a human-friendly HTML summary. At minimum, export a JSON file containing findings and priority scores. For a nicer view, render a small HTML template with grouped sections: Critical, High, Medium, Low.</p>

  <pre><code>def write_json_report(findings, path='report.json'):
    import json
    with open(path, 'w') as f:
        json.dump({'findings': findings}, f, indent=2)

# Minimal HTML reporter (concept)
HTML_TEMPLATE = '''
<html><head><meta charset="utf-8"><title>SEO Audit

SEO Audit Report

{rows} ''' def write_html_report(findings, path='report.html'): rows = '' for f in findings: rows += f"

{f['priority_score']} - {f['message']}

{f['fix']}

" with open(path,'w') as f: f.write(HTML_TEMPLATE.format(rows=rows))

Step 7 — Automation and CI integration

Run audits automatically:

  • On PRs (catch regressions like missing meta or large asset changes)
  • Scheduled (daily/weekly) to monitor trends
  • As a pre-deploy check in staging environments

Use GitHub Actions or GitLab CI to run the CLI and upload the JSON artifact; fail a job only on critical regressions to avoid noisy failures.

Testing and validation

Teach students to write small tests for each check. Mock HTTP responses for status and HTML snippets for meta checks. Example using pytest:

def test_check_meta_missing_title():
    html = '<html><head><meta name="description" content="desc"/></head><body></body></html>'
    findings = check_meta(html, 'https://example.com')
    assert any(f['id']=='missing-title' for f in findings)

Keep learning outcomes oriented to current trends so the project stays relevant:

  • Entity-based SEO: Search engines increasingly interpret entity relationships. Audits should note structured data and content that supports entity graphs (late 2025 to 2026 trend).
  • AI content & signal detection: With more AI-generated content in 2026, audits should flag thin or templated pages and encourage unique, authoritative content.
  • Performance & privacy: RUM and lab metrics are evolving alongside privacy-first measurement approaches—treat lab CWV from Lighthouse as part of a bigger picture.
  • Accessibility as SEO: Accessibility issues (missing alts, bad contrast) now intersect with SEO performance and should be flagged.
  • Jamstack & dynamic SPAs: Many sites render client-side—include Playwright-driven checks by default for SPAs.
Practical audits balance automated signals with human judgment. The goal is prioritized, actionable fixes, not a laundry list of noise.

Advanced strategies (challenges for students)

  1. Integrate Page Experience field data by calling the PageSpeed Insights API and merging field results with lab metrics.
  2. Add a crawler that respects robots.txt and `crawl-delay`, and performs breadth-first discovery to a configurable depth.
  3. Build a scoring dashboard and trend charts by persisting reports to a simple database and plotting changes over time.
  4. Identify content duplication by computing simhash or tf-idf and flagging near-duplicate pages.
  5. Hook into the CMS (if available) to propose a bulk fix plan or even a draft PR containing meta updates.

Classroom assignment ideas

  • Mini project: implement 5 checks (status, title, description, canonical, image-alt) and generate a prioritized list.
  • Team project: one team builds crawler and checks, another team builds reporter and UI; combine in CI.
  • Capstone: extend tool to run as a GitHub Action that comments on PRs with critical SEO regressions.

Actionable checklist for students (start here)

  1. Set up the repo and virtual environment.
  2. Implement fetch + simple meta checks using requests + BeautifulSoup.
  3. Add Playwright support for one dynamic page in your test set.
  4. Wire in Lighthouse or PSI and extract CWV metrics for one URL.
  5. Implement the priority scoring and output JSON + simple HTML report.
  6. Write 3–5 unit tests for your checks and run in CI.

Key takeaways

  • Automate the repetitive parts: use scripts for checks and report generation so audits are reproducible.
  • Balance impact and effort: a simple priority formula helps teams pick fixes that move the needle fast.
  • Combine lab and field data: Lighthouse gives lab CWV; use PSI field data where possible.
  • Handle modern architectures: include a JS renderer (Playwright) for SPAs and headless checks.
  • Make findings actionable: each issue should include a clear fix and an estimated effort.

Further reading & sources

Read up on Lighthouse and PageSpeed Insights documentation, recent SEO audit guides (HubSpot and industry blogs), and the Playwright docs for headless rendering. Keep an eye on updates from Google throughout 2026, since ranking-related signals and measurement tools continue to evolve.

Final challenge and call-to-action

Ready to build something you can show a hiring manager? Clone your starter repo, implement the five core checks, and run a 1-page audit. Then extend it: add Lighthouse CWV, produce an HTML report, and open a PR. Share your report with classmates or your instructor and explain the top three prioritized fixes.

Action: Start the project now—create the repo, scaffold the modules from the architecture above, and push your first commit with a README that describes your checks. Tag it for peer review and iterate: automated audits are as much about improving the rules as they are about running them.

Advertisement

Related Topics

#projects#SEO#python
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-06T02:45:03.009Z