Build a Python Tool That Automates an SEO Audit (Student Project)
Project-based tutorial: build a Python CLI SEO audit tool that runs checks, scores issues, and outputs a prioritized remediation list.
Build a Python Tool That Automates an SEO Audit (Student Project)
Hook: If you feel overwhelmed by fragmented SEO checks, slow manual audits, and unclear priorities—this project gives you a practical, code-first answer: build a Python command-line tool that runs common SEO checks, scores findings, and outputs a prioritized remediation list you can act on today.
Why this project matters in 2026
SEO audits in 2026 must handle dynamic sites, Core Web Vitals, entity-based signals, and rising accessibility and privacy expectations. Automated tooling saves time and gives students a real deliverable to show employers: a reproducible audit that can run in CI, on-demand, or as part of a PR check.
What you’ll build
- A CLI Python tool that accepts one or more URLs and runs a set of checks.
- Checks include: HTTP status, redirects, robots.txt, sitemap presence, title/meta checks, canonical/hreflang, image alt, structured data detection, and Core Web Vitals via Lighthouse or PageSpeed API.
- An algorithm that scores and prioritizes remediation with recommended fixes and estimated effort/impact.
- Outputs: terminal summary, JSON export, and a simple HTML report.
Prerequisites
- Python 3.11+ (2026 standard for many classrooms)
- Node (optional) to run Lighthouse CLI for full Core Web Vitals
- pip packages: requests, beautifulsoup4, playwright (for JS-rendered sites), and rich (for pretty CLI output)
- Basic familiarity with HTTP, HTML, and command-line tools
Install the basics
python -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4 playwright rich
# If you want Lighthouse-based CWV checks:
# npm i -g lighthouse
# and install Playwright browsers:
playwright install
Project architecture
Keep the tool modular so students can extend it. A simple structure:
seo-audit/
├─ cli.py
├─ audit.py
├─ checks/
│ ├─ http_checks.py
│ ├─ meta_checks.py
│ └─ cwv.py
├─ reporter.py
└─ tests/
Data model (issues & findings)
Each finding is a small dict with fields like:
- id: unique key (eg. missing-title)
- severity: low/medium/high
- impact: numeric impact estimate (1-10)
- effort: numeric effort estimate (1-10)
- message: human-readable explanation
- url: affected page
- fix: recommended remediation
Step 1 — Basic HTTP & metadata checks
Start small: fetching the page and checking responses and core on-page signals.
Example: fetch and check status
import requests
from bs4 import BeautifulSoup
def fetch(url, timeout=10):
resp = requests.get(url, timeout=timeout, allow_redirects=True)
return resp
def check_status(url):
resp = fetch(url)
if resp.status_code >= 400:
return {
'id': 'http-4xx-5xx',
'severity': 'high',
'impact': 9,
'effort': 2,
'message': f'HTTP {resp.status_code} returned for {url}',
'url': url,
'fix': 'Fix server response or redirects.'
}
return None
Meta tag checks (title & description)
def check_meta(html, url):
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string.strip() if soup.title else ''
desc = ''
tag = soup.find('meta', attrs={'name':'description'})
if tag:
desc = tag.get('content','').strip()
findings = []
if not title:
findings.append({
'id': 'missing-title', 'severity': 'high', 'impact': 8, 'effort': 2,
'message': 'Missing ', 'url': url, 'fix': 'Add an informative title (50-60 chars).'
})
elif len(title) > 70:
findings.append({
'id': 'long-title', 'severity': 'medium', 'impact': 4, 'effort': 1,
'message': 'Title is long (>70 chars).', 'url': url, 'fix': 'Shorten to ~50-60 chars.'
})
if not desc:
findings.append({
'id': 'missing-description', 'severity': 'medium', 'impact': 6, 'effort': 2,
'message': 'Missing meta description.', 'url': url, 'fix': 'Add a unique descriptive meta description.'
})
return findings
Step 2 — Handling dynamic sites with Playwright
Many modern sites render content client-side. Use Playwright to get the rendered HTML before running checks.
from playwright.sync_api import sync_playwright
def fetch_rendered(url, timeout=30):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until='networkidle', timeout=timeout*1000)
html = page.content()
browser.close()
return html
Tip: Only use Playwright when you detect JavaScript-driven content; otherwise prefer fast requests-based fetches.
Step 3 — Core Web Vitals (CWV) data
To evaluate performance and page experience, call Lighthouse CLI (Node) or Google PageSpeed Insights (PSI) API and parse JSON output. Lighthouse remains the most complete way to capture lab CWV metrics; PSI provides field and lab data via API.
Call Lighthouse via subprocess (basic)
import subprocess, json
def run_lighthouse(url, out_json='lhr.json'):
cmd = [
'lighthouse', url,
'--quiet',
f'--output=json',
f'--output-path={out_json}',
'--chrome-flags="--headless"'
]
subprocess.run(' '.join(cmd), shell=True, check=True)
with open(out_json) as f:
return json.load(f)
def extract_cwv(lhr):
audits = lhr.get('audits', {})
return {
'LCP': audits.get('largest-contentful-paint', {}).get('numericValue'),
'CLS': audits.get('cumulative-layout-shift', {}).get('numericValue'),
'FID': audits.get('interactive', {}).get('numericValue')
}
Note: Lighthouse and PSI evolve often. In late 2025 and into 2026 Google refined lab/field distinctions and added more nuanced metrics; treat CWV as one input to prioritization, not the only signal.
Step 4 — Other meaningful checks
- robots.txt — ensure it isn't blocking pages you expect indexed.
- sitemap.xml — detect presence and check URLs.
- canonical — duplicate content prevention.
- hreflang — international sites should expose correct annotations.
- image alt — missing alt attributes (accessibility & SEO).
- structured data — detect JSON-LD, schema.org types; missing or invalid schema reduces rich result opportunities.
- broken links — crawl internal links and check 4xx/5xx responses.
Step 5 — Prioritization algorithm
A key learning outcome: turn raw findings into a prioritized remediation list. Use a simple scoring function that balances impact and effort. Students should be able to tweak weights and thresholds.
def compute_priority(finding):
# Normalize impact/effort (1-10 scale expected)
impact = finding.get('impact', 5)
effort = finding.get('effort', 5)
severity_weight = {'low': 1, 'medium': 1.5, 'high': 2}[finding.get('severity','medium')]
# Higher score => higher priority
# priority_score = (impact * severity_weight) / effort
score = (impact * severity_weight) / max(1, effort)
finding['priority_score'] = round(score, 2)
return finding
# Example: sort findings
findings = [compute_priority(f) for f in findings]
findings.sort(key=lambda x: x['priority_score'], reverse=True)
This produces an ordered list where high-impact, low-effort fixes bubble to the top—perfect for sprint planning.
Enrich findings with suggested fixes and estimated effort
Where possible, attach a short remediation recipe and an estimated time-to-fix (T-shirt sizing). Example mapping:
- Missing title — fix in CMS, effort: 15–30 minutes
- Large LCP image — compress or lazy-load image, effort: 1–4 hours
- Broken internal links — update links, effort: 30–90 minutes
Step 6 — Reporting
Students should produce machine-readable output and a human-friendly HTML summary. At minimum, export a JSON file containing findings and priority scores. For a nicer view, render a small HTML template with grouped sections: Critical, High, Medium, Low.
def write_json_report(findings, path='report.json'):
import json
with open(path, 'w') as f:
json.dump({'findings': findings}, f, indent=2)
# Minimal HTML reporter (concept)
HTML_TEMPLATE = '''
SEO Audit
SEO Audit Report
{rows}
'''
def write_html_report(findings, path='report.html'):
rows = ''
for f in findings:
rows += f"{f['priority_score']} - {f['message']}
{f['fix']}
"
with open(path,'w') as f:
f.write(HTML_TEMPLATE.format(rows=rows))
Step 7 — Automation and CI integration
Run audits automatically:
- On PRs (catch regressions like missing meta or large asset changes)
- Scheduled (daily/weekly) to monitor trends
- As a pre-deploy check in staging environments
Use GitHub Actions or GitLab CI to run the CLI and upload the JSON artifact; fail a job only on critical regressions to avoid noisy failures.
Testing and validation
Teach students to write small tests for each check. Mock HTTP responses for status and HTML snippets for meta checks. Example using pytest:
def test_check_meta_missing_title():
html = '<html><head><meta name="description" content="desc"/></head><body></body></html>'
findings = check_meta(html, 'https://example.com')
assert any(f['id']=='missing-title' for f in findings)
2026 trends and how they shape this project
Keep learning outcomes oriented to current trends so the project stays relevant:
- Entity-based SEO: Search engines increasingly interpret entity relationships. Audits should note structured data and content that supports entity graphs (late 2025 to 2026 trend).
- AI content & signal detection: With more AI-generated content in 2026, audits should flag thin or templated pages and encourage unique, authoritative content.
- Performance & privacy: RUM and lab metrics are evolving alongside privacy-first measurement approaches—treat lab CWV from Lighthouse as part of a bigger picture.
- Accessibility as SEO: Accessibility issues (missing alts, bad contrast) now intersect with SEO performance and should be flagged.
- Jamstack & dynamic SPAs: Many sites render client-side—include Playwright-driven checks by default for SPAs.
Practical audits balance automated signals with human judgment. The goal is prioritized, actionable fixes, not a laundry list of noise.
Advanced strategies (challenges for students)
- Integrate Page Experience field data by calling the PageSpeed Insights API and merging field results with lab metrics.
- Add a crawler that respects robots.txt and `crawl-delay`, and performs breadth-first discovery to a configurable depth.
- Build a scoring dashboard and trend charts by persisting reports to a simple database and plotting changes over time.
- Identify content duplication by computing simhash or tf-idf and flagging near-duplicate pages.
- Hook into the CMS (if available) to propose a bulk fix plan or even a draft PR containing meta updates.
Classroom assignment ideas
- Mini project: implement 5 checks (status, title, description, canonical, image-alt) and generate a prioritized list.
- Team project: one team builds crawler and checks, another team builds reporter and UI; combine in CI.
- Capstone: extend tool to run as a GitHub Action that comments on PRs with critical SEO regressions.
Actionable checklist for students (start here)
- Set up the repo and virtual environment.
- Implement fetch + simple meta checks using requests + BeautifulSoup.
- Add Playwright support for one dynamic page in your test set.
- Wire in Lighthouse or PSI and extract CWV metrics for one URL.
- Implement the priority scoring and output JSON + simple HTML report.
- Write 3–5 unit tests for your checks and run in CI.
Key takeaways
- Automate the repetitive parts: use scripts for checks and report generation so audits are reproducible.
- Balance impact and effort: a simple priority formula helps teams pick fixes that move the needle fast.
- Combine lab and field data: Lighthouse gives lab CWV; use PSI field data where possible.
- Handle modern architectures: include a JS renderer (Playwright) for SPAs and headless checks.
- Make findings actionable: each issue should include a clear fix and an estimated effort.
Further reading & sources
Read up on Lighthouse and PageSpeed Insights documentation, recent SEO audit guides (HubSpot and industry blogs), and the Playwright docs for headless rendering. Keep an eye on updates from Google throughout 2026, since ranking-related signals and measurement tools continue to evolve.
Final challenge and call-to-action
Ready to build something you can show a hiring manager? Clone your starter repo, implement the five core checks, and run a 1-page audit. Then extend it: add Lighthouse CWV, produce an HTML report, and open a PR. Share your report with classmates or your instructor and explain the top three prioritized fixes.
Action: Start the project now—create the repo, scaffold the modules from the architecture above, and push your first commit with a README that describes your checks. Tag it for peer review and iterate: automated audits are as much about improving the rules as they are about running them.
Related Reading
- Legal Storms and Asset Value: Building a Checklist to Assess Litigation Risk in Royalty Investments
- DIY: Recreate Jo Malone and Chanel-Inspired Diffuser Blends at Home
- VR Matchday: What Meta’s Workrooms Shutdown Means for Virtual Fan Experiences
- What AI Won’t Replace in Advertising Measurement: Roles and Tasks to Keep
- Case Study: What a $4M Fund Sale Teaches About Rebalancing and Hedging Metal Exposures
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Challenging Cloud Giants: Building Your AI-Native Infrastructure
Utilizing Smart Tags in Projects: A Guide from Xiaomi's Latest Innovations
Building Chatbot Interfaces: Lessons from ChatGPT Atlas
Gamepad Development: Learning from Valve's Latest UI Update
Handling System Outages: Best Practices for Developers
From Our Network
Trending stories across our publication group