Jerry A. Henley

Posted on Feb 5

Static vs. Dynamic Scraping: Choosing the Right Architecture for AI-Generated Code

#webscraping #devops #dataengineering

AI coding assistants and tools like an AI scraper builder have fundamentally changed web scraping. Tasks that once required hours of manual DOM inspection and regex tinkering can now be completed with a single prompt: "Write me a Python script to scrape product prices from this URL."

However, there is a trap in this convenience. While AI is excellent at writing code that works, it isn't always great at writing code that is optimal. If you don't specify the architecture, an AI might generate a heavy, resource-intensive browser automation script for a site that could have been scraped with a simple HTTP request.

Many developers run into edge cases when scraping complex sites like Amazon. Instead of relying entirely on AI-generated selectors, reference open-source Amazon scrapers that demonstrate proven extraction patterns across different product layouts.

This guide explores the choice between static and dynamic extraction. Understanding these trade-offs helps you prompt AI tools to build scrapers that are faster, cheaper, and more scalable.

The Contenders: Static vs. Dynamic

Before optimizing scripts, we need to define the two primary methodologies used in modern web scraping.

Static Extraction (Requests + BeautifulSoup)

This is the traditional, highly efficient method. A library like requests or httpx sends a GET request to a server, which returns the raw HTML. A parser like BeautifulSoup or lxml then processes that text.

Pros: Fast, low memory usage, easy to scale.
Cons: Cannot execute JavaScript. If a framework like React or Vue renders data after the page loads, this method sees an empty shell.

Dynamic Extraction (Playwright/Selenium)

Dynamic extraction involves controlling a headless browser. Tools like Playwright or Selenium launch an instance of Chromium or Firefox, load the page, and execute JavaScript just like a real user.

Pros: Handles any website, interacts with buttons, manages scrolls, and handles complex authentication.
Cons: Slow, resource-heavy, and difficult to manage at scale.

Feature	Static (BeautifulSoup)	Dynamic (Playwright)
Speed	Very Fast (ms)	Slow (seconds)
Resource Cost	Low (Minimal CPU/RAM)	High (Browser overhead)
JS Execution	No	Yes
Complexity	Simple	Intermediate/Advanced

When to Choose Static Extraction

Treat static extraction as the default choice. In engineering, the goal is to use the simplest tool that solves the problem.

Performance Benefits

When using requests, you only download raw HTML bytes. Playwright downloads the HTML, CSS, JavaScript, and images, then uses significant CPU power to render that page in a virtual window.

If you scrape 10,000 pages, the difference between a 200ms static request and a 5-second browser load is the difference between finishing in 30 minutes or 14 hours.

Ideal Scenarios

News and Blogs: Most articles are baked into the HTML for SEO.
Standard E-commerce: Many large retailers serve product data in the initial HTML response.
Government/Wiki Sites: These rarely rely on complex JavaScript frameworks.

Tip: To check if a site supports static extraction, right-click the page and select View Page Source (not Inspect). If the data appears in that text, use a static scraper.

Prompting AI for Static Extraction

Be explicit with AI assistants. Otherwise, the AI might default to Playwright as a catch-all solution.

Better Prompt: "Write a Python script using requests and BeautifulSoup to extract the titles and prices from this URL. Assume the data is available in the static HTML."

import requests
from bs4 import BeautifulSoup

def scrape_static(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        products = []
        for item in soup.select('.product-card'):
            products.append({
                'name': item.select_one('.title').text.strip(),
                'price': item.select_one('.price').text.strip()
            })
        return products
    return []

When to Choose Dynamic Extraction

Static extraction isn't always possible. Single Page Applications (SPAs) often serve almost empty initial HTML, fetching content via API calls after the page loads.

When Playwright is Necessary

JavaScript Rendering: The data only appears after a few seconds of loading.
User Interaction: You must click a "Show Phone Number" button or scroll to trigger an infinite load.
Complex Redirects: Sites use JS-based challenges or heavy client-side redirects.

Prompting AI for Dynamic Extraction

When a site is dynamic, tell the AI which specific interaction to handle.

Better Prompt: "Generate a Python script using Playwright. The site loads data dynamically. Wait for the .results-list selector to be visible before parsing, then click the 'Next' button."

from playwright.sync_api import sync_playwright

def scrape_dynamic(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        # Wait for the JS to render the content
        page.wait_for_selector('.product-card')

        # Extract data from the rendered DOM
        products = page.eval_on_selector_all('.product-card', """
            nodes => nodes.map(n => ({
                name: n.querySelector('.title').innerText,
                price: n.querySelector('.price').innerText
            }))
        """)

        browser.close()
        return products

Benchmarking the Results

A test scraping 50 pages of a standard product listing shows the stark difference in overhead.

Metric	Static (Requests)	Dynamic (Playwright)
Avg. Time per Page	0.45 seconds	4.2 seconds
Total Time (50 pages)	22.5 seconds	210 seconds
RAM Usage	~45 MB	~380 MB
CPU Usage	Negligible	15-25% (per instance)

In this scenario, the Playwright script was nearly 10 times slower and used 8 times more memory. Running scrapers at scale on cloud providers like AWS or DigitalOcean makes these resource differences a major factor in monthly costs.

Decision Framework: The Prompt-to-Parser Workflow

Follow this three-step workflow to decide which architecture to request from an AI:

1. The No-JS Test

Disable JavaScript in your browser settings and reload the page.

Is the data still there? Go Static.
Is the page blank? Go Dynamic.

2. Network Inspection

Open Chrome DevTools (F12), go to the Network tab, and filter by Fetch/XHR. Reload the page. If a JSON response contains your data, you can use a static scraper to hit that API directly. This is often faster than parsing HTML.

3. Structural Prompting

Once you choose a method, use a structured prompt:

Specify the library: "Use httpx and selectolax for performance."
Define the wait condition: "Wait for the network to be idle."
Handle errors: "Include a try/except block for connection timeouts."

Summary

AI tools remove the barrier to writing code, but they don't replace architectural decisions. An effective scraper uses the least amount of overhead to get the job done.

Default to Static: Use Requests and BeautifulSoup to save time and money.
Use Dynamic for Interaction: Reserve Playwright for SPAs or sites requiring clicks and scrolls.
Inspect Before Prompting: Use the "Disable JS" trick to determine the site's architecture first.
Watch the Costs: Dynamic scrapers require significantly more server resources at scale.

By being intentional with prompts, you ensure AI-generated data pipelines are professionally optimized for performance.

DEV Community