Scraping Dynamic Websites

TL;DR

Dynamic sites load content with JavaScript. Regular HTTP requests get empty HTML. Use Playwright or Puppeteer to render pages in a real browser. Wait for content before extracting. 10x slower than static scraping.

The Problem

Many modern websites load content with JavaScript. When you fetch the HTML directly, the page looks empty:

# What you get with a simple request
curl https://spa-example.com
# Returns: <div id="root"></div>

The actual content loads after JavaScript runs in the browser. You need browser automation.

How to Detect Dynamic Content

Check if a site needs browser rendering:

View page source (Ctrl+U). Compare to what you see on screen.
If source shows empty divs but page shows content, it is dynamic.
Disable JavaScript in browser. If content disappears, it is dynamic.

Common dynamic sites:

React, Vue, Angular single-page apps
E-commerce product pages
Social media feeds
Infinite scroll content
Behind-authentication dashboards

Playwright vs Puppeteer

Feature	Playwright	Puppeteer
Browsers	Chrome, Firefox, Safari	Chrome only
Languages	JS, Python, C#, Java	JS only
Auto-wait	Built-in	Manual
Apify Support	Full (recommended)	Full

Recommendation: Use Playwright. Better auto-waiting, cross-browser support, and more active development.

Basic Playwright Scraping

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Navigate and wait for network to settle
        await page.goto(request.url, { waitUntil: 'networkidle' });

        // Extract content
        const title = await page.title();
        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                name: item.querySelector('.name').textContent,
                price: item.querySelector('.price').textContent,
            }))
        );

        console.log({ title, products });
    },
});

await crawler.run(['https://shop.example.com']);

Waiting Strategies

Dynamic content needs time to load. Choose the right wait method:

Method	Use When	Code
Wait for selector	Known element appears	`page.waitForSelector('.product-list')`
Network idle	All API calls finish	`page.waitForLoadState('networkidle')`
Wait for response	Specific API response	`page.waitForResponse(url => url.includes('/api/'))`
Fixed timeout	Last resort	`page.waitForTimeout(3000)`

Handling Infinite Scroll

async function scrollToBottom(page) {
    let previousHeight = 0;

    while (true) {
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(1000);

        const currentHeight = await page.evaluate(() => document.body.scrollHeight);
        if (currentHeight === previousHeight) break;
        previousHeight = currentHeight;
    }
}

Intercepting API Calls

Sometimes the fastest way is to capture the API response directly:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Intercept API responses
        const apiData = [];

        page.on('response', async (response) => {
            if (response.url().includes('/api/products')) {
                const json = await response.json();
                apiData.push(...json.products);
            }
        });

        await page.goto('https://shop.example.com');
        await page.waitForLoadState('networkidle');

        console.log('Captured products:', apiData.length);
    },
});

Performance Tips

Block images: Disable image loading to speed up renders
Use headless: Always run headless in production
Reuse contexts: Session persistence reduces login overhead
Limit concurrency: Browser instances use 300MB+ RAM each

const crawler = new PlaywrightCrawler({
    maxConcurrency: 5,
    launchContext: {
        launchOptions: {
            args: ['--disable-images'],
        },
    },
});

Common Questions

Q: Why is browser scraping so slow?

A: Browsers load images, CSS, JavaScript. Each page takes 2-10 seconds vs 0.1 seconds for HTTP requests. Only use browsers when necessary.

Q: How do I handle CAPTCHAs?

A: Use residential proxies. Solve manually with CAPTCHA solving services. Some sites have APIs that skip the CAPTCHA entirely.

Q: Can I run Playwright on Apify?

A: Yes. Apify provides optimized browser images. Use PlaywrightCrawler with Apify SDK for cloud deployment.