Scraping Dynamic Websites

How to scrape JavaScript-rendered pages. Browser automation, waiting strategies, and handling SPAs.

TL;DR

Dynamic sites load content with JavaScript. Regular HTTP requests get empty HTML. Use Playwright or Puppeteer to render pages in a real browser. Wait for content before extracting. 10x slower than static scraping.

The Problem

Many modern websites load content with JavaScript. When you fetch the HTML directly, the page looks empty:

# What you get with a simple request
curl https://spa-example.com
# Returns: <div id="root"></div>

The actual content loads after JavaScript runs in the browser. You need browser automation.

How to Detect Dynamic Content

Check if a site needs browser rendering:

  1. View page source (Ctrl+U). Compare to what you see on screen.
  2. If source shows empty divs but page shows content, it is dynamic.
  3. Disable JavaScript in browser. If content disappears, it is dynamic.

Common dynamic sites:

  • React, Vue, Angular single-page apps
  • E-commerce product pages
  • Social media feeds
  • Infinite scroll content
  • Behind-authentication dashboards

Playwright vs Puppeteer

Feature Playwright Puppeteer
Browsers Chrome, Firefox, Safari Chrome only
Languages JS, Python, C#, Java JS only
Auto-wait Built-in Manual
Apify Support Full (recommended) Full

Recommendation: Use Playwright. Better auto-waiting, cross-browser support, and more active development.

Basic Playwright Scraping

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Navigate and wait for network to settle
        await page.goto(request.url, { waitUntil: 'networkidle' });

        // Extract content
        const title = await page.title();
        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                name: item.querySelector('.name').textContent,
                price: item.querySelector('.price').textContent,
            }))
        );

        console.log({ title, products });
    },
});

await crawler.run(['https://shop.example.com']);

Waiting Strategies

Dynamic content needs time to load. Choose the right wait method:

Method Use When Code
Wait for selector Known element appears page.waitForSelector('.product-list')
Network idle All API calls finish page.waitForLoadState('networkidle')
Wait for response Specific API response page.waitForResponse(url => url.includes('/api/'))
Fixed timeout Last resort page.waitForTimeout(3000)

Handling Infinite Scroll

async function scrollToBottom(page) {
    let previousHeight = 0;

    while (true) {
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(1000);

        const currentHeight = await page.evaluate(() => document.body.scrollHeight);
        if (currentHeight === previousHeight) break;
        previousHeight = currentHeight;
    }
}

Intercepting API Calls

Sometimes the fastest way is to capture the API response directly:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Intercept API responses
        const apiData = [];

        page.on('response', async (response) => {
            if (response.url().includes('/api/products')) {
                const json = await response.json();
                apiData.push(...json.products);
            }
        });

        await page.goto('https://shop.example.com');
        await page.waitForLoadState('networkidle');

        console.log('Captured products:', apiData.length);
    },
});

Performance Tips

  • Block images: Disable image loading to speed up renders
  • Use headless: Always run headless in production
  • Reuse contexts: Session persistence reduces login overhead
  • Limit concurrency: Browser instances use 300MB+ RAM each
const crawler = new PlaywrightCrawler({
    maxConcurrency: 5,
    launchContext: {
        launchOptions: {
            args: ['--disable-images'],
        },
    },
});

Common Questions

Q: Why is browser scraping so slow?

A: Browsers load images, CSS, JavaScript. Each page takes 2-10 seconds vs 0.1 seconds for HTTP requests. Only use browsers when necessary.

Q: How do I handle CAPTCHAs?

A: Use residential proxies. Solve manually with CAPTCHA solving services. Some sites have APIs that skip the CAPTCHA entirely.

Q: Can I run Playwright on Apify?

A: Yes. Apify provides optimized browser images. Use PlaywrightCrawler with Apify SDK for cloud deployment.