TL;DR
Dynamic sites load content with JavaScript. Regular HTTP requests get empty HTML. Use Playwright or Puppeteer to render pages in a real browser. Wait for content before extracting. 10x slower than static scraping.
The Problem
Many modern websites load content with JavaScript. When you fetch the HTML directly, the page looks empty:
# What you get with a simple request
curl https://spa-example.com
# Returns: <div id="root"></div>
The actual content loads after JavaScript runs in the browser. You need browser automation.
How to Detect Dynamic Content
Check if a site needs browser rendering:
- View page source (Ctrl+U). Compare to what you see on screen.
- If source shows empty divs but page shows content, it is dynamic.
- Disable JavaScript in browser. If content disappears, it is dynamic.
Common dynamic sites:
- React, Vue, Angular single-page apps
- E-commerce product pages
- Social media feeds
- Infinite scroll content
- Behind-authentication dashboards
Playwright vs Puppeteer
| Feature | Playwright | Puppeteer |
|---|---|---|
| Browsers | Chrome, Firefox, Safari | Chrome only |
| Languages | JS, Python, C#, Java | JS only |
| Auto-wait | Built-in | Manual |
| Apify Support | Full (recommended) | Full |
Recommendation: Use Playwright. Better auto-waiting, cross-browser support, and more active development.
Basic Playwright Scraping
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Navigate and wait for network to settle
await page.goto(request.url, { waitUntil: 'networkidle' });
// Extract content
const title = await page.title();
const products = await page.$$eval('.product', items =>
items.map(item => ({
name: item.querySelector('.name').textContent,
price: item.querySelector('.price').textContent,
}))
);
console.log({ title, products });
},
});
await crawler.run(['https://shop.example.com']);
Waiting Strategies
Dynamic content needs time to load. Choose the right wait method:
| Method | Use When | Code |
|---|---|---|
| Wait for selector | Known element appears | page.waitForSelector('.product-list') |
| Network idle | All API calls finish | page.waitForLoadState('networkidle') |
| Wait for response | Specific API response | page.waitForResponse(url => url.includes('/api/')) |
| Fixed timeout | Last resort | page.waitForTimeout(3000) |
Handling Infinite Scroll
async function scrollToBottom(page) {
let previousHeight = 0;
while (true) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1000);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
}
}
Intercepting API Calls
Sometimes the fastest way is to capture the API response directly:
const crawler = new PlaywrightCrawler({
async requestHandler({ page }) {
// Intercept API responses
const apiData = [];
page.on('response', async (response) => {
if (response.url().includes('/api/products')) {
const json = await response.json();
apiData.push(...json.products);
}
});
await page.goto('https://shop.example.com');
await page.waitForLoadState('networkidle');
console.log('Captured products:', apiData.length);
},
});
Performance Tips
- Block images: Disable image loading to speed up renders
- Use headless: Always run headless in production
- Reuse contexts: Session persistence reduces login overhead
- Limit concurrency: Browser instances use 300MB+ RAM each
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
launchContext: {
launchOptions: {
args: ['--disable-images'],
},
},
});
Common Questions
Q: Why is browser scraping so slow?
A: Browsers load images, CSS, JavaScript. Each page takes 2-10 seconds vs 0.1 seconds for HTTP requests. Only use browsers when necessary.
Q: How do I handle CAPTCHAs?
A: Use residential proxies. Solve manually with CAPTCHA solving services. Some sites have APIs that skip the CAPTCHA entirely.
Q: Can I run Playwright on Apify?
A: Yes. Apify provides optimized browser images. Use PlaywrightCrawler with Apify SDK for cloud deployment.