Crawlee for JavaScript

TL;DR

Crawlee is the JavaScript/TypeScript library for building production web scrapers. Install with npm. Built-in proxy rotation, request queuing, and browser automation. Powers most Apify actors.

What is Crawlee?

Crawlee is Apify's open-source scraping framework for JavaScript and TypeScript. It handles proxies, retries, concurrency, and storage. You write the extraction logic.

Most actors in the Apify Store use Crawlee. Learning it unlocks actor development.

Installation

Requires Node.js 18 or higher:

npm install crawlee playwright

Install browser binaries:

npx playwright install

Your First Scraper

Simple website crawler in JavaScript:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        console.log(`Title: ${title}`);
        console.log(`URL: ${request.url}`);

        // Follow links on the page
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

Run it:

node scraper.js

Crawler Types

Crawler	Best For	Speed
HttpCrawler	API endpoints, JSON responses	Fastest
CheerioCrawler	Static HTML pages	Fast
PlaywrightCrawler	JavaScript-rendered pages	Slow
PuppeteerCrawler	Chrome automation	Slow

Scraping Dynamic Pages

Use PlaywrightCrawler for JavaScript-heavy sites:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for content to load
        await page.waitForSelector('.product-list');

        // Extract data
        const products = await page.$$eval('.product', (items) =>
            items.map((item) => ({
                name: item.querySelector('.name')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }))
        );

        console.log(products);
    },
});

await crawler.run(['https://shop.example.com']);

Saving Data

Use the built-in Dataset:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const title = $('title').text();

        await Dataset.pushData({
            url: request.url,
            title,
            scrapedAt: new Date().toISOString(),
        });
    },
});

Data exports as JSON, CSV, or Excel.

Proxy Configuration

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://user:pass@proxy1.example.com:8000',
        'http://user:pass@proxy2.example.com:8000',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
});

TypeScript Support

Crawlee is written in TypeScript. Full type definitions included:

import { CheerioCrawler, Dataset } from 'crawlee';

interface ProductData {
    name: string;
    price: number;
    url: string;
}

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const data: ProductData = {
            name: $('.product-name').text(),
            price: parseFloat($('.price').text()),
            url: request.url,
        };

        await Dataset.pushData(data);
    },
});

Creating an Apify Actor

# Install CLI
npm install -g apify-cli

# Create from template
apify create my-actor --template=ts-crawlee-cheerio

# Test locally
cd my-actor
apify run

# Deploy to Apify
apify push

Common Questions

Q: Crawlee vs Puppeteer directly?

A: Crawlee wraps Puppeteer/Playwright. It adds request queuing, proxy rotation, and error handling. Use raw Puppeteer only for simple scripts.

Q: CheerioCrawler vs PlaywrightCrawler?

A: Start with Cheerio. It is 10x faster. Switch to Playwright only if content needs JavaScript to render.

Q: Can I run Crawlee without Apify?

A: Yes. Crawlee is standalone. Data saves locally. Only deploy to Apify if you want cloud execution.

Q: How do I handle rate limits?

A: Use maxRequestsPerMinute option. Crawlee automatically respects the limit and queues excess requests.