Crawlee for JavaScript

Build web scrapers in JavaScript or TypeScript using Crawlee. The most popular Apify development framework.

TL;DR

Crawlee is the JavaScript/TypeScript library for building production web scrapers. Install with npm. Built-in proxy rotation, request queuing, and browser automation. Powers most Apify actors.

What is Crawlee?

Crawlee is Apify's open-source scraping framework for JavaScript and TypeScript. It handles proxies, retries, concurrency, and storage. You write the extraction logic.

Most actors in the Apify Store use Crawlee. Learning it unlocks actor development.

Installation

Requires Node.js 18 or higher:

npm install crawlee playwright

Install browser binaries:

npx playwright install

Your First Scraper

Simple website crawler in JavaScript:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        console.log(`Title: ${title}`);
        console.log(`URL: ${request.url}`);

        // Follow links on the page
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

Run it:

node scraper.js

Crawler Types

Crawler Best For Speed
HttpCrawler API endpoints, JSON responses Fastest
CheerioCrawler Static HTML pages Fast
PlaywrightCrawler JavaScript-rendered pages Slow
PuppeteerCrawler Chrome automation Slow

Scraping Dynamic Pages

Use PlaywrightCrawler for JavaScript-heavy sites:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for content to load
        await page.waitForSelector('.product-list');

        // Extract data
        const products = await page.$$eval('.product', (items) =>
            items.map((item) => ({
                name: item.querySelector('.name')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }))
        );

        console.log(products);
    },
});

await crawler.run(['https://shop.example.com']);

Saving Data

Use the built-in Dataset:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const title = $('title').text();

        await Dataset.pushData({
            url: request.url,
            title,
            scrapedAt: new Date().toISOString(),
        });
    },
});

Data exports as JSON, CSV, or Excel.

Proxy Configuration

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://user:pass@proxy1.example.com:8000',
        'http://user:pass@proxy2.example.com:8000',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
});

TypeScript Support

Crawlee is written in TypeScript. Full type definitions included:

import { CheerioCrawler, Dataset } from 'crawlee';

interface ProductData {
    name: string;
    price: number;
    url: string;
}

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const data: ProductData = {
            name: $('.product-name').text(),
            price: parseFloat($('.price').text()),
            url: request.url,
        };

        await Dataset.pushData(data);
    },
});

Creating an Apify Actor

# Install CLI
npm install -g apify-cli

# Create from template
apify create my-actor --template=ts-crawlee-cheerio

# Test locally
cd my-actor
apify run

# Deploy to Apify
apify push

Common Questions

Q: Crawlee vs Puppeteer directly?

A: Crawlee wraps Puppeteer/Playwright. It adds request queuing, proxy rotation, and error handling. Use raw Puppeteer only for simple scripts.

Q: CheerioCrawler vs PlaywrightCrawler?

A: Start with Cheerio. It is 10x faster. Switch to Playwright only if content needs JavaScript to render.

Q: Can I run Crawlee without Apify?

A: Yes. Crawlee is standalone. Data saves locally. Only deploy to Apify if you want cloud execution.

Q: How do I handle rate limits?

A: Use maxRequestsPerMinute option. Crawlee automatically respects the limit and queues excess requests.