TL;DR
Crawlee is the JavaScript/TypeScript library for building production web scrapers. Install with npm. Built-in proxy rotation, request queuing, and browser automation. Powers most Apify actors.
What is Crawlee?
Crawlee is Apify's open-source scraping framework for JavaScript and TypeScript. It handles proxies, retries, concurrency, and storage. You write the extraction logic.
Most actors in the Apify Store use Crawlee. Learning it unlocks actor development.
Installation
Requires Node.js 18 or higher:
npm install crawlee playwright
Install browser binaries:
npx playwright install
Your First Scraper
Simple website crawler in JavaScript:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
console.log(`Title: ${title}`);
console.log(`URL: ${request.url}`);
// Follow links on the page
await enqueueLinks();
},
});
await crawler.run(['https://example.com']);
Run it:
node scraper.js
Crawler Types
| Crawler | Best For | Speed |
|---|---|---|
| HttpCrawler | API endpoints, JSON responses | Fastest |
| CheerioCrawler | Static HTML pages | Fast |
| PlaywrightCrawler | JavaScript-rendered pages | Slow |
| PuppeteerCrawler | Chrome automation | Slow |
Scraping Dynamic Pages
Use PlaywrightCrawler for JavaScript-heavy sites:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for content to load
await page.waitForSelector('.product-list');
// Extract data
const products = await page.$$eval('.product', (items) =>
items.map((item) => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent,
}))
);
console.log(products);
},
});
await crawler.run(['https://shop.example.com']);
Saving Data
Use the built-in Dataset:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const title = $('title').text();
await Dataset.pushData({
url: request.url,
title,
scrapedAt: new Date().toISOString(),
});
},
});
Data exports as JSON, CSV, or Excel.
Proxy Configuration
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@proxy1.example.com:8000',
'http://user:pass@proxy2.example.com:8000',
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, $ }) {
// Your scraping logic
},
});
TypeScript Support
Crawlee is written in TypeScript. Full type definitions included:
import { CheerioCrawler, Dataset } from 'crawlee';
interface ProductData {
name: string;
price: number;
url: string;
}
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const data: ProductData = {
name: $('.product-name').text(),
price: parseFloat($('.price').text()),
url: request.url,
};
await Dataset.pushData(data);
},
});
Creating an Apify Actor
# Install CLI
npm install -g apify-cli
# Create from template
apify create my-actor --template=ts-crawlee-cheerio
# Test locally
cd my-actor
apify run
# Deploy to Apify
apify push
Common Questions
Q: Crawlee vs Puppeteer directly?
A: Crawlee wraps Puppeteer/Playwright. It adds request queuing, proxy rotation, and error handling. Use raw Puppeteer only for simple scripts.
Q: CheerioCrawler vs PlaywrightCrawler?
A: Start with Cheerio. It is 10x faster. Switch to Playwright only if content needs JavaScript to render.
Q: Can I run Crawlee without Apify?
A: Yes. Crawlee is standalone. Data saves locally. Only deploy to Apify if you want cloud execution.
Q: How do I handle rate limits?
A: Use maxRequestsPerMinute option. Crawlee automatically respects the limit and queues excess requests.