Crawlee for Python

TL;DR

Crawlee is Apify's open-source scraping library for Python. Install with pip. Write scrapers that handle proxies, retries, and browser automation. Deploy to Apify or run locally.

What is Crawlee?

Crawlee is a Python library for building web scrapers. It handles the hard parts: proxy rotation, request queuing, browser automation, and error handling. You focus on extracting data.

Crawlee is open source. Use it locally for free. Deploy to Apify for cloud execution.

Installation

Requires Python 3.9 or higher. Install with pip:

pip install crawlee

For browser automation (Playwright):

pip install "crawlee[playwright]"
playwright install

Your First Scraper

This example scrapes titles from a website:

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext):
        title = context.soup.title.string if context.soup.title else 'No title'
        print(f'Title: {title}')

        # Find and follow links
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

asyncio.run(main())

Run it:

python scraper.py

Crawlee Crawler Types

Crawler	Use Case	Speed
HttpCrawler	Simple HTTP requests, API calls	Fastest
BeautifulSoupCrawler	Static HTML pages	Fast
ParselCrawler	Static HTML with XPath support	Fast
PlaywrightCrawler	JavaScript-rendered pages	Slow

Scraping Dynamic Pages

Use PlaywrightCrawler for pages that need JavaScript:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def handler(context: PlaywrightCrawlingContext):
        # Wait for content to load
        await context.page.wait_for_selector('.product-list')

        # Extract data
        products = await context.page.query_selector_all('.product')
        for product in products:
            name = await product.text_content()
            print(name)

    await crawler.run(['https://shop.example.com'])

asyncio.run(main())

Saving Data

Crawlee includes a Dataset for storing results:

@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext):
    title = context.soup.title.string

    await context.push_data({
        'url': context.request.url,
        'title': title,
    })

Data saves to the storage/datasets/default folder as JSON.

Proxy Configuration

Rotate proxies to avoid blocks:

from crawlee import ProxyConfiguration

proxy_config = ProxyConfiguration(
    proxy_urls=[
        'http://user:pass@proxy1.example.com:8000',
        'http://user:pass@proxy2.example.com:8000',
    ]
)

crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config)

Deploying to Apify

Package your scraper as an Apify Actor:

# Install the Apify CLI
pip install apify-cli

# Create actor from template
apify create my-scraper --template=python-crawlee

# Deploy to Apify
cd my-scraper
apify push

Your scraper now runs in the cloud with Apify's infrastructure.

Common Questions

Q: Is Crawlee free?

A: Yes. Crawlee is open source (Apache 2.0). Run it locally for free. Cloud deployment on Apify has costs.

Q: Crawlee vs Scrapy?

A: Scrapy is more mature with larger ecosystem. Crawlee has better Playwright integration and Apify deployment. Choose Scrapy for complex projects, Crawlee for cloud deployment.

Q: Can I use my existing BeautifulSoup code?

A: Yes. BeautifulSoupCrawler gives you a standard BeautifulSoup object. Your parsing code works unchanged.

Q: How do I handle authentication?

A: Use session management in PlaywrightCrawler. Log in once, then scrape while maintaining cookies.