TL;DR
Crawlee is Apify's open-source scraping library for Python. Install with pip. Write scrapers that handle proxies, retries, and browser automation. Deploy to Apify or run locally.
What is Crawlee?
Crawlee is a Python library for building web scrapers. It handles the hard parts: proxy rotation, request queuing, browser automation, and error handling. You focus on extracting data.
Crawlee is open source. Use it locally for free. Deploy to Apify for cloud execution.
Installation
Requires Python 3.9 or higher. Install with pip:
pip install crawlee
For browser automation (Playwright):
pip install "crawlee[playwright]"
playwright install
Your First Scraper
This example scrapes titles from a website:
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
title = context.soup.title.string if context.soup.title else 'No title'
print(f'Title: {title}')
# Find and follow links
await context.enqueue_links()
await crawler.run(['https://example.com'])
asyncio.run(main())
Run it:
python scraper.py
Crawlee Crawler Types
| Crawler | Use Case | Speed |
|---|---|---|
| HttpCrawler | Simple HTTP requests, API calls | Fastest |
| BeautifulSoupCrawler | Static HTML pages | Fast |
| ParselCrawler | Static HTML with XPath support | Fast |
| PlaywrightCrawler | JavaScript-rendered pages | Slow |
Scraping Dynamic Pages
Use PlaywrightCrawler for pages that need JavaScript:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext):
# Wait for content to load
await context.page.wait_for_selector('.product-list')
# Extract data
products = await context.page.query_selector_all('.product')
for product in products:
name = await product.text_content()
print(name)
await crawler.run(['https://shop.example.com'])
asyncio.run(main())
Saving Data
Crawlee includes a Dataset for storing results:
@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext):
title = context.soup.title.string
await context.push_data({
'url': context.request.url,
'title': title,
})
Data saves to the storage/datasets/default folder as JSON.
Proxy Configuration
Rotate proxies to avoid blocks:
from crawlee import ProxyConfiguration
proxy_config = ProxyConfiguration(
proxy_urls=[
'http://user:pass@proxy1.example.com:8000',
'http://user:pass@proxy2.example.com:8000',
]
)
crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config)
Deploying to Apify
Package your scraper as an Apify Actor:
# Install the Apify CLI
pip install apify-cli
# Create actor from template
apify create my-scraper --template=python-crawlee
# Deploy to Apify
cd my-scraper
apify push
Your scraper now runs in the cloud with Apify's infrastructure.
Common Questions
Q: Is Crawlee free?
A: Yes. Crawlee is open source (Apache 2.0). Run it locally for free. Cloud deployment on Apify has costs.
Q: Crawlee vs Scrapy?
A: Scrapy is more mature with larger ecosystem. Crawlee has better Playwright integration and Apify deployment. Choose Scrapy for complex projects, Crawlee for cloud deployment.
Q: Can I use my existing BeautifulSoup code?
A: Yes. BeautifulSoupCrawler gives you a standard BeautifulSoup object. Your parsing code works unchanged.
Q: How do I handle authentication?
A: Use session management in PlaywrightCrawler. Log in once, then scrape while maintaining cookies.