Processing Scraped Data

Clean, transform, and analyze your scraped data. Deduplication, normalization, and export formats.

TL;DR

Raw scraped data is messy. Clean it by removing duplicates, normalizing formats, and validating fields. Export to CSV, JSON, or database. Automate processing in your scraper or post-processing pipeline.

Why Processing Matters

Scraped data often has problems:

  • Duplicate entries from pagination or retries
  • Inconsistent formats (dates, phones, currencies)
  • Missing required fields
  • HTML artifacts and extra whitespace
  • Wrong data types (numbers as strings)

Clean data before analysis. Garbage in, garbage out.

Common Data Issues

Problem Example Solution
Duplicates Same product appears 3 times Dedupe by unique ID or URL
Inconsistent dates "Jan 1, 2026" vs "2026-01-01" Parse and normalize to ISO
Price formatting "$1,299.99" as string Extract number, store as float
Extra whitespace " Product Name \n" Trim and collapse spaces
Missing fields Phone number null on some rows Set defaults or filter out

Processing in JavaScript

Clean data before saving in Crawlee:

import { CheerioCrawler, Dataset } from 'crawlee';

function cleanProduct(raw) {
    return {
        // Normalize name
        name: raw.name?.trim() || 'Unknown',

        // Extract number from price string
        price: parseFloat(raw.price?.replace(/[$,]/g, '')) || 0,

        // Normalize URL
        url: raw.url?.split('?')[0],

        // ISO date format
        scrapedAt: new Date().toISOString(),

        // Boolean conversion
        inStock: raw.availability?.toLowerCase().includes('in stock') || false,
    };
}

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const rawData = {
            name: $('.product-title').text(),
            price: $('.price').text(),
            url: request.url,
            availability: $('.stock-status').text(),
        };

        const cleaned = cleanProduct(rawData);
        await Dataset.pushData(cleaned);
    },
});

Processing in Python

import json
import re
from datetime import datetime

def clean_product(raw):
    # Extract price as float
    price_str = raw.get('price', '$0')
    price = float(re.sub(r'[^\d.]', '', price_str) or 0)

    return {
        'name': raw.get('name', '').strip(),
        'price': price,
        'url': raw.get('url', '').split('?')[0],
        'scraped_at': datetime.now().isoformat(),
        'in_stock': 'in stock' in raw.get('availability', '').lower(),
    }

# Process dataset
with open('dataset.json') as f:
    raw_data = json.load(f)

cleaned = [clean_product(item) for item in raw_data]

Deduplication

Remove duplicate entries:

// JavaScript - dedupe by URL
function deduplicate(items) {
    const seen = new Set();
    return items.filter(item => {
        if (seen.has(item.url)) return false;
        seen.add(item.url);
        return true;
    });
}

// Python - dedupe by URL
def deduplicate(items):
    seen = set()
    unique = []
    for item in items:
        if item['url'] not in seen:
            seen.add(item['url'])
            unique.append(item)
    return unique

Data Validation

Check required fields and data types:

function validateProduct(item) {
    const errors = [];

    if (!item.name || item.name.length < 2) {
        errors.push('Invalid name');
    }

    if (typeof item.price !== 'number' || item.price < 0) {
        errors.push('Invalid price');
    }

    if (!item.url?.startsWith('http')) {
        errors.push('Invalid URL');
    }

    return {
        isValid: errors.length === 0,
        errors,
        data: item,
    };
}

// Filter to valid items only
const validItems = items
    .map(validateProduct)
    .filter(result => result.isValid)
    .map(result => result.data);

Export Formats

Format Best For Notes
JSON APIs, nested data Preserves types and structure
CSV Excel, flat data Flatten nested objects first
Excel Business users Max 1 million rows
Database Large datasets, queries PostgreSQL, MongoDB common

Apify Dataset Features

Apify handles export automatically:

  • Format conversion: Download as JSON, CSV, Excel, XML
  • Streaming: Handle datasets larger than memory
  • Webhooks: Trigger processing when run completes
  • API access: Fetch results programmatically
// Fetch and process via API
const response = await fetch(
    'https://api.apify.com/v2/datasets/DATASET_ID/items?format=json',
    { headers: { Authorization: 'Bearer YOUR_TOKEN' } }
);
const items = await response.json();
const cleaned = items.map(cleanProduct);

Common Questions

Q: Clean during scraping or after?

A: Both. Do basic cleaning during scraping (trim whitespace). Do heavy processing after (deduplication, validation) when you have all data.

Q: What about very large datasets?

A: Stream processing. Do not load everything into memory. Use database for storage. Process in chunks.

Q: How do I handle encoding issues?

A: Always decode as UTF-8. Replace or remove invalid characters. Watch for HTML entities (&amp; → &).