TL;DR
Raw scraped data is messy. Clean it by removing duplicates, normalizing formats, and validating fields. Export to CSV, JSON, or database. Automate processing in your scraper or post-processing pipeline.
Why Processing Matters
Scraped data often has problems:
- Duplicate entries from pagination or retries
- Inconsistent formats (dates, phones, currencies)
- Missing required fields
- HTML artifacts and extra whitespace
- Wrong data types (numbers as strings)
Clean data before analysis. Garbage in, garbage out.
Common Data Issues
| Problem | Example | Solution |
|---|---|---|
| Duplicates | Same product appears 3 times | Dedupe by unique ID or URL |
| Inconsistent dates | "Jan 1, 2026" vs "2026-01-01" | Parse and normalize to ISO |
| Price formatting | "$1,299.99" as string | Extract number, store as float |
| Extra whitespace | " Product Name \n" | Trim and collapse spaces |
| Missing fields | Phone number null on some rows | Set defaults or filter out |
Processing in JavaScript
Clean data before saving in Crawlee:
import { CheerioCrawler, Dataset } from 'crawlee';
function cleanProduct(raw) {
return {
// Normalize name
name: raw.name?.trim() || 'Unknown',
// Extract number from price string
price: parseFloat(raw.price?.replace(/[$,]/g, '')) || 0,
// Normalize URL
url: raw.url?.split('?')[0],
// ISO date format
scrapedAt: new Date().toISOString(),
// Boolean conversion
inStock: raw.availability?.toLowerCase().includes('in stock') || false,
};
}
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const rawData = {
name: $('.product-title').text(),
price: $('.price').text(),
url: request.url,
availability: $('.stock-status').text(),
};
const cleaned = cleanProduct(rawData);
await Dataset.pushData(cleaned);
},
});
Processing in Python
import json
import re
from datetime import datetime
def clean_product(raw):
# Extract price as float
price_str = raw.get('price', '$0')
price = float(re.sub(r'[^\d.]', '', price_str) or 0)
return {
'name': raw.get('name', '').strip(),
'price': price,
'url': raw.get('url', '').split('?')[0],
'scraped_at': datetime.now().isoformat(),
'in_stock': 'in stock' in raw.get('availability', '').lower(),
}
# Process dataset
with open('dataset.json') as f:
raw_data = json.load(f)
cleaned = [clean_product(item) for item in raw_data]
Deduplication
Remove duplicate entries:
// JavaScript - dedupe by URL
function deduplicate(items) {
const seen = new Set();
return items.filter(item => {
if (seen.has(item.url)) return false;
seen.add(item.url);
return true;
});
}
// Python - dedupe by URL
def deduplicate(items):
seen = set()
unique = []
for item in items:
if item['url'] not in seen:
seen.add(item['url'])
unique.append(item)
return unique
Data Validation
Check required fields and data types:
function validateProduct(item) {
const errors = [];
if (!item.name || item.name.length < 2) {
errors.push('Invalid name');
}
if (typeof item.price !== 'number' || item.price < 0) {
errors.push('Invalid price');
}
if (!item.url?.startsWith('http')) {
errors.push('Invalid URL');
}
return {
isValid: errors.length === 0,
errors,
data: item,
};
}
// Filter to valid items only
const validItems = items
.map(validateProduct)
.filter(result => result.isValid)
.map(result => result.data);
Export Formats
| Format | Best For | Notes |
|---|---|---|
| JSON | APIs, nested data | Preserves types and structure |
| CSV | Excel, flat data | Flatten nested objects first |
| Excel | Business users | Max 1 million rows |
| Database | Large datasets, queries | PostgreSQL, MongoDB common |
Apify Dataset Features
Apify handles export automatically:
- Format conversion: Download as JSON, CSV, Excel, XML
- Streaming: Handle datasets larger than memory
- Webhooks: Trigger processing when run completes
- API access: Fetch results programmatically
// Fetch and process via API
const response = await fetch(
'https://api.apify.com/v2/datasets/DATASET_ID/items?format=json',
{ headers: { Authorization: 'Bearer YOUR_TOKEN' } }
);
const items = await response.json();
const cleaned = items.map(cleanProduct);
Common Questions
Q: Clean during scraping or after?
A: Both. Do basic cleaning during scraping (trim whitespace). Do heavy processing after (deduplication, validation) when you have all data.
Q: What about very large datasets?
A: Stream processing. Do not load everything into memory. Use database for storage. Process in chunks.
Q: How do I handle encoding issues?
A: Always decode as UTF-8. Replace or remove invalid characters. Watch for HTML entities (& → &).