Scaling scrapers to scrap 10 millions items
Web scraping is an interesting job, finding out imperfections or loop holes in a site that can be used to scrap the website without rate limits. With current webstack framework like react or vue, it gets quite difficult to work with just simple requests module as it demands rendering of javascript to load dom content. It usually ends up with either using selenium to scrap the page content or mess around the network tab to look for any open APIs that can be used.
But what's next step, after building a scraper? Deployment!
Issue with deployment is most of the scrapers I have seen or worked on aren't scalable after some point. Sure they can scrap 10k records from facebook or weibo without getting blocked but that's it. Scaling them vertically is not an option since it binds to the same IP address of machine which can lead to 419 error (too many requests error) or direct block.
I have heard of many solutions where the links are divided in batches and executed on different machines. While that works, issue is if a machine goes down or I have an extra machine to add, I have to start over again.
tl;dr: Celery is the solution
Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system. It’s a task queue with focus on real-time processing, while also supporting task scheduling.
- Celery's homepage
In one of the recent projects, I had to scrap website with around 338 search pages, and each page containing 100 items and for each item there are around 30-500 comments.
Idea is dividing the scraping of single item into basic celery tasks that can be sent over to queue and processed as atomic functions in different machines.
A trigger (job adder) script will make a search query in main website and send out the search result pages link to the SCRAP_SEARCH_PAGES worker i.e.,
1. domain.com/?q=iphone&page=1
2. domain.com/?q=iphone&page=2
3. domain.com/?q=iphone&page=3
SCRAP_SEARCH_PAGES will scrap all products links from the passed search page url, and send scraped product links to next worker SCRAP_PRODUCT_INFO.
SCRAP_PRODUCT_INFO will scrap item's details like title, description, seller etc. and send the generated item json to INSERT_INTO_DB worker.
And at the same time it will send the item to SCRAP_ITEM_COMMENTS worker, to process the comments as well.
SCRAP_ITEM_COMMENTS will scrap comments from the item's detail page and send them as json to INSERT_INTO_DB worker.
Instead of linking each worker with database, I used one db worker cause on scaling out other workers (scraper workers), it will also increase number of db connections hence increasing the load on database server creating a bottleneck.
I have measured single db worker running on t3.micro instance and mongo database server running on t3.medium and it was able to reach around ~420 entries per second, which was good enough for the project.
Picking up resources
I have used rabbitmq instead of going with redis for celery queue, while redis comes with easy plug an play config I don't think it could have handled 10+ million queue items without creating any latency issue for 35+ workers running at once. I went with rabbitmq for performance reasons and after adding memory swap and resizing allocated in-memory block for queues it worked without any issues on single t3.nano instance.
For the workers, it depends on what kind of task is being done. Like if it's image processing or uploading to s3 bucket, I'd prefer picking medium or large instance with 2-4 workers. If it's making simple http requests using requests module and bs4 parsing, micro instances work well. I picked both micro and large instances with aws autoscaling option to scale out instances depending on the size of job queues.
For easier and faster deployment, I built docker image and pushed everything to aws ecr. I wrote a bash script and put it in user-data so on every start of machine, it will pull the latest image and start worker with given env for queue, listening for queue passed in the $QUEUE env using the below command:
celery -A proj worker -Q $QUEUE
Conclusion
There are not many solutions to distributed scraping in the market. I have tested crawlab and scrapyd but they aren't very suitable for large jobs. While this solution is complicated to setup but it promises that no jobs are lost in case of system crash or restarting workers. With this setup, the entire site was scraped in less than a day. Only limit was how much spot instances can be allocated.
What we didn't do
- Kubernetes setup for better CI/CD
- Automatic scaling based on average cluster load or queue's size
- Cluster monitoring using grafana to get proper utilization metrics of instances
- Centralized logs if in case tasks were failing