8 min read

Take your web scraping to a new level: Let's play with scrapy

Changing my code to scrapy, a web scraping framework for Python, was challenging but reaped many dividends.
Take your web scraping to a new level: Let's play with scrapy
Photo by Егор Камелев / Unsplash
☝🏼
Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgements
This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wrote

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub - houfu/pdpc-decisions: Data Protection Enforcement Cases in Singapore
Data Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.
The code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

This post is for subscribers only