Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.
Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.
There are thus a lot of articles which teach you how to scrape websites — here’s mine.
After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.
I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.
Levelling up... is not easy.
Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.
To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.
Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.
Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.
The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.