Beginnings of Automation with Scrapy

Kelvin Arellano
4 min readJun 13, 2021
Photo by Christopher Gower on Unsplash

Last week I mentioned how I wanted to start getting better at web scraping and eventually be able to automate more of my general processes with an army or python bots. This is me chronicling my first steps into achieving that goal.

For this I wanted to start learning new software and approaches for things that I was familiar with before, enter Scrapy.

Scrapy is a python framework for crawling websites and extracting then storing the information. Its basically an all in one swiss army type tool for web scraping. It uses both xpath and css tags to parse through websites, but for this project I’ll be using the css tags, along with the built in shell to create crawlers and extractors.

What I found from scrapy is that it centers its programs and methodology around spiders, from what I understand the difference between spiders and crawlers is that spiders work in conjunction with crawlers to extract data, but crawlers work on primarily navigating through pages and websites and spiders work on sifting through the information on those pages.

Above is the default commands for a scrapy object. When you install scrapy into your shell or computer its has these commands for every scrapy project you create. Once you fetch a url it return a response this is the same response that you would get from using beautiful soup, but with scrapy its stored as an object instead of you having to store it as its own variable.

Now using the response you can parse through different tags and list present in the html page.

This is usually my go to example when explaining to people how different languages and programs work together or are even reliant on each other to realize their full potential. You need to know how to key into object using python and how websites are structured using css and html.

Then what I like about scrapy that I wasn’t able to figure out how to integrate using just beautiful soup, is that it comes with regular expression integration. Regular expressions are sequences of characters that specify a search query. Using this method in conjunction with scrapy can be very powerful in filtering your initial data and keeping your data organized.

For example I wanted to find instances of words on the webpage first contained the word scraping and then that started with the letter s.

With this I think I can preform better nlp projects or make it so that I can add some additional functionality to my former projects, such as saving book snippets as json objects and then parsing through that. To do so I think I would first need to get better at filling out forms automatically on websites or being able to more accurately dictate behavior for my bots.

First thing is first however and in order to be able to take that step with full confidence I first had to be able to do everything that I did with beautiful soup and do it with scrapy with confidence.

In one of the last projects I did I had to parse through information on a website and store that info into a csv file, the most difficult thing was being able to correctly pick the tags and set rules and guidelines for behavior to be taken if those tags were not found. Now I wanted to do that with scrapy, and luckily I didn’t struggle nearly as much by drawing on my past experience and using the functionality of scrappy’s spiders.

So for each post or paragraph on this website I wanted to get the date of publication, the title and the post text.

I still would like to integrate this into one of my projects but with this I believe that the functionality that I was able to achieve matches what I did previously using beautiful soup.

--

--