Scrapy Tutorial - Complete Guide

Welcome to Zenva’s tutorial on Scrapy, an incredibly useful Python library that allows you to create web scrapers with ease. Whether you’re new to coding or an experienced programmer, this guide will show you just how engaging Scrapy can be, demonstrating its practical value while making the learning process as accessible as possible.

Table of contents

What Is Scrapy?

Scrapy is a high-level, open-source framework specially designed for web scraping. It’s developed in Python, which is a popular programming language for its readability and ease of use.

Web scraping often involves retrieving data from websites and storing this information for analysis, or perhaps integrating it into an app or a game. Scrapy makes this process much more straightforward. Here’s why learning Scrapy would be valuable:

As a Python framework, it is beginner-friendly yet powerful for more advanced coders.
It has broad utility, from data science to game development, enhancing your current skills set.
Automating data retrieval saves you loads of time when contrasted against manual data collection.

We recommend learning Scrapy if you are interested in efficiently collecting accurate data from web sources or if you wish to expand your Python programming skills.

CTA Small Image - Scrapy Tutorial - Complete Guide

FREE COURSES AT ZENVA

LEARN GAME DEVELOPMENT, PYTHON AND MORE

ACCESS FOR FREE

AVAILABLE FOR A LIMITED TIME ONLY

Getting Started with Scrapy – Basic Example

Before starting to code, ensure that Scrapy is installed in your local machine. If not, you can easily install Scrapy with pip by typing in your terminal:

pip install scrapy

Basic Spider Example

In Scrapy, spiders are classes, and that’s how we define how a website should be scraped. Let’s start with a simple example. We’ll create a new spider that will scrape quotes from “http://quotes.toscrape.com”.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
        }

In the example above:

We defined a QuotesSpider spider class extending scrapy.Spider.
We provided the start_urls listing to target the pages that we aim to scrape.
We parsed the data from the targeted URLs using CSS selectors in our parse method.

Running Your Spider

Once your spider is set, you can run it using Scrapy’s command line interface.

scrapy crawl quotes

The code above should start the crawling process based on your QuotesSpider class.

Storing Your Scraped Data

Often, you would want to store the data you’ve scraped. Scrapy allows you to save data into many formats, including JSON, XML, and CSV. Here’s how you can save the scraped data as a JSON file:

scrapy crawl quotes -o quotes.json

In the example above, “-o quotes.json” instructs Scrapy to store the resultant data in a JSON file named “quotes.json”. Each time you run the command, the quotes.json file is appended not overwritten. Note, that you can replace “.json” with “.xml” or “.csv” to store data in a different format.

Now you know the basics of creating, running, and storing information from a Scrapy spider!

Adding Multiple Pages to Your Spider

Often, data you need isn’t on just one page. Let’s expand our QuotesSpider to crawl through all the pages on our quote website.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

In this spider, we added code to notice links to the subsequent page and follow them until no more pages are found.

Using Scrapy Shell

Scrapy Shell is an interactive shell where you can try and debug your scraping code very quickly without running the spider. It’s a great feature Scrapy provides to avoid unnecessary bugs in your spider. Here’s how you can use it:

scrapy shell 'http://quotes.toscrape.com'

After running this command, Scrapy fetches the page and allows you to test it interactively. For example:

In [1]: response.css('title')
Out[1]: [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

Using Scrapy with BeautifulSoup

BeautifulSoup is another Python library used for web scraping purposes to pull the data out of HTML and XML files. We can use BeautifulSoup with Scrapy to enhance selecting data from a webpage.

import scrapy
from bs4 import BeautifulSoup

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        for quote in soup.find_all('div', class_='quote'):
            yield {
                'text': quote.find('span', class_='text').text,
                'author': quote.find('small', class_='author').text,
            }
        next_page = soup.find('li', class_='next')
        if next_page is not None:
            next_page = next_page.find('a')['href']
            yield response.follow(next_page, self.parse)

In our spider, we use BeautifulSoup to parse the HTML content of the page more conveniently.

By now, you should have immersed yourself into the world of Scrapy and its extensive capabilities for web scraping. Happy data hunting!

Where to Go Next – Continuing Your Journey

Scrapy is just one of the versatile tools you can master with the Python programming language. There’s a vast world to explore with Python, opening innumerable opportunities in various fields like game development, data science, app development, and more.

If you’re interested in digging deeper into Python and its applications, our Python Mini-Degree could be the perfect next step for you. Our Python Mini-Degree offers a comprehensive suite of Python programming courses. Beginners will learn coding basics, algorithms, and object-oriented programming. At the same time, more experienced programmers can delve into game and app development in Python.

Our courses are designed to provide maximum convenience and flexibility. You can access the course material anytime, learn at your own pace, and reinforce what you’ve learned through interactive lessons, coding challenges, and quizzes.

Upon completion of the courses, you will be awarded a certificate. Many of our students have leveraged their newfound skills to publish games, land jobs, or even start their own businesses.

In case you’re looking for a broader selection of Python courses, feel free to explore our Python courses collection.

Conclusion

We’ve had an engaging journey into the world of Scrapy, showing you the basics and how to effectively harness its capabilities to your advantage. Such skills aren’t just useful; they’re highly demanded in today’s coding landscape. Remember, the most essential aspect of learning is to keep the experience interactive and enjoyable.

At Zenva, we strive to provide high-quality content that’s accessible anytime, from anywhere, and on any device. We look forward to joining you in your next steps. Perhaps you’re fascinated by Python’s many applications, in which case, we’d like to invite you to further your journey with our Python Mini-Degree. You can continue to develop your Python skills, diving deeper into its rich world of opportunities. Happy coding!

Did you come across any errors in this tutorial? Please let us know by completing this form and we’ll look into it!

FREE COURSES