Scrapy

Scrapy is an incredible tool for writing web crawlers.

Quick setup

scrapy startproject [name]

Use the scrapy shell to test out selectors

scrapy shell [URL]

Storing the output, e.g. as JSON

scrapy crawl [spidername] -O output.json

Creating new spiders can be done easily with

scrapy genspider spidername domain.com

Useful recipes

follow and follow_all

Tie additional requests together using

response.follow(url, callback=handler)

or a list of urls with

response.follow_all(urls, handler)

Note, these do not have to explicitly be URLs, but any tag with a href will also work.

Example use:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'
    # ... 
    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

    def parse_author(self, response):
        # handle page
        ...

Downloading images

Images require the definition of an Item class, and the use of the Images Pipeline.

In brief, enable the pipeline by modifying settings.py and adding

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

Next, define a location for the images to be stored with

IMAGES_STORE = '/path/to/directory`

These stores may even be an FTP server, Amazon’s S3, or Google Cloud.

After the pipeline has been enabled, create a class in items.py

import scrapy

class ImageItem(scrapy.Item):
    # ... other relevant fields 
    image_urls = scrapy.Field()
    images = scrapy.Field()

To download an image, your spider then merely needs to return an instance of this item:

from project.items import ImageItem
# ... 

class ImageSpider(scrapy.Spider):
    # ... 

    def parse_image(self, response):
        image_url = response.xpath("//img")[0].attrib["src]

        image = ImageItem()
        image["image_urls"] = [image_url]
        return image 

Command line arguments

You can read in command line arguments in a spider class, e.g. in the __init__ method:

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    # ...
    def start_requests(self):
        # ...
        page = int(getattr(self, 'page', 1))

If the page attribute is not set, the value is set to the default, in this case 1.

When launching the spider from the command line, we can now pass in the argument with

scrapy crawl [spidername] -a page=10

Useful resources