Back
Featured image of post Scrapy

Scrapy

Scrapy

Scrapy is written in pure Python and depends on a few key Python packages (among others):

  • lxml, an efficient XML and HTML parser
  • parsel, an HTML/XML data extraction library written on top of lxml,
  • w3lib, a multi-purpose helper for dealing with URLs and web page encodings
  • twisted, an asynchronous networking framework
  • cryptography and pyOpenSSL, to deal with various network-level security needs

官方doc

scrapy startproject tutorial下好tutorial项目,然后自己配置一下venv进行pip3 install scrapy

Get start

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

主要就是写一个类来继承scrapy.Spider类,然后完成一些必要的属性和方法的实现,如name,start_requests,parse等,都要实现特定的功能

然后在根目录下运行scrapy crawl quotes得到如下俩html文件

.
├── bin
├── include
├── lib
├── lib64 -> lib
├── pyvenv.cfg
├── quotes-1.html
├── quotes-2.html
├── scrapy.cfg
├── share
└── tutorial

start_requests可以用一个start_urls的list来代替,如下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

ps:f'表示之后的字符串可以直接通过大括号{}的方式进行传参~(例如上面的page就是'1',‘2’)

Extracting data

上面爬到的都是html,接下来要对数据进行提取

下面利用shell进行一些测试

发送请求,并且进入shell

scrapy shell 'http://quotes.toscrape.com/page/1/'

下面对response中的title进行分析

  1. 直接取
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
  1. ::text去掉data的<title>标签
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
  1. getall()去掉前面的selector
>>> response.css('title::text').getall()
['Quotes to Scrape']
  1. get()得到第一个selector

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

>>> response.css('title::text').get()
'Quotes to Scrape'

或者这样操作

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

但是有缺陷,还是直接用get()好,因为这个可能会下标错误(如果是多个selector的话,用get()是默认返回第0个的)

Besides the getall() and get() methods, you can also use the re() method to extract using

除了getallget等方法,还有一个re()方法可以根据正则表达式提取想要的数据

example:

对一个名人名言网站进行爬取,先请求一下得到response等信息

scrapy shell 'http://quotes.toscrape.com'

爬到的html大概是这样的(对于第一句话)

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

通过response的css方法访问div标签且class为"quote"的部分,返回的应该是一个列表,因为每句话都是这样的形式,上面只是其中的一句话。

response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 ...]

可以选择第一句

quote = response.css("div.quote")[0]

接着可以进行提取

text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

获得tags下面的所有tag

tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

上面尝试完了shell中的运行,接下来可以试着把写一个脚本来实现

for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...

循环实现所有的quote的爬取

回到我们之前最早的例程中,改写如下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

正如官方文档写的那样:

If you run this spider, it will output the extracted data with the log:

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

Storing the scraped data

最简单的方法是用Feed exports,例如json、json line、xml、csv等

scrapy crawl quotes -O quotes.json

That will generate a quotes.json file containing all scraped items, serialized in JSON.

The -O command-line switch overwrites any existing file; use -o instead to append new content to any existing file**(个人感觉是json外层有一对中括号的缘故,而json line没有,而是"流式"的)**.However, appending to a JSON file makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such as JSON Lines:

scrapy crawl quotes -o quotes.jl

Q:当你不满足于当前页的爬取,想要利用nextpage按钮来跳转到下一页进行自动爬取的时候该怎么办?

观察到跳转按键如下

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们在shell中输入下面的命令

response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

There is also an attrib property available (see Selecting element attributes for more):

>>> response.css('li.next a').attrib['href']
'/page/2/'

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

通过urljoin的方式进行url的拼接,然后得到下一页,利用回调函数慢慢一页页搞下去

对于urljoin的个人理解:不是普通的拼接,他应该会根据原来url中的'/‘和next_page中的url的’/‘进行对比,从后往前根据’/‘去匹配到一个合适的位置进行替换

A shortcut for creating Requests

利用response.follow来代替scrapy.Request

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

对比两种方法,前者省略了urljoin的过程:

        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

For <a> elements there is a shortcut: response.follow uses their href attribute automatically. So the code can be shortened further:

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

To create multiple requests from an iterable, you can use response.follow_all instead:

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

or, shortening it further:

yield from response.follow_all(css='ul.pager a', callback=self.parse)

More examples and patterns

一堆懒狗精简用法,主要是用了一次就yield一下把数据扔了,直接看吧不解释了主要是懒

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.

Here we’re passing callbacks to response.follow_all as positional arguments to make the code shorter; it also works for Request.

The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.

Using spider arguments

You can provide command line arguments to your spiders by using the -a option when running them:

scrapy crawl quotes -O quotes-humor.json -a tag=humor

These arguments are passed to the Spider’s __init__ method and become spider attributes by default.

In this example, the value provided for the tag argument will be available via self.tag. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

If you pass the tag=humor argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as http://quotes.toscrape.com/tag/humor.

You can learn more about handling spider arguments here.

Next steps

This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and other things the tutorial hasn’t covered like modeling the scraped data. If you prefer to play with an example project, check the Examples section.

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy