XPath 关键概念
1. 前言
xpath是一门在XML文档中查找信息的语言。 BeautifulSoup不支持xpath,但是python中的很多库比如(lxml,selenium,scrapy等)都支持。它的使用方式通常和CSS Selector(比如mytag#idname)一样。它原本被设计用于处理更规范的XML文档而不是HTML文档。
2. 学习准备
这里使用scrapy shell作为xpath语法的实现环境,如果熟悉scrapy可以跳过这段。本文的解析目标为:http://quotes.toscrape.com/,这是scrapy教学使用的网站。scrapy库的安装方法如下:
$ pipenv install # 创建虚拟环境并安装Pipfile中的依赖项
$ pipenv shell # 进入虚拟环境
$ pipenv install scrapy # 安装scrapy
scrapy shell是一个可以对接收到的网页信息进行交互式提取的shell环境。在安装好scrapy后,打开scrapy shell的方法为:$ scrapy shell http://quotes.toscrape.com/。执行该命令后会看到类似如下内容:
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f7ca6803978>
[s] item {}
[s] request <GET http://quotes.toscrape.com/>
[s] response <200 http://quotes.toscrape.com/>
[s] settings <scrapy.settings.Settings object at 0x7f7ca6812b00>
[s] spider <DefaultSpider 'default' at 0x7f7ca5002c18>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: response.headers
Out[1]:
{b'Content-Type': b'text/html; charset=utf-8',
b'Date': b'Fri, 02 Feb 2018 13:39:41 GMT',
b'Server': b'nginx/1.12.1',
b'X-Upstream': b'spidyquotes-master_web'}
在scrapy shell中response是一个已经定义好的变量,它的含义为从服务器接收到的响应。使用response.headers可以得到它的头信息。之后就可以使用xpath语法对response中的内容进行进一步抓取解析。
3. XPath语法的四个重要概念
根节点和非根节点
/div选择div节点,只有当它是文档的根节点时//div选择所有的div节点(包括非根节点)
通过属性选择节点
//@href选择带href属性的所有节点//a[@href='http://google.com']选择页面中所有指向Google网站的链接
通过位置选择节点
//a[3]选择文档中的第三个链接//table[last()]选择文档中的最后一个表//a[position()<3]选择文档中的前三个链接
星号(*)匹配任意字符或节点,可以在不同条件下使用。
//table/tr/*选择所有表格行tr标签的所有子节点//div[@*]选择带任意属性的所有div标签
4. 在Scrapy中使用XPath
这里回到之前打开的scrapy shell环境,执行下面指令可以抓取网页标题:
In [17]: response.xpath('//title') # 抓取标题节点并提取
Out[17]: [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
In [18]: response.xpath('//title/text()').extract_first() # 抓取标题节点内容并提取
Out[18]: 'Quotes to Scrape'
这里text()的作用是定位到标签的内容。
下面的命令可以抓取网页所有内容为about的a标签节点中的href属性值。值得注意的是@href代表定位到属性值href。
In [6]: response.xpath("/html/body/div/div[2]/div[1]//span//a/@href").extract()
Out[6]:
['/author/Albert-Einstein',
'/author/J-K-Rowling',
'/author/Albert-Einstein',
'/author/Jane-Austen',
'/author/Marilyn-Monroe',
'/author/Albert-Einstein',
'/author/Andre-Gide',
'/author/Thomas-A-Edison',
'/author/Eleanor-Roosevelt',
'/author/Steve-Martin']
下面指令的效果是抓取所有网页中的所有quote:
In [74]: esponse.xpath("/html/body/div/div[2]/div[1]/div//span[@class='text']/text()").extract()
Out[74]:
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“Try not to become a man of success. Rather become a man of value.”',
'“It is better to be hated for what you are than to be loved for what you are not.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
'“A day without sunshine is like, you know, night.”']
这里要注意的是//span[@class='text']的含义是寻找所有class属性值为’text’的span标签,//span意味着非根节点。
5. 练习
现在我们已经知道了如何使用scrapy shell来进行xpath的实践。但是如果是对本地文件进行xpath提取呢?lxml可以帮助你:
from lxml import etree
html="""
<!DOCTYPE html>
<html>
<head lang="en">
<title>测试</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<div id="content">
<ul id="ul">
<li>NO.1</li>
<li>NO.2</li>
<li>NO.3</li>
</ul>
<ul id="ul2">
<li>one</li>
<li>two</li>
</ul>
</div>
<div id="url">
<a href="http:www.58.com" title="58">58</a>
<a href="http:www.csdn.net" title="CSDN">CSDN</a>
</div>
</body>
</html>
"""
selector=etree.HTML(html)
content=selector.xpath('//div[@id="content"]/ul[@id="ul"]/li/text()') #这里使用id属性来定位哪个div和ul被匹配 使用text()获取文本内容
for i in content:
print(i)
上面的示例代码可以提取No.1到No.3那3个元素。你可以试着提取上面的html 中的所有的超链接吗?
6. 使用Chrome开发者工具调试XPath和CSS Selector
Chrome浏览器包含开发者工具,可以用来调试XPath或者CSS selector。有两种方法可以实现这个目的:
- 使用Elements面板的搜索功能
- 使用Console面板的指令,包括
$x("some_xpath"),$$("css-selectors")
使用Elements面板的的方法为输入Ctrl+F之后,在搜索框内输入XPath,比如//div[@class="info"],//span[contains(@class, "a b")],或者输入css selector,比如#info。可以发现在右边现实搜索到了几个结果,可以在不同搜索结果之间跳转。
使用Console面板的方法为输入指令:
// xpath
$x('//div[@class="info"]')
$x('//span[contains(@class, "a b")]')
// css selector
$$('#info')
在返回结果中点击可以跳转到element面板中的对应位置。
https://yizeng.me/2014/03/23/evaluate-and-validate-xpath-css-selectors-in-chrome-developer-tools/
7. 更多关于XPath的资料:
- Use XPath with Scrapy Selectors: https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors
- Learn XPath through examples: http://zvon.org/comp/r/tut-XPath_1.html
- How to think in XPath: http://plasmasturm.org/log/xpath101/
- w3c教程: http://www.w3school.com.cn/xpath/
- 微软XPath语法界面: https://msdn.microsoft.com/en-us/enus/library/ms256471
- 可以爬取的网页:http://books.toscrape.com/