python BeautifulSoup库的常用操作
作者:Focuson
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查询,修改文档的方式,本文就来给大家简单介绍一下BeautifulSoup库的常用操作,需要的朋友可以参考下
BeautifulSoup库
0、所有方法都有的
from bs4 import BeautifulSoup # 前面几个方法使用的都是这个参数,所以统一使用这个(后面的那些方法没有引用这个html文本文件) html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
1、基本用法
''' 基本用法demo1 ''' def demo01(html_doc): # 这里的作用是将html_doc中缺少的标签补充完善,使用的库是lxml进行补全 soup = BeautifulSoup(html_doc, "lxml") # 更正html_doc的格式,使得上面文本的格式是正确的 print(soup.prettify()) # 查看经过上面步骤处理过后的结果 print(soup.title.string)
2、节点选择器
''' 节点选择器demo2 ''' def demo02(html_doc): soup = BeautifulSoup(html_doc, 'lxml') # 选择html_doc中的title标签 # 结果:<title>The Dormouse's story</title> print(soup.title) # 查看对应的类型 # 结果:<class 'bs4.element.Tag'> print(type(soup.title)) # 结果:The Dormouse's story print(soup.title.string) # 结果:<head><title>The Dormouse's story</title></head> print(soup.head) # 结果:<p class="title"><b>The Dormouse's story</b></p> print(soup.p) # 结果:<class 'bs4.element.Tag'> print(type(soup.p)) # 结果:<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a> 【默认返回第一个】 print(soup.a)
3、提取节点信息
''' 提取节点信息demo3 ''' def demo03(html_doc): soup = BeautifulSoup(html_doc, "lxml") # <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a> tag = soup.a # 1、获取名称 # 结果:a print(tag.name) # 2、获取属性值 # 结果: # class值为: ['sister'] # href值为: http://example.com/elsie print("class值为: ", tag.attrs["class"]) print("href值为: ", tag.attrs["href"]) # 3、获取内容 # 结果:Elsie print(tag.string)
4、获取子节点信息
''' 获取子节点信息demo4 ''' def demo04(html_doc): soup = BeautifulSoup(html_doc, 'lxml') # 1、首先获取head标签的内容部分 # 结果:<head><title>The Dormouse's story</title></head> print(soup.head) # 2、然后获取head中title标签的内容 # 结果:<title>The Dormouse's story</title> print(soup.head.title) # 3、获取head中title下的文本内容 # 结果:The Dormouse's story print(soup.head.title.string)
5、关联选择
1、获取子节点--contents
''' 关联选择demo05--01--下级节点 使用contents属性进行获取--获取子节点 介绍: 在做选择的时候,有时候不能做到一步就获取到我想要的节点元素,需要选取某一个节点元素, 然后以这个节点为基准再选取它的子节点、父节点、兄弟节点等 ''' def demo05(): # 注意它的第一个p标签没有换行展示 html_doc01 = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="Dormouse"><b>The Dormouse's story</b></p> <p class="story">...</p> """ # 注意它和html_doc01的区别在于,p标签进行了换行 html_doc02 = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="Dormouse"><b>The Dormouse's story</b> </p> <p class="story">...</p> """ # 1、获取节点的子节点和子孙节点--contents属性 soup01 = BeautifulSoup(html_doc01, "lxml") # 结果:[<b>The Dormouse's story</b>] print(soup01.p.contents) soup02 = BeautifulSoup(html_doc02, "lxml") # 注意这里的结果多了一个换行符 # 结果:[<b>The Dormouse's story</b>, '\n'] print(soup02.p.contents)
2、获取子节点--children
''' 关联选择demo06--02--下级节点 使用children属性进行获取--获取子节点 ''' def demo06(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 结果:<list_iterator object at 0x000002B35915BFA0 print(soup.p.children) # 结果:[ # '\n Once upon a time there were three little sisters; and their names were\n ', # <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # ',\n ', # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # ' and\n ', # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>, # ';\n and they lived at the bottom of a well.\n ' # ] print(list(soup.p.children)) for item in soup.p.children: print(item)
3、获取子孙节点--descendants
''' 关联选择demo07--03--下级节点 使用descendants属性进行获取--获取子孙节点(获取:子节点和孙节点的内容) ''' def demo07(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><span>Elsie</span>Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 结果:<generator object Tag.descendants at 0x000001C0E79DCC10> print(soup.p.descendants) # 结果:[ # 'Once upon a time there were three little sisters; and their names were\n ', # <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><span>Elsie</span>Elsie</a>, # <span>Elsie</span>, # 'Elsie', # 'Elsie', # ',\n ', # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # 'Lacie', # ' and\n ', # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>, # 'Tillie', # ';\n and they lived at the bottom of a well.' # ] print(list(soup.p.descendants)) # for item in soup.p.descendants: # print(item)
4、获取父节点--parent、祖先节点--parents
''' 关联选择demo08--01--上级节点 使用parent属性进行获取--获取父节点 使用parents属性进行获取--获取祖先节点 ''' def demo08(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a> <p> <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> </p> </p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 会打印出<body>标签中所有的内容,包括子节点p标签和孙节点a标签等全部的值 print(soup.p.parent) # 获取第一个a标签的父节点p标签的值,包括当前的这个a标签中的文本内容 print(soup.a.parent) print("=======================") # 结果:<generator object PageElement.parents at 0x000001403E6ECC10> print(soup.a.parents) for i, parent in enumerate(soup.a.parents): print(i, parent)
5、获取兄弟节点
''' 关联选择demo09--兄弟节点 # 可以使用的属性有: 1、next_sibling 2、previous_sibling 3、next_siblings 4、previous_siblings ''' def demo09(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>hello <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a> <a href="http://example.com/a" rel="external nofollow" class="sister" id="link3">a</a> <a href="http://example.com/b" rel="external nofollow" class="sister" id="link3">b</a> and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 1、使用next_sibling # 结果:hello print(soup.a.next_sibling) # 2、使用next_siblings # 结果:<generator object PageElement.next_siblings at 0x00000241CA26CC10> print(soup.a.next_siblings) # print(list(soup.a.next_siblings)) # 3、使用previous_sibling # 结果:Once upon a time there were three little sisters; and their names were print(soup.a.previous_sibling) # 4、使用previous_siblings # <generator object PageElement.previous_siblings at 0x000001F4E6E6CBA0> print(soup.a.previous_siblings) # print(list(soup.a.previous_siblings))
6、方法选择器
1、find_all()
''' 方法选择器 -- find_all() -- 以列表形式返回多个元素 find_all(name, attrs={}, recursive=True, string, limit) # 1、name: 标签的名称--查找标签 # 2、attrs: 属性过滤器字典 # 3、recursive: 递归查找一个元素的子孙元素们,默认为True # 4、string:查找文本 # 5、limit: 查找结果的个数限制 ''' def demo10(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="Dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 1、【基本使用】找到所有的a标签 # 结果:[ # <a class="sister hi" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a> # ] print(soup.find_all("a")) # for item in soup.find_all("a"): # print(item.string) # 2、【属性查找】根据指定的属性字典进行元素的查找,这里查找的是class为sister的元素 print(soup.find_all(attrs={"class": "sister"})) # 效果同上 print(soup.find_all(class_ = "sister")) # ============这个没有找到结果,需找到原因============ print(soup.find_all(class_ = "hi")) # 3、【文本查找】查找文本为Elsie的内容 print(soup.find_all(string="Elsie"))
2、find()
''' 方法选择器 -- find() -- 返回单个元素【一般是返回第一个元素作为结果】 ''' def demo11(): html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="Dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><span>Elsie</span></a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2"><span>Lacie</span></a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3"><span>Tillie</span></a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, "lxml") # 结果:<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><span>Elsie</span></a> print(soup.find("a"))
3、其他方法选择器
''' 其他方法选择器 find_parents(): 返回所以的祖先节点 find_parent(): 返回当前节点的父节点 find_next_siblings():返回当前节点后面的所有兄弟节点 find_previous_siblings():返回当前节点后面的相邻的那个兄弟节点 find_next_sibling():返回当前节点前面的所有兄弟节点 find_previous_sibling():返回当前节点前面的相邻的那个兄弟节点 '''
7、CSS选择器--select()
''' CSS选择器 -- select()方法 ''' def demo12(): html_doc = """ <div class="panel"> <div class="panel-heading"> <h4>Hello World</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-samll" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> </div> """ soup = BeautifulSoup(html_doc, "lxml") # 1、获取class为panel-heading的节点 # 结果:[<div class="panel-heading"> # <h4>Hello World</h4> # </div>] print(soup.select(".panel-heading")) # 2、获取ul下的li节点 # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] print(soup.select("ul li")) # 3、获取id为list-2下的li节点 # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] print(soup.select("#list-2 li")) # 4、获取所有的ul节点 # 结果:[<ul class="list" id="list-1"> # <li class="element">Foo</li> # <li class="element">Bar</li> # <li class="element">Jay</li> # </ul>, <ul class="list list-samll" id="list-2"> # <li class="element">Foo</li> # <li class="element">Bar</li> # <li class="element">Jay</li> # </ul>] print(soup.select("ul")) # 结果:<class 'bs4.element.Tag'> print(type(soup.select('ul')[0]))
说明:
# 1、查询所有的子孙节点
在 select(css)中的 css 有多个节点时,节点元素之间用空格分开,就是查找子孙节点,
例如 soup.select(“div p”)是查找所有<div>节点下面的所有子孙<p>节点。# 2、只查直接的子节点,不查孙节点
节点元素之间用" > "分开(注意>的前后至少包含一个空格),就是查找直接子节点:
# 例如 soup.select(“div > p”)是查找所有<div>节点下面的所有直接子节点<p>,不包含孙节点。# 3、查找某个节点同级别的某类节点
用" ~ "连接两个节点表示查找前一个节点后面的所有同级别的兄弟节点(注意~号前后至少有一个空格),
例如 soup.select(“div ~ p”)查找<div>后面的所有同级别的<p>兄弟节点。# 4、查找同级别某个节点后的第一个某类节点
用" + "连接两个节点表示查找前一个节点后面的第一个同级别的兄弟节点(注意+号前后至少有一个空格):
例如 soup.select(“div + p”)查找<div>后面的第一个同级别的<p>兄弟节点。
8、嵌套选择--select()
''' 嵌套选择 -- select( )方法 ''' def demo13(): html_doc = """ <div class="panel"> <div class="panel-heading"> <h4>Hello World</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-samll" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> </div> """ soup = BeautifulSoup(html_doc, 'lxml') # 运行结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] # [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] for ul in soup.select('ul'): print(ul.select('li'))
9、获取属性
''' 获取属性(两种方法) ''' def demo14(): html_doc = """ <div class="panel"> <div class="panel-heading"> <h4>Hello World</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-samll" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> </div> """ soup = BeautifulSoup(html_doc, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
以上就是python BeautifulSoup库的常用操作的详细内容,更多关于python BeautifulSoup的资料请关注脚本之家其它相关文章!