当前位置：首页 > news >正文

BeautifulSoup的详细使用说明

news 2025/4/28 13:35:27

BeautifulSoup 是一个非常强大的 Python 库，用于解析 HTML 和 XML 文档。它提供了简单易用的 API，可以方便地提取和操作文档中的数据。以下是对 BeautifulSoup 的详细解释，包括其主要功能和使用方法。

一、安装 BeautifulSoup

首先，确保你已经安装了 BeautifulSoup 和 lxml（一个高效的解析器）。可以通过以下命令安装：

bash

pip install beautifulsoup4 lxml

二、解析 HTML 文档

（一）解析 HTML 字符串

可以将 HTML 字符串直接传递给 BeautifulSoup，并指定解析器。

Python

from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""soup = BeautifulSoup(html_doc, 'lxml')

（二）解析 HTML 文件

也可以从文件中读取 HTML 内容并解析。

Python

with open('example.html', 'r') as file:html_doc = file.read()soup = BeautifulSoup(html_doc, 'lxml')

三、导航和搜索文档树

（一）导航文档树

BeautifulSoup 提供了多种方法来导航文档树，例如访问标签、属性和内容。

1. 访问标签

Python

print(soup.title)  # <title>The Dormouse's story</title>
print(soup.title.name)  # title
print(soup.title.string)  # The Dormouse's story

2. 访问属性

Python

print(soup.a)  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.a['href'])  # http://example.com/elsie

3. 访问内容

Python

print(soup.p)  # <p class="title"><b>The Dormouse's story</b></p>
print(soup.p.text)  # The Dormouse's story

（二）搜索文档树

BeautifulSoup 提供了多种方法来搜索文档树，例如 find、find_all、select 等。

1. 使用 `find` 和 `find_all`

Python

# 查找所有 <a> 标签
links = soup.find_all('a')
for link in links:print(link['href'])  # 打印链接的 href 属性# 查找第一个 <p> 标签
first_paragraph = soup.find('p')
print(first_paragraph.text)  # 打印第一个段落的文本内容

2. 使用 CSS 选择器

BeautifulSoup 支持 CSS 选择器，可以通过 select 方法使用 CSS 选择器语法。

Python

# 查找所有 class="sister" 的 <a> 标签
sister_links = soup.select('a.sister')
for link in sister_links:print(link['href'])  # 打印链接的 href 属性# 查找 id="link1" 的标签
link1 = soup.select_one('#link1')
print(link1.text)  # 打印链接的文本内容

四、提取数据

（一）提取文本

可以通过 .text 或 .get_text() 提取标签的文本内容。

Python

print(soup.get_text())  # 提取所有文本内容
print(soup.p.get_text())  # 提取第一个 <p> 标签的文本内容

（二）提取属性

可以通过字典的方式访问标签的属性。

Python

print(soup.a['href'])  # 提取 <a> 标签的 href 属性
print(soup.a.get('href'))  # 提取 <a> 标签的 href 属性

（三）提取标签

可以通过 .find 和 .find_all 提取特定的标签。

Python

# 提取所有 <a> 标签
links = soup.find_all('a')
for link in links:print(link['href'])  # 打印链接的 href 属性# 提取第一个 <p> 标签
first_paragraph = soup.find('p')
print(first_paragraph.text)  # 打印第一个段落的文本内容

五、修改文档树

（一）添加新标签

可以通过 .append 或 .insert 添加新标签。

Python

new_tag = soup.new_tag('a', href='http://example.com/new')
soup.p.append(new_tag)
print(soup.p)

（二）删除标签

可以通过 .decompose 或 .extract 删除标签。

Python

# 删除第一个 <a> 标签
soup.a.decompose()
print(soup.p)

（三）修改标签

可以通过直接赋值修改标签的属性。

Python

# 修改第一个 <a> 标签的 href 属性
soup.a['href'] = 'http://example.com/updated'
print(soup.a)

六、完整示例：解析 1688 商品详情

以下是一个完整的示例，展示如何使用 BeautifulSoup 解析 1688 商品详情页的内容。

Python

import requests
from bs4 import BeautifulSoupdef get_html(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}response = requests.get(url, headers=headers)return response.textdef parse_html(html):soup = BeautifulSoup(html, 'html.parser')product_info = {}product_name = soup.find('h1', class_='product-title').text.strip()product_info['product_name'] = product_nameproduct_price = soup.find('span', class_='price').text.strip()product_info['product_price'] = product_priceproduct_description = soup.find('div', class_='product-description').text.strip()product_info['product_description'] = product_descriptionproduct_image = soup.find('img', class_='main-image')['src']product_info['product_image'] = product_imagereturn product_infodef main():url = "https://detail.1688.com/offer/123456789.html"html = get_html(url)if html:product_info = parse_html(html)print("商品名称:", product_info['product_name'])print("商品价格:", product_info['product_price'])print("商品描述:", product_info['product_description'])print("商品图片:", product_info['product_image'])if __name__ == "__main__":main()