Python - jieba库的使用

文章目录

- - jieba库概述
  - jieba分词的三种模式
  - - jieba库的安装
  - jieba分词的原理
  - jieba库常用函数
  - 实例 : 文本词频统计

jieba库概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语
jieba是优秀的中文分词第三方库，需要额外安装
jieba库提供三种分词模式，最简单的只需要掌握一个函数

jieba分词的三种模式

精确模式，全模式，搜索引擎模式

精确模式：把文本精确的且分开，不存在冗余单词
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式基础上，对长词再次切分

jieba库的安装

cmd命令行： pip install jieba
在这里插入图片描述

jieba分词的原理

利用一个中文词库，确定中文字符之间的关联概率
中文字符间概率大的组成词组，形成分词结果
除了分词，用户还可以添加自定义词组

jieba库常用函数

函数	描述
jieba.cut(s)	精确模式，返回一个可迭代的数据类型
jieba.cut(s,cut_all=True)	全模式，输出文本s中所有可能单词
jieba.cut_for_search(s)	搜索引擎模式，适合搜索引擎建立索引的分词结果

'''
@Author: yjy
@Time: 2024/11/12
'''
import jiebas = 'yjy在努力学习Python'
print(jieba.cut(s)) # <generator object Tokenizer.cut at 0x0000021EFBCA4040>
print(jieba.cut(s,cut_all=True)) # <generator object Tokenizer.cut at 0x000001A6DE434040>
print(jieba.cut_for_search(s)) # <generator object Tokenizer.cut_for_search at 0x000002BDA3E73890>
print(list(jieba.cut(s))) # ['yjy', '在', '努力学习', 'Python']
print(list(jieba.cut(s,cut_all=True))) # ['yjy', '在', '努力', '努力学习', '力学', '学习', 'Python']
print(list(jieba.cut_for_search(s))) # ['yjy', '在', '努力', '力学', '学习', '努力学习', 'Python']

在这里插入图片描述

实例 : 文本词频统计

在这里插入图片描述
问题分析:
文本词频统计

需求: 一篇文章,出现哪些词?哪些词出现得最多?
该怎么做呢?

这里以
https://python123.io/resources/pye/hamlet.txt 文本为例子:
在这里插入图片描述
“Hamlet英文词频统计”

# 文本去噪及归一化
def getText():txt = open("hamlet.txt","r").read()txt=txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':txt = txt.replace(ch," ")return txt# 使用字典表达词频
if __name__ == '__main__':hamletTxt = getText()words = hamletTxt.split()counts = {}for word in words:counts[word] = counts.get(word,0)+1 # 统计单词数量items = list(counts.items())# print(items) #('the', 1138), ('tragedy', 3), ('of', 669), ('hamlet', 462), ...items.sort(key=lambda x:x[1],reverse=True)for i in range(10):word,count = items[i]print("{0:<10}{1:>5}".format(word,count))"""
格式化字符串的基本语法
格式化字符串的基本语法是 {} 和 :，其中 {} 是占位符，: 后面跟着格式说明符。格式说明符可以包括对齐方式、填充字符、宽度、精度等。对齐和宽度
对齐方式：
<：左对齐
>：右对齐
^：居中对齐
宽度：
指定占位符的最小宽度。如果实际内容的长度小于指定的宽度，将使用空格或其他指定的填充字符进行填充。
示例
左对齐 (<)print("{0:<10}".format("hello"))
输出：hello     
解释："hello" 左对齐，总宽度为 10，右侧用空格填充。右对齐 (>)print("{0:>10}".format("hello"))
输出：hello
解释："hello" 右对齐，总宽度为 10，左侧用空格填充。居中对齐 (^)print("{0:^10}".format("hello"))
输出：深色版本hello   
解释："hello" 居中对齐，总宽度为 10，左右两侧各用两个空格填充。填充字符
可以在对齐方式之前指定填充字符。默认的填充字符是空格。pythonprint("{0:*<10}".format("hello"))  # 左对齐，用 * 填充
print("{0:*>10}".format("hello"))  # 右对齐，用 * 填充
print("{0:*^10}".format("hello"))  # 居中对齐，用 * 填充
输出：hello*****
*****hello
***hello***"""

在这里插入图片描述

中文文本：《三国演义》分析人物https://python123.io/resources/pye/threekingdoms.txt

'''
中文文本分词,使用字典表达词频
'''
import jieba
text = open("实验文本.txt","r",encoding="utf-8").read()
words = jieba.lcut(text) #精确模式
counts = {}
for word in words:if len(word)==1:continueelse:counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):word,count = items[i]print("{0:<10}{1:>5}".format(word,count))

在这里插入图片描述
我们发现明明是同一个人,不过有别的称号罢了,但是统计的却不一样,所以我们要进行修改:

import jieba
txt = open("实验文件.txt","r",encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
words = jieba.lcut(txt) # 精确分词
counts = {}
for word in words:if len(word) == 1:continueelif word == '诸葛亮' or word == '孔明曰':rword = '孔明'elif word == '关公' or word == '云长':rword = "关羽"elif word == '玄德' or word == '玄德曰':rword = '刘备'elif word == '孟德' or word == '丞相':rword = '曹操'else:rword = wordcounts[rword] = counts.get(rword,0) + 1
for word in excludes:del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):word,count = items[i]print(word,count)