在获取到运动汇的网站链接后,界面如图所示:
右键检查,我们会发现没有任何数据,只有当我们点开这些"第一单元"、"第二单元"等,数据才会加载出来;
由于我们只需要分析这一个网页并获取其中的数据,所以我们可以简单的手动点开所有这些"单元"以及里面的各个项目的比赛数据,让网站将数据全部加载出来,如下所示:
紧接着,直接利用浏览器! 右键检查:
点击最上面的<html>, 右键移动到复制, 点击复制元素:
自主将这里面的html内容复制到某个文本文件中,随后使用Xpath表达式即可轻松解析出数据,解析代码如下:
# 项目, 姓名, 学院, 组/道, 名词, 成绩, 得分, 备注
from lxml import etreeEVENT_YEAR = "2024.11.15-16"class ReadHtmlTable(object):def __init__(self, html_content: str):self.html_content = html_contentself.table_col = ("比赛时间", "项目", "姓名", "学院", "组/道", "名次", "成绩", "得分", "备注")def get_different_units(self) -> list[etree.Element]:tree = etree.HTML(self.html_content)units = tree.xpath(r'//*[@id="div_Result"]/ul/li')return unitsdef get_all_tables(self) -> list[etree.Element]:units = self.get_different_units()tables = []for unit in units:table = unit.xpath(r"./div[2]/div/div/table/tbody")tables.extend(table)return tablesdef get_table_record(self, table_element: etree.Element) -> tuple[str, list[tuple]]:# 一条记录be like: [None, '2/4', 'XX名字', '日语', '16.97', '27', None, None]rows = table_element.xpath("./tr[2]/td/table/tbody/tr")[1:]records = []try:# 如果是空表就跳出event_name = table_element.xpath("./tr[1]/td[1]/text()")[0].strip().split('\xa0')[1]except IndexError:return recordsfor rank, row in enumerate(rows, 1):# ("项目", "姓名", "学院", "组/道", "名次", "成绩", "得分", "备注")items = [i.text.strip() if i.text is not None else "" for i in row.xpath(r"./td")]records.append((EVENT_YEAR, event_name, items[2], items[3], items[1], rank, items[4], items[5], items[6]))return recordsdef write_in_csv(self):with open("./sports_meeting_res.csv", "w", encoding="utf-8") as f:f.write(",".join(self.table_col) + "\n")for table in self.get_all_tables():for record in self.get_table_record(table):# 将记录转换为字符串并写入文件csv_line = ",".join(map(str, record)) + "\n"f.write(csv_line)if __name__ == "__main__":with open("./sports_meeting_res2024.txt", "r", encoding="utf-8") as f:html_content = f.read()read_html_table = ReadHtmlTable(html_content)read_html_table.write_in_csv()
需要注意上述代码中, 全局变量EVENT_YEAR是指代比赛年份,需要你自己修改成比赛的真实年份;此外,你需要修改读取的文件路径,在代码中是"./sports_meeting_res2024.txt", 里面是我复制下来的html元素内容;
最后,所有内容就会按照指定格式保存成csv文件!非常Easy啊