正在学习网页相关的Python模块,一起学习下这个“美丽的汤”
功能简介Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
安装beautiful soup: $ easy_install beautifulsoup4 $ pip install beautifulsoup4 安装解析器: $ easy_install lxml $ pip install lxml 流程:1.requests库获取网页->2.Beautifulsoup创建soup对象->使用bs4解析得到相应的内容。
示例#coding:utf-8 from bs4 import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) #''.join(doc) 将list doc 转换成字符串 print (soup.title) print (soup.title.name) print (soup.title.string) print (soup.p) print (soup.p['id']) print (soup.find_all('p')) print (soup.find_all(id = "secondpara" )) print (soup.get_text())
执行结果:
C:PythonPython36python.exe D:/2.codes/PycharmProjects/PyReptilian/beautysoap.py <title>Page title</title> title Page title <p align="center" id="firstpara">This is paragraph <b>one</b>.</p> firstpara [<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>, <p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>] [<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>] Page titleThis is paragraph one.This is paragraph two. Process finished with exit code 0
获取一个网页的相关信息示例,参考网页内容:
#coding:utf-8 from bs4 import BeautifulSoup import requests class Html(): soup = None def __init__(self): url = 'http://news.baidu.com/' html = requests.get(url).content # 获取首页的html self.soup = BeautifulSoup (html, 'lxml') # 得到soup对象 def getTitle(self): #title = self.soup.title #返回的结果带title标签<title> </title> title = self.soup.title.string return title def getH1(self): try: h2 = self.soup.select("h2") # 获取h2,结果带h2标签 if (len(h2) > 1): #print (''.join(["糟糕了 ", str(len(h2)),"个h2,不利seo"])) #list转str print("共%d个h2"%len(h2)) except AttributeError: return "h2不存在" return h2 demo = Html() print ( "标题:%s " % (demo.getTitle() )) print ("h1: %s" %(demo.getH1()))