对豆瓣图书Top250展开爬取以往,    上述七个难点促使自己改换excel处理模块

    继上壹篇【Python数据解析】Python三操作Excel-以豆瓣图书Top250为例 对豆瓣图书Top250展开爬取以后,鉴于还有壹对难点尚未消除,所以实行了更为的沟通座谈,那里面取得了1只尼玛的扶植与启示,十二分多谢!

【Python数据解析】Python3操作Excel(二) 一些难点的缓解与优化,pythonpython三

   
继上1篇【Python数据解析】Python三操作Excel-以豆瓣图书Top250为例 对豆瓣图书Top250进行爬取今后,鉴于还有部分标题从未缓解,所以举行了尤其的调换研究,那时期取得了贰只尼玛的扶助与启示,十三分感激!

    上次留存的难题如下:

    一.写入不可能继续的标题

    二.在Python IDLE中肯定输出正确的结果,写到excel中就乱码了。

    上述多少个难题促使本人改换excel处理模块,因为据他们说xlwt只支持到Excel
200三,很有望会出难点。

   
就算“多头尼玛”给了二个Validate函数,可是那是指向去除Windows下文件名中国和亚洲法字符的函数,跟写入excel乱码未有涉及,所以依然思量更换模块。

   
上次设有的难点如下:

更换xlsxwriter模块

   
这一次小编改成xlsxwriter那么些模块,https://pypi.python.org/pypi/XlsxWriter.
同样能够pip三 install xlsxwriter,自动下载安装,简便易行。1些用法样例:

import xlsxwriter

# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()

# Widen the first column to make the text clearer.
worksheet.set_column('A:A', 20)

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': True})

# Write some simple text.
worksheet.write('A1', 'Hello')

# Text with formatting.
worksheet.write('A2', 'World', bold)

# Write some numbers, with row/column notation.
worksheet.write(2, 0, 123)
worksheet.write(3, 0, 123.456)

# Insert an image.
worksheet.insert_image('B5', 'logo.png')

workbook.close()

贰话不说更换写入excel的代码。效果如下:

图片 1

果不其然鼻子是鼻子脸是脸,该是链接就是链接,不管怎么字符都能写,毕竟unicode。

所以说,选对模块很关键选对模块很要紧选对模块很重大!(重说三)

若是要爬的始末不是很公正规范的字符串或数字来说,作者是不会用xlwt啦。

此处有四中Python写入excel的模块比较:http://ju.outofmemory.cn/entry/56671

自身截了一个相比较图如下,具体能够看下面那篇作品,分外详细!

图片 2

    1.写入不能一连的标题

归根到底

以此既然如此顺畅,还足以写入图片,那大家何不尝试看吗?

对象:把图片链接那1列的内容换来真的的图纸!

骨子里很不难,因为我们前边曾经有了图片的蕴藏路径,把它插入到里头就能够了。

    the_img = "I:\\douban\\image\\"+bookName+".jpg"
    writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
    for k in range(0,9):
        if k == 5:
            worksheet.insert_image(i+j,k,the_img)
        else:
            worksheet.write(i+j,k,writelist[k])

出去是这般的效应,显著欠美观,那我们相应适中调整壹些每行的莫斯中国科学技术大学学,以及让她们居中试试看:

图片 3

查阅xlsxwriter文书档案可见,能够那样设置行列宽度和居中:(当然,那么些操作在excel中得以间接做,而且也许会比写代码越来越快,可是本身倒是想更多试试那些模块)

format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

由来达成了excel的写入,只然则设置格式那块实在繁杂,得不断调节和测试距离,大小,所以在excel里面做会简单些。

最后代码:

图片 4# -*-
coding:utf-8 -*- import requests import re import xlwt import
xlsxwriter from bs4 import BeautifulSoup from datetime import datetime
import codecs now = datetime.now() #始发计时 print(now) def
validate(title): #from nima rstr =
r”[\/\\\:\*\?\”\<\>\|]” # ‘/\:*?”<>|-‘
new_title = re.sub(rstr, “”, title) return new_title txtfile =
codecs.open(“top2501.txt”,’w’,’utf-8′) url =
http://book.douban.com/top250?” header = { “User-Agent”: “Mozilla/5.0
(Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.13 Safari/537.36”, “Referer”: “http://book.douban.com/
} image_dir = “I:\\douban\\image\\” #下载图片 def
download_img(imageurl,imageName = “xxx.jpg”): rsp =
requests.get(imageurl, stream=True) image = rsp.content path =
image_dir + imageName +’.jpg’ #print(path) with open(path,’wb’) as
file: file.write(image) #建立Excel workbookx =
xlsxwriter.Workbook(‘I:\\douban\\btop250.xlsx’) worksheet =
workbookx.add_worksheet() format = workbookx.add_format()
format.set_align(‘justify’) format.set_align(‘center’)
format.set_align(‘vjustify’) format.set_align(‘vcenter’)
format.set_text_wrap() worksheet.set_row(0,12,format) for i in
range(1,251): worksheet.set_row(i,70)
worksheet.set_column(‘A:A’,3,format)
worksheet.set_column(‘B:C’,17,format)
worksheet.set_column(‘D:D’,4,format)
worksheet.set_column(‘E:E’,7,format)
worksheet.set_column(‘F:F’,10,format)
worksheet.set_column(‘G:G’,19,format)
worksheet.set_column(‘H:I’,40,format) item =
[‘书名’,’外号’,’评分’,’评价人数’,’封面’,’图书链接’,’出版音信’,’标签’]
for i in range(1,9): worksheet.write(0,i,item[i-1]) s =
requests.Session() #创造会话 s.get(url,headers=header) for i in
range(0,250,贰5): geturl = url + “/start=” + str(i) #要拿走的页面地址
print(“Now to get ” + geturl) postData = {“start”:i} #post数据 res =
s.post(url,data = postData,headers = header) #post soup =
BeautifulSoup(res.content.decode(),”html.parser”) #BeautifulSoup解析
table = soup.findAll(‘table’,{“width”:”100%”}) #找到全部书籍消息的table
sz = len(table) #sz = 25,每页列出25篇小说 for j in range(1,sz+一): #j =
1~25 sp = BeautifulSoup(str(table[j-1]),”html.parser”)
#剖析每本图书的音信 imageurl = sp.img[‘src’] #找图片链接 bookurl =
sp.a[‘href’] #找图书链接 bookName = sp.div.a[‘title’] nickname =
sp.div.span #找别名 if(nickname): #1旦有别称则存款和储蓄外号不然存’无‘
nickname = nickname.string.strip() else: nickname = “” notion =
str(sp.find(‘p’,{“class”:”pl”}).string)
#抓取出版音信,注意里面的.string还不是真的str类型 rating =
str(sp.find(‘span’,{“class”:”rating_nums”}).string) #抓取平分数据 nums
= sp.find(‘span’,{“class”:”pl”}).string #抓取评分人数 nums =
nums.replace(‘(‘,”).replace(‘)’,”).replace(‘\n’,”).strip() nums =
re.findall(‘(\d+)人评价’,nums)[0] download_img(imageurl,bookName)
#下载图片 book = requests.get(bookurl) #开拓该图书的网页 sp3 =
BeautifulSoup(book.content,”html.parser”) #解析 taglist =
sp3.find_all(‘a’,{“class”:” tag”}) #找标签新闻 tag = “” lis = [] for
tagurl in taglist: sp4 = BeautifulSoup(str(tagurl),”html.parser”)
#分析每一种标签 lis.append(str(sp四.a.string)) tag = ‘,’.join(lis)
#加逗号 the_img = “I:\\douban\\image\\”+bookName+”.jpg”
writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
for k in range(0,9): if k == 5: worksheet.insert_image(i+j,k,the_img)
else: worksheet.write(i+j,k,writelist[k])
txtfile.write(str(writelist[k])) txtfile.write(‘\t’)
txtfile.write(u’\r\n’) end = datetime.now() #利落计时 print(end)
print(“程序耗费时间: ” + str(end-now)) txtfile.close() workbookx.close()
View Code

运维结果如下:

2016-03-28 11:40:50.525635
Now to get http://book.douban.com/top250?/start=0
Now to get http://book.douban.com/top250?/start=25
Now to get http://book.douban.com/top250?/start=50
Now to get http://book.douban.com/top250?/start=75
Now to get http://book.douban.com/top250?/start=100
Now to get http://book.douban.com/top250?/start=125
Now to get http://book.douban.com/top250?/start=150
Now to get http://book.douban.com/top250?/start=175
Now to get http://book.douban.com/top250?/start=200
Now to get http://book.douban.com/top250?/start=225
2016-03-28 11:48:14.946184
程序耗时: 0:07:24.420549

图片 5

顺风爬完250本书。此次爬取行动就不易来说已告成功!

此次耗费时间九分24秒,依然彰显太慢了。下一步就应该是怎么着在提升功能上边下武术了。

http://www.bkjia.com/Pythonjc/1114491.htmlwww.bkjia.comtruehttp://www.bkjia.com/Pythonjc/1114491.htmlTechArticle【Python数据分析】Python3操作Excel(二)
一些标题标化解与优化,pythonpython3继上1篇【Python数据解析】Python三操作Excel-以豆瓣图书Top250为例对豆…

 
  二.在Python IDLE中大名鼎鼎输出正确的结果,写到excel中就乱码了。

   
上述五个难题促使自个儿改换excel处理模块,因为听说xlwt只协助到Excel
200叁,很有一点都不小可能率会出难题。

   
即便“三只尼玛”给了一个Validate函数,可是那是对准去除Windows下文件名中国和南美洲法字符的函数,跟写入excel乱码没有关系,所以依旧思量更换模块。

更换xlsxwriter模块

    此次作者改成xlsxwriter那么些模块,https://pypi.python.org/pypi/XlsxWriter.
同样能够pip叁 install
xlsxwriter,自动下载安装,简便易行。一些用法样例:

import xlsxwriter

# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()

# Widen the first column to make the text clearer.
worksheet.set_column('A:A', 20)

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': True})

# Write some simple text.
worksheet.write('A1', 'Hello')

# Text with formatting.
worksheet.write('A2', 'World', bold)

# Write some numbers, with row/column notation.
worksheet.write(2, 0, 123)
worksheet.write(3, 0, 123.456)

# Insert an image.
worksheet.insert_image('B5', 'logo.png')

workbook.close()

二话不说更换写入excel的代码。效果如下:

图片 6

果不其然鼻子是鼻子脸是脸,该是链接便是链接,不管什么样字符都能写,毕竟unicode。

所以说,选对模块很关键选对模块很要紧选对模块很重点!(重说三)

万1要爬的内容不是很公道规范的字符串或数字来说,小编是不会用xlwt啦。

此处有4中Python写入excel的模块相比:http://ju.outofmemory.cn/entry/56671

本身截了多少个对照图如下,具体能够看上边那篇小说,极度详尽!

图片 7

顺藤摸瓜

这几个既然如此顺畅,还足以写入图片,那我们何不尝试看吗?

对象:把图片链接那一列的始末换来真的的图样!

实际上极粗略,因为大家此前已经有了图片的囤积路径,把它插入到内部就能够了。

    the_img = "I:\\douban\\image\\"+bookName+".jpg"
    writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
    for k in range(0,9):
        if k == 5:
            worksheet.insert_image(i+j,k,the_img)
        else:
            worksheet.write(i+j,k,writelist[k])

出来是如此的功能,明显不雅观,那大家理应适量调整一些每行的冲天,以及让她们居中间试验试看:

图片 8

查阅xlsxwriter文档可见,能够如此设置行列宽度和居中:(当然,那么些操作在excel中能够直接做,而且说不定会比写代码越来越快,不过作者倒是想更加多试试这么些模块)

format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

迄今实现了excel的写入,只可是设置格式那块实在繁杂,得连连调节和测试距离,大小,所以在excel里面做会不难些。

末段代码:

图片 9图片 10

# -*- coding:utf-8 -*-
import requests
import re
import xlwt
import xlsxwriter
from bs4 import BeautifulSoup
from datetime import datetime
import codecs

now = datetime.now()             #开始计时
print(now)

def validate(title):                        #from nima
    rstr = r"[\/\\\:\*\?\"\<\>\|]"          # '/\:*?"<>|-'
    new_title = re.sub(rstr, "", title)
    return new_title

txtfile = codecs.open("top2501.txt",'w','utf-8')
url = "http://book.douban.com/top250?"

header = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36",
           "Referer": "http://book.douban.com/"
           }

image_dir = "I:\\douban\\image\\"
#下载图片
def download_img(imageurl,imageName = "xxx.jpg"):
    rsp = requests.get(imageurl, stream=True)
    image = rsp.content
    path = image_dir + imageName +'.jpg'
    #print(path)
    with open(path,'wb') as file:
        file.write(image)

#建立Excel
workbookx = xlsxwriter.Workbook('I:\\douban\\btop250.xlsx')
worksheet = workbookx.add_worksheet()
format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

item = ['书名','别称','评分','评价人数','封面','图书链接','出版信息','标签']
for i in range(1,9):
    worksheet.write(0,i,item[i-1])

s = requests.Session()      #建立会话
s.get(url,headers=header)

for i in range(0,250,25):  
    geturl = url + "/start=" + str(i)                     #要获取的页面地址
    print("Now to get " + geturl)
    postData = {"start":i}                                #post数据
    res = s.post(url,data = postData,headers = header)    #post
    soup = BeautifulSoup(res.content.decode(),"html.parser")       #BeautifulSoup解析
    table = soup.findAll('table',{"width":"100%"})        #找到所有图书信息的table
    sz = len(table)                                       #sz = 25,每页列出25篇文章
    for j in range(1,sz+1):                               #j = 1~25
        sp = BeautifulSoup(str(table[j-1]),"html.parser") #解析每本图书的信息

        imageurl = sp.img['src']                          #找图片链接
        bookurl = sp.a['href']                            #找图书链接
        bookName = sp.div.a['title']
        nickname = sp.div.span                            #找别名
        if(nickname):                                     #如果有别名则存储别名否则存’无‘
            nickname = nickname.string.strip()
        else:
            nickname = ""

        notion = str(sp.find('p',{"class":"pl"}).string)   #抓取出版信息,注意里面的.string还不是真的str类型
        rating = str(sp.find('span',{"class":"rating_nums"}).string)    #抓取平分数据
        nums = sp.find('span',{"class":"pl"}).string                    #抓取评分人数
        nums = nums.replace('(','').replace(')','').replace('\n','').strip()
        nums = re.findall('(\d+)人评价',nums)[0]
        download_img(imageurl,bookName)                     #下载图片
        book = requests.get(bookurl)                        #打开该图书的网页
        sp3 = BeautifulSoup(book.content,"html.parser")     #解析
        taglist = sp3.find_all('a',{"class":"  tag"})       #找标签信息
        tag = ""
        lis = []
        for tagurl in taglist:
            sp4 = BeautifulSoup(str(tagurl),"html.parser")  #解析每个标签
            lis.append(str(sp4.a.string))

        tag = ','.join(lis)        #加逗号
        the_img = "I:\\douban\\image\\"+bookName+".jpg"
        writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
        for k in range(0,9):
            if k == 5:
                worksheet.insert_image(i+j,k,the_img)
            else:
                worksheet.write(i+j,k,writelist[k])
            txtfile.write(str(writelist[k]))
            txtfile.write('\t')
        txtfile.write(u'\r\n')

end = datetime.now()    #结束计时
print(end)
print("程序耗时: " + str(end-now))
txtfile.close()
workbookx.close()

View Code

运作结果如下:

2016-03-28 11:40:50.525635
Now to get http://book.douban.com/top250?/start=0
Now to get http://book.douban.com/top250?/start=25
Now to get http://book.douban.com/top250?/start=50
Now to get http://book.douban.com/top250?/start=75
Now to get http://book.douban.com/top250?/start=100
Now to get http://book.douban.com/top250?/start=125
Now to get http://book.douban.com/top250?/start=150
Now to get http://book.douban.com/top250?/start=175
Now to get http://book.douban.com/top250?/start=200
Now to get http://book.douban.com/top250?/start=225
2016-03-28 11:48:14.946184
程序耗时: 0:07:24.420549

图片 11

顺风爬完250本书。此番爬取行动就不易来说已告完成!

此番耗费时间七分二四秒,依然显示太慢了。下一步就应当是怎么着在升高效用上边下武术了。

相关文章