代码合集

作者: 手机游戏网站  发布:2019-10-08

DC学院   《Python 爬虫:入门+进级》系统课程   

第三节:下载百度首页新闻

import requests

data = requests.get('https://www.baidu.com/')
data.encoding='utf-8'

print(data.text)

  

第一节:Requsts+Xpath 爬取豆瓣电影

1.爬取单个成分消息

import requests
from lxml import etree

url = 'https://movie.douban.com/subject/1292052/'
data = requests.get(url).text
s=etree.HTML(data)

file=s.xpath('//*[@id="content"]/h1/span[1]/text()')
print(file)

2.爬取多少个因素音讯

import requests
from lxml import etree

url = 'https://movie.douban.com/subject/1292052/'
data = requests.get(url).text
s=etree.HTML(data)

film=s.xpath('//*[@id="content"]/h1/span[1]/text()')
director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()')
time=s.xpath('//*[@id="info"]/span[13]/text()')

print('电影名称:',film)
print('导演:',director)
print('主演:',actor)
print('片长:',time)

  

第4节:爬取豆瓣图书TOP250音讯

from lxml import etree
import requests
import time

for a in range(10):
    url = 'https://book.douban.com/top250?start={}'.format(a*25)
    data = requests.get(url).text

    s=etree.HTML(data)
    file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
    time.sleep(3)

    for div in file:
        title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
        href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
        score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
        num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
        scrible=div.xpath("./tr/td[2]/p[2]/span/text()")

        if len(scrible) > 0:
            print("{},{},{},{},{}n".format(title,href,score,num,scrible[0]))
        else:
            print("{},{},{},{}n".format(title,href,score,num))

  

手机游戏网站,第五节:爬取小猪短租房子新闻

from lxml import etree
import requests
import time

for a in range(1,6):
    url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
    data = requests.get(url).text

    s=etree.HTML(data)
    file=s.xpath('//*[@id="page_list"]/ul/li')
    time.sleep(3)

    for div in file:
        title=div.xpath("./div[2]/div/a/span/text()")[0]
        price=div.xpath("./div[2]/span[1]/i/text()")[0]
        scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
        pic=div.xpath("./a/img/@lazy_src")[0]

        print("{}   {}   {}   {}n".format(title,price,scrible,pic))

  

第六节:将爬取的多寡存到本地

1.仓储小猪短租数据

from lxml import etree
import requests
import time

with open('/Users/mac/Desktop/xiaozhu.csv','w',encoding='utf-8') as f:
    for a in range(1,6):
        url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
        data = requests.get(url).text

        s=etree.HTML(data)
        file=s.xpath('//*[@id="page_list"]/ul/li')
        time.sleep(3)

        for div in file:
            title=div.xpath("./div[2]/div/a/span/text()")[0]
            price=div.xpath("./div[2]/span[1]/i/text()")[0]
            scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
            pic=div.xpath("./a/img/@lazy_src")[0]

            f.write("{},{},{},{}n".format(title,price,scrible,pic))

2.囤积豆瓣图书TOP250数额

from lxml import etree
import requests
import time

with open('/Users/mac/Desktop/top250.csv','w',encoding='utf-8') as f:
    for a in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(a*25)
        data = requests.get(url).text

        s=etree.HTML(data)
        file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
        time.sleep(3)

        for div in file:
            title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
            href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
            score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
            num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
            scrible=div.xpath("./tr/td[2]/p[2]/span/text()")

            if len(scrible) > 0:
                f.write("{},{},{},{},{}n".format(title,href,score,num,scrible[0]))
            else:
                f.write("{},{},{},{}n".format(title,href,score,num))

  

第七节:爬取豆瓣分类电影消息,解决动态加载页面

import requests
import json
import time

for a in range(3):
    url_visit = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={}'.format(a*20)
    file = requests.get(url_visit).json()   #这里跟之前的不一样,因为返回的是 json 文件
    time.sleep(2)

    for i in range(20):
        dict=file['data'][i]   #取出字典中 'data' 下第 [i] 部电影的信息
        urlname=dict['url']
        title=dict['title']
        rate=dict['rate']
        cast=dict['casts']

        print('{}  {}  {}  {}n'.format(title,rate,'  '.join(cast),urlname))

     

DC学院:《Python爬虫:入门+进阶》

快速的学习路线
一贯从实际的案例入手,通超过实际际的操作,学习具体的知识点

每课都有学习资料
慎选最有效的求学材质,你只要求去推行,不必浪费时间采撷、筛选能源

超多案例,覆盖主流网址
搜狐、天猫商城、微博、去何方、应聘网等数13个网址案例

GET 八种反反爬技巧
轻巧化解IP限制、动态加载、验证码等八种反爬虫手段

晋级布满式爬虫
调控布满式本领,搭建爬虫框架,大范围数据爬取和数据仓库储存款和储蓄

  

《Python爬虫:入门+进级》(PC端插手直接开头上课)

手提式有线电话计算机扫描描二维码踏向,立即开学

     

本文由银河网站登录发布于手机游戏网站,转载请注明出处:代码合集

关键词: