欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

首頁(yè)綜合正文

評(píng)論

柚子快報(bào)激活碼778899分享：初識(shí)Scrapy和使用

Homedepot家居優(yōu)選購(gòu)綜合2025-05-05690

柚子快報(bào)激活碼778899分享：初識(shí)Scrapy和使用

http://yzkb.51969.com/

初識(shí)Scrapy和使用

了解Scrapy

Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。可以應(yīng)用在包括數(shù)據(jù)挖掘，信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。

安裝Scrapy

使用pip直接安裝

Windows：打開cmd，輸入 pip install scrapy ，回車。

pip install scrapy

使用清華鏡像

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy

其他鏡像網(wǎng)址

清華大學(xué)鏡像：https://pypi.tuna.tsinghua.edu.cn/simple/

阿里云：http://mirrors.aliyun.com/pypi/simple/

中科大鏡像：https://pypi.mirrors.ustc.edu.cn/simple/

豆瓣鏡像：http://pypi.douban.com/simple/

中科大鏡像2：http://pypi.mirrors.ustc.edu.cn/simple/

創(chuàng)建Scrapy項(xiàng)目

scrapy startproject 項(xiàng)目名稱

--- 例如：scrapy startproject spider_demo01

Scrapy項(xiàng)目構(gòu)成

spiders

__init__.py

自定義的爬蟲文件.py ‐‐‐》由我們自己創(chuàng)建，是實(shí)現(xiàn)爬蟲核心功能的文件

__init__.py

items.py ‐‐‐》定義數(shù)據(jù)結(jié)構(gòu)的地方，是一個(gè)繼承自scrapy.Item的類

middlewares.py ‐‐‐》中間件代理

pipelines.py ‐‐‐》管道文件，里面只有一個(gè)類，用于處理下載數(shù)據(jù)的后續(xù)處理

默認(rèn)是300優(yōu)先級(jí)，值越小優(yōu)先級(jí)越高（1‐1000）

settings.py ‐‐‐》配置文件比如：是否遵守robots協(xié)議，User‐Agent定義等

創(chuàng)建爬蟲文件

scrapy genspider 爬蟲名字網(wǎng)頁(yè)的域名

--- 例如：scrapy genspider baidu https://www.baidu.com

運(yùn)行爬蟲文件

scrapy crawl 爬蟲名稱

--- 例如：scrapy crawl baidu

Scrapy的架構(gòu)組成

引擎 ‐‐‐》自動(dòng)運(yùn)行，無(wú)需關(guān)注，會(huì)自動(dòng)組織所有的請(qǐng)求對(duì)象，分發(fā)給下載器下載器 ‐‐‐》從引擎處獲取到請(qǐng)求對(duì)象后，請(qǐng)求數(shù)據(jù) spiders ‐‐‐》Spider類定義了如何爬取某個(gè)(或某些)網(wǎng)站。包括了爬取的動(dòng)作(例如:是否跟進(jìn)鏈接)以及如何從網(wǎng)頁(yè)的內(nèi)容中提取結(jié)構(gòu)化數(shù)據(jù)(爬取item)。換句話說(shuō)，Spider就是您定義爬取的動(dòng)作及分析某個(gè)網(wǎng)頁(yè)(或者是有些網(wǎng)頁(yè))的地方。調(diào)度器 ‐‐‐》有自己的調(diào)度規(guī)則，無(wú)需關(guān)注管道（Item pipeline） ‐‐‐》最終處理數(shù)據(jù)的管道，會(huì)預(yù)留接口供我們處理數(shù)據(jù) 當(dāng)Item在Spider中被收集之后，它將會(huì)被傳遞到Item Pipeline，一些組件會(huì)按照一定的順序執(zhí)行對(duì)Item的處理。每個(gè)item pipeline組件(有時(shí)稱之為“Item Pipeline”)是實(shí)現(xiàn)了簡(jiǎn)單方法的Python類。他們接收到Item并通過(guò)它執(zhí)行一些行為，同時(shí)也決定此Item是否繼續(xù)通過(guò)pipeline，或是被丟棄而不再進(jìn)行處理。

以下是item pipeline的一些典型應(yīng)用：

清理HTML數(shù)據(jù) 驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段) 查重(并丟棄) 將爬取結(jié)果保存到數(shù)據(jù)庫(kù)中

Scrapy的工作原理

引擎向spiders要url引擎將要爬取的url給調(diào)度器調(diào)度器會(huì)將url生成請(qǐng)求對(duì)象放入到指定的隊(duì)列中從隊(duì)列中出隊(duì)一個(gè)請(qǐng)求引擎將請(qǐng)求交給下載器進(jìn)行處理下載器發(fā)送請(qǐng)求獲取互聯(lián)網(wǎng)數(shù)據(jù)下載器將數(shù)據(jù)返回給引擎引擎將數(shù)據(jù)再次給到spidersspiders通過(guò)xpath解析該數(shù)據(jù)，得到數(shù)據(jù)或者urlspiders將數(shù)據(jù)或者url給到引擎引|擎判斷該數(shù)據(jù)還是url,是數(shù)據(jù)，交給管道(itempipeline)處理，是url交給調(diào)度器處理

相關(guān)案例

PS：相關(guān)案例僅用于學(xué)習(xí)使用！??！

汽車之家

# 使用前注釋robot協(xié)議或者改成False

import scrapy

class CarSpider(scrapy.Spider):

name = "car"

allowed_domains = ["cn.58.com"]

start_urls = ["https://cn.58.com/dazhong/?PGTID=0d30001d-008d-2709-b9ed-9477389f1bd7"]

def parse(self, response):

# response.text 字符串

# content = response.text

content = response.xpath("http://div[@id='list']//img[@class='info_pic']/@data-original")

for i in content:

print(i.extract())

當(dāng)當(dāng)網(wǎng)

# 該文件用于讀取一些數(shù)據(jù)，并未做過(guò)多的修飾

import scrapy

from ..items import SpiderDemo03Item

class DangSpider(scrapy.Spider):

name = "dang"

allowed_domains = ["category.dangdang.com", "product.dangdang.com"]

start_urls = ["https://category.dangdang.com/cp01.43.19.00.00.00.html"]

base_url = 'https://category.dangdang.com/pg'

page = 1

def parse(self, response):

li_list = response.xpath("http://ul[@id='component_59']/li")

for li in li_list:

# 第一張圖片的src可以使用

src = li.xpath(".//img/@data-original").extract_first()

if src:

src = src

else:

src = li.xpath(".//img/@src").extract_first()

name = li.xpath(".//img/@alt").extract_first()

price = li.xpath(".//p[@class='price']/span[@class='search_now_price']/text()").extract_first()

book = SpiderDemo03Item(src=src, name=name, price=price)

# 將數(shù)據(jù)交給管道文件

yield book

# https://category.dangdang.com/pg2-cp01.43.19.00.00.00.html

# 爬取100頁(yè)

# if self.page < 100:

# self.page += 1

# url = self.base_url + str(self.page) + '-cp01.43.19.00.00.00.html'

# yield scrapy.Request(url=url, callback=self.parse)

# items.py

class SpiderDemo03Item(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

src = scrapy.Field()

name = scrapy.Field()

price = scrapy.Field()

time = scrapy.Field()

# pipelines.py

import urllib.request

class SpiderDemo03Pipeline:

# 打開文件

def open_spider(self, spider):

self.fp = open('dang.json', 'w', encoding='utf-8')

def process_item(self, item, spider):

# 寫入到j(luò)son文件

self.fp.write(str(item))

return item

# 關(guān)閉文件

def close_spider(self, spider):

self.fp.close()

# 多管道下載

# 記得在當(dāng)前目錄下面添加books文件夾

class dangPipeline:

def process_item(self, item, spider):

url = 'http:'+item.get('src')

filename = './books/' + item.get('name') + '.jpg'

urllib.request.urlretrieve(url, filename)

return item

# settings.py

# 需要添加下圖的管道解開注釋即可

ITEM_PIPELINES = {

# 下載圖片的管道

# "spider_demo03.pipelines.dangPipeline": 301,

# 保存json文件的管道

"spider_demo03.pipelines.SpiderDemo03Pipeline": 300,

}

CrawlSpider

CrawlSpider可以定義規(guī)則，再解析html內(nèi)容的時(shí)候，可以根據(jù)鏈接規(guī)則提取出指定的鏈接，然后再向這些鏈接發(fā) 送請(qǐng)求

所以，如果有需要跟進(jìn)鏈接的需求，意思就是爬取了網(wǎng)頁(yè)之后，需要提取鏈接再次爬取，使用CrawlSpider是非常合適的

案例：讀書網(wǎng)數(shù)據(jù)入庫(kù)

創(chuàng)建項(xiàng)目：

scrapy startproject spider_demo

創(chuàng)建爬蟲類：

scrapy genspider ‐t crawl read www.dushu.com

配置數(shù)據(jù)庫(kù)：

# settings.py

DB_HOST = 'localhost'

DB_PORT = 3306

DB_USER = 'root'

DB_PASSWORD = '123456'

DB_NAME = 'spider'

DB_CHARSET = 'utf8'

# 關(guān)閉robot協(xié)議

ROBOTSTXT_OBEY = False

# 開啟日志輸出文件，可以不選擇開啟

LOG_FILE = "log.log"

編寫爬蟲類文件：

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from ..items import SpiderDemo04Item

class ReadSpider(CrawlSpider):

name = "read"

allowed_domains = ["www.dushu.com"]

start_urls = ["https://www.dushu.com/book/1617_1.html"]

# 這里需要編寫匹配的html地址，可以去官網(wǎng)參考

rules = (

Rule(LinkExtractor(

allow=r"/book/1617_\d+"),

callback="parse_item",

follow=True),

)

def parse_item(self, response):

img_list = response.xpath("http://div[@class='book-info']//a/img")

for i in img_list:

src = i.xpath("./@data-original").extract_first()

name = i.xpath("./@alt").extract_first()

book = SpiderDemo04Item(name=name, src=src)

yield book

# items.py

class SpiderDemo04Item(scrapy.Item):

# define the fields for your item here like:

name = scrapy.Field()

src = scrapy.Field()

編寫管道文件：

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

# 加載配置文件

from scrapy.utils.project import get_project_settings

import pymysql

class SpiderDemo04Pipeline:

def open_spider(self, spider):

self.fp = open('dushu.json', 'w', encoding='utf-8')

def process_item(self, item, spider):

self.fp.write(str(item))

return item

def close_spider(self, spider):

self.fp.close()

class MysqlPipeline:

# 鏈接MySQL

def open_spider(self, spider):

settings = get_project_settings()

self.conn = pymysql.connect(

host=settings['DB_HOST'],

port=settings['DB_PORT'],

user=settings['DB_USER'],

password=settings['DB_PASSWORD'],

database=settings['DB_NAME'],

charset=settings['DB_CHARSET']

)

self.cursor = self.conn.cursor()

print('鏈接數(shù)據(jù)庫(kù)成功')

def process_item(self, item, spider):

sql = 'insert into book(name,src) values(%s,%s)'

self.cursor.execute(sql, (item['name'], item['src']))

self.conn.commit()

return item

# 關(guān)閉鏈接

def close_spider(self, spider):

self.cursor.close()

self.conn.close()

print('關(guān)閉數(shù)據(jù)庫(kù)鏈接')

記得在配置文件打開管道：

ITEM_PIPELINES = {

"spider_demo04.pipelines.MysqlPipeline": 301,

"spider_demo04.pipelines.SpiderDemo04Pipeline": 300,

}

Scrapy的post請(qǐng)求

import scrapy

import json

class DushuSpider(scrapy.Spider):

name = "dushu"

allowed_domains = ["fanyi.baidu.com"]

start_urls = ["https://fanyi.baidu.com/sug/"]

# 需要使用功能 start_requests 方法

def start_requests(self):

url = "https://fanyi.baidu.com/sug/"

data = {

"kw": "hello"

}

yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse_second)

def parse_second(self, response):

content = response.text

obj = json.loads(content)

print(obj)

代理

到settings.py中，打開一個(gè)選項(xiàng)

DOWNLOADER_MIDDLEWARES = {

'postproject.middlewares.Proxy': 543,

}

到middlewares.py中寫代碼

def process_request(self, request, spider):

request.meta['proxy'] = 'https://113.68.202.10:9999'

return None

柚子快報(bào)激活碼778899分享：初識(shí)Scrapy和使用

http://yzkb.51969.com/

精彩內(nèi)容

評(píng)論可見(jiàn)，查看隱藏內(nèi)容

本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理，出于傳遞更多信息之目的，不代表金鑰匙跨境贊同其觀點(diǎn)和立場(chǎng)。

轉(zhuǎn)載請(qǐng)注明，如有侵權(quán)，聯(lián)系刪除。

本文鏈接：http://m.gantiao.com.cn/post/19259227.html

發(fā)布評(píng)論

取消回復(fù)

您暫未設(shè)置收款碼

請(qǐng)?jiān)谥黝}配置——文章設(shè)置里上傳

金鑰匙跨境

掃描二維碼手機(jī)訪問(wèn)

文章目錄

欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

柚子快報(bào)激活碼778899分享：初識(shí)Scrapy和使用

隨便看看

特朗普要求美國(guó)最高法院暫停執(zhí)行TikTok強(qiáng)制出售令

最新留言

您暫未設(shè)置收款碼