柚子快報激活碼778899分享：Scrapy爬蟲框架

AsosMarketplace時尚鋪綜合2025-06-07450

http://yzkb.51969.com/

1. 定義

Scrapy是一個Python編寫的開源網絡爬蟲框架，用于快速、高效地從網站中提取結構化的數據。它提供了一個完整的工具集，使得開發(fā)者可以輕松地構建和部署爬蟲程序。

2. 組件

引擎(Engine)：用來控制整個系統(tǒng)的數據處理流程 Spiders（爬蟲）：定義了如何提取網站數據的規(guī)則，以及如何跟蹤鏈接和處理頁面。Spiders定義了起始URL、從頁面中提取數據的方式以及如何跟蹤鏈接等。 Items（數據項）：用于定義要抓取的數據結構，包含你所關心的數據字段?？梢詫⒕W頁中抓取的數據封裝到Items對象中，方便進一步處理和存儲。每個Item類似于一個字典，用于存儲和提取特定的數據字段。 Item Pipelines（數據管道）：負責處理從爬蟲中提取到的Item對象。通過定義一系列的Pipeline來對Item進行預處理、過濾、存儲等操作，如數據清洗、去重、存儲到數據庫或文件等。 Downloader（下載器）：負責下載網頁內容并返回給Spider處理。Scrapy內置了一個高效的下載器，可以通過設置不同的配置來實現多線程、代理、網頁策略等功能。 Scheduler（調度器）：負責管理待抓取的URL隊列，并根據一定的策略生成下一個要抓取的URL。Scrapy使用調度器來控制爬取流程，將請求分發(fā)給下載器，以便下載相應的網頁內容。 Middleware（中間件）：提供了對Scrapy請求和響應的處理機制，可以在請求和響應的過程中注入自定義的操作。通過中間件，可以對發(fā)出的請求和得到的響應進行修改和處理。

3.安裝和創(chuàng)建

3.1 安裝

pip install scrapy

3.2創(chuàng)建scrapy項目

scrapy startproject 項目名

spiders文件夾專門用來放置爬蟲程序，定義抓取特定網站的規(guī)則和邏輯，items.py用來定義要提取和保存的數據結構，pipelines.py用來處理數據，處理從爬蟲提取的數據，如清洗、驗證和存儲等，settings.py來進行各種設置，配置Scrapy框架和爬蟲參數，middlewares.py用來修改請求和響應的處理過程，如添加代理、設置User-Agent等。

cd 項目名

scrapy genspider 文件名目標網站

3.3 爬取b站的數據

項目目錄

upinfo_spider.py

import scrapy

from scrapy.http import HtmlResponse

from bilibli_upinfo.items import UserBasicItem,UserOfficialItem,UserVipItem

class UpinfoSpider(scrapy.Spider):

name = "upinfo_spider"

allowed_domains = ["api.bilibili.com"]

uuids = ["1899427463","268975831","478548672","510027392"]

start_url = "https://api.bilibili.com/x/space/wbi/acc/info?mid=%(uuid)s"

def start_requests(self):

for userId in self.uuids:

yield scrapy.Request(url=self.start_url%{"uuid":userId},dont_filter=True)

def parse(self, response:HtmlResponse,**kwargs):

up_data = response.json()['data']

user_basic_item = UserBasicItem()

user_basic_item['up_name'] = up_data['name']

user_basic_item['up_mid'] = up_data['mid']

user_basic_item['up_gender'] = up_data['sex']

user_basic_item['up_avatar'] = up_data['face']

user_basic_item['up_intro'] = up_data['sign']

user_basic_item['up_level'] = up_data['level']

user_basic_item['up_registered_time'] = up_data['jointime']

user_basic_item['up_coins'] = up_data['coins']

user_basic_item['up_login_log'] = up_data['fans_badge']

user_basic_item['up_fans_medal'] = up_data['fans_medal']['show']

user_official_item = UserOfficialItem()

user_official_item['up_mid'] = up_data['mid']

user_official_item['official_role'] = up_data['official']['role']

user_official_item['official_title'] = up_data['official']['title']

user_official_item['official_desc'] = up_data['official']['desc']

user_official_item['official_type'] = up_data['official']['type']

user_vip_item = UserVipItem()

user_vip_item['up_mid'] = up_data['mid']

user_vip_item['vip_type'] = up_data['vip']['type']

user_vip_item['vip_status'] = up_data['vip']['status']

user_vip_item['vip_due_date'] = up_data['vip']['due_date']

user_vip_item['vip_pay_type'] = up_data['vip']['vip_pay_type']

user_vip_item['vip_theme_type'] = up_data['vip']['theme_type']

user_vip_item['vip_label_path'] = up_data['vip']['label']['path']

user_vip_item['vip_label_text'] = up_data['vip']['label']['text']

user_vip_item['vip_label_theme'] = up_data['vip']['label']['label_theme']

user_vip_item['vip_label_text_color'] = up_data['vip']['label']['text_color']

user_vip_item['vip_label_bg_style'] = up_data['vip']['label']['bg_style']

user_vip_item['vip_label_bg_color'] = up_data['vip']['label']['bg_color']

user_vip_item['vip_label_border_color'] = up_data['vip']['label']['border_color']

user_vip_item['vip_label_use_img'] = up_data['vip']['label']['use_img_label']

user_vip_item['vip_label_img_uri_hans'] = up_data['vip']['label']['img_label_uri_hans']

user_vip_item['vip_label_img_uri_hant'] = up_data['vip']['label']['img_label_uri_hant']

user_vip_item['vip_label_img_uri_hans_static'] = up_data['vip']['label']['img_label_uri_hans_static']

user_vip_item['vip_label_img_uri_hant_static'] = up_data['vip']['label']['img_label_uri_hant_static']

yield user_vip_item

yield user_basic_item

yield user_official_item

# #用戶基礎信息表

# print(f"UP主姓名：{up_name}")

# print(f"UP主ID：{up_mid}")

# print(f"UP主性別：{up_gender}")

# print(f"UP主頭像鏈接：{up_avatar}")

# print(f"UP主簡介：{up_intro}")

# print(f"UP主等級：{up_level}")

# print(f"UP主注冊時間：{up_registered_time}")

# print(f"UP主硬幣數：{up_coins}")

# print(f"UP主登錄日志：{up_login_log}")

# print(f"UP主粉絲勛章：{up_fans_medal}")

# #認證信息表

# print(f"UP主ID：{up_mid}")

# print(f"UP主官方認證信息：{official_role}")

# print(f"UP主官方認證標題：{official_title}")

# print(f"UP主官方認證描述：{official_desc}")

# print(f"UP主官方認證類型：{official_type}")

# #會員信息表

# print(f"UP主ID：{up_mid}")

# print(f"UP主會員類型：{vip_type}")

# print(f"UP主會員狀態(tài)：{vip_status}")

# print(f"UP主會員到期日期：{vip_due_date}")

# print(f"UP主會員支付類型：{vip_pay_type}")

# print(f"UP主會員主題類型：{vip_theme_type}")

# print(f"UP主會員標簽路徑：{vip_label_path}")

# print(f"UP主會員標簽文本：{vip_label_text}")

# print(f"UP主會員標簽主題：{vip_label_theme}")

# print(f"UP主會員標簽文本顏色：{vip_label_text_color}")

# print(f"UP主會員標簽背景樣式：{vip_label_bg_style}")

# print(f"UP主會員標簽背景顏色：{vip_label_bg_color}")

# print(f"UP主會員標簽邊框顏色：{vip_label_border_color}")

# print(f"UP主會員是否使用標簽圖片：{vip_label_use_img}")

# print(f"UP主會員標簽圖片URI（簡體中文）：{vip_label_img_uri_hans}")

# print(f"UP主會員標簽圖片URI（繁體中文）：{vip_label_img_uri_hant}")

# print(f"UP主會員標簽圖片URI（簡體中文，靜態(tài)）：{vip_label_img_uri_hans_static}")

# print(f"UP主會員標簽圖片URI（繁體中文，靜態(tài)）：{vip_label_img_uri_hant_static}")

items.py

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class UserBasicItem(scrapy.Item):

up_name = scrapy.Field()

up_mid = scrapy.Field()

up_gender = scrapy.Field()

up_avatar = scrapy.Field()

up_intro = scrapy.Field()

up_level = scrapy.Field()

up_registered_time = scrapy.Field()

up_coins = scrapy.Field()

up_login_log = scrapy.Field()

up_fans_medal = scrapy.Field()

class UserOfficialItem(scrapy.Item):

up_mid = scrapy.Field()

official_role = scrapy.Field()

official_title = scrapy.Field()

official_desc = scrapy.Field()

official_type = scrapy.Field()

class UserVipItem(scrapy.Item):

up_mid = scrapy.Field()

vip_type = scrapy.Field()

vip_status = scrapy.Field()

vip_due_date = scrapy.Field()

vip_pay_type = scrapy.Field()

vip_theme_type = scrapy.Field()

vip_label_path = scrapy.Field()

vip_label_text = scrapy.Field()

vip_label_theme = scrapy.Field()

vip_label_text_color = scrapy.Field()

vip_label_bg_style = scrapy.Field()

vip_label_bg_color = scrapy.Field()

vip_label_border_color = scrapy.Field()

vip_label_use_img = scrapy.Field()

vip_label_img_uri_hans = scrapy.Field()

vip_label_img_uri_hant = scrapy.Field()

vip_label_img_uri_hans_static = scrapy.Field()

vip_label_img_uri_hant_static = scrapy.Field()

pipelines.py

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from traceback import format_exc

import pymysql

from itemadapter import ItemAdapter

from scrapy.utils.project import get_project_settings

from bilibli_upinfo.items import UserBasicItem, UserOfficialItem, UserVipItem

class BilibliUpinfoPipeline:

def __init__(self):

# 獲取MySQL配置

mysql_host = get_project_settings().get('MYSQL_HOST')

mysql_port = get_project_settings().get('MYSQL_PORT')

mysql_database = get_project_settings().get('MYSQL_DATABASE')

mysql_user = get_project_settings().get('MYSQL_USER')

mysql_password = get_project_settings().get('MYSQL_PASSWORD')

# 連接MySQL數據庫

self.connection = pymysql.connect(

host=mysql_host,

port=mysql_port,

database=mysql_database,

user=mysql_user,

password=mysql_password,

charset='utf8mb4',

cursorclass=pymysql.cursors.DictCursor

)

def process_item(self, item, spider):

try:

if isinstance(item,UserBasicItem):

# 處理用戶基礎信息表

with self.connection.cursor() as cursor:

sql = '''

INSERT INTO user_basic_info (up_name, up_mid, up_gender, up_avatar, up_intro, up_level, up_registered_time,

up_coins, up_login_log, up_fans_medal)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)

'''

cursor.execute(sql, (

item['up_name'], item['up_mid'], item['up_gender'], item['up_avatar'], item['up_intro'],

item['up_level'],

item['up_registered_time'], item['up_coins'], item['up_login_log'], item['up_fans_medal']

))

self.connection.commit()

elif isinstance(item,UserOfficialItem):

# 處理認證信息表

with self.connection.cursor() as cursor:

sql = '''

INSERT INTO user_official_info (up_mid, official_role, official_title, official_desc, official_type)

VALUES (%s, %s, %s, %s, %s)

'''

cursor.execute(sql, (

item['up_mid'], item['official_role'], item['official_title'], item['official_desc'],

item['official_type']

))

self.connection.commit()

elif isinstance(item, UserVipItem):

# 處理會員信息表

with self.connection.cursor() as cursor:

sql = '''

INSERT INTO user_vip_info (up_mid, vip_type, vip_status, vip_due_date, vip_pay_type, vip_theme_type,

vip_label_path, vip_label_text, vip_label_theme, vip_label_text_color, vip_label_bg_style, vip_label_bg_color,

vip_label_border_color, vip_label_use_img, vip_label_img_uri_hans, vip_label_img_uri_hant,

vip_label_img_uri_hans_static, vip_label_img_uri_hant_static)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)

'''

cursor.execute(sql, (

item['up_mid'], item['vip_type'], item['vip_status'], item['vip_due_date'], item['vip_pay_type'],

item['vip_theme_type'], item['vip_label_path'], item['vip_label_text'], item['vip_label_theme'],

item['vip_label_text_color'], item['vip_label_bg_style'], item['vip_label_bg_color'],

item['vip_label_border_color'], item['vip_label_use_img'], item['vip_label_img_uri_hans'],

item['vip_label_img_uri_hant'], item['vip_label_img_uri_hans_static'],

item['vip_label_img_uri_hant_static']

))

self.connection.commit()

except:

print(item)

print(format_exc())

def close_spider(self, spider):

self.connection.close()

settings.py

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = { # 請求的頭部

'origin': "https://space.bilibili.com",

'referer': "https://space.bilibili.com/",

'user-agent': "*********自己的********",

'Host': "api.bilibili.com"

}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

"bilibli_upinfo.pipelines.BilibliUpinfoPipeline": 300,

}

MYSQL_HOST = '127.0.0.1' # 主機名

MYSQL_PORT = 3306 # 端口號

MYSQL_DATABASE = 'bilibili_users' # 數據庫名

MYSQL_USER = 'root' # 用戶名

MYSQL_PASSWORD = '123456' # 密碼

main.py

def run_cmd_Popen_fileno(cmd_string):

"""

執(zhí)行cmd命令，并得到執(zhí)行后的返回值，python調試界面輸出返回值

:param cmd_string: cmd命令，如：'adb devices'

:return:

"""

import subprocess

print('運行cmd指令：{}'.format(cmd_string))

return subprocess.Popen(cmd_string, shell=True).wait()

if __name__ == '__main__':

run_cmd_Popen_fileno("scrapy crawl upinfo_spider")

3.3.1 游標

在使用Python操作數據庫時，使用游標（Cursor）是非常重要的。游標可以看作是與數據庫直接交互的對象，通過游標可以執(zhí)行SQL語句、獲取執(zhí)行結果等操作。

使用游標的主要原因是為了實現與數據庫的交互和操作。通過游標，可以執(zhí)行SQL語句對數據庫進行不同的操作，如插入、查詢、更新、刪除等。游標還提供了獲取查詢結果和處理查詢結果的方法，如fetchone()和fetchall()來獲取查詢結果的一行或多行數據。

具體原因如下：

執(zhí)行多個SQL語句：游標可以執(zhí)行多個SQL語句，執(zhí)行的SQL語句可以是插入、更新、刪除等操作。通過游標可以逐個執(zhí)行這些操作，并將其提交到數據庫中。提交事務：數據庫操作通常是在事務中進行的，事務是一個數據庫操作序列，要么全部執(zhí)行成功，要么全部執(zhí)行失敗。游標提供了commit()方法，可以將操作立即提交到數據庫中。獲取執(zhí)行結果：游標還可以獲取執(zhí)行SQL語句后的結果，如查詢結果。通過使用fetchone()和fetchall()等方法，可以獲取到查詢結果的行數據或所有數據。錯誤處理：通過游標可以捕獲數據庫操作的錯誤，如插入重復數據、數據插入長度超過限制等情況。異常處理可以幫助開發(fā)者快速定位問題，并做相應的處理。

總之，通過創(chuàng)建游標，可以更好地與數據庫進行交互和操作，并能夠獲取執(zhí)行結果和進行錯誤處理。這樣可以有效地管理和控制對數據庫的操作，確保數據的安全和完整性。

3.1.2 解析器

解析器是指用于解析和處理文檔的工具或程序。在爬蟲中，解析器主要用于解析網頁內容，從中提取所需的數據。

常見的解析器包括XPath和CSS選擇器。

XPath：XPath是一種用于在XML或HTML文檔中進行導航和提取數據的解析語言。它使用路徑表達式來定位和選擇節(jié)點，可以按照層級結構以及屬性等進行篩選和定位。XPath可以通過節(jié)點名稱、屬性、文本內容等進行節(jié)點的定位，并提供豐富的函數和操作符用于節(jié)點的篩選和提取。在Scrapy中，使用XPath可以非常方便地從網頁中提取所需的數據。 CSS選擇器：CSS選擇器是一種通過樣式選擇器來篩選和選擇節(jié)點的方式，常用于網頁的樣式設計。在爬蟲中，可以借用CSS選擇器來定位和提取網頁中的元素。CSS選擇器與XPath相比更簡潔，容易理解和使用。Scrapy中也提供了對CSS選擇器的支持，使得開發(fā)者可以根據CSS選擇器來定位和提取數據。

解析器的選擇取決于個人偏好和具體的應用場景。XPath和CSS選擇器都是強大而靈活的工具，可以根據需求選擇適合的解析器。在爬蟲開發(fā)中，合理運用解析器能夠更快地定位和提取所需的數據，提高爬蟲的效率和準確性。

總結起來，這個完整的爬蟲代碼利用Scrapy框架實現了網頁的下載、解析和數據提取，使用了XPath和CSS選擇器來定位和提取所需的數據。通過自定義的中間件和Item Pipeline，實現了對請求和響應的處理以及數據的存儲。這些關鍵技術的使用幫助實現了一個高效、可擴展的網絡爬蟲。

柚子快報激活碼778899分享：Scrapy爬蟲框架

http://yzkb.51969.com/

精彩鏈接

評論可見，查看隱藏內容

本文內容根據網絡資料整理，出于傳遞更多信息之目的，不代表金鑰匙跨境贊同其觀點和立場。

轉載請注明，如有侵權，聯(lián)系刪除。

本文鏈接：http://m.gantiao.com.cn/post/18568222.html