爬取京东商品评论

打开京东主页,找到需要爬取评论的商品:

image-20200812110125490

右键 检查网页,将记载的文件先清空,找到商品评论处 ,当切换不同页的评论时,可以看到加载出新的文件:

image-20200812110417205

Request URL 复制到浏览器可以看到本页评论的内容;

https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100005207111&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&rid=0&fold=1

url 分析:https://club.jd.com/comment/productPageComments.action? 后面的部分是请求所需的参数,在下面的 Query String Parameters 中也可以看到。在发送请求的时候 将 https://club.jd.com/comment/productPageComments.action? 作为 url, 后面的部分作为参数进行传递。

参数解析:

image-20200812111236090

  • callback: 固定参数
  • productId: 商品 ID
  • score: 不同的评价类型,
    • 0 : 全部评价
    • 1:差评
    • 2:中评
    • 3:好评
  • sortType: 评论排序方式:推荐排序(5),时间排序(6)
  • page: 评论页,通过改变评论页,可以实现不同页面评论的循环爬取
  • pageSize: 固定参数
  • isShadowSku:
  • rid:
  • fold:

测试:

import requests
import json
import re
import pandas as pd
import time

url = 'https://club.jd.com/comment/productPageComments.action?'


for i in range(50):
    # 请求参数
    params = {
        'callback': 'fetchJSON_comment98',  # 固定值
        'productId': '100005207111',        # 产品 ID
        'score': '0',   # 0-5 分别对应不同的评价,好评、中评、全部评价等
        'sortType': '5',    # 排序方式
        'page': i,    # 当前的评论页
        'pageSize': '10',   # 每页评论的数量
        'isShadowSku': '0',
        'rid': '0',
        'fold': '1'
    }

    headers = {
        'Referer': 'https://item.jd.com/100005207111.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Cookie': '__jdu=1568007721111834248469; shshshfpa=6814de79-d912-fb65-2445-e0c1001418be-1568007725; shshshfpb=eK6h%2FTTG3UOQIjI3nubc4rg%3D%3D; pinId=uKm5zMomrBzAmLNKou2uKDqrJX175ugD; unpl=V2_ZzNtbRcFShclDkVTchwOAGIDR1VKV0EWcwxCBnobD1IzURFaclRCFnQUR11nGF4UZgsZX0RcQRNFCEVkexhdBGAAE19BVXMlRQtGZHopXAFgChNcRFFAFXUIRl15HF8AbgYVVXJnRBV8OHYfIl9VAmIAE1hAVEcldg1BUX8bbARXARNcQVBKFXwJRmQwd11IZwcVVENWRRN2CEZUexBeAGQGG1hFX3MVdAlHVXseXQZmMxE%3d; __jdv=122270672|google|tw6|cpc|not set|1596435399183; areaId=12; ipLoc-djd=12-904-3373-0; PCSYCityID=CN_320000_320100_0; __jdc=122270672; __jda=122270672.1568007721111834248469.1568007721.1596435399.1597198601.61; shshshfp=220f18c9406df7935c99c0b11d700617; 3AB9D23F7A4B3C9B=IXUELWJK3WKMGCI6K6VNEMY5JTJRYFIN7UACHEU53PFRGL26M3Z4EPHTURYSBFSZOFLSHWLCALKBM2HL2Z6PVCSWZ4; jwotest_product=99; wlfstk_smdl=ehul02b6y6mrpp3c0io9rgen0mvqvz09; TrackID=1odMbOfz9qDnsQUriX0MI0xuEBMGOvndQJls-Cnyljpewp5cEF-8coPrSiWOAgRyjr8am57HCcwV9xQf_rr3xTPy5hoA-MhDYUFaH7zOoUlO458mlGY3vk0rpwZPB_jrZ; thor=E720FC5049F725A474B2F674576F76565C554ECB4F3557F926E8468A57A9F7319EE2F9D0A1DB99CB745E9A3DD4FB61D813FCC037D93474C5510BE4250A6DA7A54A946F6619D1239AE6A096341CF9A38620295216FAF185554394D80EC04D8B5A7B46EF7FC4618FA645AD46ED4BD7383928B3013693A6F5931E4F5AA7EFCBBA78; pin=%E4%B8%B6%E5%8C%97%E5%9F%8E%E9%B1%BC%E5%AE%89%E7%94%9F; unick=___%E6%A2%A6%E5%AF%90___; ceshi3.com=000; _tp=6KbAcOO79y3rwvtuJvHoHRBSFSbWIqCOxKNAM%2FSmxbwMBsbJ8KsKQWmVHGQC11rU1S2%2FwNGWzgqqKzflVhJMlw%3D%3D; _pst=%E4%B8%B6%E5%8C%97%E5%9F%8E%E9%B1%BC%E5%AE%89%E7%94%9F; shshshsID=8ed6a639183b7cc4d65115ddd4ec2c4b_5_1597199326879; __jdb=122270672.8.1568007721111834248469|61.1597198601; JSESSIONID=C4161CA90DF9781E4CF679DB423DF2D6.s1'
    }

    res = requests.get(url=url,params=params,headers=headers)
    print(res.status_code)

    # 返回的内容并不是规范的json 格式文件, 利用正则表达式将其转化为规范的 json 文件
    html_json = re.search('(?<=fetchJSON_comment98\().*(?=\);)',res.text).group(0)
    j = json.loads(html_json)
    for jj in range(len(j['comments'])):
        data1 = [j['comments'][jj]['content']]
        data2 = pd.DataFrame(data1)
        # print(data1)
        data2.to_csv('jd.csv',header=False,index=False,mode='a+')

    time.sleep(3)
    print('page'+str(1+i)+' has done')

上面是商品的全部评价,当然也可以只爬取当前商品的评价;

刷新网页,检查,清空。 选择只看当前商品评价:

image-20200812113908758

image-20200812114016651

image-20200812114045752

我们同样可以 发送Request URL 的 request 请求,并传递相应的参数获取评论。

测试:

# -*- encoding: utf-8 -*-
"""
@software: PyCharm
@file : jd2.py
@time : 2020/8/12 
"""
import requests
import json
import re
import pandas as pd
import time

url = 'https://club.jd.com/comment/skuProductPageComments.action?'


for i in range(25):
    # 请求参数, 与全部评价相比少了一个 'rid' 参数
    params = {
        'callback': 'fetchJSON_comment98',  # 固定值
        'productId': '100005207111',        # 产品 ID
        'score': '0',   # 0-5 分别对应不同的评价,好评、中评、全部评价等
        'sortType': '5',    # 排序方式
        'page': i,    # 当前的评论页
        'pageSize': '10',   # 每页评论的数量
        'isShadowSku': '0',
        'fold': '1'
    }

    headers = {
        'Referer': 'https://item.jd.com/100005207111.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Cookie': '__jdu=1568007721111834248469; shshshfpa=6814de79-d912-fb65-2445-e0c1001418be-1568007725; shshshfpb=eK6h%2FTTG3UOQIjI3nubc4rg%3D%3D; pinId=uKm5zMomrBzAmLNKou2uKDqrJX175ugD; unpl=V2_ZzNtbRcFShclDkVTchwOAGIDR1VKV0EWcwxCBnobD1IzURFaclRCFnQUR11nGF4UZgsZX0RcQRNFCEVkexhdBGAAE19BVXMlRQtGZHopXAFgChNcRFFAFXUIRl15HF8AbgYVVXJnRBV8OHYfIl9VAmIAE1hAVEcldg1BUX8bbARXARNcQVBKFXwJRmQwd11IZwcVVENWRRN2CEZUexBeAGQGG1hFX3MVdAlHVXseXQZmMxE%3d; __jdv=122270672|google|tw6|cpc|not set|1596435399183; areaId=12; ipLoc-djd=12-904-3373-0; PCSYCityID=CN_320000_320100_0; __jdc=122270672; __jda=122270672.1568007721111834248469.1568007721.1596435399.1597198601.61; shshshfp=220f18c9406df7935c99c0b11d700617; 3AB9D23F7A4B3C9B=IXUELWJK3WKMGCI6K6VNEMY5JTJRYFIN7UACHEU53PFRGL26M3Z4EPHTURYSBFSZOFLSHWLCALKBM2HL2Z6PVCSWZ4; jwotest_product=99; wlfstk_smdl=ehul02b6y6mrpp3c0io9rgen0mvqvz09; TrackID=1odMbOfz9qDnsQUriX0MI0xuEBMGOvndQJls-Cnyljpewp5cEF-8coPrSiWOAgRyjr8am57HCcwV9xQf_rr3xTPy5hoA-MhDYUFaH7zOoUlO458mlGY3vk0rpwZPB_jrZ; thor=E720FC5049F725A474B2F674576F76565C554ECB4F3557F926E8468A57A9F7319EE2F9D0A1DB99CB745E9A3DD4FB61D813FCC037D93474C5510BE4250A6DA7A54A946F6619D1239AE6A096341CF9A38620295216FAF185554394D80EC04D8B5A7B46EF7FC4618FA645AD46ED4BD7383928B3013693A6F5931E4F5AA7EFCBBA78; pin=%E4%B8%B6%E5%8C%97%E5%9F%8E%E9%B1%BC%E5%AE%89%E7%94%9F; unick=___%E6%A2%A6%E5%AF%90___; ceshi3.com=000; _tp=6KbAcOO79y3rwvtuJvHoHRBSFSbWIqCOxKNAM%2FSmxbwMBsbJ8KsKQWmVHGQC11rU1S2%2FwNGWzgqqKzflVhJMlw%3D%3D; _pst=%E4%B8%B6%E5%8C%97%E5%9F%8E%E9%B1%BC%E5%AE%89%E7%94%9F; shshshsID=8ed6a639183b7cc4d65115ddd4ec2c4b_5_1597199326879; __jdb=122270672.8.1568007721111834248469|61.1597198601; JSESSIONID=C4161CA90DF9781E4CF679DB423DF2D6.s1'
    }

    res = requests.get(url=url,params=params,headers=headers)
    print(res.status_code)

    # 返回的内容并不是规范的json 格式文件, 利用正则表达式将其转化为规范的 json 文件
    html_json = re.search('(?<=fetchJSON_comment98\().*(?=\);)',res.text).group(0)
    j = json.loads(html_json)
    for jj in range(len(j['comments'])):
        data1 = [j['comments'][jj]['content']]
        data2 = pd.DataFrame(data1)
        # print(data1)
        data2.to_csv('jd_dq.csv',header=False,index=False,mode='a+')

    time.sleep(3)
    print('page'+str(1+i)+'has done')

image-20200812115202981

参考

Update time: 2020-08-15

results matching ""

    No results matching ""