1. 背景
在網頁爬取的時候,有時候會使用scrapy.FormRequest向目標網站提交數據(表單提交)。參照scrapy官方文檔的標準寫法是:
# header信息 unicornHeader = { 'Host': 'www.example.com', 'Referer': 'http://www.example.com/', } # 表單需要提交的數據 myFormData = {'name': 'John Doe', 'age': '27'} # 自定義信息,向下層響應(response)傳遞下去 customerData = {'key1': 'value1', 'key2': 'value2'} yield scrapy.FormRequest(url = "http://www.example.com/post/action", headers = unicornHeader, method = 'POST', # GET or POST formdata = myFormData, # 表單提交的數據 meta = customerData, # 自定義,向response傳遞數據 callback = self.after_post, errback = self.error_handle, # 如果需要多次提交表單,且url一樣,那麼就必須加此參數dont_filter,防止被當成重複網頁過濾掉了 dont_filter = True )
但是,當表單提交數據myFormData 是形如字典內嵌字典的形式,又該如何寫?
2. 案例 ― 參數為字典
在做亞馬遜網站爬取時,當進入商家店鋪,爬取店鋪內商品列表時,發現採取的方式是ajax請求,返回的是json數據。
請求信息如下:
響應信息如下:
如上圖所示,From Data中的數據包含一個字典:
marketplaceID:ATVPDKIKX0DER seller:A2FE6D62A4WM6Q productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"} # formDate 必須構造如下: myFormData = { 'marketplaceID' : 'ATVPDKIKX0DER', 'seller' : 'A2FE6D62A4WM6Q', # 注意下面這一行,內部字典是作為一個字符串的形式 'productSearchRequestData' :'{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"}' }
在amazon中實際使用的構造方法如下:
def sendRequestForProducts(response): ajaxParam = response.meta for pageIdx in range(1, ajaxParam['totalPageNum']+1): ajaxParam['isFirstAjax'] = False ajaxParam['pageNumber'] = pageIdx unicornHeader = { 'Host': 'www.amazon.com', 'Origin': 'https://www.amazon.com', 'Referer': ajaxParam['referUrl'], } ''' marketplaceID:ATVPDKIKX0DER seller:AYZQAQRQKEXRP productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"AYZQAQRQKEXRP","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":1} ''' productSearchRequestData = '{"marketplace": "ATVPDKIKX0DER", "seller": "' + f'{ajaxParam["sellerID"]}' + '","url": "/sp/ajax/products", "pageSize": 12, "searchKeyword": "","extraRestrictions": {}, "pageNumber": "' + str(pageIdx) + '"}' formdataProduct = { 'marketplaceID': ajaxParam['marketplaceID'], 'seller': ajaxParam['sellerID'], 'productSearchRequestData': productSearchRequestData } productAjaxMeta = ajaxParam # 請求店鋪商品列表 yield scrapy.FormRequest( url = 'https://www.amazon.com/sp/ajax/products', headers = unicornHeader, formdata = formdataProduct, func = 'POST', meta = productAjaxMeta, callback = self.solderProductAjax, errback = self.error, # 處理http error dont_filter = True, # 需要加此參數的 )
3. 原理分析
舉例來說,目前有如下一筆數據:
formdata = { 'Field': {"pageIdx":99, "size":"10"}, 'func': 'nextPage', }
從網頁上,可以看到請求數據如下:
Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage
第一種,按照如下方式發出請求,結果如下(正確):
yield scrapy.FormRequest( url = 'https://www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': '{"pageIdx":99, "size":"10"}', 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 請求數據為:Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage
第二種,按照如下方式發出請求,結果如下(錯誤,無法獲取到正確的數據):
yield scrapy.FormRequest( url = 'https://www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': {"pageIdx":99, "size":"10"}, 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 經過錯誤的編碼之後,發送的請求為:Field=size&Field=pageIdx&func=nextPage
我們跟蹤看一下scrapy中的源碼:
# E:/Miniconda/Lib/site-packages/scrapy/http/request/form.py # FormRequest class FormRequest(Request): def __init__(self, *args, **kwargs): formdata = kwargs.pop('formdata', None) if formdata and kwargs.get('func') is None: kwargs['func'] = 'POST' super(FormRequest, self).__init__(*args, **kwargs) if formdata: items = formdata.items() if isinstance(formdata, dict) else formdata querystr = _urlencode(items, self.encoding) if self.func == 'POST': self.headers.setdefault(b'Content-Type', b'application/x-www-form-urlencoded') self._set_body(querystr) else: self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr) # 關鍵函數 _urlencode def _urlencode(seq, enc): values = [(to_bytes(k, enc), to_bytes(v, enc)) for k, vs in seq for v in (vs if is_listlike(vs) else [vs])] return urlencode(values, doseq=1)
分析過程如下:
# 第一步:items = formdata.items() if isinstance(formdata, dict) else formdata # 第一步結果:經過items()方法執行後,原始的dict格式變成如下列表形式: dict_items([('func', 'nextPage'), ('Field', {'size': '10', 'pageIdx': 99})]) # 第二步:再經過後面的 _urlencode方法將items轉換成如下: [(b'func', b'nextPage'), (b'Field', b'size'), (b'Field', b'pageIdx')] # 可以看到就是在調用 _urlencode方法的時候出現了問題,上面的方法執行過後,會使字典形式的數據只保留了keys(value是字典的情況下,只保留了value字典中的key).
解決方案: 就是將字典當成普通的字符串,然後編碼(轉換成bytes),進行傳輸,到達服務器端之後,服務器會反過來進行解碼,得到這個字典字符串。然後服務器按照Dict進行解析。
拓展:對於其他特殊類型的數據,都按照這種方式打包成字符串進行傳遞。
4. 補充1 ――參數類型
formdata的 參數值 必須是unicode , str 或者 bytes object,不能是整數。
案例:
yield FormRequest( url = 'https://www.amztracker.com/unicorn.php', headers = unicornHeader, # formdata 的參數必須是字符串 formdata={'rank': 10, 'category': productDetailInfo['topCategory']}, method = 'GET', meta = {'productDetailInfo': productDetailInfo}, callback = self.amztrackerSale, errback = self.error, # 本項目中這裡觸發errback佔絕大多數 dont_filter = True, # 按理來說是不需要加此參數的 ) # 提示如下ERROR: Traceback (most recent call last): File "E:Minicondalibsite-packagesscrapyutilsdefer.py", line 102, in iter_errback yield next(it) File "E:Minicondalibsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output for x in result: File "E:Minicondalibsite-packagesscrapyspidermiddlewares
eferer.py", line 339, in
[lousu-xi ] scrapy爬蟲:scrapy.FormRequest中formdata參數詳解已經有305次圍觀