python - scrapy error processing url -


hi i'm new python , scrapy, i'm trying code spider can't find error or solution error while processing starting url, don't know if it's problem xpath or other thing, of threads found talks wrong indentation, not case. code:

import scrapy scrapy.exceptions import closespider  scrapy_crawls.items import vino   class bodebocaspider(scrapy.spider):     name = "bodeboca"     allowed_domains = ["bodeboca.com"]     start_urls = (         'http://www.bodeboca.com/vino/espana',     )     counter = 1     next_url = ""      vino = none      def __init__(self):         self.next_url = self.start_urls[0]       def parse(self, response):          sel in response.xpath(                 '//div[@id="venta-main-wrapper"]/div[@id="venta-main"]/div/div/div/div/div/div/span'):              #print sel             # href             a_href = sel.xpath('.//a/@href').extract()             the_href = a_href[0]             print the_href             yield scrapy.request(the_href, callback=self.parse_item, headers={'referer': response.url.encode('utf-8'),                                                                               'accept-language': 'es-es,es;q=0.8,en-us;q=0.5,en;q=0.3'})          # siguiente url         results = response.xpath(             '//div[@id="wrapper"]/article/div[@id="article-inner"]/div[@id="default-filter-form-wrapper"]/div[@id="venta-main-wrapper"]/div[@class="bb-product-info-sort bb-sort-behavior-attached"]/div[@clsas="bb-product-info"]/span[@class="bb-product-info-count"]').extract()           if not results:             raise closespider         else:             #self.next_url = self.next_url.replace(str(self.counter), str(self.counter + 1))             #self.counter += 1             self.next_url = response.xpath('//div[@id="venta-main-wrapper"]/div[@class="item-list"]/ul[@class="pager"]/li[@class="pager-next"]/a/@href').extract()[0]             yield scrapy.request(self.next_url, callback=self.parse, headers={'referer': self.allowed_domains[0],                                                                               'accept-language': 'es-es,es;q=0.8,en-us;q=0.5,en;q=0.3'}) 

error:

2017-03-28 12:29:08 [scrapy.core.engine] info: spider opened 2017-03-28 12:29:08 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-28 12:29:08 [scrapy.core.engine] debug: crawled (200) <get http://www.bodeboca.com/robots.txt> (referer: none) 2017-03-28 12:29:08 [scrapy.core.engine] debug: crawled (200) <get http://www.bodeboca.com/vino/espana> (referer: none) /vino/terra-cuques-2014 2017-03-28 12:29:08 [scrapy.core.scraper] error: spider error processing <get http://www.bodeboca.com/vino/espana> (referer: none)  traceback (most recent call last):   file "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it)     file "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output x in result:   file "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> return (_set_referer(r) r in result or ())   file "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r r in result or () if _filter(r))   file "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r r in result or () if _filter(r))   file "/home/gerardo/proyectos/vinos-diferentes-crawl/scrapy_crawls/spiders/bodeboca.py", line 36, in parse 'accept-language': 'es-es,es;q=0.8,en-us;q=0.5,en;q=0.3'})   file "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__ self._set_url(url)   file "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: /vino/terra-cuques-2014 2017-03-28 12:29:08 [scrapy.core.engine] info: closing spider (finished) 2017-03-28 12:29:08 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 449,  'downloader/request_count': 2,  'downloader/request_method_count/get': 2,  'downloader/response_bytes': 38558,  'downloader/response_count': 2,  'downloader/response_status_count/200': 2,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 3, 28, 10, 29, 8, 951654),  'log_count/debug': 2,  'log_count/error': 1,  'log_count/info': 7,  'response_received_count': 2,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'spider_exceptions/valueerror': 1,  'start_time': datetime.datetime(2017, 3, 28, 10, 29, 8, 690948)} 2017-03-28 12:29:08 [scrapy.core.engine] info: spider closed (finished) 

simply answer: extract page relative url e.g. /vino/terra-cuques-2014

in order make scrapy request url need full: http://www.bodeboca.com/vino/terra-cuques-2014. can make full url using scrapy response.urljoin() method e.g.: full_url = response.urljoin(url).

try not use xpath expression like: /div[@id="venta-main"]/div/div/div/div/div/div/span - it's hard read , easy can broken slightest change in page. instead, can use xpath based on class: //a[@class="verficha"].

you can rewrite part of spider this:

def parse(self, response):     links = response.xpath('//a[@class="verficha"]')     link in links:         url = link.xpath('@href').extract_first()         full_url = response.urljoin(url)         yield scrapy.request(full_url , callback= callback) 

if want extract url next page can use xpath next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first(), again call response.urljoin(next_page) etc.


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -