response object in scrapy not complete -


sorry if question stupid couldn't find answer yet.

i trying prepare script extract data website using "scrapy shell" command :

what suspect web server serves first static data fills-in dynamic data in page. guess managed through javascript on web page.

if understanding correct, needs happen scrapy needs wait little bit before returning result.

could me here ?

thanks !

here working example using selenium , phantomjs headless webdriver in download handler middleware.

class jsdownload(object):  @check_spider_middleware def process_request(self, request, spider):     driver = webdriver.phantomjs(executable_path='d:\phantomjs.exe')     driver.get(request.url)     return htmlresponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8')) 

i wanted ability tell different spiders middleware use implemented wrapper:

def check_spider_middleware(method): @functools.wraps(method) def wrapper(self, request, spider):     msg = '%%s %s middleware step' % (self.__class__.__name__,)     if self.__class__ in spider.middleware:         spider.log(msg % 'executing', level=log.debug)         return method(self, request, spider)     else:         spider.log(msg % 'skipping', level=log.debug)         return none  return wrapper 

settings.py:

downloader_middlewares = {'myproj.middleware.middlewaremodule.middlewareclass': 500} 

for wrapper work spiders must have @ minimum:

middleware = set([]) 

to include middleware:

middleware = set([myproj.middleware.modulename.classname]) 

you have implemented in request callback (in spider) http request happening twice. isn't full proof solution works stuff loads on .ready(). if spend time reading selenium can wait specific event's trigger before saving page source.

another example: https://github.com/scrapinghub/scrapyjs

cheers!


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

c# - must be a non-abstract type with a public parameterless constructor in redis -

ajax - PHP/JSON Login script (Twitter style) not setting sessions -