response object in scrapy not complete -
sorry if question stupid couldn't find answer yet.
i trying prepare script extract data website using "scrapy shell" command :
using web browser entering url (e.g. "http://www.testsite.com/data_to_extract"), data extract. page contains static data + dynamic data.
using command "scrapy shell http://www.testsite.com/data_to_extract" , issuing command ("view(response)"), see in web browser static data of page not dynamic data.
what suspect web server serves first static data fills-in dynamic data in page. guess managed through javascript on web page.
if understanding correct, needs happen scrapy needs wait little bit before returning result.
could me here ?
thanks !
here working example using selenium , phantomjs headless webdriver in download handler middleware.
class jsdownload(object): @check_spider_middleware def process_request(self, request, spider): driver = webdriver.phantomjs(executable_path='d:\phantomjs.exe') driver.get(request.url) return htmlresponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
i wanted ability tell different spiders middleware use implemented wrapper:
def check_spider_middleware(method): @functools.wraps(method) def wrapper(self, request, spider): msg = '%%s %s middleware step' % (self.__class__.__name__,) if self.__class__ in spider.middleware: spider.log(msg % 'executing', level=log.debug) return method(self, request, spider) else: spider.log(msg % 'skipping', level=log.debug) return none return wrapper
settings.py:
downloader_middlewares = {'myproj.middleware.middlewaremodule.middlewareclass': 500}
for wrapper work spiders must have @ minimum:
middleware = set([])
to include middleware:
middleware = set([myproj.middleware.modulename.classname])
you have implemented in request callback (in spider) http request happening twice. isn't full proof solution works stuff loads on .ready(). if spend time reading selenium can wait specific event's trigger before saving page source.
another example: https://github.com/scrapinghub/scrapyjs
cheers!
Comments
Post a Comment