java - HtmlUnit failing when it tries to open dead javascript links. Is there a way to tell it not to load specific URLs? -
i'm trying little scraping on this site programatically find polling info. tried python, worked great loading site , navigating around aspx
forms, couldn't extract embedded maps data (since no packages (as of yet) handle javascript). i've opted dust off java skills , break out htmlunit. however, instantly hit snag.
it appears though there dead links javascript files on site don't exists. when htmlunit tries load them gets 404 , self destructs.
specific error
jul 21, 2013 9:51:22 pm com.gargoylesoftware.htmlunit.html.htmlpage loadexternaljavascriptfile severe: error loading javascript [http://www.eci-polldaymonitoring.nic.in/psl/googlemapforaspnet.ascx/jsdebug]. com.gargoylesoftware.htmlunit.failinghttpstatuscodeexception: 404 not found http://www.eci-polldaymonitoring.nic.in/psl/googlemapforaspnet.ascx/jsdebug @ com.gargoylesoftware.htmlunit.webclient.throwfailinghttpstatuscodeexceptionifnecessary(webclient.java:544) @ com.gargoylesoftware.htmlunit.html.htmlpage.loadjavascriptfromurl(htmlpage.java:1119) @ com.gargoylesoftware.htmlunit.html.htmlpage.loadexternaljavascriptfile(htmlpage.java:1059) @ com.gargoylesoftware.htmlunit.html.htmlscript.executescriptifneeded(htmlscript.java:399) @ com.gargoylesoftware.htmlunit.html.htmlscript$3.execute(htmlscript.java:260) @ com.gargoylesoftware.htmlunit.html.htmlscript.onallchildrenaddedtopage(htmlscript.java:276) @ com.gargoylesoftware.htmlunit.html.htmlparser$htmlunitdombuilder.endelement(htmlparser.java:676) @ org.apache.xerces.parsers.abstractsaxparser.endelement(unknown source) @ com.gargoylesoftware.htmlunit.html.htmlparser$htmlunitdombuilder.endelement(htmlparser.java:635) @ org.cyberneko.html.htmltagbalancer.callendelement(htmltagbalancer.java:1170) @ org.cyberneko.html.htmltagbalancer.endelement(htmltagbalancer.java:1072) @ org.cyberneko.html.filters.defaultfilter.endelement(defaultfilter.java:206) @ org.cyberneko.html.filters.namespacebinder.endelement(namespacebinder.java:330) @ org.cyberneko.html.htmlscanner$contentscanner.scanendelement(htmlscanner.java:3074) @ org.cyberneko.html.htmlscanner$contentscanner.scan(htmlscanner.java:2041) @ org.cyberneko.html.htmlscanner.scandocument(htmlscanner.java:918) @ org.cyberneko.html.htmlconfiguration.parse(htmlconfiguration.java:499) @ org.cyberneko.html.htmlconfiguration.parse(htmlconfiguration.java:452) @ org.apache.xerces.parsers.xmlparser.parse(unknown source) @ com.gargoylesoftware.htmlunit.html.htmlparser$htmlunitdombuilder.parse(htmlparser.java:892) @ com.gargoylesoftware.htmlunit.html.htmlparser.parse(htmlparser.java:241) @ com.gargoylesoftware.htmlunit.html.htmlparser.parsehtml(htmlparser.java:187) @ com.gargoylesoftware.htmlunit.defaultpagecreator.createhtmlpage(defaultpagecreator.java:268) @ com.gargoylesoftware.htmlunit.defaultpagecreator.createpage(defaultpagecreator.java:156) @ com.gargoylesoftware.htmlunit.webclient.loadwebresponseinto(webclient.java:434) @ com.gargoylesoftware.htmlunit.webclient.getpage(webclient.java:309) @ com.gargoylesoftware.htmlunit.webclient.getpage(webclient.java:374) @ com.gargoylesoftware.htmlunit.webclient.getpage(webclient.java:359) @ scrapetest$.main(scrapetest.scala:12) @ scrapetest.main(scrapetest.scala)
is there way tell either (a) ignore 404 errors completely, or (b) ignore specific javascript urls?
my code far (scala)
import com.gargoylesoftware.htmlunit.webclient import com.gargoylesoftware.htmlunit.browserversion import com.gargoylesoftware.htmlunit.html.htmlpage object scrapetest { def main(args: array[string]): unit = { val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/" val client = new webclient(browserversion.internet_explorer_8) var response: htmlpage = client.getpage(pageurl) println(response.astext()) } }
a brief @ htmlunit javadoc seems indicate should able use webclientoptions#setexceptiononfailingstatuscode(boolean)
e.g.,
import com.gargoylesoftware.htmlunit.webclient import com.gargoylesoftware.htmlunit.browserversion import com.gargoylesoftware.htmlunit.html.htmlpage object scrapetest { def main(args: array[string]): unit = { val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/" val client = new webclient(browserversion.internet_explorer_8) // don't throw exception on failing status code client.getoptions.setexceptiononfailingstatuscode(false) var response: htmlpage = client.getpage(pageurl) println(response.astext()) } }
if doesn't work try:
Comments
Post a Comment