python - BeautifulSoup replaces single quotes with double quotes -
in beautifulsoup4 python if exectue following commands:
soup = beautifulsoup("<a href='http://somelink'>link</a>") print soup the output is:
<a href="http://somelink">link</a> beaurifulsoup replaces single quotes double quotes , don't want that. how can cancel/overwrite behaviour?
clarification:
i use urllib2 html of following page: http://www.download3000.com/ , use beautifulsoup4 extract part of html.
i have made function takes document (not html) , samples of needs catch , returns regular expression. feed function follwoing samples:
samples = [ '/showarticles-1-0-date.html', '/showarticles-2-0-date.html', '/showarticles-3-0-date.html' ] given html code of http://www.download3000.com/ page , samples above, function returns following regular expressions: \w\w><li><a href="(.*?)">\w\w\w\w\w
if apply regex html code of download3000, won't find match. that's because links surrounded single quotes in html, when use beautifulsoup replaces single quotes double quotes , regular expression generated works on html modified beaurifulsoup.
that's why need force beautifulsoup not replace single quotes double quotes, generated regular expression \w\w><li><a href='(.*?)'>\w\w\w\w\w, extracting need page.
i use dump solution replacing single quotes in regex ["\'], regex catch links don't want.
this works beautifulsoup 3.2. assume happening single quotes converted " parser surrounds them " on output pattern "' or '" occurs
>>> c="<a href='http://somelink'>" >>> beautifulsoup import beautifulsoup >>> import re >>> d=re.sub("'",""",c) >>> e=beautifulsoup(d) >>> def qfix(x): return re.sub("\'\"|\"'","'",x) >>> qfix((str(e)) you might able use similar "qfix" formatter in beautifulsoup 4
or might not work @ :)
Comments
Post a Comment