python - BeautifulSoup replaces single quotes with double quotes -
in beautifulsoup4 python if exectue following commands:
soup = beautifulsoup("<a href='http://somelink'>link</a>") print soup
the output is:
<a href="http://somelink">link</a>
beaurifulsoup replaces single quotes double quotes , don't want that. how can cancel/overwrite behaviour?
clarification:
i use urllib2 html of following page: http://www.download3000.com/
, use beautifulsoup4 extract part of html.
i have made function takes document (not html) , samples of needs catch , returns regular expression. feed function follwoing samples:
samples = [ '/showarticles-1-0-date.html', '/showarticles-2-0-date.html', '/showarticles-3-0-date.html' ]
given html code of http://www.download3000.com/
page , samples above, function returns following regular expressions: \w\w><li><a href="(.*?)">\w\w\w\w\w
if apply regex html code of download3000, won't find match. that's because links surrounded single quotes in html, when use beautifulsoup replaces single quotes double quotes , regular expression generated works on html modified beaurifulsoup.
that's why need force beautifulsoup not replace single quotes double quotes, generated regular expression \w\w><li><a href='(.*?)'>\w\w\w\w\w
, extracting need page.
i use dump solution replacing single quotes in regex ["\']
, regex catch links don't want.
this works beautifulsoup 3.2. assume happening single quotes converted "
parser surrounds them " on output pattern "'
or '"
occurs
>>> c="<a href='http://somelink'>" >>> beautifulsoup import beautifulsoup >>> import re >>> d=re.sub("'",""",c) >>> e=beautifulsoup(d) >>> def qfix(x): return re.sub("\'\"|\"'","'",x) >>> qfix((str(e))
you might able use similar "qfix" formatter in beautifulsoup 4
or might not work @ :)
Comments
Post a Comment