python - BeautifulSoup replaces single quotes with double quotes -


in beautifulsoup4 python if exectue following commands:

soup = beautifulsoup("<a href='http://somelink'>link</a>") print soup 

the output is:

<a href="http://somelink">link</a> 

beaurifulsoup replaces single quotes double quotes , don't want that. how can cancel/overwrite behaviour?

clarification:

i use urllib2 html of following page: http://www.download3000.com/ , use beautifulsoup4 extract part of html.

i have made function takes document (not html) , samples of needs catch , returns regular expression. feed function follwoing samples:

samples = [     '/showarticles-1-0-date.html',     '/showarticles-2-0-date.html',     '/showarticles-3-0-date.html' ] 

given html code of http://www.download3000.com/ page , samples above, function returns following regular expressions: \w\w><li><a href="(.*?)">\w\w\w\w\w

if apply regex html code of download3000, won't find match. that's because links surrounded single quotes in html, when use beautifulsoup replaces single quotes double quotes , regular expression generated works on html modified beaurifulsoup.

that's why need force beautifulsoup not replace single quotes double quotes, generated regular expression \w\w><li><a href='(.*?)'>\w\w\w\w\w, extracting need page.

i use dump solution replacing single quotes in regex ["\'], regex catch links don't want.

this works beautifulsoup 3.2. assume happening single quotes converted &quot; parser surrounds them " on output pattern "' or '" occurs

>>> c="<a href='http://somelink'>" >>> beautifulsoup import beautifulsoup >>> import re >>> d=re.sub("'","&quot;",c) >>> e=beautifulsoup(d) >>> def qfix(x):  return re.sub("\'\"|\"'","'",x) >>> qfix((str(e)) 

you might able use similar "qfix" formatter in beautifulsoup 4

or might not work @ :)


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

c# - must be a non-abstract type with a public parameterless constructor in redis -

ajax - PHP/JSON Login script (Twitter style) not setting sessions -