regex - How to extract a string that both matches some pattern and rests between two other strings -
sorry if duplicate.. it's not clear me what's available on how perform specific task..
my goal find filename of zipped file inside html code. filename inside <a href=...>
html block, it's easy human find.
here's code reproduce i'm looking at:
# character vector 2 strings html file string.examples <- c("anes time series cumulative data file</b><br /><a href=\"../cdf/cdf.htm\"> study page</a> | <a href=\"../cdf/cdf_errata.htm\">errata</a> | <a href=\"../data/cdf/anes_cdf.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-ascii']);\">download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfpor.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-por']);\">download .por file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfdta.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-dta']);\">download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | august 25, 2011 version </td></tr>", "anes 2012 time series study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">study page</a> | <a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">errata</a> | <a href=\"../data/anes_timeseries_2012/anes2012ts.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-ascii']);\">download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012ts_sav.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-sav']);\">download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012ts_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012ts_dta.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-dta']);\">download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | july 1, 2013 version<br />" )
buried deep in first line, there's text <a href=\"../data/cdf/anes_cdfdta.zip\"
, in second line, there's text <a href=\"../data/anes_timeseries_2012/anes2012ts_dta.zip\"
from these 2 lines, want extract ../data/cdf/anes_cdfdta.zip
, ../data/anes_timeseries_2012/anes2012ts_dta.zip
because contain text dta.zip
and because start <a href=\"
, end \"
i'd want where:
x <- some.regex.function( string.examples )
produces character vector of length 2 with..
> x [1] "../data/cdf/anes_cdfdta.zip" "../data/anes_timeseries_2012/anes2012ts_dta.zip"
here assume patterm you're looking starts after a href=\"
, ends dta.zip
. idea use greedy search through a href
until dta.zip
. also, capture each portion , replace searched string required capture.
gsub("(.*a href=\\\")(.*dta\\.zip)(.*)$", "\\2", string.examples)
the .*a href=\\\"
mentioned before "greedy" searches pattern (had escape \ , "). doing .*data\\.zip
, restrict greedy search not go beyond point require. pattern we'are interested in. so, make sure capture well. rest obvious. replace pattern second capture.
Comments
Post a Comment