regex - How to extract a string that both matches some pattern and rests between two other strings -


sorry if duplicate.. it's not clear me what's available on how perform specific task..

my goal find filename of zipped file inside html code. filename inside <a href=...> html block, it's easy human find.

here's code reproduce i'm looking at:

# character vector 2 strings html file string.examples <-     c("anes time series cumulative data file</b><br /><a href=\"../cdf/cdf.htm\"> study page</a>&nbsp; | &nbsp;<a href=\"../cdf/cdf_errata.htm\">errata</a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdf.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-ascii']);\">download ascii data files  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfpor.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-por']);\">download .por file  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfdta.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/cdf-dta']);\">download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;august 25, 2011 version </td></tr>",      "anes 2012 time series study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">study page</a>&nbsp; | &nbsp;<a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">errata</a>&nbsp; |  &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012ts.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-ascii']);\">download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012ts_sav.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-sav']);\">download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012ts_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012ts_dta.zip\" onclick=\"javascript: _gaq.push(['_trackpageview','/downloads/2012ts-dta']);\">download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;july 1, 2013 version<br />" ) 

buried deep in first line, there's text <a href=\"../data/cdf/anes_cdfdta.zip\" , in second line, there's text <a href=\"../data/anes_timeseries_2012/anes2012ts_dta.zip\"

from these 2 lines, want extract ../data/cdf/anes_cdfdta.zip , ../data/anes_timeseries_2012/anes2012ts_dta.zip because contain text dta.zipand because start <a href=\" , end \"

i'd want where:

x <- some.regex.function( string.examples ) 

produces character vector of length 2 with..

> x [1] "../data/cdf/anes_cdfdta.zip"                     "../data/anes_timeseries_2012/anes2012ts_dta.zip" 

here assume patterm you're looking starts after a href=\" , ends dta.zip. idea use greedy search through a href until dta.zip. also, capture each portion , replace searched string required capture.

gsub("(.*a href=\\\")(.*dta\\.zip)(.*)$", "\\2", string.examples) 

the .*a href=\\\" mentioned before "greedy" searches pattern (had escape \ , "). doing .*data\\.zip, restrict greedy search not go beyond point require. pattern we'are interested in. so, make sure capture well. rest obvious. replace pattern second capture.


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

html - Unable to style the color of bullets in a list -

c# - must be a non-abstract type with a public parameterless constructor in redis -