performance - Algorithm for making two histograms proportional, minimizing units removed -
imagine have 2 histograms equal number of bins. n observations distributed among bins. each bin has between 0 , n observations.
what algorithm appropriate determining minimum number of observations remove both histograms in order make them proportional? not need equal in absolute number, proportional each other. is, there must common factor bins in 1 histogram can multiplied in order make equal other histogram.
for example, imagine following 2 histograms, item in each histogram refers number of observations in bin respective histogram.
histogram 1: 4, 7, 4, 9
histogram 2: 2, 0, 2, 1
for these histograms, solution remove histogram 1 7 observations in bin 2 , 7 observations bin 4, such (histogram 1)*2 = histogram 2.
but general algorithm used find subsets of 2 histograms maximized number of total observations between them while making them proportional? can drop observations both histograms or one.
thanks!
seems me problem equivalent (if consider each histogram n-dimensional vector), minimizing manhattan length |r|, r=xa-b, , b 'vectors' , x proportional scale.
|r| has single minimum (not integer) can find rapidly using simple bisection algorithm (or akin newton's method).
then, assuming want solution proportion integer, test 2 cases ceil(x), , floor(x), find has smallest manhattan length (and number of observations need remove).
proof problem not np-hard:
consider inefficient 'solution' whereby removed n observations bins. both a , b equal 'zero' histogram 0 = (0,0,0,...). 2 histograms equal , proportional 0 = s * 0 proportional values s, hard maximum number of observations remove n.
now assume more efficient solution exists assitions/removals < n , proportional scale s > 2*n (i.e after removal of observations a = n * b or b=n * a ). if both a = 0 , b = 0, have previous solution n removals (which contradicts assumption there less n removals). if a = 0 , b ≠ 0 there no s <> 0 such 0 = s * b , no s such s * 0 = b (with similar argument b = 0 , s ≠ 0). must case both a ≠ 0 , b ≠ 0. assume moment a histogram scaled (so a * s = b), a must have @ least 1 non-zero entry a[i] minimum value 1 (after removal of observations), when scaled have minimum value ≥. therefore equivalent entry b[i] must have @ least 2*n observations. total number of observations n, have needed add @ least n observations b[i], contradicts assumption improved solution had less n additions/removals. no 'efficient' solution requires proportional scale greater n.
so find efficient solution requires, @ worst, testing 'best fit' solution scaling factors in range 0-n.
the 'best fit' solution scaling factor s in a = s * b, a , b have m bins each requires
sum(i=1 m) of { abs(a[i]- s * b[i]) mod s + abs(a[i]- s * b[i]) div s } additions/removals.
this order m operation, test each scaling factor in range 0-n algorithm of order o(m*n)
i (but haven't got formal proof), scale factor cannot exceed number of observations in filled bin. in practice typically smaller. 2 histograms 2 hundred bins , randomly chosen 30-300 observations per bin: if there na > nb total observations in bins of a , b respectively scaling factor either found in range na/nb-4 < s < na/nb + 4, (or s = 0 if na >> nb).
Comments
Post a Comment