SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe -


i come sql background , use following data processing step frequently:

  1. partition table of data 1 or more fields
  2. for each partition, add rownumber each of rows ranks row 1 or more other fields, analyst specifies ascending or descending

ex:

df = pd.dataframe({'key1' : ['a','a','a','b','a'],            'data1' : [1,2,2,3,3],            'data2' : [1,10,2,3,30]}) df      data1        data2     key1     0    1            1                    1    2            10                2    2            2                3    3            3         b        4    3            30                

i'm looking how pandas equivalent sql window function:

rn = row_number() on (partition key1, key2 order data1 asc, data2 desc)       data1        data2     key1    rn 0    1            1               1     1    2            10              2  2    2            2               3 3    3            3         b       1 4    3            30              4 

i've tried following i've gotten work there no 'partitions':

def row_number(frame,orderby_columns, orderby_direction,name):     frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = true)     frame[name] = list(xrange(len(frame.index))) 

i tried extend idea work partitions (groups in pandas) following didn't work:

df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[true, false], inplace = true)).reset_index()  def nf(x):     x['rn'] = list(xrange(len(x.index)))  df1['rn1'] = df1.groupby('key1').apply(nf) 

but got lot of nans when this.

ideally, there'd succinct way replicate window function capability of sql (i've figured out window based aggregates...that's 1 liner in pandas)...can share me idiomatic way number rows in pandas?

you can using groupby twice along rank method:

in [11]: g = df.groupby('key1') 

use min method argument give values share same data1 same rn:

in [12]: g['data1'].rank(method='min') out[12]: 0    1 1    2 2    2 3    1 4    4 dtype: float64  in [13]: df['rn'] = g['data1'].rank(method='min') 

and groupby these results , add rank respect data2:

in [14]: g1 = df.groupby(['key1', 'rn'])  in [15]: g1['data2'].rank(ascending=false) - 1 out[15]: 0    0 1    0 2    1 3    0 4    0 dtype: float64  in [16]: df['rn'] += g1['data2'].rank(ascending=false) - 1  in [17]: df out[17]:    data1  data2 key1  rn 0      1      1      1 1      2     10      2 2      2      2      3 3      3      3    b   1 4      3     30      4 

it feels there ought native way (there may be!...).


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

c# - must be a non-abstract type with a public parameterless constructor in redis -

ajax - PHP/JSON Login script (Twitter style) not setting sessions -