Bcolz Groupby Discussion Document

Imagine you have a table like this:
A1 A2 A3 M1 M2
A Y 3 100 30.0
C Z 2 50 22.34
A X 3 25 10.0
A X 4 12 2.0
C X 1 98 5.45
B Z 2 150 20.12
A Z 3 200 30.45
C Y 2 225 20.0
B Z 4 203 34.5
Etc.

And we want to aggregrate it
So basically our input can look like this (will use this for the
example):
• Group by columns list
• Eg: [‘A1’, ‘A2’, ‘A3’]
• Measure columns set
• Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]}
• Where statement or boolarray to filter the existing rows
• if None, then the entire table should be scanned, else the selected rows
only
• Rootdir option is also needed to specify in-core or out-of-core
result; nb: in-between results (factorized & count sort results)
should perhaps follow not the specified outcome but instead
whether the input ctable is in-mem or on-disk
• NB: stuff like factor caching and parallel is mostly meant as ideas
for future (might greatly accelerate the groupby though)
• I have left sorting the end-result out too for now

Logic pipeline
A1 A2 A3
Factorize Factorize Factorize
A1
Factorized
A2
Factorized
A3
Factorized
Combine individual indexes into
a unique new one
A1/A2/A3
Combined
• Factorizing of each carray
can be parallel / multi
threaded
• Factorizations of carrays
can be potentially be cached
next to the original carray
until next carray delete /
update / insert
• Worst cost of cache would
be tripling the size (in case
of unique integer columns ;)
A1/A2/A3
Factorized
Factorize
• Combination step is only needed in case of
groupby over multiple columns, else take
the factorized carray directly
Create A3
Empty
ctable
A1/A2/A3
Index
Counted
Sort
• Ctable can be based on
length from combined
factor + dtypes input
• Can be done in parallel
A3
A3
M1_Sum
M1_Avg
M2_Sum
• Can run parallel
• Groupby columns have
to be filled deriving
A1/A2/A3 Factorized
back into original
values for lookup
• Measure columns have
to use index to filter
original measure carray
and perform
aggregation for each
A1/A2/A3 combination
• You can also parallelize
aggregations!

So we first factorize the groupby
columns
A1
A
C
A
A
C
B
A
C
B
Etc.
A1 Values
A
C
B
A1 Index
0
1
0
0
1
2
0
1
2
Etc.
+
• While factorizing, you do not
yet know how many unique
values you will get (the
entire column might be
unique), so you start out
with 2 carrays of equal
length to the input
• The hashing is done in-memory
(klib) but this
should be okay for almost all
cases (memory usage is
limited to unique nr of
values)
• At the end you can resize the
Values carray to its actual
size
• In case of WORM (write
once, read many) it can be
very beneficial to cache this
result already in the carray
(meaning we end up with
three carrays on-disk)

So we end up with 3 factor results
A1
Values
A
C
B
A1
Index
0
1
0
0
1
2
0
1
2
Etc.
A2
Values
Y
Z
X
A2
Index
0
1
2
2
2
1
1
0
1
Etc.
A3
Values
3
2
4
1
A3
Index
0
1
0
2
3
1
0
1
2
Etc.
3 Unique
Values
3 Unique
Values
4 Unique
Values
• The # of unique
values are
important for
the next step
which is
combining the
indexes into one
• If there is only
one column we
groupby on,
there would be
no additional
step needed

How to combine the factorized carrays
into unique values
• So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range
• We can create this range by calculating a multiplier for each column, where you start at a
multiplier 1 and then for each following column multiply the previous multiplier by the
number of unique values from the previous column:
# of
values multiplier
Value example
"start" Value example "end"
• So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31
• We calculate this for each row and end up with a new carray that contains all
multiplications
• You can also calculate this back (for instance for 31) by doing:
• Val1 = floor(31/12)
• Val2 = floor((31-val1*12)/4)
• Val3 = floor((31-val1*12-val2*4)/1)
Value example
second value of all
Value example
random
3 12 0 3 1 2
3 4 0 3 1 1
4 1 0 4 1 3
0 52 17 31

So we create a groupby index & values
like this
* 12 * 4 * 1
A1
Index
0
1
0
0
1
2
0
1
2
Etc.
A2
Index
0
1
2
2
2
1
1
0
1
Etc.
A3
Index
0
1
0
2
3
1
0
1
2
Etc.
Groupby
Input
0
17
8
10
23
29
4
13
30
Etc.
The length of the groupby
values is the length of the
ctable output!
Groupby
Index
0
1
2
3
4
5
6
7
8
Etc.
Groupby
Values
0
17
8
10
23
29
4
13
30
Calculate
(numexpr can do this very nicely)
factorize
(Okay, slightly crappy example as everything is unique here ;)

Create the new ctable
• @Valentin: it’s probably better to just create the carrays on
the go from iterations right? (no need to first create an
empty one)
• We know the length from the groupby values carray size and
the dtypes from the input carrays

Sort
Groupby
Index
0
1
2
2
0
3
0
1
2
3
Counted sort gives per value a count and a sorted carray
which gives the row indices
NB: we have this cython function already through
Pandas)
Groupby
Values
0
17
8
10
Groupby
Row Index
0
5
7
1
8
Etc.
I changed the example from slide 8 to make it more understandable ;)
Groupby
Value Count
3
2
3
2
So now you can select rows from the original carrays
using index lookups

Create groupby & measure columns
• Don’t have time for this slide anymore but using the
previous slides we should be okay I hope ;)
• Basically create the groupby columns looking up the correct
value from the values carrays deriving that from the
groupby input
• Create the measure column by index selecting the values for
each groupby value and applying the aggregation

Bcolz Groupby Discussion Document

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bcolz Groupby Discussion Document

Similar to Bcolz Groupby Discussion Document (20)

Recently uploaded

Recently uploaded (20)

Bcolz Groupby Discussion Document