2. Imagine you have a table like this:
A1 A2 A3 M1 M2
A Y 3 100 30.0
C Z 2 50 22.34
A X 3 25 10.0
A X 4 12 2.0
C X 1 98 5.45
B Z 2 150 20.12
A Z 3 200 30.45
C Y 2 225 20.0
B Z 4 203 34.5
Etc.
3. And we want to aggregrate it
So basically our input can look like this (will use this for the
example):
• Group by columns list
• Eg: [‘A1’, ‘A2’, ‘A3’]
• Measure columns set
• Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]}
• Where statement or boolarray to filter the existing rows
• if None, then the entire table should be scanned, else the selected rows
only
• Rootdir option is also needed to specify in-core or out-of-core
result; nb: in-between results (factorized & count sort results)
should perhaps follow not the specified outcome but instead
whether the input ctable is in-mem or on-disk
• NB: stuff like factor caching and parallel is mostly meant as ideas
for future (might greatly accelerate the groupby though)
• I have left sorting the end-result out too for now
4. Logic pipeline
A1 A2 A3
Factorize Factorize Factorize
A1
Factorized
A2
Factorized
A3
Factorized
Combine individual indexes into
a unique new one
A1/A2/A3
Combined
• Factorizing of each carray
can be parallel / multi
threaded
• Factorizations of carrays
can be potentially be cached
next to the original carray
until next carray delete /
update / insert
• Worst cost of cache would
be tripling the size (in case
of unique integer columns ;)
A1/A2/A3
Factorized
Factorize
• Combination step is only needed in case of
groupby over multiple columns, else take
the factorized carray directly
Create A3
Empty
ctable
A1/A2/A3
Index
Counted
Sort
• Ctable can be based on
length from combined
factor + dtypes input
• Can be done in parallel
A3
A3
M1_Sum
M1_Avg
M2_Sum
• Can run parallel
• Groupby columns have
to be filled deriving
A1/A2/A3 Factorized
back into original
values for lookup
• Measure columns have
to use index to filter
original measure carray
and perform
aggregation for each
A1/A2/A3 combination
• You can also parallelize
aggregations!
5. So we first factorize the groupby
columns
A1
A
C
A
A
C
B
A
C
B
Etc.
A1 Values
A
C
B
A1 Index
0
1
0
0
1
2
0
1
2
Etc.
+
• While factorizing, you do not
yet know how many unique
values you will get (the
entire column might be
unique), so you start out
with 2 carrays of equal
length to the input
• The hashing is done in-memory
(klib) but this
should be okay for almost all
cases (memory usage is
limited to unique nr of
values)
• At the end you can resize the
Values carray to its actual
size
• In case of WORM (write
once, read many) it can be
very beneficial to cache this
result already in the carray
(meaning we end up with
three carrays on-disk)
6. So we end up with 3 factor results
A1
Values
A
C
B
A1
Index
0
1
0
0
1
2
0
1
2
Etc.
A2
Values
Y
Z
X
A2
Index
0
1
2
2
2
1
1
0
1
Etc.
A3
Values
3
2
4
1
A3
Index
0
1
0
2
3
1
0
1
2
Etc.
3 Unique
Values
3 Unique
Values
4 Unique
Values
• The # of unique
values are
important for
the next step
which is
combining the
indexes into one
• If there is only
one column we
groupby on,
there would be
no additional
step needed
7. How to combine the factorized carrays
into unique values
• So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range
• We can create this range by calculating a multiplier for each column, where you start at a
multiplier 1 and then for each following column multiply the previous multiplier by the
number of unique values from the previous column:
# of
values multiplier
Value example
"start" Value example "end"
• So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31
• We calculate this for each row and end up with a new carray that contains all
multiplications
• You can also calculate this back (for instance for 31) by doing:
• Val1 = floor(31/12)
• Val2 = floor((31-val1*12)/4)
• Val3 = floor((31-val1*12-val2*4)/1)
Value example
second value of all
Value example
random
3 12 0 3 1 2
3 4 0 3 1 1
4 1 0 4 1 3
0 52 17 31
8. So we create a groupby index & values
like this
* 12 * 4 * 1
A1
Index
0
1
0
0
1
2
0
1
2
Etc.
A2
Index
0
1
2
2
2
1
1
0
1
Etc.
A3
Index
0
1
0
2
3
1
0
1
2
Etc.
Groupby
Input
0
17
8
10
23
29
4
13
30
Etc.
The length of the groupby
values is the length of the
ctable output!
Groupby
Index
0
1
2
3
4
5
6
7
8
Etc.
Groupby
Values
0
17
8
10
23
29
4
13
30
Calculate
(numexpr can do this very nicely)
factorize
(Okay, slightly crappy example as everything is unique here ;)
9. Create the new ctable
• @Valentin: it’s probably better to just create the carrays on
the go from iterations right? (no need to first create an
empty one)
• We know the length from the groupby values carray size and
the dtypes from the input carrays
10. Sort
Groupby
Index
0
1
2
2
0
3
0
1
2
3
Counted sort gives per value a count and a sorted carray
which gives the row indices
NB: we have this cython function already through
Pandas)
Groupby
Values
0
17
8
10
Groupby
Row Index
0
5
7
1
8
Etc.
I changed the example from slide 8 to make it more understandable ;)
Groupby
Value Count
3
2
3
2
So now you can select rows from the original carrays
using index lookups
11. Create groupby & measure columns
• Don’t have time for this slide anymore but using the
previous slides we should be okay I hope ;)
• Basically create the groupby columns looking up the correct
value from the values carrays deriving that from the
groupby input
• Create the measure column by index selecting the values for
each groupby value and applying the aggregation