SlideShare a Scribd company logo
1 of 11
Groupby Thoughts
Imagine you have a table like this: 
A1 A2 A3 M1 M2 
A Y 3 100 30.0 
C Z 2 50 22.34 
A X 3 25 10.0 
A X 4 12 2.0 
C X 1 98 5.45 
B Z 2 150 20.12 
A Z 3 200 30.45 
C Y 2 225 20.0 
B Z 4 203 34.5 
Etc.
And we want to aggregrate it 
So basically our input can look like this (will use this for the 
example): 
• Group by columns list 
• Eg: [‘A1’, ‘A2’, ‘A3’] 
• Measure columns set 
• Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]} 
• Where statement or boolarray to filter the existing rows 
• if None, then the entire table should be scanned, else the selected rows 
only 
• Rootdir option is also needed to specify in-core or out-of-core 
result; nb: in-between results (factorized & count sort results) 
should perhaps follow not the specified outcome but instead 
whether the input ctable is in-mem or on-disk 
• NB: stuff like factor caching and parallel is mostly meant as ideas 
for future (might greatly accelerate the groupby though) 
• I have left sorting the end-result out too for now
Logic pipeline 
A1 A2 A3 
Factorize Factorize Factorize 
A1 
Factorized 
A2 
Factorized 
A3 
Factorized 
Combine individual indexes into 
a unique new one 
A1/A2/A3 
Combined 
• Factorizing of each carray 
can be parallel / multi 
threaded 
• Factorizations of carrays 
can be potentially be cached 
next to the original carray 
until next carray delete / 
update / insert 
• Worst cost of cache would 
be tripling the size (in case 
of unique integer columns ;) 
A1/A2/A3 
Factorized 
Factorize 
• Combination step is only needed in case of 
groupby over multiple columns, else take 
the factorized carray directly 
Create A3 
Empty 
ctable 
A1/A2/A3 
Index 
Counted 
Sort 
• Ctable can be based on 
length from combined 
factor + dtypes input 
• Can be done in parallel 
A3 
A3 
M1_Sum 
M1_Avg 
M2_Sum 
• Can run parallel 
• Groupby columns have 
to be filled deriving 
A1/A2/A3 Factorized 
back into original 
values for lookup 
• Measure columns have 
to use index to filter 
original measure carray 
and perform 
aggregation for each 
A1/A2/A3 combination 
• You can also parallelize 
aggregations!
So we first factorize the groupby 
columns 
A1 
A 
C 
A 
A 
C 
B 
A 
C 
B 
Etc. 
A1 Values 
A 
C 
B 
A1 Index 
0 
1 
0 
0 
1 
2 
0 
1 
2 
Etc. 
+ 
• While factorizing, you do not 
yet know how many unique 
values you will get (the 
entire column might be 
unique), so you start out 
with 2 carrays of equal 
length to the input 
• The hashing is done in-memory 
(klib) but this 
should be okay for almost all 
cases (memory usage is 
limited to unique nr of 
values) 
• At the end you can resize the 
Values carray to its actual 
size 
• In case of WORM (write 
once, read many) it can be 
very beneficial to cache this 
result already in the carray 
(meaning we end up with 
three carrays on-disk)
So we end up with 3 factor results 
A1 
Values 
A 
C 
B 
A1 
Index 
0 
1 
0 
0 
1 
2 
0 
1 
2 
Etc. 
A2 
Values 
Y 
Z 
X 
A2 
Index 
0 
1 
2 
2 
2 
1 
1 
0 
1 
Etc. 
A3 
Values 
3 
2 
4 
1 
A3 
Index 
0 
1 
0 
2 
3 
1 
0 
1 
2 
Etc. 
3 Unique 
Values 
3 Unique 
Values 
4 Unique 
Values 
• The # of unique 
values are 
important for 
the next step 
which is 
combining the 
indexes into one 
• If there is only 
one column we 
groupby on, 
there would be 
no additional 
step needed
How to combine the factorized carrays 
into unique values 
• So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range 
• We can create this range by calculating a multiplier for each column, where you start at a 
multiplier 1 and then for each following column multiply the previous multiplier by the 
number of unique values from the previous column: 
# of 
values multiplier 
Value example 
"start" Value example "end" 
• So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31 
• We calculate this for each row and end up with a new carray that contains all 
multiplications 
• You can also calculate this back (for instance for 31) by doing: 
• Val1 = floor(31/12) 
• Val2 = floor((31-val1*12)/4) 
• Val3 = floor((31-val1*12-val2*4)/1) 
Value example 
second value of all 
Value example 
random 
3 12 0 3 1 2 
3 4 0 3 1 1 
4 1 0 4 1 3 
0 52 17 31
So we create a groupby index & values 
like this 
* 12 * 4 * 1 
A1 
Index 
0 
1 
0 
0 
1 
2 
0 
1 
2 
Etc. 
A2 
Index 
0 
1 
2 
2 
2 
1 
1 
0 
1 
Etc. 
A3 
Index 
0 
1 
0 
2 
3 
1 
0 
1 
2 
Etc. 
Groupby 
Input 
0 
17 
8 
10 
23 
29 
4 
13 
30 
Etc. 
The length of the groupby 
values is the length of the 
ctable output! 
Groupby 
Index 
0 
1 
2 
3 
4 
5 
6 
7 
8 
Etc. 
Groupby 
Values 
0 
17 
8 
10 
23 
29 
4 
13 
30 
Calculate 
(numexpr can do this very nicely) 
factorize 
(Okay, slightly crappy example as everything is unique here ;)
Create the new ctable 
• @Valentin: it’s probably better to just create the carrays on 
the go from iterations right? (no need to first create an 
empty one) 
• We know the length from the groupby values carray size and 
the dtypes from the input carrays
Sort 
Groupby 
Index 
0 
1 
2 
2 
0 
3 
0 
1 
2 
3 
Counted sort gives per value a count and a sorted carray 
which gives the row indices 
NB: we have this cython function already through 
Pandas) 
Groupby 
Values 
0 
17 
8 
10 
Groupby 
Row Index 
0 
5 
7 
1 
8 
Etc. 
I changed the example from slide 8 to make it more understandable ;) 
Groupby 
Value Count 
3 
2 
3 
2 
So now you can select rows from the original carrays 
using index lookups
Create groupby & measure columns 
• Don’t have time for this slide anymore but using the 
previous slides we should be okay I hope ;) 
• Basically create the groupby columns looking up the correct 
value from the values carrays deriving that from the 
groupby input 
• Create the measure column by index selecting the values for 
each groupby value and applying the aggregation

More Related Content

What's hot

Introduction to matlab lecture 4 of 4
Introduction to matlab lecture 4 of 4Introduction to matlab lecture 4 of 4
Introduction to matlab lecture 4 of 4Randa Elanwar
 
Signed Addition And Subtraction
Signed Addition And SubtractionSigned Addition And Subtraction
Signed Addition And SubtractionKeyur Vadodariya
 
Test yourself unit 2 foundation qs
Test yourself unit 2 foundation qsTest yourself unit 2 foundation qs
Test yourself unit 2 foundation qsMrJames Kcc
 
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...Firas Husseini
 
Fixed Point Conversion
Fixed Point ConversionFixed Point Conversion
Fixed Point ConversionRajesh Sharma
 
Matlab on basic mathematics
Matlab on basic mathematicsMatlab on basic mathematics
Matlab on basic mathematicsmonauweralam1
 
Assignment method
Assignment methodAssignment method
Assignment methodR A Shah
 
Multi dimensional array
Multi dimensional arrayMulti dimensional array
Multi dimensional arrayRajendran
 
Array Introduction One-dimensional array Multidimensional array
Array Introduction One-dimensional array Multidimensional arrayArray Introduction One-dimensional array Multidimensional array
Array Introduction One-dimensional array Multidimensional arrayimtiazalijoono
 
Algorithm for Hungarian Method of Assignment
Algorithm for Hungarian Method of AssignmentAlgorithm for Hungarian Method of Assignment
Algorithm for Hungarian Method of AssignmentRaja Adapa
 
One dimensional arrays
One dimensional arraysOne dimensional arrays
One dimensional arraysSatyam Soni
 

What's hot (20)

Mat lab
Mat labMat lab
Mat lab
 
Introduction to matlab lecture 4 of 4
Introduction to matlab lecture 4 of 4Introduction to matlab lecture 4 of 4
Introduction to matlab lecture 4 of 4
 
Arrays
ArraysArrays
Arrays
 
Signed Addition And Subtraction
Signed Addition And SubtractionSigned Addition And Subtraction
Signed Addition And Subtraction
 
Test yourself unit 2 foundation qs
Test yourself unit 2 foundation qsTest yourself unit 2 foundation qs
Test yourself unit 2 foundation qs
 
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...
Quantitativeanalysisfordecisionmaking 13427543542352-phpapp02-120719222252-ph...
 
Fixed Point Conversion
Fixed Point ConversionFixed Point Conversion
Fixed Point Conversion
 
Lecture one
Lecture oneLecture one
Lecture one
 
Matlab on basic mathematics
Matlab on basic mathematicsMatlab on basic mathematics
Matlab on basic mathematics
 
Assignment method
Assignment methodAssignment method
Assignment method
 
Multi dimensional array
Multi dimensional arrayMulti dimensional array
Multi dimensional array
 
C programming , array 2020
C programming , array 2020C programming , array 2020
C programming , array 2020
 
Array Introduction One-dimensional array Multidimensional array
Array Introduction One-dimensional array Multidimensional arrayArray Introduction One-dimensional array Multidimensional array
Array Introduction One-dimensional array Multidimensional array
 
Computer arithmetic
Computer arithmeticComputer arithmetic
Computer arithmetic
 
Algorithm for Hungarian Method of Assignment
Algorithm for Hungarian Method of AssignmentAlgorithm for Hungarian Method of Assignment
Algorithm for Hungarian Method of Assignment
 
C++ lecture 04
C++ lecture 04C++ lecture 04
C++ lecture 04
 
Introduction to Arrays in C
Introduction to Arrays in CIntroduction to Arrays in C
Introduction to Arrays in C
 
Basic concepts in_matlab
Basic concepts in_matlabBasic concepts in_matlab
Basic concepts in_matlab
 
Array in c++
Array in c++Array in c++
Array in c++
 
One dimensional arrays
One dimensional arraysOne dimensional arrays
One dimensional arrays
 

Similar to Bcolz Groupby Discussion Document

Matlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IMatlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IVijay Kumar Gupta
 
CHAPTER-5.ppt
CHAPTER-5.pptCHAPTER-5.ppt
CHAPTER-5.pptTekle12
 
Introduction to Matlab - Basic Functions
Introduction to Matlab - Basic FunctionsIntroduction to Matlab - Basic Functions
Introduction to Matlab - Basic Functionsjoellivz
 
Libre Office Calc Lesson 4: Understanding Functions
Libre Office Calc Lesson 4: Understanding FunctionsLibre Office Calc Lesson 4: Understanding Functions
Libre Office Calc Lesson 4: Understanding FunctionsSmart Chicago Collaborative
 
C (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxC (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxrohinitalekar1
 
Importance of matlab
Importance of matlabImportance of matlab
Importance of matlabkrajeshk1980
 
6 arrays injava
6 arrays injava6 arrays injava
6 arrays injavairdginfo
 
TRAINING PROGRAMME ON MATLAB ASSOCIATE EXAM (1).pptx
TRAINING PROGRAMME ON MATLAB  ASSOCIATE EXAM (1).pptxTRAINING PROGRAMME ON MATLAB  ASSOCIATE EXAM (1).pptx
TRAINING PROGRAMME ON MATLAB ASSOCIATE EXAM (1).pptxanaveenkumar4
 
MS Excel Learning for PPC Google AdWords Training Course
MS Excel Learning for PPC Google AdWords Training CourseMS Excel Learning for PPC Google AdWords Training Course
MS Excel Learning for PPC Google AdWords Training CourseRanjan Jena
 
Variables in matlab
Variables in matlabVariables in matlab
Variables in matlabTUOS-Sam
 
0-Slot18-19-20-ContiguousStorage.pdf
0-Slot18-19-20-ContiguousStorage.pdf0-Slot18-19-20-ContiguousStorage.pdf
0-Slot18-19-20-ContiguousStorage.pdfssusere19c741
 

Similar to Bcolz Groupby Discussion Document (20)

Matlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IMatlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - I
 
CHAPTER-5.ppt
CHAPTER-5.pptCHAPTER-5.ppt
CHAPTER-5.ppt
 
Introduction to Matlab - Basic Functions
Introduction to Matlab - Basic FunctionsIntroduction to Matlab - Basic Functions
Introduction to Matlab - Basic Functions
 
Libre Office Calc Lesson 4: Understanding Functions
Libre Office Calc Lesson 4: Understanding FunctionsLibre Office Calc Lesson 4: Understanding Functions
Libre Office Calc Lesson 4: Understanding Functions
 
Pandas csv
Pandas csvPandas csv
Pandas csv
 
Matlab introduction
Matlab introductionMatlab introduction
Matlab introduction
 
Array
ArrayArray
Array
 
presentation.pptx
presentation.pptxpresentation.pptx
presentation.pptx
 
C (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptxC (PPS)Programming for problem solving.pptx
C (PPS)Programming for problem solving.pptx
 
Importance of matlab
Importance of matlabImportance of matlab
Importance of matlab
 
6 arrays injava
6 arrays injava6 arrays injava
6 arrays injava
 
4.ArraysInC.pdf
4.ArraysInC.pdf4.ArraysInC.pdf
4.ArraysInC.pdf
 
Algo>Arrays
Algo>ArraysAlgo>Arrays
Algo>Arrays
 
Matlab Tutorial.ppt
Matlab Tutorial.pptMatlab Tutorial.ppt
Matlab Tutorial.ppt
 
TRAINING PROGRAMME ON MATLAB ASSOCIATE EXAM (1).pptx
TRAINING PROGRAMME ON MATLAB  ASSOCIATE EXAM (1).pptxTRAINING PROGRAMME ON MATLAB  ASSOCIATE EXAM (1).pptx
TRAINING PROGRAMME ON MATLAB ASSOCIATE EXAM (1).pptx
 
Chapter-Five.pptx
Chapter-Five.pptxChapter-Five.pptx
Chapter-Five.pptx
 
MS Excel Learning for PPC Google AdWords Training Course
MS Excel Learning for PPC Google AdWords Training CourseMS Excel Learning for PPC Google AdWords Training Course
MS Excel Learning for PPC Google AdWords Training Course
 
Variables in matlab
Variables in matlabVariables in matlab
Variables in matlab
 
0-Slot18-19-20-ContiguousStorage.pdf
0-Slot18-19-20-ContiguousStorage.pdf0-Slot18-19-20-ContiguousStorage.pdf
0-Slot18-19-20-ContiguousStorage.pdf
 
Array
ArrayArray
Array
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Bcolz Groupby Discussion Document

  • 2. Imagine you have a table like this: A1 A2 A3 M1 M2 A Y 3 100 30.0 C Z 2 50 22.34 A X 3 25 10.0 A X 4 12 2.0 C X 1 98 5.45 B Z 2 150 20.12 A Z 3 200 30.45 C Y 2 225 20.0 B Z 4 203 34.5 Etc.
  • 3. And we want to aggregrate it So basically our input can look like this (will use this for the example): • Group by columns list • Eg: [‘A1’, ‘A2’, ‘A3’] • Measure columns set • Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]} • Where statement or boolarray to filter the existing rows • if None, then the entire table should be scanned, else the selected rows only • Rootdir option is also needed to specify in-core or out-of-core result; nb: in-between results (factorized & count sort results) should perhaps follow not the specified outcome but instead whether the input ctable is in-mem or on-disk • NB: stuff like factor caching and parallel is mostly meant as ideas for future (might greatly accelerate the groupby though) • I have left sorting the end-result out too for now
  • 4. Logic pipeline A1 A2 A3 Factorize Factorize Factorize A1 Factorized A2 Factorized A3 Factorized Combine individual indexes into a unique new one A1/A2/A3 Combined • Factorizing of each carray can be parallel / multi threaded • Factorizations of carrays can be potentially be cached next to the original carray until next carray delete / update / insert • Worst cost of cache would be tripling the size (in case of unique integer columns ;) A1/A2/A3 Factorized Factorize • Combination step is only needed in case of groupby over multiple columns, else take the factorized carray directly Create A3 Empty ctable A1/A2/A3 Index Counted Sort • Ctable can be based on length from combined factor + dtypes input • Can be done in parallel A3 A3 M1_Sum M1_Avg M2_Sum • Can run parallel • Groupby columns have to be filled deriving A1/A2/A3 Factorized back into original values for lookup • Measure columns have to use index to filter original measure carray and perform aggregation for each A1/A2/A3 combination • You can also parallelize aggregations!
  • 5. So we first factorize the groupby columns A1 A C A A C B A C B Etc. A1 Values A C B A1 Index 0 1 0 0 1 2 0 1 2 Etc. + • While factorizing, you do not yet know how many unique values you will get (the entire column might be unique), so you start out with 2 carrays of equal length to the input • The hashing is done in-memory (klib) but this should be okay for almost all cases (memory usage is limited to unique nr of values) • At the end you can resize the Values carray to its actual size • In case of WORM (write once, read many) it can be very beneficial to cache this result already in the carray (meaning we end up with three carrays on-disk)
  • 6. So we end up with 3 factor results A1 Values A C B A1 Index 0 1 0 0 1 2 0 1 2 Etc. A2 Values Y Z X A2 Index 0 1 2 2 2 1 1 0 1 Etc. A3 Values 3 2 4 1 A3 Index 0 1 0 2 3 1 0 1 2 Etc. 3 Unique Values 3 Unique Values 4 Unique Values • The # of unique values are important for the next step which is combining the indexes into one • If there is only one column we groupby on, there would be no additional step needed
  • 7. How to combine the factorized carrays into unique values • So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range • We can create this range by calculating a multiplier for each column, where you start at a multiplier 1 and then for each following column multiply the previous multiplier by the number of unique values from the previous column: # of values multiplier Value example "start" Value example "end" • So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31 • We calculate this for each row and end up with a new carray that contains all multiplications • You can also calculate this back (for instance for 31) by doing: • Val1 = floor(31/12) • Val2 = floor((31-val1*12)/4) • Val3 = floor((31-val1*12-val2*4)/1) Value example second value of all Value example random 3 12 0 3 1 2 3 4 0 3 1 1 4 1 0 4 1 3 0 52 17 31
  • 8. So we create a groupby index & values like this * 12 * 4 * 1 A1 Index 0 1 0 0 1 2 0 1 2 Etc. A2 Index 0 1 2 2 2 1 1 0 1 Etc. A3 Index 0 1 0 2 3 1 0 1 2 Etc. Groupby Input 0 17 8 10 23 29 4 13 30 Etc. The length of the groupby values is the length of the ctable output! Groupby Index 0 1 2 3 4 5 6 7 8 Etc. Groupby Values 0 17 8 10 23 29 4 13 30 Calculate (numexpr can do this very nicely) factorize (Okay, slightly crappy example as everything is unique here ;)
  • 9. Create the new ctable • @Valentin: it’s probably better to just create the carrays on the go from iterations right? (no need to first create an empty one) • We know the length from the groupby values carray size and the dtypes from the input carrays
  • 10. Sort Groupby Index 0 1 2 2 0 3 0 1 2 3 Counted sort gives per value a count and a sorted carray which gives the row indices NB: we have this cython function already through Pandas) Groupby Values 0 17 8 10 Groupby Row Index 0 5 7 1 8 Etc. I changed the example from slide 8 to make it more understandable ;) Groupby Value Count 3 2 3 2 So now you can select rows from the original carrays using index lookups
  • 11. Create groupby & measure columns • Don’t have time for this slide anymore but using the previous slides we should be okay I hope ;) • Basically create the groupby columns looking up the correct value from the values carrays deriving that from the groupby input • Create the measure column by index selecting the values for each groupby value and applying the aggregation