Dapper Tool - A Bundle to Make your ECL Neater

2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Dapper – A Bundle to Make Your ECL NeaterRob Mansfield
Senior Data Scientist
Proagrica

Please ask questions!
Dapper – A Bundle to Make Your ECL Neater

Who thinks ECL
can be a little
verbose?

Engineers on big projects may need this level of control. But.
QAs Analysts
Developers
Data
Scientist

For these people, ECL syntax is a bit of a trial!
Dedup
• DEDUP(SORT(DISTRIBUTE(x, HASH(y)), x, LOCAL), x, LOCAL);
One column transform
• PROJECT(x, TRANSFORM(RECORDOF(LEFT), SELF.y := LEFT.y+1; SELF := LEFT;);
Named output
• OUTPUT(x, NAMED('x'));
Write to CSV
• OUTPUT(x, , '~ROB::TEMP::x', CSV(HEADING(SINGLE), SEPARATOR(','), TERMINATOR('n'),
QUOTE('"')));
Grouped count
• [I ran out of space]

How does this stuff work in other languages? Well, R is nice!
library(dplyr)
df <- read.csv('x')
df <- select(df, col1, col2)
df <- mutate(df, col3 = col1 +
col2)
df <- group_by(df, col3)
df <- summarise(df, col5 = n())
write.csv(df, file='output.csv')

How does this stuff work in other languages? Well, R is nice!
library(dplyr)
df <-
read.csv('x') %>%
select(col1, col2) %>%
mutate(col3 = col1 + col2)
%>%
group_by(col3) %>%
summarise(col5 = n()) %>%
write.csv(file='output.csv')

SQL is also lovely, but can be hard to arrange into a single call
SELECT COUNT(col2), col1 FROM TABLE GROUP BY
col1;

….and Python is, as always, Python
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
…

Enter Dapper…

Let’s work through an example
I don’t know about you but I’ve always wanted to know
Jabba the Hutt’s Body Mass Index…

Load Data
IMPORT dapper.ExampleData;
IMPORT dapper.TransformTools as tt;

View Data
//load data
StarWars :=
ExampleData.starwars;
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);

//Fill blank species with unknown
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.', species));
tt.head(fillblankHome);
Fill in some blanks

That’s right, we don’t need LEFT or SELF!!!
What sorcery is this?!?!?

Okay, we now need to make our BMI column!
//make height meters
heightMeters := tt.mutate(fillblankHome, height, height/100);
//Create a BMI for each character
bmi := tt.append(heightMeters, REAL, BMI, mass/(height^2));
//Look at just the new column and name
bmiSelect := tt.select(bmi, 'name, bmi');
tt.head(bmiSelect);

Let's work through an example
Sort!
//Find the highest
sortedBMI := tt.arrange(bmiSelect, '-bmi');
tt.head(sortedBMI);

Lovely, I feel that’s
one of life’s great
questions
answered
I do of course
have other
questions on Star
Wars

Has anyone else noticed the lack of diversity in the SW
universe?
//How many of each species are there?
species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);

There are some pretty exciting eye colours though!
//Finally let's look at unique hair/eye colour combinations:
colourData := tt.select(StarWars, 'eye_color');
unqiueColours := tt.distinct(colourData, 'eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);

Let's work through an example
Save
//and save our results
tt.to_csv(sortedBMI, 'ROB::TEMP::STARWARSCSV');
tt.to_thor(sortedBMI, 'ROB::TEMP::STARWARS');

Let’s do a quick
side-by-side

IMPORT dapper.ExampleData;
IMPORT dapper.TransformTools as tt;
//load data
StarWars := ExampleData.starwars;
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.',
species));
tt.head(fillblankHome);
bmi := tt.append(fillblankHome, REAL, BMI, mass/height^2);
tt.head(bmi);
//Find the highest
sortedBMI := tt.arrange(bmi, '-bmi');
tt.head(sortedBMI);
//Jabba should probably go on a diet.
Dapper IMPORT dapper.ExampleData;
//load data
StarWars := ExampleData.starwars;
// Look at the data
OUTPUT(COUNT(StarWars), NAMED('COUNTstarWars'));
OUTPUT(StarWars, NAMED('starWars'));
fillblankHomeAndBMI :=
PROJECT(StarWars,
TRANSFORM({RECORDOF(LEFT); REAL BMI;},
SELF.BMI := LEFT.mass / LEFT.Height^2;
SELF.species := IF(LEFT.species = '', 'Unkn.', LEFT.species);
SELF := LEFT;));
OUTPUT(fillblankHomeAndBMI, NAMED('fillblankHomeAndBMI'));
//Find the highest
sortedBMI := SORT(fillblankHomeAndBMI, -bmi);
OUTPUT(sortedBMI, NAMED('sortedBMI'));
//Jabba should probably go on a diet.
Base ECL

species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);
//Finally let's look at eye colour :
colourData := tt.select(StarWars, 'eye_color');
unqiueColours := tt.distinct(colourData, 'eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);
tt.to_csv(sortedBMI,
'ROB::TEMP::STARWARSCSV');
CountRec := RECORD
STRING Species := sortedBMI.species;
INTEGER n := COUNT(GROUP);
END;
species := TABLE(sortedBMI, CountRec, species);
sortedspecies := SORT(species, -n);
OUTPUT(sortedspecies, NAMED('sortedspecies'));
//Finally let's look at unique eye colour:
colourData := TABLE(sortedBMI, {eye_color});
unqiueColours := DEDUP(SORT(DISTRIBUTE(colourData,
HASH(eye_color)),
eye_color, LOCAL), eye_color, LOCAL);
OUTPUT(COUNT(unqiueColours), NAMED('COUNTunqiueColours'));
OUTPUT(unqiueColours, NAMED('unqiueColours'));
OUTPUT(sortedBMI, , 'ROB::TEMP::STARWARSCSV',
CSV(HEADING(SINGLE), SEPARATOR(','),
TERMINATOR('n'), QUOTE('"')));
Dapper
Base ECL

…and we still haven’t even scratched the surface…

Interested? You can install from our GitHub:
ecl bundle install https://github.com/OdinProAgrica/dapper.git
There’s also a more in-depth walkthrough (and infographic)
here:
https://hpccsystems.com/blog/dapper-bundle
Similar projects? Yes, yes we have!
https://github.com/OdinProAgrica

Bonus deck! We would like to introduce you to hpycc

Hpycc is a Python package that builds on the ideas of Dapper
That is:
How can we make HPCC Systems more useable to the Data Scientist?
How can this translate to engineering and development?

Things I find overly taxing
• Spraying new data
• Running scripts that I can customise easily
• Getting the results of queries and files
• ECL dev when I’m offsite

What if you could run all this from a Python notebook?
Now you can!

For the purposes of this demo I’ve made a throwaway function

I’m dev-ing locally so I’ll need HPCC Systems running
…then create a connection to my server

Let’s grab the raw Star Wars dataset…

What if we have more than one output?

Interested? You can install from pypi:
pip install hpycc
There’s also a more info on our github:
https://github.com/OdinProAgrica/hpycc

Watch this space for our most recent project: Wally!

A little flavour of what we have already…

Interested? You can install from our github:
pip install hpycc
There’s also a more info on our github:
https://github.com/OdinProAgrica/wally

Oh, and Dapper
has some string
tools!

…we are also building a stringtools as part of the Dapper
bundle
IMPORT dapper.stringtools as st;
source := 'No1 e-xp-ec-t-s t809he [S]pammish ReQuIsiTion';
target := 'nobody expects the spanish inquisition';

bundle

bundle
IMPORT STD;
one := TRIM(std.Str.ToLowerCase(source), LEFT, RIGHT);
two := REGEXREPLACE('1', one, 'body');
three := REGEXREPLACE('[^a-z ]', two, '');
four := REGEXREPLACE('mm', three, 'n');
five := REGEXREPLACE('req', four, 'inq');
six := REGEXREPLACE('s+', five, ' ');
six;

bundle
IMPORT dapper.stringtools as st;
regexDS := DATASET([
{'1' , 'body'},
{'[^a-z ]', '' },
{'mm' , 'n' },
{'req' , 'inq' },
{'s+' , ' ' }
], {STRING Regex; STRING Repl;});
st.regexLoop(source, regexDS);
target;

Questions?
Rob Mansfield
Senior Data Scientist
Proagrica, RBI
Rob.Mansfield@proagrica.com

View this presentation on YouTube:
https://www.youtube.com/watch?v=jOORZdOWnxk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=5&t=0s (20:46)

Dapper Tool - A Bundle to Make your ECL Neater

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dapper Tool - A Bundle to Make your ECL Neater

Similar to Dapper Tool - A Bundle to Make your ECL Neater (20)

More from HPCC Systems

More from HPCC Systems (20)

Recently uploaded

Recently uploaded (20)

Dapper Tool - A Bundle to Make your ECL Neater