User Defined Functions 
… new in Apache Cassandra 3.0 
CASSANDRA-7395 + many more…
Me, Robert… 
• Contribute code to Apache Cassandra (UDFs, row cache + more) 
• Help customers to build Cassandra solutions 
• Freelancer, Coder 
Robert Stupp, Cologne, Germany 
@snazy snazy@snazy.de
Disclaimer 
• Apache Cassandra 3.0 is the next major release 
• Everything is under development 
• Things may change 
• Things may be different in final 3.0 release
Apache Cassandra 3.0 
• … will bring a lot of cool, new features ! 
• … will bring a lot of great improvements ! 
• UDFs is just one of these features :)
UDF - 
What’s that??
UDF 
• UDF means User Defined Function 
• You write the code that’s executed on Cassandra nodes 
• Functions are distributed transparently to the whole cluster 
• You may not have to wait for a new release for new functionality :)
UDF Characteristics 
• „Pure“ 
• just input parameters 
• no state, side effects, dependencies to other code, etc 
• Usually deterministic
Consider a Java function like… 
import nothing; 
public final class MyClass 
{ 
public static int myFunction ( int argument ) 
{ 
return argument * 42; 
} 
} 
This would be your UDF
CREATE FUNCTION sinPlusFoo 
( 
valueA double, 
valueB double 
arguments 
return type 
) 
RETURNS double 
LANGUAGE java 
AS ’return Math.sin(valueA) + valueB;’; 
Java works out of the box! 
Example 
define the UDF language 
Java code
Next Example 
CREATE FUNCTION sin ( 
value double ) 
RETURNS double 
LANGUAGE javascript 
AS ’Math.sin(value);’; 
JavaScript works out of the box! 
Cassandra 3.0 targets Java8 - so it’s „Nashorn“ 
JavaScript works, too! 
Javascript code
JSR 223 
• “Scripting for the Java Platform“ 
• UDFs can be written in Java and JavaScript 
• Optionally: Groovy, JRuby, Jython, Scala 
• Not: Clojure (JSR 223 implementation’s wrong)
Behind the scenes 
• Builds Java (or script) source 
• Compiles that code (Java class, or compiled script) 
• Loads the compiled code 
• Migrates the function to all other nodes 
• Done - UDF is executable on any node
Types for UDFs 
• Support for all Cassandra types for arguments and return value 
• All means 
• Primitives (boolean, int, double, uuid, etc) 
• Collections (list, set, map) 
• Tuple types, User Defined Types
UDF - 
For what?
UDF invocation 
SELECT sumThat ( colA, colB ) 
Now your application can 
sum two values in one row - 
or create the sin of a value! 
GREAT NEW FEATURES! 
Okay - not really… 
FROM myTable 
WHERE key = ... 
SELECT sin ( foo ) 
FROM myCircle 
WHERE pk = ...
UDFs are good for… 
• UDFs on their own are just „nice to have“ 
• Nothing you couldn’t do better in your application
Real Use Case for UDFs ?
User Defined Aggregates ! 
CASSANDRA-8053
User Defined Aggregates 
Use UDFs to code your own aggregation functions 
(Aggregates are things like SUM, AVG, MIN, MAX, etc) 
Aggregates : 
consume values from multiple rows & 
produce a single result
Example 
name of the aggregate 
function argument types 
CREATE AGGREGATE minimum ( int ) 
STYPE int 
SFUNC minimumState; 
name of the “state“ UDF 
Syntax similar to Postgres. 
state type
How an aggregate works 
SELECT minimum ( val ) FROM foo … 
1. Initial state is set to null 
2. for each row the state function is called with 
current state and column value - returns new state 
3. After all rows the aggregate returns the last state
More sophisticated 
CREATE AGGREGATE average ( int ) 
SFUNC averageState 
STYPE tuple<long,int> 
FINALFUNC averageFinal 
INITCOND (0, 0); 
UDF called after last 
row 
FINALFUNC + INITCOND are optional initial state value
How that aggregate works 
SELECT average ( val ) FROM foo … 
1. Initial state is set to (0,0) 
2. for each row the state function is called with 
current state + column value - returns new state 
3. After all rows the final function is called with last state 
4. final function calculates the aggregate
Now everybody can execute evil 
code on your cluster :)
UDF permissions 
• There will be permissions to restrict (allow) 
• UDF creation (DDL) 
• UDF execution (DML) 
CASSANDRA-7557
Built in functions 
• All known built-in functions are called native functions 
• Native functions belong to SYSTEM keyspace 
• Native functions cannot be modified (or dropped) 
Note: you already know native functions like 
now, count, unixtimestampof
UDF belong to a keyspace 
• User Defined Functions and 
• User Defined Aggregate 
• belong to a keyspace 
• SYSTEM keyspace is searched first for functions 
(then the current keyspace) 
if function/aggregate is not fully qualified
UDF - some final words… 
Keep in mind: 
• JSR-223 has overhead - Java UDFs are much faster 
• Do not allow everyone to create UDFs (in production) 
• Keep your UDFs “pure“ 
• Test your UDFs and user defined aggregates thoroughly
For the geeks :) 
• UDFs and user defined aggregates are executed on the coordinator node 
• Prefer to use Java-UDFs for performance reasons
Let a man dream… 
UDFs could be useful for… 
• Functional indexes 
• Partial indexes 
• Filtering 
• Distributed GROUP BY 
• etc etc
Q & A 
THANK YOU FOR YOUR ATTENTION :) 
Robert Stupp, Cologne, Germany 
@snazy snazy@snazy.de
User defined-functions-cassandra-summit-eu-2014

User defined-functions-cassandra-summit-eu-2014

  • 1.
    User Defined Functions … new in Apache Cassandra 3.0 CASSANDRA-7395 + many more…
  • 2.
    Me, Robert… •Contribute code to Apache Cassandra (UDFs, row cache + more) • Help customers to build Cassandra solutions • Freelancer, Coder Robert Stupp, Cologne, Germany @snazy snazy@snazy.de
  • 3.
    Disclaimer • ApacheCassandra 3.0 is the next major release • Everything is under development • Things may change • Things may be different in final 3.0 release
  • 4.
    Apache Cassandra 3.0 • … will bring a lot of cool, new features ! • … will bring a lot of great improvements ! • UDFs is just one of these features :)
  • 5.
  • 6.
    UDF • UDFmeans User Defined Function • You write the code that’s executed on Cassandra nodes • Functions are distributed transparently to the whole cluster • You may not have to wait for a new release for new functionality :)
  • 7.
    UDF Characteristics •„Pure“ • just input parameters • no state, side effects, dependencies to other code, etc • Usually deterministic
  • 8.
    Consider a Javafunction like… import nothing; public final class MyClass { public static int myFunction ( int argument ) { return argument * 42; } } This would be your UDF
  • 9.
    CREATE FUNCTION sinPlusFoo ( valueA double, valueB double arguments return type ) RETURNS double LANGUAGE java AS ’return Math.sin(valueA) + valueB;’; Java works out of the box! Example define the UDF language Java code
  • 10.
    Next Example CREATEFUNCTION sin ( value double ) RETURNS double LANGUAGE javascript AS ’Math.sin(value);’; JavaScript works out of the box! Cassandra 3.0 targets Java8 - so it’s „Nashorn“ JavaScript works, too! Javascript code
  • 11.
    JSR 223 •“Scripting for the Java Platform“ • UDFs can be written in Java and JavaScript • Optionally: Groovy, JRuby, Jython, Scala • Not: Clojure (JSR 223 implementation’s wrong)
  • 12.
    Behind the scenes • Builds Java (or script) source • Compiles that code (Java class, or compiled script) • Loads the compiled code • Migrates the function to all other nodes • Done - UDF is executable on any node
  • 13.
    Types for UDFs • Support for all Cassandra types for arguments and return value • All means • Primitives (boolean, int, double, uuid, etc) • Collections (list, set, map) • Tuple types, User Defined Types
  • 14.
    UDF - Forwhat?
  • 15.
    UDF invocation SELECTsumThat ( colA, colB ) Now your application can sum two values in one row - or create the sin of a value! GREAT NEW FEATURES! Okay - not really… FROM myTable WHERE key = ... SELECT sin ( foo ) FROM myCircle WHERE pk = ...
  • 16.
    UDFs are goodfor… • UDFs on their own are just „nice to have“ • Nothing you couldn’t do better in your application
  • 17.
    Real Use Casefor UDFs ?
  • 18.
    User Defined Aggregates! CASSANDRA-8053
  • 19.
    User Defined Aggregates Use UDFs to code your own aggregation functions (Aggregates are things like SUM, AVG, MIN, MAX, etc) Aggregates : consume values from multiple rows & produce a single result
  • 20.
    Example name ofthe aggregate function argument types CREATE AGGREGATE minimum ( int ) STYPE int SFUNC minimumState; name of the “state“ UDF Syntax similar to Postgres. state type
  • 21.
    How an aggregateworks SELECT minimum ( val ) FROM foo … 1. Initial state is set to null 2. for each row the state function is called with current state and column value - returns new state 3. After all rows the aggregate returns the last state
  • 22.
    More sophisticated CREATEAGGREGATE average ( int ) SFUNC averageState STYPE tuple<long,int> FINALFUNC averageFinal INITCOND (0, 0); UDF called after last row FINALFUNC + INITCOND are optional initial state value
  • 23.
    How that aggregateworks SELECT average ( val ) FROM foo … 1. Initial state is set to (0,0) 2. for each row the state function is called with current state + column value - returns new state 3. After all rows the final function is called with last state 4. final function calculates the aggregate
  • 24.
    Now everybody canexecute evil code on your cluster :)
  • 25.
    UDF permissions •There will be permissions to restrict (allow) • UDF creation (DDL) • UDF execution (DML) CASSANDRA-7557
  • 26.
    Built in functions • All known built-in functions are called native functions • Native functions belong to SYSTEM keyspace • Native functions cannot be modified (or dropped) Note: you already know native functions like now, count, unixtimestampof
  • 27.
    UDF belong toa keyspace • User Defined Functions and • User Defined Aggregate • belong to a keyspace • SYSTEM keyspace is searched first for functions (then the current keyspace) if function/aggregate is not fully qualified
  • 28.
    UDF - somefinal words… Keep in mind: • JSR-223 has overhead - Java UDFs are much faster • Do not allow everyone to create UDFs (in production) • Keep your UDFs “pure“ • Test your UDFs and user defined aggregates thoroughly
  • 29.
    For the geeks:) • UDFs and user defined aggregates are executed on the coordinator node • Prefer to use Java-UDFs for performance reasons
  • 30.
    Let a mandream… UDFs could be useful for… • Functional indexes • Partial indexes • Filtering • Distributed GROUP BY • etc etc
  • 31.
    Q & A THANK YOU FOR YOUR ATTENTION :) Robert Stupp, Cologne, Germany @snazy snazy@snazy.de