DataFu (now Apache DataFu) is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills”. It contains functions for:
* PageRank
* Quantiles (median), variance, etc.
* Sessionization
* Convenience bag functions (e.g., set operations, enumerating bags, etc)
* Convenience utility functions (e.g., assertions, easier writing of EvalFuncs)
* and more…
Check out the project page:
http://datafu.incubator.apache.org/