Michal Malohlava discusses the magic behind the math - exposing the way that open source big data analysis H2O uses Scala to get work done, and demos how users can interact with Scala to get the most out of data analysis.
2. Why Scala & H2O ?
●
H 2 O ~ fa s t, d is trib u te d , la rg e s c a le c o m p
p la tfo rm p ro v id in g ric h J a v a A P I
–
B u t lo w -le v e l a n d fo r m a n y u s e r s t o o c o m p li c a t e d
public class ShuffleTask extends MRTask2<ShuffleTask> {
@Override public void map(Chunk ic, Chunk oc) {
if (ic._len==0) return;
// Each vector is shuffled in the same way
Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L));
oc.set0(0,ic.at0(0));
for (int row=1; row<ic._len; row++) {
int j = rng.nextInt(row+1); // inclusive upper bound <0,row>
if (j!=row) oc.set0(row, oc.at0(j));
oc.set0(j, ic.at0(row));
}
}
}
3. What we provides
●
ScAlH2O - Scala library providing a DSL
–
–
Easy data manipulation and distributed computation
–
●
Abstracting of H2O low-level API
BUT still inside JVM
Scala REPL integration into H2O
–
Console for experimenting with ScAlH2O
5. Frame operations
●
Parse data
●
Basic slicing
–
val f = parse("smalldata/cars.csv")
val f1 = f("name") ++ f(*, 5 to 7)
Column/Rows selectors, append
●
Scalar operations
●
Support head/tail/ncols/nrows/...
●
Cooperation with H2O distributed KV store
–
Load/save operations
val f2 = f1("year") + 1900
val g = load("cars.hex")
val g1 = g ++ g("year") > 80
save("cars.hex", g1)
6. Map/filter/collect operations
●
M ap
–
P e r v a lu e /r o w
●
F ilte r
●
C o lle c t
// Returns a boolean vector
val fm = f map ( new FAOp {
def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4;
});
// Collect all cars with more than 4 cylinders
val ff = f filter ( new FAOp {
def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4;
});
// Compute sum of 2. column
val fc = f collect ( 0.0, new CDOp() {
def apply(acc:scala.Double rhs:Array[scala.Double]) =
acc + rhs(2)
def reduce(l:scala.Double,r:scala.Double) = l+r
} )
8. Internals
●
No magic, BUT there are key-tricks
–
connect H2O classloaders with Scala ecosystem
●
–
Make sure that all distrib. objects are correctly iced
make translation of Scala code into calls of Java API
●
●
Pass operations around t he cloud
●
–
Create H2O MR tasks
Create new frames
preserve primitives types
●
do not introduce overhead of boxing/unboxing
9. Internals – translation to H2O MR tasks
// Collect all cars with more than 4 cylinders
val f5 = f filter ( new FAOp {
def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4;
});
T_A2B_Transf has to be water.Freezable
def filter(af: T_A2B_Transf[scala.Double]):T = {
val f = frame()
val mrt = new MRTask2() {
override def map(in:Array[Chunk], out:Array[NewChunk]) = {
val rlen = in(0)._len
val tmprow = new Array[scala.Double](in.length)
for (row:Int <- 0 until rlen ) {
if (af(Utils.readRow(in,row,tmprow))) {
for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i))
}
}
}
}
mrt.doAll(f.numCols(), f)
val result = mrt.outputFrame(f.names(), f.domains())
apply(result) // return the DFrame
}
11. Towards Scalding-like API
●
V is io n is to p ro v id e S c a ld in g -lik e s y n ta x
Input scheme
Output scheme
f map ( ('name, 'cylinders) -> ('name, 'moreThan4) )
{ (n:String, c:Int) => (n, if (c>4) 1 else 0) }
●
B u t s o fa r D S L is s till u g ly
f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) )
{ new IcedFunctor2to2[Double,Int,Double,Int] {
def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) }
}
Transformation
12. Try and contribute !
> git clone git@github.com:0xdata/h2o.git
> git checkout -b h2oscala origin/h2oscala
> cd h2o-scala && ./depl.sh # or sbt compile
=== Welcome to the world of ScAlH2O ===
Type `help` or `example` to begin...
h2o>
Thank you!