Michal Malohlava presents: Open Source H2O and Scala

•

0 likes•1,092 views

Michal Malohlava discusses the magic behind the math - exposing the way that open source big data analysis H2O uses Scala to get work done, and demos how users can interact with Scala to get the most out of data analysis.

Technology Education

val ScAlH2O =
Scala ++ H2O
San Francisco Data Science

$Why Scala & H2O ? ● H 2 O ~ fa s t, d is trib u te d , la rg e s c a le c o m p p la tfo rm p ro v id in g ric h J a v a A P I – B u t lo w -le v e l a n d fo r m a n y u s e r s t o o c o m p li c a t e d public class ShuffleTask extends MRTask2<ShuffleTask> { @Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } } }$

What we provides
●

ScAlH2O - Scala library providing a DSL
–
–

Easy data manipulation and distributed computation

–

●

Abstracting of H2O low-level API
BUT still inside JVM

Scala REPL integration into H2O
–

Console for experimenting with ScAlH2O

Basic concepts
●

First-class entities
–

Scalars

–

Frames

●

Scala expressions

●

Access to H2O aglos
–

And still preserving access to low-level H2O API)

Frame operations
●

Parse data

●

Basic slicing
–

val f = parse("smalldata/cars.csv")

val f1 = f("name") ++ f(*, 5 to 7)

Column/Rows selectors, append

●

Scalar operations

●

Support head/tail/ncols/nrows/...

●

Cooperation with H2O distributed KV store
–

Load/save operations

val f2 = f1("year") + 1900

val g = load("cars.hex")
val g1 = g ++ g("year") > 80
save("cars.hex", g1)

Map/filter/collect operations
●

M ap
–

P e r v a lu e /r o w

●

F ilte r

●

C o lle c t

// Returns a boolean vector
val fm = f map ( new FAOp {
def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4;
});

// Collect all cars with more than 4 cylinders
val ff = f filter ( new FAOp {
def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4;
});

// Compute sum of 2. column
val fc = f collect ( 0.0, new CDOp() {
def apply(acc:scala.Double rhs:Array[scala.Double]) =
acc + rhs(2)
def reduce(l:scala.Double,r:scala.Double) = l+r
} )

Internals
●

No magic, BUT there are key-tricks
–

connect H2O classloaders with Scala ecosystem
●

–

Make sure that all distrib. objects are correctly iced

make translation of Scala code into calls of Java API
●

●

Pass operations around t he cloud

●

–

Create H2O MR tasks
Create new frames

preserve primitives types
●

do not introduce overhead of boxing/unboxing

$Internals – translation to H2O MR tasks // Collect all cars with more than 4 cylinders val f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; }); T_A2B_Transf has to be water.Freezable def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }$

Towards Scalding-like API
●

V is io n is to p ro v id e S c a ld in g -lik e s y n ta x
Input scheme

Output scheme

f map ( ('name, 'cylinders) -> ('name, 'moreThan4) )
{ (n:String, c:Int) => (n, if (c>4) 1 else 0) }

●

B u t s o fa r D S L is s till u g ly
f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) )
{ new IcedFunctor2to2[Double,Int,Double,Int] {
def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) }
}

Transformation

Try and contribute !

> git clone git@github.com:0xdata/h2o.git
> git checkout -b h2oscala origin/h2oscala
> cd h2o-scala && ./depl.sh # or sbt compile
=== Welcome to the world of ScAlH2O ===
Type `help` or `example` to begin...
h2o>

Thank you!

What's hot

StrataGEM: A Generic Petri Net Verification FrameworkEdmundo López Bóbeda

Pert management Ahmed Gamal

Ece512 h1 20139_621386735458ece512_test2_solutionsnadia abd

Matematica CS-GOTHICMrGothicMaestro

Virtual realitybuds nan kis

Generating and Analyzing Eventsztellman

Rumus vbDyah Narziz

Coq for ML userstmiya

ArraysGeorge Scott IV

Lec20 dimension1Nikhil Bagri

Google V8 engineXuân Thu Nguyễn

Algorithm analysis basics - Seven Functions/Big-Oh/Omega/ThetaPriyanka Rana

Push down automataRatnakar Mikkili

Distributed DBMS - Unit - 4 - Data Distribution Alternatives Gyanmanjari Institute Of Technology

Call report from x++Ahmed Farag

13. quadratic formtemplatetouchpadMedia4math

Contrastive Divergence Learningpenny 梁斌

OpenTuesday: Neues aus der RRDtool WeltDigicomp Academy AG

Faster Python, FOSDEMVictor Stinner

Connected hubs: an analysis of the Lufthansa network in EuropeSau Yee Chan

What's hot (20)

StrataGEM: A Generic Petri Net Verification Framework

Pert management

Ece512 h1 20139_621386735458ece512_test2_solutions

Matematica CS-GOTHIC

Virtual reality

Generating and Analyzing Events

Rumus vb

Coq for ML users

Arrays

Lec20 dimension1

Google V8 engine

Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta

Push down automata

Distributed DBMS - Unit - 4 - Data Distribution Alternatives

Call report from x++

13. quadratic formtemplatetouchpad

Contrastive Divergence Learning

OpenTuesday: Neues aus der RRDtool Welt

Faster Python, FOSDEM

Connected hubs: an analysis of the Lufthansa network in Europe

Similar to Michal Malohlava presents: Open Source H2O and Scala

Spark workshopWojciech Pituła

Introduction to Functional Programming with Haskell and JavaScriptWill Kurt

talk at Virginia Bioinformatics Institute, December 5, 2013ericupnorth

Full Stack ClojureMichiel Borkent

[FT-11][suhorng] “Poor Man's” Undergraduate CompilersFunctional Thursday

Being functional in PHP (PHPDay Italy 2016)David de Boer

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)Menlo Systems GmbH

Go Says WAT?jonbodner

Spark by Adform Research, PauliusVasil Remeniuk

User Defined Aggregation in Apache Spark: A Love StoryDatabricks

Refactoring to Macros with ClojureDmitry Buzdin

C++11Sasha Goldshtein

All I know about rsc.io/c2goMoriyoshi Koizumi

Seminar PSU 10.10.2014 mmeVyacheslav Arbuzov

When RV Meets CEP (RV 2016 Tutorial)Sylvain Hallé

Scala by Luc DuponcheelStephan Janssen

Spark_Documentation_Template1Nagavarunkumar Kolla

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA

R meets HadoopHidekazu Tanaka

Similar to Michal Malohlava presents: Open Source H2O and Scala (20)

Spark workshop

Introduction to Functional Programming with Haskell and JavaScript

talk at Virginia Bioinformatics Institute, December 5, 2013

Full Stack Clojure

[FT-11][suhorng] “Poor Man's” Undergraduate Compilers

Being functional in PHP (PHPDay Italy 2016)

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

Go Says WAT?

Spark by Adform Research, Paulius

User Defined Aggregation in Apache Spark: A Love Story

Refactoring to Macros with Clojure

C++11

All I know about rsc.io/c2go

Seminar PSU 10.10.2014 mme

When RV Meets CEP (RV 2016 Tutorial)

Scala by Luc Duponcheel

Spark_Documentation_Template1

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...

R meets Hadoop

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

A Call to Action for Generative AI in 2024Results

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

A Year of the Servo Reboot: Where Are We Now?Igalia

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter

Driving Behavioral Change for Information Management through Data-Driven Gree...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Finology Group – Insurtech Innovation Award 2024

A Call to Action for Generative AI in 2024

CNv6 Instructor Chapter 6 Quality of Service

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

Advantages of Hiring UIUX Design Service Providers for Your Business

Tata AIG General Insurance Company - Insurer Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Handwritten Text Recognition for manuscripts and early printed texts

How to Troubleshoot Apps for the Modern Connected Worker

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Automating Google Workspace (GWS) & more with Apps Script

Boost Fertility New Invention Ups Success Rates.pdf

Boost PC performance: How more available memory can improve productivity

A Year of the Servo Reboot: Where Are We Now?

Michal Malohlava presents: Open Source H2O and Scala

1. val ScAlH2O = Scala ++ H2O San Francisco Data Science

2. Why Scala & H2O ? ● H 2 O ~ fa s t, d is trib u te d , la rg e s c a le c o m p p la tfo rm p ro v id in g ric h J a v a A P I – B u t lo w -le v e l a n d fo r m a n y u s e r s t o o c o m p li c a t e d public class ShuffleTask extends MRTask2<ShuffleTask> { @Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } } }

3. What we provides ● ScAlH2O - Scala library providing a DSL – – Easy data manipulation and distributed computation – ● Abstracting of H2O low-level API BUT still inside JVM Scala REPL integration into H2O – Console for experimenting with ScAlH2O

4. Basic concepts ● First-class entities – Scalars – Frames ● Scala expressions ● Access to H2O aglos – And still preserving access to low-level H2O API)

5. Frame operations ● Parse data ● Basic slicing – val f = parse("smalldata/cars.csv") val f1 = f("name") ++ f(*, 5 to 7) Column/Rows selectors, append ● Scalar operations ● Support head/tail/ncols/nrows/... ● Cooperation with H2O distributed KV store – Load/save operations val f2 = f1("year") + 1900 val g = load("cars.hex") val g1 = g ++ g("year") > 80 save("cars.hex", g1)

6. Map/filter/collect operations ● M ap – P e r v a lu e /r o w ● F ilte r ● C o lle c t // Returns a boolean vector val fm = f map ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; }); // Collect all cars with more than 4 cylinders val ff = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; }); // Compute sum of 2. column val fc = f collect ( 0.0, new CDOp() { def apply(acc:scala.Double rhs:Array[scala.Double]) = acc + rhs(2) def reduce(l:scala.Double,r:scala.Double) = l+r } )

7. Internals It's magic

8. Internals ● No magic, BUT there are key-tricks – connect H2O classloaders with Scala ecosystem ● – Make sure that all distrib. objects are correctly iced make translation of Scala code into calls of Java API ● ● Pass operations around t he cloud ● – Create H2O MR tasks Create new frames preserve primitives types ● do not introduce overhead of boxing/unboxing

9. Internals – translation to H2O MR tasks // Collect all cars with more than 4 cylinders val f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; }); T_A2B_Transf has to be water.Freezable def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }

10. Party demo time!

11. Towards Scalding-like API ● V is io n is to p ro v id e S c a ld in g -lik e s y n ta x Input scheme Output scheme f map ( ('name, 'cylinders) -> ('name, 'moreThan4) ) { (n:String, c:Int) => (n, if (c>4) 1 else 0) } ● B u t s o fa r D S L is s till u g ly f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) ) { new IcedFunctor2to2[Double,Int,Double,Int] { def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) } } Transformation

12. Try and contribute ! > git clone git@github.com:0xdata/h2o.git > git checkout -b h2oscala origin/h2oscala > cd h2o-scala && ./depl.sh # or sbt compile === Welcome to the world of ScAlH2O === Type `help` or `example` to begin... h2o> Thank you!

Michal Malohlava presents: Open Source H2O and Scala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Michal Malohlava presents: Open Source H2O and Scala

Similar to Michal Malohlava presents: Open Source H2O and Scala (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Michal Malohlava presents: Open Source H2O and Scala