SlideShare a Scribd company logo
1 of 97
Download to read offline
Internship final report
@Treasure Data Inc. (2016 8/1-9/30)
ITO Ryuichi
Outline
• Who am I?
• What I did?
• About Hivemall
• Benchmark
• Add several new features
Who am I?
Who am I?
• ITO Ryuichi(@amaya382)
• Graduate School of Information Science and Technology,
Osaka University(’16-)
• Accelerating graph processing engine:

concurrency control, hardware-aware optimization
• (a little) Natural language processing:

conversation system with context consistency
❤ Scala, C#
What I did?
What I did?
• About Hivemall
• Benchmark
• Add several new features
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.
Cute Logo!
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.
Cute Logo!
About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classification
• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.
Cute Logo!
About Hivemall(cont.)
• How does Hivemall work on Hive?
• Hivemall is a set of UDFs(User-Defined Functions)
• UDF: projection, one entry -> one entry
• UDTF(Table-generating): some entries -> some entries
• UDAF(Aggregate): all entries -> one entry
• Define features as UDFs following interfaces in Java
prepared by Hive
• And by loading Hivemall jar file, enable to use extra
functions in HQL
About Hivemall(cont.)
• Example: Training by logistic regression
• Only HQL, no need to be familiar with
programming. (Already, HQL(Hive) is close to data!)
CREATE TABLE model AS
SELECT
feature, AVG(weight) AS weight
FROM (
SELECT logress(features, label, ...)
AS (feature, weight)
FROM train_data) t
GROUP BY feature
What I did?
• About Hivemall
• Benchmark
• Add several new features
Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-defined test cases w/ prepared data set
1. Logistic Regression
2. Random Forest
• Several hyper parameters
3. Boosting
4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn,
Spark, etc.)
NOTE: basically, using common environment, but some cases use different environments
For more details, see bench-ml project
Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-defined test cases w/ prepared data set
1. Logistic Regression
2. Random Forest
• Several hyper parameters
3. Boosting
4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn,
Spark, etc.)
NOTE: basically, using common environment, but some cases use different environments
For more details, see bench-ml project
Tried
Tried
Benchmark(cont.)
• Environment
• Amazon Web Service
• EMR(Elastic MapReduce)
• m3.xlarge*3 + c3.xlarge*3
• Hadoop: Amazon 2.7.2
• Tez: 0.8.4
• Hive: 2.1.0
• Hivemall: 0.4.2-RC2
• Misc.
• Basically, using six parallel processing, fitting to #instances
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
10x10x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
1.3x1.3x
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
(Time[sec] / AUC[%])
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
(Time[sec] / AUC[%])
High initial overhead
caused by Hive
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
(Time[sec] / AUC[%])
High initial overhead
caused by Hive
Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
+0.4+0.4
(Time[sec] / AUC[%])
High initial overhead
caused by Hive
Benchmark - Random Forest(1)
• Using train_randomforest_classifier() on Hivemall
• (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(1)
• Using train_randomforest_classifier() on Hivemall
• (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(1)
• Using train_randomforest_classifier() on Hivemall
• (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
(Time[sec] / AUC[%])
Amazing…
Benchmark - Random Forest(2)
• Using train_randomforest_classifier() on Hivemall
• (2)Regulation: 100 trees, max depth 20
• Hivemall is good until 1M, but cannot process 10M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall
(Time[sec] / AUC[%])
Benchmark - Random Forest(2)
• Using train_randomforest_classifier() on Hivemall
• (2)Regulation: 100 trees, max depth 20
• Hivemall is good until 1M, but cannot process 10M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall
(Time[sec] / AUC[%])
What I did?
• About Hivemall
• Benchmark
• Add several new features
What I did?
• About Hivemall
• Benchmark
• Add several new features
Main topic!
Add several new features
• systemtest module
• Feature binning
• Feature selection
• Some spark integrations
Add new features - systemtest
• What’s systemtest?
• Testing framework for UDFs
• Also can apply other applications based on UDFs
• Already tests exist, not? Why need?
• Yes, but the existing is...
• Cannot run on Hive actually, only run as Java programs
• Difficult to write coverall tests
• e.g. in UDAF, several work flows depending on a
kind of function, data set and environment
• Difficult to use existing resources
• Low extendability, etc.
Add new features - systemtest
• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();

final ObjectInspector[] OIs = new ObjectInspector[] {

ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector)};

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false));

evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer();

evaluator.reset(agg);

...

for (int i = 0; i < features.length; i++) {

final List<IntWritable> labelList = new ArrayList<IntWritable>();

for (int label : labels[i]) {

labelList.add(new IntWritable(label));

}

evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),

labelList});

}

final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);

...

Assert.assertArrayEquals(answer, result, 1e-5);
Add new features - systemtest
• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();

final ObjectInspector[] OIs = new ObjectInspector[] {

ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector)};

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false));

evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer();

evaluator.reset(agg);

...

for (int i = 0; i < features.length; i++) {

final List<IntWritable> labelList = new ArrayList<IntWritable>();

for (int label : labels[i]) {

labelList.add(new IntWritable(label));

}

evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),

labelList});

}

final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);

...

Assert.assertArrayEquals(answer, result, 1e-5);
omittedalot
→
→
Add new features - systemtest
• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();

final ObjectInspector[] OIs = new ObjectInspector[] {

ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector)};

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false));

evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer();

evaluator.reset(agg);

...

for (int i = 0; i < features.length; i++) {

final List<IntWritable> labelList = new ArrayList<IntWritable>();

for (int label : labels[i]) {

labelList.add(new IntWritable(label));

}

evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),

labelList});

}

final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);

...

Assert.assertArrayEquals(answer, result, 1e-5);
omittedalot
Useless and long initializationUseless and long initialization
→
→
Add new features - systemtest
• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();

final ObjectInspector[] OIs = new ObjectInspector[] {

ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector)};

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false));

evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer();

evaluator.reset(agg);

...

for (int i = 0; i < features.length; i++) {

final List<IntWritable> labelList = new ArrayList<IntWritable>();

for (int label : labels[i]) {

labelList.add(new IntWritable(label));

}

evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),

labelList});

}

final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);

...

Assert.assertArrayEquals(answer, result, 1e-5);
omittedalot
Useless and long initializationUseless and long initialization
→
→
Useless many conversionsUseless many conversions
Add new features - systemtest
• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();

final ObjectInspector[] OIs = new ObjectInspector[] {

ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector)};

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false));

evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);

final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer();

evaluator.reset(agg);

...

for (int i = 0; i < features.length; i++) {

final List<IntWritable> labelList = new ArrayList<IntWritable>();

for (int label : labels[i]) {

labelList.add(new IntWritable(label));

}

evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),

labelList});

}

final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);

...

Assert.assertArrayEquals(answer, result, 1e-5);
omittedalot
Useless and long initializationUseless and long initialization
→
→
Useless many conversionsUseless many conversions
AndnotrunonHive,onlylogicaltest!!
AndnotrunonHive,onlylogicaltest!!
Add new features - systemtest
• Solution
• New module based on JUnit, HiveRunner and td-client-java
• What it can do?
• Short and unified initialization
• Write and combine HQL
• Run local Hive and also remote Treasure Data with the
same code
• Testbed is prepared and cleaned up automatically
• Easy to use external resources, e.g. TSV file
• Literal definition(HQL), but test with debugger
• Useful DSL
Add new features - systemtest(1)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
1.Write tests based on
SystemTestRunner interface
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
Add new features - systemtest(2)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
2. Read initialization and execute
via impls of SystemTestRunner
It works based on JUnit @ClassRule
Prepare database
specialized for
each test class
Use external resources depending on needs
Add new features - systemtest(3)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
3. Execute first test
It works based on JUnit @Rule
Run as HQL,
and check return values
Rewrite DSL & HQL for each env
Add new features - systemtest(4)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
4. Reset testbeds
It works based on JUnit @Rule
Drop temporary tables
Add new features - systemtest(5,6…)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
5. Execute second test
6. Reset testbeds
…repeat all tests
It works based on JUnit @Rule
Add new features - systemtest(7)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner
7. Finalize test
Drop temporary database
and disconnect
It works based on JUnit @ClassRule
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
noomission!
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
noomission!
Common initialization
with external data
Common initialization
with external data
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
noomission!
Common initialization
with external data
Common initialization
with external data
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
noomission!
Common initialization
with external data
Common initialization
with external data
Testbed-specific initializationTestbed-specific initialization
Add new features - systemtest
• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);

private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv",

new LinkedHashMap<String, String>() {{

put("a", "double");

put("b", "double");

put("c", "double");

put("d", "double");

put("c0", "int");

put("c1", "int");

put("c2", “int");}});


@ClassRule

public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{

initBy(createIrisTable);

initBy(HQ.fromStatements("" +

"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +

"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +

"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +

"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +

"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};

@ClassRule

public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{

initBy(createIrisTable);}};


@Rule

public SystemTestTeam team = new SystemTestTeam(hRunner);
noomission!
Common initialization
with external data
Common initialization
with external data
Testbed-specific initializationTestbed-specific initialization
Set common runnerSet common runner
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
noomission!
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
noomission!
Execute tests on clean testbeds
using database created by init
Execute tests on clean testbeds
using database created by init
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
noomission!
Execute tests on clean testbeds
using database created by init
Execute tests on clean testbeds
using database created by init
Run on HiveRunnerRun on HiveRunner
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
noomission!
Execute tests on clean testbeds
using database created by init
Execute tests on clean testbeds
using database created by init
Run on HiveRunnerRun on HiveRunner
Add new features - systemtest
• Example: test cases(1)
@Test

public void snr() throws Exception {

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +

"SELECT snr(X, Y)" +

"FROM iris"), "$ANSWER");

team.run();

}

@Test

public void chi2() throws Exception {

team.add(tRunner);

team.set(HQ.fromStatement("" +

"WITH iris AS (" +

" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +

"stats AS (" +

" SELECT" +

" transpose_and_dot(Y, X) AS observed," +

" array_sum(X) AS feature_count," +

" array_avg(Y) AS class_prob" +

" FROM" +

" iris)," +

"test AS (" +

" SELECT" +

" transpose_and_dot(class_prob, feature_count) AS expected" +

" FROM" +

" stats)" +

"SELECT" +

" chi2(observed, expected) AS x " +

"FROM" +

" test JOIN stats"), "$ANSWER");

team.run();

}
noomission!
Execute tests on clean testbeds
using database created by init
Execute tests on clean testbeds
using database created by init
Run on HiveRunnerRun on HiveRunner
Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData
Add new features - systemtest
• Example: test cases(2)
@Test

public void someTest0() throws Exception {

final String tableName = "color";

team.initBy(HQ.uploadByResourcePathAsNewTable(

tableName, ci.initDir + "color.tsv",

new LinkedHashMap<String, String>() {{

put("name", "string");

put("red", "int");

put("green", "int");

put("blue", "int");}}));

team.set(HQ.fromStatement("" +

"SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +

tableName +

" u LEFT JOIN color c on u.favorite_color = c.name"),

"rgb(255,165,0)trgb(255,192,203)");

team.run();

}

@Test

public void someTest1() throws Exception {

team.set(HQ.autoMatchingByFileName("hoge"), ci);

team.run();

}
Add new features - systemtest
• Example: test cases(2)
@Test

public void someTest0() throws Exception {

final String tableName = "color";

team.initBy(HQ.uploadByResourcePathAsNewTable(

tableName, ci.initDir + "color.tsv",

new LinkedHashMap<String, String>() {{

put("name", "string");

put("red", "int");

put("green", "int");

put("blue", "int");}}));

team.set(HQ.fromStatement("" +

"SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +

tableName +

" u LEFT JOIN color c on u.favorite_color = c.name"),

"rgb(255,165,0)trgb(255,192,203)");

team.run();

}

@Test

public void someTest1() throws Exception {

team.set(HQ.autoMatchingByFileName("hoge"), ci);

team.run();

}
noomission!
Add new features - systemtest
• Example: test cases(2)
@Test

public void someTest0() throws Exception {

final String tableName = "color";

team.initBy(HQ.uploadByResourcePathAsNewTable(

tableName, ci.initDir + "color.tsv",

new LinkedHashMap<String, String>() {{

put("name", "string");

put("red", "int");

put("green", "int");

put("blue", "int");}}));

team.set(HQ.fromStatement("" +

"SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +

tableName +

" u LEFT JOIN color c on u.favorite_color = c.name"),

"rgb(255,165,0)trgb(255,192,203)");

team.run();

}

@Test

public void someTest1() throws Exception {

team.set(HQ.autoMatchingByFileName("hoge"), ci);

team.run();

}
noomission!
Test-specific initialization
It also can chain
Test-specific initialization
It also can chain
Add new features - systemtest
• Example: test cases(2)
@Test

public void someTest0() throws Exception {

final String tableName = "color";

team.initBy(HQ.uploadByResourcePathAsNewTable(

tableName, ci.initDir + "color.tsv",

new LinkedHashMap<String, String>() {{

put("name", "string");

put("red", "int");

put("green", "int");

put("blue", "int");}}));

team.set(HQ.fromStatement("" +

"SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +

tableName +

" u LEFT JOIN color c on u.favorite_color = c.name"),

"rgb(255,165,0)trgb(255,192,203)");

team.run();

}

@Test

public void someTest1() throws Exception {

team.set(HQ.autoMatchingByFileName("hoge"), ci);

team.run();

}
noomission!
Test-specific initialization
It also can chain
Test-specific initialization
It also can chain
Use HQL and answers
written in external files
Use HQL and answers
written in external files
Add new features - systemtest
• More details?
• https://github.com/myui/hivemall/issues/323
• https://github.com/myui/hivemall/pull/336
• And systemtest/README.md
Add new features - feature binning
• What’s feature binning?
• A method to divide quantitative variables
into meaningful categorical variables
Add new features - feature binning
• How does it work?
• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
build_bins feature_binning
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• Use percentile internally, make all areas
uniform
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• What’s auto_shrink?
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• What’s auto_shrink?
Sometimes made void bins
by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• What’s auto_shrink?
Sometimes made void bins
by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• What’s auto_shrink?
Exception!Sometimes made void bins
by small or skewed data set
!?!? ->
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
• Distribute variables into bins by its value
feature_binning
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
• Distribute variables into bins by its value
feature_binning
bin 0 ->
bin 1 ->
bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
• Distribute variables into bins by its value
feature_binning
17 is between
-Infinity and 18.0 …
bin 0 ->
bin 1 ->
bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
• Distribute variables into bins by its value
feature_binning
17 is between
-Infinity and 18.0 …
<here!bin 0 ->
bin 1 ->
bin 2 ->
Age:17
Add new features - feature binning
• [UDF] feature_binning(features, quantiles_map)

/(weight, quantiles)
• Distribute variables into bins by its value
feature_binning
17 is between
-Infinity and 18.0 …
<here!bin 0 ->
bin 1 ->
bin 2 ->
Age:17
Add new features - feature binning
• More details?
• https://github.com/myui/hivemall/issues/319
• https://github.com/myui/hivemall/pull/322
Add new features - feature selection
• What’s feature selection?
• A generic term of methods to select meaningful
features
• Used to preprocessing of machine learning
• Why used?
• Enhance results
• Shorten learning time
• Make a set of features human-understandable
Add new features - feature selection
• A kind of feature selection
• Use variance
• Use Chi-square value
• Use SNR(Signal Noise Ratio)
• mRMR(minimumRedundancy MaximumRelevance)
• etc.
Add new features - feature selection
• A kind of feature selection
• Use variance
• Use Chi-square value
• Use SNR(Signal Noise Ratio)
• mRMR(minimumRedundancy MaximumRelevance)
• etc.
Implemented
Implemented
Add new features - feature selection
• Feature selection using Chi-square value
• To calc Chi-square value, need both observed
values and expected values(=hypothesis)
• Observed: aggregated features of each class
• Expected: assuming each features and each
classes are independent, calc expected values
• Calc Chi-square value
• Select top-k features
Chi-square
Add new features - feature selection
• How does it work on Hivemall?
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• [UDF] chi2(observed::array<array<number>>,
expected::array<array<number>>)::struct<array
<double>, array<double>>
• [UDF] select_k_best(X::array<number>,
importance_list::array<int>
k::int)::array<double>
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• Utility for matrix calculation, generic UDF
YX
T
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• Utility for matrix calculation, generic UDF
YX
T
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• Utility for matrix calculation, generic UDF
YX
T
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• Utility for matrix calculation, generic UDF
YX
T
Maybe you think
matrix multiplication requires repetition…
Chi-square
Add new features - feature selection
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• Utility for matrix calculation, generic UDF
YX
T
Calculate incrementally!
Maybe you think
matrix multiplication requires repetition…
Chi-square
Add new features - feature selection
• [UDF] chi2(observed::array<array<number>>,
expected::array<array<number>>)::struct<arra
y<double>, array<double>>
• Calculate Chi-square value and p-value
•
• Calculate p-value by above and Chi-square
distribution
Chi-square
Add new features - feature selection
• [UDF] select_k_best(X::array<number>,
importance_list::array<int>,
k::int)::array<double>
• Select top-k elements from X by importance_list
• Generic UDF
NOTE: Current implementation expects all each importance_list and k are equal
k = 2
Chi-square
Add new features - feature selection
• [UDF] select_k_best(X::array<number>,
importance_list::array<int>,
k::int)::array<double>
• Select top-k elements from X by importance_list
• Generic UDF
NOTE: Current implementation expects all each importance_list and k are equal
k = 2
Chi-square
Add new features - feature selection
• Feature selection using SNR
• Aggregate mean and variance of each feature
and each class
• When termination, calc Signal Noise Ratio
between all combination of classes, of each
feature
• Sum up Signal Noise Ratio each feature
Signal Noise Ratio
Add new features - feature selection
• How does it work on Hivemall?
• [UDAF] snr(X::array<number>,
label::array<int>)::array<double>
Signal Noise Ratio
Add new features - feature selection
• [UDAF] snr(X::array<number>,
label::array<int>)::array<double>
• Aggregate variance by Chan’s method
• Calc Signal Noise Ratio and sum them up each features
Signal Noise Ratio
Add new features - feature selection
• More details?
• https://github.com/myui/hivemall/issues/338
• https://github.com/myui/hivemall/pull/352
Add new features - spark integration
• Integrated feature selection into spark module
• Improved build flow for resolving binary
incompatibility between spark-1.6 and
spark-2.0
Thank you for listening!
Thank you for listening!
Any questions?

More Related Content

What's hot

The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015craig lehmann
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJavaJobaer Chowdhury
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Ontico
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...Pôle Systematic Paris-Region
 
Joblib for cloud computing
Joblib for cloud computingJoblib for cloud computing
Joblib for cloud computingAlexandre Abadie
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8 Dori Waldman
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012 Erik Onnen
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akkanartamonov
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced BasicsDoug Jones
 
JVM @ Taobao - QCon Hangzhou 2011
JVM @ Taobao - QCon Hangzhou 2011JVM @ Taobao - QCon Hangzhou 2011
JVM @ Taobao - QCon Hangzhou 2011Kris Mok
 
Objective-C Is Not Java
Objective-C Is Not JavaObjective-C Is Not Java
Objective-C Is Not JavaChris Adamson
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Introduction of failsafe
Introduction of failsafeIntroduction of failsafe
Introduction of failsafeSunghyouk Bae
 
Non blocking programming and waiting
Non blocking programming and waitingNon blocking programming and waiting
Non blocking programming and waitingRoman Elizarov
 
Rxjava 介紹與 Android 中的 RxJava
Rxjava 介紹與 Android 中的 RxJavaRxjava 介紹與 Android 中的 RxJava
Rxjava 介紹與 Android 中的 RxJavaKros Huang
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Concurrency in Scala - the Akka way
Concurrency in Scala - the Akka wayConcurrency in Scala - the Akka way
Concurrency in Scala - the Akka wayYardena Meymann
 

What's hot (20)

The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
 
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)Mасштабирование микросервисов на Go, Matt Heath (Hailo)
Mасштабирование микросервисов на Go, Matt Heath (Hailo)
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
 
Joblib for cloud computing
Joblib for cloud computingJoblib for cloud computing
Joblib for cloud computing
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
High performance network programming on the jvm oscon 2012
High performance network programming on the jvm   oscon 2012 High performance network programming on the jvm   oscon 2012
High performance network programming on the jvm oscon 2012
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced Basics
 
JVM @ Taobao - QCon Hangzhou 2011
JVM @ Taobao - QCon Hangzhou 2011JVM @ Taobao - QCon Hangzhou 2011
JVM @ Taobao - QCon Hangzhou 2011
 
Objective-C Is Not Java
Objective-C Is Not JavaObjective-C Is Not Java
Objective-C Is Not Java
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Introduction of failsafe
Introduction of failsafeIntroduction of failsafe
Introduction of failsafe
 
Non blocking programming and waiting
Non blocking programming and waitingNon blocking programming and waiting
Non blocking programming and waiting
 
Rxjava 介紹與 Android 中的 RxJava
Rxjava 介紹與 Android 中的 RxJavaRxjava 介紹與 Android 中的 RxJava
Rxjava 介紹與 Android 中的 RxJava
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Concurrency in Scala - the Akka way
Concurrency in Scala - the Akka wayConcurrency in Scala - the Akka way
Concurrency in Scala - the Akka way
 

Viewers also liked

Treasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportTreasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportNaoki Ishikawa
 
Treasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportTreasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportRitta Narita
 
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...KCS Keio Computer Society
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方Takahiro Inoue
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Ha Phuong
 
Lighting talk chainer hands on
Lighting talk chainer hands onLighting talk chainer hands on
Lighting talk chainer hands onOgushi Masaya
 
Chainer meetup lt
Chainer meetup ltChainer meetup lt
Chainer meetup ltAce12358
 
Chainer Contribution Guide
Chainer Contribution GuideChainer Contribution Guide
Chainer Contribution GuideKenta Oono
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLNaoto Yoshida
 
Swift : クラス継承とプロトコル拡張を比べてみる #yidev
Swift : クラス継承とプロトコル拡張を比べてみる #yidevSwift : クラス継承とプロトコル拡張を比べてみる #yidev
Swift : クラス継承とプロトコル拡張を比べてみる #yidevTomohiro Kumagai
 
ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法Yuko Fujiyama
 
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Jun-ya Norimatsu
 
ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)Motoki Sato
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Seiya Tokui
 
Chainer入門と最近の機能
Chainer入門と最近の機能Chainer入門と最近の機能
Chainer入門と最近の機能Yuya Unno
 
Introduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerIntroduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerRyo Shimizu
 
On the benchmark of Chainer
On the benchmark of ChainerOn the benchmark of Chainer
On the benchmark of ChainerKenta Oono
 
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみようPythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう洋資 堅田
 

Viewers also liked (20)

Treasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportTreasure Data Summer Internship Final Report
Treasure Data Summer Internship Final Report
 
Treasure Data Summer Internship Final Report
Treasure Data Summer Internship Final ReportTreasure Data Summer Internship Final Report
Treasure Data Summer Internship Final Report
 
tmu_science_cafe02
tmu_science_cafe02tmu_science_cafe02
tmu_science_cafe02
 
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...
Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Dat...
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)
 
Lighting talk chainer hands on
Lighting talk chainer hands onLighting talk chainer hands on
Lighting talk chainer hands on
 
Chainer meetup lt
Chainer meetup ltChainer meetup lt
Chainer meetup lt
 
Chainer Contribution Guide
Chainer Contribution GuideChainer Contribution Guide
Chainer Contribution Guide
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
Swift : クラス継承とプロトコル拡張を比べてみる #yidev
Swift : クラス継承とプロトコル拡張を比べてみる #yidevSwift : クラス継承とプロトコル拡張を比べてみる #yidev
Swift : クラス継承とプロトコル拡張を比べてみる #yidev
 
ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法
 
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
 
ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12
 
Chainer入門と最近の機能
Chainer入門と最近の機能Chainer入門と最近の機能
Chainer入門と最近の機能
 
CuPy解説
CuPy解説CuPy解説
CuPy解説
 
Introduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerIntroduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainer
 
On the benchmark of Chainer
On the benchmark of ChainerOn the benchmark of Chainer
On the benchmark of Chainer
 
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみようPythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう
Pythonで動かして学ぶ機械学習入門_予測モデルを作ってみよう
 

Similar to Internship final report@Treasure Data Inc.

Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Introduction to Ansible - Jan 28 - Austin MeetUp
Introduction to Ansible - Jan 28 - Austin MeetUpIntroduction to Ansible - Jan 28 - Austin MeetUp
Introduction to Ansible - Jan 28 - Austin MeetUptylerturk
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processingducquoc_vn
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the CloudInes Sombra
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 

Similar to Internship final report@Treasure Data Inc. (20)

Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Introduction to Ansible - Jan 28 - Austin MeetUp
Introduction to Ansible - Jan 28 - Austin MeetUpIntroduction to Ansible - Jan 28 - Austin MeetUp
Introduction to Ansible - Jan 28 - Austin MeetUp
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
Open Platform for AI & ML modeling
Open Platform for AI & ML modelingOpen Platform for AI & ML modeling
Open Platform for AI & ML modeling
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
YARN
YARNYARN
YARN
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 

More from Ryuichi ITO

scala.collection 再入門 (改)
scala.collection 再入門 (改)scala.collection 再入門 (改)
scala.collection 再入門 (改)Ryuichi ITO
 
ゼロから始めるScala文法
ゼロから始めるScala文法ゼロから始めるScala文法
ゼロから始めるScala文法Ryuichi ITO
 
ゼロから始めるScalaプロジェクト
ゼロから始めるScalaプロジェクトゼロから始めるScalaプロジェクト
ゼロから始めるScalaプロジェクトRyuichi ITO
 
サクサクアンドロイド
サクサクアンドロイドサクサクアンドロイド
サクサクアンドロイドRyuichi ITO
 

More from Ryuichi ITO (7)

scala.collection 再入門 (改)
scala.collection 再入門 (改)scala.collection 再入門 (改)
scala.collection 再入門 (改)
 
ゼロから始めるScala文法
ゼロから始めるScala文法ゼロから始めるScala文法
ゼロから始めるScala文法
 
ゼロから始めるScalaプロジェクト
ゼロから始めるScalaプロジェクトゼロから始めるScalaプロジェクト
ゼロから始めるScalaプロジェクト
 
OUCC LT会2
OUCC LT会2OUCC LT会2
OUCC LT会2
 
サクサクアンドロイド
サクサクアンドロイドサクサクアンドロイド
サクサクアンドロイド
 
getstartedc#_2
getstartedc#_2getstartedc#_2
getstartedc#_2
 
getstartedc#_1
getstartedc#_1getstartedc#_1
getstartedc#_1
 

Recently uploaded

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Recently uploaded (20)

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Internship final report@Treasure Data Inc.

  • 1. Internship final report @Treasure Data Inc. (2016 8/1-9/30) ITO Ryuichi
  • 2. Outline • Who am I? • What I did? • About Hivemall • Benchmark • Add several new features
  • 4. Who am I? • ITO Ryuichi(@amaya382) • Graduate School of Information Science and Technology, Osaka University(’16-) • Accelerating graph processing engine:
 concurrency control, hardware-aware optimization • (a little) Natural language processing:
 conversation system with context consistency ❤ Scala, C#
  • 6. What I did? • About Hivemall • Benchmark • Add several new features
  • 7. About Hivemall • A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉 • Can use many features via HQL(Hive Query Language, like SQL) • Classification • Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation • Matrix Factorisation, Factorisation Machine, etc. • Utilities • Feature engineering, Additional array operations, etc. • etc.
  • 8. About Hivemall • A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉 • Can use many features via HQL(Hive Query Language, like SQL) • Classification • Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation • Matrix Factorisation, Factorisation Machine, etc. • Utilities • Feature engineering, Additional array operations, etc. • etc. Cute Logo!
  • 9. About Hivemall • A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉 • Can use many features via HQL(Hive Query Language, like SQL) • Classification • Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation • Matrix Factorisation, Factorisation Machine, etc. • Utilities • Feature engineering, Additional array operations, etc. • etc. Cute Logo!
  • 10. About Hivemall • A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉 • Can use many features via HQL(Hive Query Language, like SQL) • Classification • Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation • Matrix Factorisation, Factorisation Machine, etc. • Utilities • Feature engineering, Additional array operations, etc. • etc. Cute Logo!
  • 11. About Hivemall(cont.) • How does Hivemall work on Hive? • Hivemall is a set of UDFs(User-Defined Functions) • UDF: projection, one entry -> one entry • UDTF(Table-generating): some entries -> some entries • UDAF(Aggregate): all entries -> one entry • Define features as UDFs following interfaces in Java prepared by Hive • And by loading Hivemall jar file, enable to use extra functions in HQL
  • 12. About Hivemall(cont.) • Example: Training by logistic regression • Only HQL, no need to be familiar with programming. (Already, HQL(Hive) is close to data!) CREATE TABLE model AS SELECT feature, AVG(weight) AS weight FROM ( SELECT logress(features, label, ...) AS (feature, weight) FROM train_data) t GROUP BY feature
  • 13. What I did? • About Hivemall • Benchmark • Add several new features
  • 14. Benchmark • Based on bench-ml (https://github.com/szilard/benchm-ml) • Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest • Several hyper parameters 3. Boosting 4. Deep Learning • Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.) NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project
  • 15. Benchmark • Based on bench-ml (https://github.com/szilard/benchm-ml) • Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest • Several hyper parameters 3. Boosting 4. Deep Learning • Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.) NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project Tried Tried
  • 16. Benchmark(cont.) • Environment • Amazon Web Service • EMR(Elastic MapReduce) • m3.xlarge*3 + c3.xlarge*3 • Hadoop: Amazon 2.7.2 • Tez: 0.8.4 • Hive: 2.1.0 • Hivemall: 0.4.2-RC2 • Misc. • Basically, using six parallel processing, fitting to #instances
  • 17. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall (Time[sec] / AUC[%])
  • 18. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall 10x10x (Time[sec] / AUC[%])
  • 19. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall 10x10x 12.5x12.5x (Time[sec] / AUC[%])
  • 20. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖10x10x 12.5x12.5x (Time[sec] / AUC[%])
  • 21. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 1.3x1.3x 10x10x 12.5x12.5x (Time[sec] / AUC[%])
  • 22. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 4.9x4.9x1.3x1.3x 10x10x 12.5x12.5x (Time[sec] / AUC[%])
  • 23. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 4.9x4.9x1.3x1.3x 10x10x 12.5x12.5x 3.9x3.9x (Time[sec] / AUC[%])
  • 24. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 4.9x4.9x1.3x1.3x 10x10x 12.5x12.5x 3.9x3.9x (Time[sec] / AUC[%]) High initial overhead caused by Hive
  • 25. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 4.9x4.9x1.3x1.3x 10x10x 12.5x12.5x 3.9x3.9x ±0±0 (Time[sec] / AUC[%]) High initial overhead caused by Hive
  • 26. Benchmark - Logistic Regression • Using logress() on Hivemall • Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall ✖✖ 4.9x4.9x1.3x1.3x 10x10x 12.5x12.5x 3.9x3.9x ±0±0 +0.4+0.4 (Time[sec] / AUC[%]) High initial overhead caused by Hive
  • 27. Benchmark - Random Forest(1) • Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables • Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall (Time[sec] / AUC[%])
  • 28. Benchmark - Random Forest(1) • Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables • Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall (Time[sec] / AUC[%])
  • 29. Benchmark - Random Forest(1) • Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables • Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall (Time[sec] / AUC[%]) Amazing…
  • 30. Benchmark - Random Forest(2) • Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20 • Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall (Time[sec] / AUC[%])
  • 31. Benchmark - Random Forest(2) • Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20 • Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall (Time[sec] / AUC[%])
  • 32. What I did? • About Hivemall • Benchmark • Add several new features
  • 33. What I did? • About Hivemall • Benchmark • Add several new features Main topic!
  • 34. Add several new features • systemtest module • Feature binning • Feature selection • Some spark integrations
  • 35. Add new features - systemtest • What’s systemtest? • Testing framework for UDFs • Also can apply other applications based on UDFs • Already tests exist, not? Why need? • Yes, but the existing is... • Cannot run on Hive actually, only run as Java programs • Difficult to write coverall tests • e.g. in UDAF, several work flows depending on a kind of function, data set and environment • Difficult to use existing resources • Low extendability, etc.
  • 36. Add new features - systemtest • Example: a part of an existing test final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();
 final ObjectInspector[] OIs = new ObjectInspector[] {
 ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)};
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false));
 evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer();
 evaluator.reset(agg);
 ...
 for (int i = 0; i < features.length; i++) {
 final List<IntWritable> labelList = new ArrayList<IntWritable>();
 for (int label : labels[i]) {
 labelList.add(new IntWritable(label));
 }
 evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),
 labelList});
 }
 final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);
 ...
 Assert.assertArrayEquals(answer, result, 1e-5);
  • 37. Add new features - systemtest • Example: a part of an existing test final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();
 final ObjectInspector[] OIs = new ObjectInspector[] {
 ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)};
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false));
 evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer();
 evaluator.reset(agg);
 ...
 for (int i = 0; i < features.length; i++) {
 final List<IntWritable> labelList = new ArrayList<IntWritable>();
 for (int label : labels[i]) {
 labelList.add(new IntWritable(label));
 }
 evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),
 labelList});
 }
 final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);
 ...
 Assert.assertArrayEquals(answer, result, 1e-5); omittedalot → →
  • 38. Add new features - systemtest • Example: a part of an existing test final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();
 final ObjectInspector[] OIs = new ObjectInspector[] {
 ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)};
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false));
 evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer();
 evaluator.reset(agg);
 ...
 for (int i = 0; i < features.length; i++) {
 final List<IntWritable> labelList = new ArrayList<IntWritable>();
 for (int label : labels[i]) {
 labelList.add(new IntWritable(label));
 }
 evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),
 labelList});
 }
 final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);
 ...
 Assert.assertArrayEquals(answer, result, 1e-5); omittedalot Useless and long initializationUseless and long initialization → →
  • 39. Add new features - systemtest • Example: a part of an existing test final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();
 final ObjectInspector[] OIs = new ObjectInspector[] {
 ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)};
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false));
 evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer();
 evaluator.reset(agg);
 ...
 for (int i = 0; i < features.length; i++) {
 final List<IntWritable> labelList = new ArrayList<IntWritable>();
 for (int label : labels[i]) {
 labelList.add(new IntWritable(label));
 }
 evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),
 labelList});
 }
 final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);
 ...
 Assert.assertArrayEquals(answer, result, 1e-5); omittedalot Useless and long initializationUseless and long initialization → → Useless many conversionsUseless many conversions
  • 40. Add new features - systemtest • Example: a part of an existing test final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF();
 final ObjectInspector[] OIs = new ObjectInspector[] {
 ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)};
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false));
 evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs);
 final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer();
 evaluator.reset(agg);
 ...
 for (int i = 0; i < features.length; i++) {
 final List<IntWritable> labelList = new ArrayList<IntWritable>();
 for (int label : labels[i]) {
 labelList.add(new IntWritable(label));
 }
 evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]),
 labelList});
 }
 final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg);
 ...
 Assert.assertArrayEquals(answer, result, 1e-5); omittedalot Useless and long initializationUseless and long initialization → → Useless many conversionsUseless many conversions AndnotrunonHive,onlylogicaltest!! AndnotrunonHive,onlylogicaltest!!
  • 41. Add new features - systemtest • Solution • New module based on JUnit, HiveRunner and td-client-java • What it can do? • Short and unified initialization • Write and combine HQL • Run local Hive and also remote Treasure Data with the same code • Testbed is prepared and cleaned up automatically • Easy to use external resources, e.g. TSV file • Literal definition(HQL), but test with debugger • Useful DSL
  • 42. Add new features - systemtest(1) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User 1.Write tests based on SystemTestRunner interface ImplementationInterface SystemTestTeam HiveSystemTestRunner
  • 43. Add new features - systemtest(2) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User ImplementationInterface SystemTestTeam HiveSystemTestRunner 2. Read initialization and execute via impls of SystemTestRunner It works based on JUnit @ClassRule Prepare database specialized for each test class Use external resources depending on needs
  • 44. Add new features - systemtest(3) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User ImplementationInterface SystemTestTeam HiveSystemTestRunner 3. Execute first test It works based on JUnit @Rule Run as HQL, and check return values Rewrite DSL & HQL for each env
  • 45. Add new features - systemtest(4) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User ImplementationInterface SystemTestTeam HiveSystemTestRunner 4. Reset testbeds It works based on JUnit @Rule Drop temporary tables
  • 46. Add new features - systemtest(5,6…) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User ImplementationInterface SystemTestTeam HiveSystemTestRunner 5. Execute second test 6. Reset testbeds …repeat all tests It works based on JUnit @Rule
  • 47. Add new features - systemtest(7) • How does it work? SystemTestRunner TDSystemTestRunner Treasure Data HiveRunner Test code User ImplementationInterface SystemTestTeam HiveSystemTestRunner 7. Finalize test Drop temporary database and disconnect It works based on JUnit @ClassRule
  • 48. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner);
  • 49. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner); noomission!
  • 50. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner); noomission! Common initialization with external data Common initialization with external data
  • 51. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner); noomission! Common initialization with external data Common initialization with external data
  • 52. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner); noomission! Common initialization with external data Common initialization with external data Testbed-specific initializationTestbed-specific initialization
  • 53. Add new features - systemtest • Example: initialization private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class);
 private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv",
 new LinkedHashMap<String, String>() {{
 put("a", "double");
 put("b", "double");
 put("c", "double");
 put("d", "double");
 put("c0", "int");
 put("c1", "int");
 put("c2", “int");}}); 
 @ClassRule
 public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{
 initBy(createIrisTable);
 initBy(HQ.fromStatements("" +
 "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" +
 "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" +
 "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" +
 "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" +
 "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}};
 @ClassRule
 public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{
 initBy(createIrisTable);}}; 
 @Rule
 public SystemTestTeam team = new SystemTestTeam(hRunner); noomission! Common initialization with external data Common initialization with external data Testbed-specific initializationTestbed-specific initialization Set common runnerSet common runner
  • 54. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 }
  • 55. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 } noomission!
  • 56. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 } noomission! Execute tests on clean testbeds using database created by init Execute tests on clean testbeds using database created by init
  • 57. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 } noomission! Execute tests on clean testbeds using database created by init Execute tests on clean testbeds using database created by init Run on HiveRunnerRun on HiveRunner
  • 58. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 } noomission! Execute tests on clean testbeds using database created by init Execute tests on clean testbeds using database created by init Run on HiveRunnerRun on HiveRunner
  • 59. Add new features - systemtest • Example: test cases(1) @Test
 public void snr() throws Exception {
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" +
 "SELECT snr(X, Y)" +
 "FROM iris"), "$ANSWER");
 team.run();
 }
 @Test
 public void chi2() throws Exception {
 team.add(tRunner);
 team.set(HQ.fromStatement("" +
 "WITH iris AS (" +
 " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," +
 "stats AS (" +
 " SELECT" +
 " transpose_and_dot(Y, X) AS observed," +
 " array_sum(X) AS feature_count," +
 " array_avg(Y) AS class_prob" +
 " FROM" +
 " iris)," +
 "test AS (" +
 " SELECT" +
 " transpose_and_dot(class_prob, feature_count) AS expected" +
 " FROM" +
 " stats)" +
 "SELECT" +
 " chi2(observed, expected) AS x " +
 "FROM" +
 " test JOIN stats"), "$ANSWER");
 team.run();
 } noomission! Execute tests on clean testbeds using database created by init Execute tests on clean testbeds using database created by init Run on HiveRunnerRun on HiveRunner Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData
  • 60. Add new features - systemtest • Example: test cases(2) @Test
 public void someTest0() throws Exception {
 final String tableName = "color";
 team.initBy(HQ.uploadByResourcePathAsNewTable(
 tableName, ci.initDir + "color.tsv",
 new LinkedHashMap<String, String>() {{
 put("name", "string");
 put("red", "int");
 put("green", "int");
 put("blue", "int");}}));
 team.set(HQ.fromStatement("" +
 "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +
 tableName +
 " u LEFT JOIN color c on u.favorite_color = c.name"),
 "rgb(255,165,0)trgb(255,192,203)");
 team.run();
 }
 @Test
 public void someTest1() throws Exception {
 team.set(HQ.autoMatchingByFileName("hoge"), ci);
 team.run();
 }
  • 61. Add new features - systemtest • Example: test cases(2) @Test
 public void someTest0() throws Exception {
 final String tableName = "color";
 team.initBy(HQ.uploadByResourcePathAsNewTable(
 tableName, ci.initDir + "color.tsv",
 new LinkedHashMap<String, String>() {{
 put("name", "string");
 put("red", "int");
 put("green", "int");
 put("blue", "int");}}));
 team.set(HQ.fromStatement("" +
 "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +
 tableName +
 " u LEFT JOIN color c on u.favorite_color = c.name"),
 "rgb(255,165,0)trgb(255,192,203)");
 team.run();
 }
 @Test
 public void someTest1() throws Exception {
 team.set(HQ.autoMatchingByFileName("hoge"), ci);
 team.run();
 } noomission!
  • 62. Add new features - systemtest • Example: test cases(2) @Test
 public void someTest0() throws Exception {
 final String tableName = "color";
 team.initBy(HQ.uploadByResourcePathAsNewTable(
 tableName, ci.initDir + "color.tsv",
 new LinkedHashMap<String, String>() {{
 put("name", "string");
 put("red", "int");
 put("green", "int");
 put("blue", "int");}}));
 team.set(HQ.fromStatement("" +
 "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +
 tableName +
 " u LEFT JOIN color c on u.favorite_color = c.name"),
 "rgb(255,165,0)trgb(255,192,203)");
 team.run();
 }
 @Test
 public void someTest1() throws Exception {
 team.set(HQ.autoMatchingByFileName("hoge"), ci);
 team.run();
 } noomission! Test-specific initialization It also can chain Test-specific initialization It also can chain
  • 63. Add new features - systemtest • Example: test cases(2) @Test
 public void someTest0() throws Exception {
 final String tableName = "color";
 team.initBy(HQ.uploadByResourcePathAsNewTable(
 tableName, ci.initDir + "color.tsv",
 new LinkedHashMap<String, String>() {{
 put("name", "string");
 put("red", "int");
 put("green", "int");
 put("blue", "int");}}));
 team.set(HQ.fromStatement("" +
 "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " +
 tableName +
 " u LEFT JOIN color c on u.favorite_color = c.name"),
 "rgb(255,165,0)trgb(255,192,203)");
 team.run();
 }
 @Test
 public void someTest1() throws Exception {
 team.set(HQ.autoMatchingByFileName("hoge"), ci);
 team.run();
 } noomission! Test-specific initialization It also can chain Test-specific initialization It also can chain Use HQL and answers written in external files Use HQL and answers written in external files
  • 64. Add new features - systemtest • More details? • https://github.com/myui/hivemall/issues/323 • https://github.com/myui/hivemall/pull/336 • And systemtest/README.md
  • 65. Add new features - feature binning • What’s feature binning? • A method to divide quantitative variables into meaningful categorical variables
  • 66. Add new features - feature binning • How does it work? • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) build_bins feature_binning
  • 67. Add new features - feature binning • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • Use percentile internally, make all areas uniform
  • 68. Add new features - feature binning • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • What’s auto_shrink?
  • 69. Add new features - feature binning • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • What’s auto_shrink? Sometimes made void bins by small or skewed data set !?!? ->
  • 70. Add new features - feature binning • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • What’s auto_shrink? Sometimes made void bins by small or skewed data set !?!? ->
  • 71. Add new features - feature binning • [UDAF] build_bins(weight, num_of_bins[, auto_shrink]) • What’s auto_shrink? Exception!Sometimes made void bins by small or skewed data set !?!? ->
  • 72. Add new features - feature binning • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) • Distribute variables into bins by its value feature_binning Age:17
  • 73. Add new features - feature binning • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) • Distribute variables into bins by its value feature_binning bin 0 -> bin 1 -> bin 2 -> Age:17
  • 74. Add new features - feature binning • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) • Distribute variables into bins by its value feature_binning 17 is between -Infinity and 18.0 … bin 0 -> bin 1 -> bin 2 -> Age:17
  • 75. Add new features - feature binning • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) • Distribute variables into bins by its value feature_binning 17 is between -Infinity and 18.0 … <here!bin 0 -> bin 1 -> bin 2 -> Age:17
  • 76. Add new features - feature binning • [UDF] feature_binning(features, quantiles_map)
 /(weight, quantiles) • Distribute variables into bins by its value feature_binning 17 is between -Infinity and 18.0 … <here!bin 0 -> bin 1 -> bin 2 -> Age:17
  • 77. Add new features - feature binning • More details? • https://github.com/myui/hivemall/issues/319 • https://github.com/myui/hivemall/pull/322
  • 78. Add new features - feature selection • What’s feature selection? • A generic term of methods to select meaningful features • Used to preprocessing of machine learning • Why used? • Enhance results • Shorten learning time • Make a set of features human-understandable
  • 79. Add new features - feature selection • A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimumRedundancy MaximumRelevance) • etc.
  • 80. Add new features - feature selection • A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimumRedundancy MaximumRelevance) • etc. Implemented Implemented
  • 81. Add new features - feature selection • Feature selection using Chi-square value • To calc Chi-square value, need both observed values and expected values(=hypothesis) • Observed: aggregated features of each class • Expected: assuming each features and each classes are independent, calc expected values • Calc Chi-square value • Select top-k features Chi-square
  • 82. Add new features - feature selection • How does it work on Hivemall? • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array <double>, array<double>> • [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double> Chi-square
  • 83. Add new features - feature selection • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF YX T Chi-square
  • 84. Add new features - feature selection • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF YX T Chi-square
  • 85. Add new features - feature selection • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF YX T Chi-square
  • 86. Add new features - feature selection • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF YX T Maybe you think matrix multiplication requires repetition… Chi-square
  • 87. Add new features - feature selection • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF YX T Calculate incrementally! Maybe you think matrix multiplication requires repetition… Chi-square
  • 88. Add new features - feature selection • [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<arra y<double>, array<double>> • Calculate Chi-square value and p-value • • Calculate p-value by above and Chi-square distribution Chi-square
  • 89. Add new features - feature selection • [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF NOTE: Current implementation expects all each importance_list and k are equal k = 2 Chi-square
  • 90. Add new features - feature selection • [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF NOTE: Current implementation expects all each importance_list and k are equal k = 2 Chi-square
  • 91. Add new features - feature selection • Feature selection using SNR • Aggregate mean and variance of each feature and each class • When termination, calc Signal Noise Ratio between all combination of classes, of each feature • Sum up Signal Noise Ratio each feature Signal Noise Ratio
  • 92. Add new features - feature selection • How does it work on Hivemall? • [UDAF] snr(X::array<number>, label::array<int>)::array<double> Signal Noise Ratio
  • 93. Add new features - feature selection • [UDAF] snr(X::array<number>, label::array<int>)::array<double> • Aggregate variance by Chan’s method • Calc Signal Noise Ratio and sum them up each features Signal Noise Ratio
  • 94. Add new features - feature selection • More details? • https://github.com/myui/hivemall/issues/338 • https://github.com/myui/hivemall/pull/352
  • 95. Add new features - spark integration • Integrated feature selection into spark module • Improved build flow for resolving binary incompatibility between spark-1.6 and spark-2.0
  • 96. Thank you for listening!
  • 97. Thank you for listening! Any questions?