Internship final report@Treasure Data Inc.

Internship ﬁnal report
@Treasure Data Inc. (2016 8/1-9/30)
ITO Ryuichi

Outline
• Who am I?
• What I did?
• About Hivemall
• Benchmark
• Add several new features

Who am I?
• ITO Ryuichi(@amaya382)
• Graduate School of Information Science and Technology,
Osaka University(’16-)
• Accelerating graph processing engine: 
concurrency control, hardware-aware optimization
• (a little) Natural language processing: 
conversation system with context consistency
❤ Scala, C#

What I did?
• About Hivemall
• Benchmark

About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classiﬁcation
• Perceptron, AdaGradRDA, Soft Conﬁdence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.

About Hivemall
• A scalable machine learning library running on Apache Hive(+Spark, Pig)
• Developed by @myui and others as an OSS
• Joined Apache Incubator 🎉
• Can use many features via HQL(Hive Query Language, like SQL)
• Classiﬁcation
• Perceptron, AdaGradRDA, Soft Conﬁdence Weighted, etc.
• Recommendation
• Matrix Factorisation, Factorisation Machine, etc.
• Utilities
• Feature engineering, Additional array operations, etc.
• etc.
Cute Logo!

About Hivemall(cont.)
• How does Hivemall work on Hive?
• Hivemall is a set of UDFs(User-Defined Functions)
• UDF: projection, one entry -> one entry
• UDTF(Table-generating): some entries -> some entries
• UDAF(Aggregate): all entries -> one entry
• Define features as UDFs following interfaces in Java
prepared by Hive
• And by loading Hivemall jar file, enable to use extra
functions in HQL

About Hivemall(cont.)
• Example: Training by logistic regression
• Only HQL, no need to be familiar with
programming. (Already, HQL(Hive) is close to data!)
CREATE TABLE model AS
SELECT
feature, AVG(weight) AS weight
FROM (
SELECT logress(features, label, ...)
AS (feature, weight)
FROM train_data) t
GROUP BY feature

Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-deﬁned test cases w/ prepared data set
1. Logistic Regression
2. Random Forest
• Several hyper parameters
3. Boosting
4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn,
Spark, etc.)
NOTE: basically, using common environment, but some cases use different environments
For more details, see bench-ml project

Benchmark
• Based on bench-ml (https://github.com/szilard/benchm-ml)
• Several pre-deﬁned test cases w/ prepared data set
1. Logistic Regression
2. Random Forest
• Several hyper parameters
3. Boosting
4. Deep Learning
• Already tested by several tools(e.g. R, Python-sklearn,
Spark, etc.)
NOTE: basically, using common environment, but some cases use different environments
For more details, see bench-ml project
Tried
Tried

Benchmark(cont.)
• Environment
• Amazon Web Service
• EMR(Elastic MapReduce)
• m3.xlarge*3 + c3.xlarge*3
• Hadoop: Amazon 2.7.2
• Tez: 0.8.4
• Hive: 2.1.0
• Hivemall: 0.4.2-RC2
• Misc.
• Basically, using six parallel processing, ﬁtting to #instances

Benchmark - Logistic Regression
• Using logress() on Hivemall
• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD)
• But can be sure its scalability
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall
(Time[sec] / AUC[%])

10x10x

10x10x 12.5x12.5x

✖✖10x10x 12.5x12.5x

✖✖
1.3x1.3x
10x10x 12.5x12.5x

✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x

✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x

✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
High initial overhead
caused by Hive

✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
caused by Hive

✖✖
4.9x4.9x1.3x1.3x
10x10x 12.5x12.5x
3.9x3.9x
±0±0
+0.4+0.4
caused by Hive

Benchmark - Random Forest(1)
• Using train_randomforest_classifier() on Hivemall
• (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M
• Need to tune environment and parameters
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

• (1)Regulation: 500 trees, three variables
• Hivemall is almost good until 0.1M, but cannot process 1M
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall
Amazing…

• (2)Regulation: 100 trees, max depth 20
• Hivemall is good until 1M, but cannot process 10M
Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

What I did?
• About Hivemall
• Benchmark
Main topic!

Add several new features
• systemtest module
• Feature binning
• Feature selection
• Some spark integrations

Add new features - systemtest
• What’s systemtest?
• Testing framework for UDFs
• Also can apply other applications based on UDFs
• Already tests exist, not? Why need?
• Yes, but the existing is...
• Cannot run on Hive actually, only run as Java programs
• Difficult to write coverall tests
• e.g. in UDAF, several work flows depending on a
kind of function, data set and environment
• Difficult to use existing resources
• Low extendability, etc.

• Example: a part of an existing test
final SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); 
final ObjectInspector[] OIs = new ObjectInspector[] { 
ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector),
PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; 
final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator(
new SimpleGenericUDAFParameterInfo(OIs, false, false)); 
evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); 
final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg =
(SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer)
evaluator.getNewAggregationBuffer(); 
evaluator.reset(agg); 
... 
for (int i = 0; i < features.length; i++) { 
final List<IntWritable> labelList = new ArrayList<IntWritable>(); 
for (int label : labels[i]) { 
labelList.add(new IntWritable(label)); 
} 
evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), 
labelList}); 
} 
final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); 
... 
Assert.assertArrayEquals(answer, result, 1e-5);

... 
} 
labelList}); 
} 
... 
omittedalot
→
→

... 
} 
labelList}); 
} 
... 
omittedalot
Useless and long initializationUseless and long initialization
→
→

... 
} 
labelList}); 
} 
... 
omittedalot
→
→
Useless many conversionsUseless many conversions

... 
} 
labelList}); 
} 
... 
omittedalot
→
→
Useless many conversionsUseless many conversions
AndnotrunonHive,onlylogicaltest!!
AndnotrunonHive,onlylogicaltest!!

• Solution
• New module based on JUnit, HiveRunner and td-client-java
• What it can do?
• Short and unified initialization
• Write and combine HQL
• Run local Hive and also remote Treasure Data with the
same code
• Testbed is prepared and cleaned up automatically
• Easy to use external resources, e.g. TSV file
• Literal definition(HQL), but test with debugger
• Useful DSL

Add new features - systemtest(1)
• How does it work?
SystemTestRunner
TDSystemTestRunner
Treasure Data HiveRunner
Test code
User
1.Write tests based on
SystemTestRunner interface
ImplementationInterface
SystemTestTeam
HiveSystemTestRunner

SystemTestRunner
TDSystemTestRunner
Test code
User
SystemTestTeam
2. Read initialization and execute
via impls of SystemTestRunner
It works based on JUnit @ClassRule
Prepare database
specialized for
each test class
Use external resources depending on needs

SystemTestRunner
TDSystemTestRunner
Test code
User
SystemTestTeam
3. Execute first test
It works based on JUnit @Rule
Run as HQL,
and check return values
Rewrite DSL & HQL for each env

SystemTestRunner
TDSystemTestRunner
Test code
User
SystemTestTeam
4. Reset testbeds
Drop temporary tables

Add new features - systemtest(5,6…)
SystemTestRunner
TDSystemTestRunner
Test code
User
SystemTestTeam
5. Execute second test
6. Reset testbeds
…repeat all tests

SystemTestRunner
TDSystemTestRunner
Test code
User
SystemTestTeam
7. Finalize test
Drop temporary database
and disconnect
It works based on JUnit @ClassRule

• Example: initialization
private static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); 
private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable(
"iris0", ci.initDir + "iris0.csv", 
new LinkedHashMap<String, String>() {{ 
put("a", "double"); 
put("b", "double"); 
put("c", "double"); 
put("d", "double"); 
put("c0", "int"); 
put("c1", "int"); 
put("c2", “int");}});
 
@ClassRule 
public static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ 
initBy(createIrisTable); 
initBy(HQ.fromStatements("" + 
"CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + 
"CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + 
"CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + 
"CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + 
"CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; 
@ClassRule 
public static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ 
initBy(createIrisTable);}};
 
@Rule 
public SystemTestTeam team = new SystemTestTeam(hRunner);

put("c0", "int"); 
put("c1", "int"); 
put("c2", “int");}});
 
@ClassRule 
@ClassRule 
 
@Rule 
noomission!

put("c0", "int"); 
put("c1", "int"); 
put("c2", “int");}});
 
@ClassRule 
@ClassRule 
 
@Rule 
noomission!
Common initialization
with external data
with external data

put("c0", "int"); 
put("c1", "int"); 
put("c2", “int");}});
 
@ClassRule 
@ClassRule 
 
@Rule 
noomission!
with external data
with external data
Testbed-speciﬁc initializationTestbed-speciﬁc initialization

put("c0", "int"); 
put("c1", "int"); 
put("c2", “int");}});
 
@ClassRule 
@ClassRule 
 
@Rule 
noomission!
with external data
with external data
Testbed-speciﬁc initializationTestbed-speciﬁc initialization
Set common runnerSet common runner

• Example: test cases(1)
@Test 
public void snr() throws Exception { 
team.set(HQ.fromStatement("" + 
"WITH iris AS (" + 
" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + 
"SELECT snr(X, Y)" + 
"FROM iris"), "$ANSWER"); 
team.run(); 
} 
@Test 
public void chi2() throws Exception { 
team.add(tRunner); 
" SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + 
"stats AS (" + 
" SELECT" + 
" transpose_and_dot(Y, X) AS observed," + 
" array_sum(X) AS feature_count," + 
" array_avg(Y) AS class_prob" + 
" FROM" + 
" iris)," + 
"test AS (" + 
" SELECT" + 
" transpose_and_dot(class_prob, feature_count) AS expected" + 
" FROM" + 
" stats)" + 
"SELECT" + 
" chi2(observed, expected) AS x " + 
"FROM" + 
" test JOIN stats"), "$ANSWER"); 
team.run(); 
}

@Test 
team.run(); 
} 
@Test 
"stats AS (" + 
" SELECT" + 
" FROM" + 
" iris)," + 
"test AS (" + 
" SELECT" + 
" FROM" + 
" stats)" + 
"SELECT" + 
"FROM" + 
team.run(); 
}
noomission!

@Test 
team.run(); 
} 
@Test 
"stats AS (" + 
" SELECT" + 
" FROM" + 
" iris)," + 
"test AS (" + 
" SELECT" + 
" FROM" + 
" stats)" + 
"SELECT" + 
"FROM" + 
team.run(); 
}
noomission!
Execute tests on clean testbeds
using database created by init

@Test 
team.run(); 
} 
@Test 
"stats AS (" + 
" SELECT" + 
" FROM" + 
" iris)," + 
"test AS (" + 
" SELECT" + 
" FROM" + 
" stats)" + 
"SELECT" + 
"FROM" + 
team.run(); 
}
noomission!
Run on HiveRunnerRun on HiveRunner

@Test 
team.run(); 
} 
@Test 
"stats AS (" + 
" SELECT" + 
" FROM" + 
" iris)," + 
"test AS (" + 
" SELECT" + 
" FROM" + 
" stats)" + 
"SELECT" + 
"FROM" + 
team.run(); 
}
noomission!
Run on HiveRunnerRun on HiveRunner
Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData

@Test 
public void someTest0() throws Exception { 
final String tableName = "color"; 
team.initBy(HQ.uploadByResourcePathAsNewTable( 
tableName, ci.initDir + "color.tsv", 
put("name", "string"); 
put("red", "int"); 
put("green", "int"); 
put("blue", "int");}})); 
"SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + 
tableName + 
" u LEFT JOIN color c on u.favorite_color = c.name"), 
"rgb(255,165,0)trgb(255,192,203)"); 
team.run(); 
} 
@Test 
team.set(HQ.autoMatchingByFileName("hoge"), ci); 
team.run(); 
}

@Test 
tableName + 
"rgb(255,165,0)trgb(255,192,203)"); 
team.run(); 
} 
@Test 
team.run(); 
}
noomission!

@Test 
tableName + 
"rgb(255,165,0)trgb(255,192,203)"); 
team.run(); 
} 
@Test 
team.run(); 
}
noomission!
Test-speciﬁc initialization
It also can chain
It also can chain

@Test 
tableName + 
"rgb(255,165,0)trgb(255,192,203)"); 
team.run(); 
} 
@Test 
team.run(); 
}
noomission!
It also can chain
It also can chain
Use HQL and answers
written in external ﬁles
Use HQL and answers
written in external ﬁles

• More details?
• https://github.com/myui/hivemall/issues/323
• https://github.com/myui/hivemall/pull/336
• And systemtest/README.md

Add new features - feature binning
• What’s feature binning?
• A method to divide quantitative variables
into meaningful categorical variables

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])
• [UDF] feature_binning(features, quantiles_map) 
/(weight, quantiles)
build_bins feature_binning

• [UDAF] build_bins(weight, num_of_bins[,
auto_shrink])
• Use percentile internally, make all areas
uniform

auto_shrink])
• What’s auto_shrink?

auto_shrink])
Sometimes made void bins
by small or skewed data set
!?!? ->

auto_shrink])
Exception!Sometimes made void bins
by small or skewed data set
!?!? ->

• Distribute variables into bins by its value
feature_binning
Age:17

feature_binning
bin 0 ->
bin 1 ->
bin 2 ->
Age:17

feature_binning
17 is between
-Inﬁnity and 18.0 …
bin 0 ->
bin 1 ->
bin 2 ->
Age:17

feature_binning
17 is between
-Inﬁnity and 18.0 …
<here!bin 0 ->
bin 1 ->
bin 2 ->
Age:17

• More details?

Add new features - feature selection
• What’s feature selection?
• A generic term of methods to select meaningful
features
• Used to preprocessing of machine learning
• Why used?
• Enhance results
• Shorten learning time
• Make a set of features human-understandable

• A kind of feature selection
• Use variance
• Use Chi-square value
• Use SNR(Signal Noise Ratio)
• mRMR(minimumRedundancy MaximumRelevance)
• etc.

• A kind of feature selection
• Use variance
• Use Chi-square value
• Use SNR(Signal Noise Ratio)
• mRMR(minimumRedundancy MaximumRelevance)
• etc.
Implemented
Implemented

• Feature selection using Chi-square value
• To calc Chi-square value, need both observed
values and expected values(=hypothesis)
• Observed: aggregated features of each class
• Expected: assuming each features and each
classes are independent, calc expected values
• Calc Chi-square value
• Select top-k features
Chi-square

• How does it work on Hivemall?
• [UDAF] transpose_and_dot(X::array<number>,
Y::array<number>)::array<array<double>>
• [UDF] chi2(observed::array<array<number>>,
expected::array<array<number>>)::struct<array
<double>, array<double>>
• [UDF] select_k_best(X::array<number>,
importance_list::array<int>
k::int)::array<double>
Chi-square

• Utility for matrix calculation, generic UDF
YX
T
Chi-square

YX
T
Maybe you think
matrix multiplication requires repetition…
Chi-square

YX
T
Calculate incrementally!
Maybe you think
matrix multiplication requires repetition…
Chi-square

• [UDF] chi2(observed::array<array<number>>,
expected::array<array<number>>)::struct<arra
y<double>, array<double>>
• Calculate Chi-square value and p-value
•
• Calculate p-value by above and Chi-square
distribution
Chi-square

• [UDF] select_k_best(X::array<number>,
importance_list::array<int>,
k::int)::array<double>
• Select top-k elements from X by importance_list
• Generic UDF
NOTE: Current implementation expects all each importance_list and k are equal
k = 2
Chi-square

• Feature selection using SNR
• Aggregate mean and variance of each feature
and each class
• When termination, calc Signal Noise Ratio
between all combination of classes, of each
feature
• Sum up Signal Noise Ratio each feature
Signal Noise Ratio

• How does it work on Hivemall?
• [UDAF] snr(X::array<number>,
label::array<int>)::array<double>
Signal Noise Ratio

• [UDAF] snr(X::array<number>,
label::array<int>)::array<double>
• Aggregate variance by Chan’s method
• Calc Signal Noise Ratio and sum them up each features
Signal Noise Ratio

• More details?

Add new features - spark integration
• Integrated feature selection into spark module
• Improved build ﬂow for resolving binary
incompatibility between spark-1.6 and
spark-2.0

Thank you for listening!
Any questions?

Internship final report@Treasure Data Inc.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Internship final report@Treasure Data Inc.

Similar to Internship final report@Treasure Data Inc. (20)

More from Ryuichi ITO

More from Ryuichi ITO (7)

Recently uploaded

Recently uploaded (20)

Internship final report@Treasure Data Inc.