6. Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/
22. Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others
24. PerfectSched
• Provides periodical/scheduled queries for customers
• it's like reliable "cron"
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-once semantics
• PerfectSched enqueues jobs into PerfectQueue
25. Jobs in TD
LOST Duplicated
Retried
for
Errors
Throughput
Execution
time
DATA
import/
export
NG NG
OK
or
NG
HIGH SHORT
(secs-mins)
QUERY NG OK OK LOW
SHORT
(secs)
or
LONG
(mins-hours)
28. Features
• Priorities for query types
• Resource limits per accounts
• Graceful restarts
• Queries must run long time (<= 1d)
• New worker code should be loaded, besides
running job with older code
29. PerfectQueue
• Highly available distributed queue using RDBMS
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Using transactions
• Flexible scheduling rather than scalability
• Workers does many things
• Plazmadb operations (including importing data)
• Building job parameters
• Handling results of jobs + kicking other jobs
• Using Amazon RDS (MySQL) internally (+ Workers on EC2)
30. Builing Jobs/Parameters
• Parameters
• for job types, accounts, price plans and clusters
• to control performance/parallelism, permissions
and data types
• ex: Java properties
• Jobs
• to prepare for customers' queries
• to make queries safer/faster
• ex: Hive Queries (HiveQL, a variety of SQL)
33. Example: Hive job (cont)
ADD JAR 'td-hadoop-1.0.jar';
CREATE DATABASE IF NOT EXISTS `db`;
USE `db`;
CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='users_20100604_080812_ce9203d0',
'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0',
'td.table_id'='2',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0'
);
CREATE TABLE tbl1 (
`uid` INT,
`key` STRING,
`time` INT
)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
34. ADD JAR 'td-hadoop-1.0.jar';
CREATE DATABASE IF NOT EXISTS `db`;
USE `db`;
CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='users_20100604_080812_ce9203d0',
'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0',
'td.table_id'='2',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0'
);
CREATE TABLE tbl1 (
`uid` INT,
`key` STRING,
`time` INT
)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='contests_20100606_120720_96abe81a',
'td.storage.path'='221/dfc/contests_20100606_120720_96abe81a',
'td.table_id'='4',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/contests_20100606_120720_96abe81a'
);
USE `db`;
35. USE `db`;
CREATE TEMPORARY FUNCTION MSGPACK_SERIALIZE AS
'com.treasure_data.hadoop.hive.udf.MessagePackSerialize';
CREATE TEMPORARY FUNCTION TD_TIME_RANGE AS
'com.treasure_data.hadoop.hive.udf.GenericUDFTimeRange';
CREATE TEMPORARY FUNCTION TD_TIME_ADD AS
'com.treasure_data.hadoop.hive.udf.UDFTimeAdd';
CREATE TEMPORARY FUNCTION TD_TIME_FORMAT AS
'com.treasure_data.hadoop.hive.udf.UDFTimeFormat';
CREATE TEMPORARY FUNCTION TD_TIME_PARSE AS
'com.treasure_data.hadoop.hive.udf.UDFTimeParse';
CREATE TEMPORARY FUNCTION TD_SCHEDULED_TIME AS
'com.treasure_data.hadoop.hive.udf.GenericUDFScheduledTime';
CREATE TEMPORARY FUNCTION TD_X_RANK AS
'com.treasure_data.hadoop.hive.udf.Rank';
CREATE TEMPORARY FUNCTION TD_FIRST AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFFirst';
CREATE TEMPORARY FUNCTION TD_LAST AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFLast';
CREATE TEMPORARY FUNCTION TD_SESSIONIZE AS
'com.treasure_data.hadoop.hive.udf.UDFSessionize';
CREATE TEMPORARY FUNCTION TD_PARSE_USER_AGENT AS
'com.treasure_data.hadoop.hive.udf.GenericUDFParseUserAgent';
CREATE TEMPORARY FUNCTION TD_HEX2NUM AS
'com.treasure_data.hadoop.hive.udf.UDFHex2num';
CREATE TEMPORARY FUNCTION TD_MD5 AS
'com.treasure_data.hadoop.hive.udf.UDFmd5';
CREATE TEMPORARY FUNCTION TD_RANK_SEQUENCE AS
'com.treasure_data.hadoop.hive.udf.UDFRankSequence';
CREATE TEMPORARY FUNCTION TD_STRING_EXPLODER AS
'com.treasure_data.hadoop.hive.udf.GenericUDTFStringExploder';
CREATE TEMPORARY FUNCTION TD_URL_DECODE AS
36. CREATE TEMPORARY FUNCTION TD_URL_DECODE AS
'com.treasure_data.hadoop.hive.udf.UDFUrlDecode';
CREATE TEMPORARY FUNCTION TD_DATE_TRUNC AS
'com.treasure_data.hadoop.hive.udf.UDFDateTrunc';
CREATE TEMPORARY FUNCTION TD_LAT_LONG_TO_COUNTRY AS
'com.treasure_data.hadoop.hive.udf.UDFLatLongToCountry';
CREATE TEMPORARY FUNCTION TD_SUBSTRING_INENCODING AS
'com.treasure_data.hadoop.hive.udf.GenericUDFSubstringInEncoding';
CREATE TEMPORARY FUNCTION TD_DIVIDE AS
'com.treasure_data.hadoop.hive.udf.GenericUDFDivide';
CREATE TEMPORARY FUNCTION TD_SUMIF AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFSumIf';
CREATE TEMPORARY FUNCTION TD_AVGIF AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFAvgIf';
CREATE TEMPORARY FUNCTION hivemall_version AS
'hivemall.HivemallVersionUDF';
CREATE TEMPORARY FUNCTION perceptron AS
'hivemall.classifier.PerceptronUDTF';
CREATE TEMPORARY FUNCTION train_perceptron AS
'hivemall.classifier.PerceptronUDTF';
CREATE TEMPORARY FUNCTION train_pa AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_pa1 AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_pa2 AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_cw AS
'hivemall.classifier.ConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_arow AS
'hivemall.classifier.AROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_arowh AS
'hivemall.classifier.AROWClassifierUDTF';
37. CREATE TEMPORARY FUNCTION train_arowh AS
'hivemall.classifier.AROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_scw AS
'hivemall.classifier.SoftConfideceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_scw2 AS
'hivemall.classifier.SoftConfideceWeightedUDTF';
CREATE TEMPORARY FUNCTION adagrad_rda AS
'hivemall.classifier.AdaGradRDAUDTF';
CREATE TEMPORARY FUNCTION train_adagrad_rda AS
'hivemall.classifier.AdaGradRDAUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_perceptron AS
'hivemall.classifier.multiclass.MulticlassPerceptronUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa1 AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa2 AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_cw AS
'hivemall.classifier.multiclass.MulticlassConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_arow AS
'hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_scw AS
'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_scw2 AS
'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION cosine_similarity AS
'hivemall.knn.similarity.CosineSimilarityUDF';
CREATE TEMPORARY FUNCTION cosine_sim AS
'hivemall.knn.similarity.CosineSimilarityUDF';
CREATE TEMPORARY FUNCTION jaccard AS
'hivemall.knn.similarity.JaccardIndexUDF';
38. CREATE TEMPORARY FUNCTION jaccard AS
'hivemall.knn.similarity.JaccardIndexUDF';
CREATE TEMPORARY FUNCTION jaccard_similarity AS
'hivemall.knn.similarity.JaccardIndexUDF';
CREATE TEMPORARY FUNCTION angular_similarity AS
'hivemall.knn.similarity.AngularSimilarityUDF';
CREATE TEMPORARY FUNCTION euclid_similarity AS
'hivemall.knn.similarity.EuclidSimilarity';
CREATE TEMPORARY FUNCTION distance2similarity AS
'hivemall.knn.similarity.Distance2SimilarityUDF';
CREATE TEMPORARY FUNCTION hamming_distance AS
'hivemall.knn.distance.HammingDistanceUDF';
CREATE TEMPORARY FUNCTION popcnt AS 'hivemall.knn.distance.PopcountUDF';
CREATE TEMPORARY FUNCTION kld AS 'hivemall.knn.distance.KLDivergenceUDF';
CREATE TEMPORARY FUNCTION euclid_distance AS
'hivemall.knn.distance.EuclidDistanceUDF';
CREATE TEMPORARY FUNCTION cosine_distance AS
'hivemall.knn.distance.CosineDistanceUDF';
CREATE TEMPORARY FUNCTION angular_distance AS
'hivemall.knn.distance.AngularDistanceUDF';
CREATE TEMPORARY FUNCTION jaccard_distance AS
'hivemall.knn.distance.JaccardDistanceUDF';
CREATE TEMPORARY FUNCTION manhattan_distance AS
'hivemall.knn.distance.ManhattanDistanceUDF';
CREATE TEMPORARY FUNCTION minkowski_distance AS
'hivemall.knn.distance.MinkowskiDistanceUDF';
CREATE TEMPORARY FUNCTION minhashes AS 'hivemall.knn.lsh.MinHashesUDF';
CREATE TEMPORARY FUNCTION minhash AS 'hivemall.knn.lsh.MinHashUDTF';
CREATE TEMPORARY FUNCTION bbit_minhash AS
'hivemall.knn.lsh.bBitMinHashUDF';
CREATE TEMPORARY FUNCTION voted_avg AS
'hivemall.ensemble.bagging.VotedAvgUDAF';
39. CREATE TEMPORARY FUNCTION voted_avg AS
'hivemall.ensemble.bagging.VotedAvgUDAF';
CREATE TEMPORARY FUNCTION weight_voted_avg AS
'hivemall.ensemble.bagging.WeightVotedAvgUDAF';
CREATE TEMPORARY FUNCTION wvoted_avg AS
'hivemall.ensemble.bagging.WeightVotedAvgUDAF';
CREATE TEMPORARY FUNCTION max_label AS
'hivemall.ensemble.MaxValueLabelUDAF';
CREATE TEMPORARY FUNCTION maxrow AS 'hivemall.ensemble.MaxRowUDAF';
CREATE TEMPORARY FUNCTION argmin_kld AS
'hivemall.ensemble.ArgminKLDistanceUDAF';
CREATE TEMPORARY FUNCTION mhash AS
'hivemall.ftvec.hashing.MurmurHash3UDF';
CREATE TEMPORARY FUNCTION sha1 AS 'hivemall.ftvec.hashing.Sha1UDF';
CREATE TEMPORARY FUNCTION array_hash_values AS
'hivemall.ftvec.hashing.ArrayHashValuesUDF';
CREATE TEMPORARY FUNCTION prefixed_hash_values AS
'hivemall.ftvec.hashing.ArrayPrefixedHashValuesUDF';
CREATE TEMPORARY FUNCTION polynomial_features AS
'hivemall.ftvec.pairing.PolynomialFeaturesUDF';
CREATE TEMPORARY FUNCTION powered_features AS
'hivemall.ftvec.pairing.PoweredFeaturesUDF';
CREATE TEMPORARY FUNCTION rescale AS 'hivemall.ftvec.scaling.RescaleUDF';
CREATE TEMPORARY FUNCTION rescale_fv AS
'hivemall.ftvec.scaling.RescaleUDF';
CREATE TEMPORARY FUNCTION zscore AS 'hivemall.ftvec.scaling.ZScoreUDF';
CREATE TEMPORARY FUNCTION normalize AS
'hivemall.ftvec.scaling.L2NormalizationUDF';
CREATE TEMPORARY FUNCTION conv2dense AS
'hivemall.ftvec.conv.ConvertToDenseModelUDAF';
CREATE TEMPORARY FUNCTION to_dense_features AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
40. CREATE TEMPORARY FUNCTION to_dense_features AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_dense AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_sparse_features AS
'hivemall.ftvec.conv.ToSparseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_sparse AS
'hivemall.ftvec.conv.ToSparseFeaturesUDF';
CREATE TEMPORARY FUNCTION quantify AS
'hivemall.ftvec.conv.QuantifyColumnsUDTF';
CREATE TEMPORARY FUNCTION vectorize_features AS
'hivemall.ftvec.trans.VectorizeFeaturesUDF';
CREATE TEMPORARY FUNCTION categorical_features AS
'hivemall.ftvec.trans.CategoricalFeaturesUDF';
CREATE TEMPORARY FUNCTION indexed_features AS
'hivemall.ftvec.trans.IndexedFeatures';
CREATE TEMPORARY FUNCTION quantified_features AS
'hivemall.ftvec.trans.QuantifiedFeaturesUDTF';
CREATE TEMPORARY FUNCTION quantitative_features AS
'hivemall.ftvec.trans.QuantitativeFeaturesUDF';
CREATE TEMPORARY FUNCTION amplify AS
'hivemall.ftvec.amplify.AmplifierUDTF';
CREATE TEMPORARY FUNCTION rand_amplify AS
'hivemall.ftvec.amplify.RandomAmplifierUDTF';
CREATE TEMPORARY FUNCTION addBias AS 'hivemall.ftvec.AddBiasUDF';
CREATE TEMPORARY FUNCTION add_bias AS 'hivemall.ftvec.AddBiasUDF';
CREATE TEMPORARY FUNCTION sortByFeature AS
'hivemall.ftvec.SortByFeatureUDF';
CREATE TEMPORARY FUNCTION sort_by_feature AS
'hivemall.ftvec.SortByFeatureUDF';
CREATE TEMPORARY FUNCTION extract_feature AS
'hivemall.ftvec.ExtractFeatureUDF';
41. CREATE TEMPORARY FUNCTION extract_feature AS
'hivemall.ftvec.ExtractFeatureUDF';
CREATE TEMPORARY FUNCTION extract_weight AS
'hivemall.ftvec.ExtractWeightUDF';
CREATE TEMPORARY FUNCTION add_feature_index AS
'hivemall.ftvec.AddFeatureIndexUDF';
CREATE TEMPORARY FUNCTION feature AS 'hivemall.ftvec.FeatureUDF';
CREATE TEMPORARY FUNCTION feature_index AS
'hivemall.ftvec.FeatureIndexUDF';
CREATE TEMPORARY FUNCTION tf AS 'hivemall.ftvec.text.TermFrequencyUDAF';
CREATE TEMPORARY FUNCTION train_logregr AS
'hivemall.regression.LogressUDTF';
CREATE TEMPORARY FUNCTION train_pa1_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa1a_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa2_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa2a_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arow_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arowe_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arowe2_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_adagrad_regr AS
'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION train_adadelta_regr AS
'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION train_adagrad AS
'hivemall.regression.AdaGradUDTF';
42. CREATE TEMPORARY FUNCTION train_adagrad AS
'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION train_adadelta AS
'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION logress AS 'hivemall.regression.LogressUDTF';
CREATE TEMPORARY FUNCTION pa1_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa1a_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa2_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa2a_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION arow_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION arowe_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION arowe2_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION adagrad AS 'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION adadelta AS 'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION float_array AS
'hivemall.tools.array.AllocFloatArrayUDF';
CREATE TEMPORARY FUNCTION array_remove AS
'hivemall.tools.array.ArrayRemoveUDF';
CREATE TEMPORARY FUNCTION sort_and_uniq_array AS
'hivemall.tools.array.SortAndUniqArrayUDF';
CREATE TEMPORARY FUNCTION subarray_endwith AS
'hivemall.tools.array.SubarrayEndWithUDF';
CREATE TEMPORARY FUNCTION subarray_startwith AS
'hivemall.tools.array.SubarrayStartWithUDF';
CREATE TEMPORARY FUNCTION collect_all AS
43. CREATE TEMPORARY FUNCTION collect_all AS
'hivemall.tools.array.CollectAllUDAF';
CREATE TEMPORARY FUNCTION concat_array AS
'hivemall.tools.array.ConcatArrayUDF';
CREATE TEMPORARY FUNCTION subarray AS 'hivemall.tools.array.SubarrayUDF';
CREATE TEMPORARY FUNCTION array_avg AS
'hivemall.tools.array.ArrayAvgGenericUDAF';
CREATE TEMPORARY FUNCTION array_sum AS
'hivemall.tools.array.ArraySumUDAF';
CREATE TEMPORARY FUNCTION to_string_array AS
'hivemall.tools.array.ToStringArrayUDF';
CREATE TEMPORARY FUNCTION map_get_sum AS
'hivemall.tools.map.MapGetSumUDF';
CREATE TEMPORARY FUNCTION map_tail_n AS 'hivemall.tools.map.MapTailNUDF';
CREATE TEMPORARY FUNCTION to_map AS 'hivemall.tools.map.UDAFToMap';
CREATE TEMPORARY FUNCTION to_ordered_map AS
'hivemall.tools.map.UDAFToOrderedMap';
CREATE TEMPORARY FUNCTION sigmoid AS
'hivemall.tools.math.SigmoidGenericUDF';
CREATE TEMPORARY FUNCTION taskid AS 'hivemall.tools.mapred.TaskIdUDF';
CREATE TEMPORARY FUNCTION jobid AS 'hivemall.tools.mapred.JobIdUDF';
CREATE TEMPORARY FUNCTION rowid AS 'hivemall.tools.mapred.RowIdUDF';
CREATE TEMPORARY FUNCTION generate_series AS
'hivemall.tools.GenerateSeriesUDTF';
CREATE TEMPORARY FUNCTION convert_label AS
'hivemall.tools.ConvertLabelUDF';
CREATE TEMPORARY FUNCTION x_rank AS 'hivemall.tools.RankSequenceUDF';
CREATE TEMPORARY FUNCTION each_top_k AS 'hivemall.tools.EachTopKUDTF';
CREATE TEMPORARY FUNCTION tokenize AS 'hivemall.tools.text.TokenizeUDF';
CREATE TEMPORARY FUNCTION is_stopword AS
'hivemall.tools.text.StopwordUDF';
CREATE TEMPORARY FUNCTION split_words AS
44. CREATE TEMPORARY FUNCTION split_words AS
'hivemall.tools.text.SplitWordsUDF';
CREATE TEMPORARY FUNCTION normalize_unicode AS
'hivemall.tools.text.NormalizeUnicodeUDF';
CREATE TEMPORARY FUNCTION lr_datagen AS
'hivemall.dataset.LogisticRegressionDataGeneratorUDTF';
CREATE TEMPORARY FUNCTION f1score AS 'hivemall.evaluation.FMeasureUDAF';
CREATE TEMPORARY FUNCTION mae AS
'hivemall.evaluation.MeanAbsoluteErrorUDAF';
CREATE TEMPORARY FUNCTION mse AS
'hivemall.evaluation.MeanSquaredErrorUDAF';
CREATE TEMPORARY FUNCTION rmse AS
'hivemall.evaluation.RootMeanSquaredErrorUDAF';
CREATE TEMPORARY FUNCTION mf_predict AS 'hivemall.mf.MFPredictionUDF';
CREATE TEMPORARY FUNCTION train_mf_sgd AS
'hivemall.mf.MatrixFactorizationSGDUDTF';
CREATE TEMPORARY FUNCTION train_mf_adagrad AS
'hivemall.mf.MatrixFactorizationAdaGradUDTF';
CREATE TEMPORARY FUNCTION fm_predict AS
'hivemall.fm.FMPredictGenericUDAF';
CREATE TEMPORARY FUNCTION train_fm AS
'hivemall.fm.FactorizationMachineUDTF';
CREATE TEMPORARY FUNCTION train_randomforest_classifier AS
'hivemall.smile.classification.RandomForestClassifierUDTF';
CREATE TEMPORARY FUNCTION train_rf_classifier AS
'hivemall.smile.classification.RandomForestClassifierUDTF';
CREATE TEMPORARY FUNCTION train_randomforest_regr AS
'hivemall.smile.regression.RandomForestRegressionUDTF';
CREATE TEMPORARY FUNCTION train_rf_regr AS
'hivemall.smile.regression.RandomForestRegressionUDTF';
CREATE TEMPORARY FUNCTION tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
45. CREATE TEMPORARY FUNCTION tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
CREATE TEMPORARY FUNCTION vm_tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
CREATE TEMPORARY FUNCTION rf_ensemble AS
'hivemall.smile.tools.RandomForestEnsembleUDAF';
CREATE TEMPORARY FUNCTION train_gradient_boosting_classifier AS
'hivemall.smile.classification.GradientTreeBoostingClassifierUDTF';
CREATE TEMPORARY FUNCTION guess_attribute_types AS
'hivemall.smile.tools.GuessAttributesUDF';
CREATE TEMPORARY FUNCTION tokenize_ja AS
'hivemall.nlp.tokenizer.KuromojiUDF';
CREATE TEMPORARY MACRO max2(x DOUBLE, y DOUBLE) if(x>y,x,y);
CREATE TEMPORARY MACRO min2(x DOUBLE, y DOUBLE) if(x<y,x,y);
CREATE TEMPORARY MACRO rand_gid(k INT) floor(rand()*k);
CREATE TEMPORARY MACRO rand_gid2(k INT, seed INT) floor(rand(seed)*k);
CREATE TEMPORARY MACRO idf(df_t DOUBLE, n_docs DOUBLE) log(10, n_docs /
max2(1,df_t)) + 1.0;
CREATE TEMPORARY MACRO tfidf(tf FLOAT, df_t DOUBLE, n_docs DOUBLE) tf *
(log(10, n_docs / max2(1,df_t)) + 1.0);
SELECT time, COUNT(1) AS cnt FROM tbl1
WHERE TD_TIME_RANGE(time, '2015-12-11', '2015-12-12', 'JST');
47. PQ written in Ruby
• Building jobs/parameters is so complex!
• using data from many configurations (YAML, JSON),
internal APIs and RDBMSs
• with many ext syntaxes/rules to tune performance,
override configurations for tests, ...
• Ruby empower to write fat/complex worker code
• Testing!
• Unit tests using Rspec
• System tests (executing real queries/jobs) using
Rspec
48. For Further improvement
about workers
• More performance for more customers and less
costs
• More scalability for many other kind jobs
• Better and well-controlled tests (indented here
documents!)