Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache	Hivemall:	
Machine	Learning	Library	for	
Apache	Hive/Spark/Pig
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-da...
Ø 2015.04~ Research	Engineer	at	Treasure	Data,	
Inc.
• My	mission	is	developing	ML-as-a-Service	in	a	Hadoop-as-
a-service	...
2016/10/29	@Dots	 3
Hiro Yoshikawa
CEO
Kaz Ota
CTO
Sada Furuhashi
Chief Architect
Open source business
veteran
Founder - w...
2016/10/29	@Dots	 4
Big	Data	Stats	in	Treasure	Data
2016/10/29	@Dots	 5
We							Open-source!	TD	invented	..
Streaming log collector Bulk data import/export efficient binary ...
2016/10/29	@Dots	 6
Treasure	Data’s	Solution
1. What	is	Hivemall	(introduction)
2. How	to	use	Hivemall
3. Roadmap	and	coming	new	features
Agenda
2016/10/29	@Dots	 7
2016/10/29	@Dots	 8
Hivemall	entered	Apache	Incubator	
on	Sept	13,	2016	 🎉
hivemall.incubator.apache.org
@ApacheHivemall
• Makoto	Yui	<Treasure	Data>
• Takeshi	Yamamuro <NTT>
Ø Hivemall	on	Apache	Spark
• Daniel	Dai	<Hortonworks>
Ø Hivemall	on	...
Champion
Nominated	Mentors
10
Project	mentors
• Reynold	Xin	<Databricks,	ASF	member>
Apache	Spark	PMC	member
• Markus	Weim...
What	is	Apache	Hivemall
Scalable	machine	learning	library	
built	as	a	collection	of	Hive	UDFs
112016/10/29	@Dots	
Multi/Cr...
Hivemall	is	easy	and	scalable	…
Classification	with	Mahout
CREATE	TABLE	lr_model AS
SELECT
feature,	-- reducers	perform	mo...
2016/10/29	@Dots	 13
Hivemall	is	a	multi/cross-platform
ML	library
HiveQL SparkSQL/Dataframe API Pig	Latin
Hivemall	is	Mul...
Hivemall’s Technology	Stack
2016/10/29	@Dots	 14
2016/10/29	@Dots	 15
Hivemall	on	Apache	Hive
2016/10/29	@Dots	 16
Hivemall	on	Apache	Spark	Dataframe
2016/10/29	@Dots	 17
Hivemall	on	SparkSQL
2016/10/29	@Dots	 18
Hivemall	on	Apache	Pig
2016/10/29	@Dots	 19
Versatile
Hivemall	is	a	Versatile	library	..
ü Hivemall	is	not	only	for	Machine	
Learning
ü Hivemall	...
2016/10/29	@Dots	 20
Hivemall	generic	functions
Array	
and	Map
Bit	and	
compress
String	and	NLP
We	welcome	contributing	yo...
List	of	supported	Algorithms
Classification	
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
...
List	of	Algorithms	for	Recommendation
22
K-Nearest	Neighbor
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	Search	...
2016/10/29	@Dots	 23
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k	query	processing
student class sc...
2016/10/29	@Dots	 24
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
List	top-2	students	for	each	class
SELE...
2016/10/29	@Dots	 25
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
List	top-2	students	for	each	class
SELE...
2016/10/29	@Dots	 26
Top-k	query	processing	by	RANK	OVER()
partition	by	class
Node	1
Sort	by	class,	score
rank	over()
rank...
2016/10/29	@Dots	 27
Top-k	query	processing	by	EACH_TOP_K
distributed	by	class
Node	1
Sort	by	class
each_top_k
OUTPUT	only...
2016/10/29	@Dots	 28
Comparison	between	RANK	and	EACH_TOP_K
distributed	by	class
Sort	by	class
each_top_k
Sort	by	class,	s...
Performance	reported	by	TD	customer
2016/10/29	@Dots	 29
•1,000	students	in	each	class
•20 million	classes
RANK	over()	que...
Other	Supported	Algorithms
30
Anomaly	Detection
✓ Local	Outlier	Factor	(LoF)
Feature	Engineering
✓Feature	Hashing
✓Feature...
• CTR	prediction	of	Ad	click	logs
• Algorithm:	Logistic	regression
• Freakout Inc.,	Smartnews,	and	more
• Gender	predictio...
• CTR	prediction	of	Ad	click	logs
• Algorithm:	Logistic	regression
• Freakout Inc.,	Smartnews,	and	more
• Gender	predictio...
• CTR	prediction	of	Ad	click	logs
• Algorithm:	Logistic	regression
• Freakout Inc.,	Smartnews,	and	more
• Gender	predictio...
• CTR	prediction	of	Ad	click	logs
• Algorithm:	Logistic	regression
• Freakout Inc.,	Smartnews,	and	more
• Gender	predictio...
OISIX,	a	leading	food	delivery	service	company	in	Japan,	
used	Hivemall’s Logistic	Regression	to	get	churn	probability	
20...
1. What	is	Hivemall	(introduction)
2. How	to	use	Hivemall
3. Roadmap	and	coming	new	features
Agenda
2016/10/29	@Dots	 36
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Data	...
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERM...
2016/10/29	@Dots	 39
How	to	use	Hivemall
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Featu...
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e20...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Train...
How	to	use	Hivemall	- Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(featur...
How	to	use	Hivemall	- Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
Vector
Feature	Vector
Label
Predi...
How	to	use	Hivemall	- Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_ex...
Real-time	prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature	
...
Export	Prediction	Model	to	a	RDBMS
Any	RDBMS
TD	export
Periodical	export	is	very easy
in	Treasure	Data
103 -0.489654362201...
Real-time	Prediction	on	MySQL
Prediction
Model
Label
Feature	Vector
SELECT		
sigmoid(sum(t.value	*	m.weight))	as	prob
FROM...
2016/10/29	@Dots	 50
Online	Prediction	by	Apache	Streaming
RandomForest	in	Hivemall
Ensemble	of	Decision	Trees
2016/10/29	@Dots	 51
Training	of	RandomForest
2016/10/29	@Dots	 52
Prediction	of	RandomForest
2016/10/29	@Dots	 53
1. What	is	Hivemall	(introduction)
2. How	to	use	Hivemall
3. Roadmap	and	coming	new	features
Agenda
2016/10/29	@Dots	 54
• IP	clearance	and	project/repository	site	
setup
• Create	contribution	guidelines
• Move	repository	from	github to	ASF
• ...
Efficient	algorithm	for	finding	change	point	and	
outliers	from	timeseries data
2016/10/29	@Dots	 56
J.	Takeuchi	and	K.	Ya...
Efficient	algorithm	for	finding	change	point	and	
outliers	from	timeseries data
2016/10/29	@Dots	 57
J.	Takeuchi	and	K.	Ya...
2016/10/29	@Dots	 58
T.	Ide	and	K.	Inoue,	"Knowledge	Discovery	from	Heterogeneous	Dynamic	Systems	using	Change-Point	
Corr...
2016/10/29	@Dots	 59
Evaluation	Metrics
2016/10/29	@Dots	 60
Feature	Engineering	– Feature	Binning
Maps	quantitative	variables	to	fixed	number	
of	bins	based	on	q...
2016/10/29	@Dots	 61
Feature	Selection	– Signal	Noise	Ratio
2016/10/29	@Dots	 62
Feature	Selection	– Chi-Square
2016/10/29	@Dots	 63
Feature	Transformation	– Onehot encoding
Maps	a	categorical	variable	to	a	
unique	number	starting	fro...
ü Spark	2.0 Dataframe support
ü XGBoost Integration
ü Field-aware	Factorization	Machines
ü Generalized	Linear	Model
• Opti...
Conclusion	and	Takeaway
Hivemall	is	a	machine	learning	library	that	is	…
2016/10/29	@Dots	 65
We	welcome	your	contribution...
66
Any	questions	or	comments?	
2016/10/29	@Dots
Upcoming SlideShare
Loading in …5
×

Dots20161029 myui

2,672 views

Published on

Keynote talk at https://eventdots.jp/event/602633

Published in: Data & Analytics

Dots20161029 myui

  1. 1. Apache Hivemall: Machine Learning Library for Apache Hive/Spark/Pig Research Engineer Makoto YUI @myui <myui@treasure-data.com> 12016/10/29 @Dots
  2. 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as- a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh Little about me … 2016/10/29 @Dots 2
  3. 3. 2016/10/29 @Dots 3 Hiro Yoshikawa CEO Kaz Ota CTO Sada Furuhashi Chief Architect Open source business veteran Founder - world’s largest Hadoop group Invented Fluentd, Messagepack TODAY
 100+ Employees, 30M+ funding 2015
 New office in Seoul, Korea 2013
 New office in Tokyo, Japan 2012
 Founded in Mountain View, CA Investors Jerry Yang
 Yahoo! Founder Bill Tai
 Angel Investor Yukihiro Matsumoto
 Ruby Inventor Sierra Ventures - Tim Guleri
 Entrerprise Software Scale Ventures - Andy Vitus
 B2B SaaS Treasure Data
  4. 4. 2016/10/29 @Dots 4 Big Data Stats in Treasure Data
  5. 5. 2016/10/29 @Dots 5 We Open-source! TD invented .. Streaming log collector Bulk data import/export efficient binary serialization Streaming Query Processor Machine learning on Hadoop digdag.io Workflow engine (Beta)
  6. 6. 2016/10/29 @Dots 6 Treasure Data’s Solution
  7. 7. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 7
  8. 8. 2016/10/29 @Dots 8 Hivemall entered Apache Incubator on Sept 13, 2016 🎉 hivemall.incubator.apache.org @ApacheHivemall
  9. 9. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT> Ø Hivemall on Apache Spark • Daniel Dai <Hortonworks> Ø Hivemall on Apache Pig Ø Apache Pig PMC member • Tsuyoshi Ozawa <NTT> ØApache Hadoop PMC member • Kai Sasaki <Treasure Data> 9 Initial committers 2016/10/29 @Dots
  10. 10. Champion Nominated Mentors 10 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member 2016/10/29 @Dots
  11. 11. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs 112016/10/29 @Dots Multi/Cross platform Versatile Scalable Ease-of-use
  12. 12. Hivemall is easy and scalable … Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ML made easy for SQL developers Born to be parallel and scalable This SQL query automatically runs in parallel on Hadoop cluster 122016/10/29 @Dots Ease-of-use Scalable
  13. 13. 2016/10/29 @Dots 13 Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive
  14. 14. Hivemall’s Technology Stack 2016/10/29 @Dots 14
  15. 15. 2016/10/29 @Dots 15 Hivemall on Apache Hive
  16. 16. 2016/10/29 @Dots 16 Hivemall on Apache Spark Dataframe
  17. 17. 2016/10/29 @Dots 17 Hivemall on SparkSQL
  18. 18. 2016/10/29 @Dots 18 Hivemall on Apache Pig
  19. 19. 2016/10/29 @Dots 19 Versatile Hivemall is a Versatile library .. ü Hivemall is not only for Machine Learning ü Hivemall provides bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing! Don’t Repeat Yourself! Don’t Repeat Yourself!
  20. 20. 2016/10/29 @Dots 20 Hivemall generic functions Array and Map Bit and compress String and NLP We welcome contributing your generic UDFs to Hivemall!
  21. 21. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 21 Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 2016/10/29 @Dots
  22. 22. List of Algorithms for Recommendation 22 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 2016/10/29 @Dots
  23. 23. 2016/10/29 @Dots 23 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing student class score 3 a 90 2 a 80 1 b 70 6 b 60 List top-2 students for each class
  24. 24. 2016/10/29 @Dots 24 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 Top-k query processing
  25. 25. 2016/10/29 @Dots 25 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t Top-k query processing
  26. 26. 2016/10/29 @Dots 26 Top-k query processing by RANK OVER() partition by class Node 1 Sort by class, score rank over() rank >= 2
  27. 27. 2016/10/29 @Dots 27 Top-k query processing by EACH_TOP_K distributed by class Node 1 Sort by class each_top_k OUTPUT only K items
  28. 28. 2016/10/29 @Dots 28 Comparison between RANK and EACH_TOP_K distributed by class Sort by class each_top_k Sort by class, score rank over() rank >= 2 SORTING IS HEAVY NEED TO PROCESS ALL OUTPUT only K items Each_top_k is very efficient where the number of class is large Bounded Priority Queue is utilized
  29. 29. Performance reported by TD customer 2016/10/29 @Dots 29 •1,000 students in each class •20 million classes RANK over() query does not finishes in 24 hours L EACH_TOP_K finishes in 2 hours J Refer for detail https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
  30. 30. Other Supported Algorithms 30 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji) 2016/10/29 @Dots
  31. 31. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. Industry use cases of Hivemall 312016/10/29 @Dots http://www.slideshare.net/eventdotsjp/hivemall
  32. 32. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo Industry use cases of Hivemall 322016/10/29 @Dots minne.com
  33. 33. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense Industry use cases of Hivemall 332016/10/29 @Dots
  34. 34. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense • User score calculation • Algrorithm: Regression • Klout Industry use cases of Hivemall 34 bit.ly/klout-hivemall 2016/10/29 @Dots Influencer marketing klout.com
  35. 35. OISIX, a leading food delivery service company in Japan, used Hivemall’s Logistic Regression to get churn probability 2016/10/29 @Dots 35 Churn Detection of Monthly Payment Service Churn rate dropped almost by half by giving gift points to customers being predicted to leave J
  36. 36. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 36
  37. 37. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 372016/10/29 @Dots
  38. 38. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 382016/10/29 @Dots
  39. 39. 2016/10/29 @Dots 39 How to use Hivemall
  40. 40. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 402016/10/29 @Dots
  41. 41. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 412016/10/29 @Dots
  42. 42. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 422016/10/29 @Dots
  43. 43. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 432016/10/29 @Dots
  44. 44. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 442016/10/29 @Dots
  45. 45. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 452016/10/29 @Dots
  46. 46. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 462016/10/29 @Dots
  47. 47. Real-time prediction Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 47 bit.ly/hivemall-rtp 2016/10/29 @Dots
  48. 48. Export Prediction Model to a RDBMS Any RDBMS TD export Periodical export is very easy in Treasure Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855 48 Prediction Model 2016/10/29 @Dots
  49. 49. Real-time Prediction on MySQL Prediction Model Label Feature Vector SELECT sigmoid(sum(t.value * m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature) Index lookups are very efficient in RDBMSs! 492016/10/29 @Dots
  50. 50. 2016/10/29 @Dots 50 Online Prediction by Apache Streaming
  51. 51. RandomForest in Hivemall Ensemble of Decision Trees 2016/10/29 @Dots 51
  52. 52. Training of RandomForest 2016/10/29 @Dots 52
  53. 53. Prediction of RandomForest 2016/10/29 @Dots 53
  54. 54. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 54
  55. 55. • IP clearance and project/repository site setup • Create contribution guidelines • Move repository from github to ASF • Add more tests and documentations • Initial Apache Release will be Dec or Jan 55 Roadmap 2016/10/29 @Dots
  56. 56. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/29 @Dots 56 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  57. 57. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/29 @Dots 57 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  58. 58. 2016/10/29 @Dots 58 T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation Less Hyper-parameters than ChangeFinder J
  59. 59. 2016/10/29 @Dots 59 Evaluation Metrics
  60. 60. 2016/10/29 @Dots 60 Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
  61. 61. 2016/10/29 @Dots 61 Feature Selection – Signal Noise Ratio
  62. 62. 2016/10/29 @Dots 62 Feature Selection – Chi-Square
  63. 63. 2016/10/29 @Dots 63 Feature Transformation – Onehot encoding Maps a categorical variable to a unique number starting from 1
  64. 64. ü Spark 2.0 Dataframe support ü XGBoost Integration ü Field-aware Factorization Machines ü Generalized Linear Model • Optimizer framework including ADAM • L1/L2 regularization 2016/10/29 @Dots 64 Other new features to come
  65. 65. Conclusion and Takeaway Hivemall is a machine learning library that is … 2016/10/29 @Dots 65 We welcome your contributions to Apache Hivemall J Multi/Cross platform Versatile Scalable Ease-of-use hivemall.incubator.apache.org Ø For Data Engineers who need ML Ø Deep Learning is out of scope Ø Recommendation is high-priority for us Hivemall’s Positioning
  66. 66. 66 Any questions or comments? 2016/10/29 @Dots

×