SlideShare a Scribd company logo
Hivemall meets DigDag
Machine Learning Pipeline by SQL queries
Research Engineer, Treasure Data
Makoto YUI @myui
@ApacheHivemall
12018/2/17 HackerTackle
Ø 2015.04~ Research Engineer at Treasure Data, Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-a-service company
Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced
Industrial Science and Technology, Japan.
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
About me …
2018/2/17 HackerTackle 2
•
•
let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i))
slideshare.net/myui/icde2010-nbgclock
About me …
2018/2/17 HackerTackle 3
ü Ocaml (for/let, type inference)
ü Lisp (every object is a sequence/atomization)
ü XPath
influenced by
2018/2/17 HackerTackle 4
We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
Plan of the talk
1. Introduction to Hivemall
2. ML workflow using Digdag
2018/2/17 HackerTackle 5
Hivemall entered Apache Incubator
on Sept 13, 2016
Since then, we invited 3 contributors as new committers (a
committer has been voted as PPMC). Currently, we are working
toward the first Apache release (v0.5.0).
hivemall.incubator.apache.org
62018/2/17 HackerTackle
2018/2/17 HackerTackle 7
2018/2/17 HackerTackle
Industry use cases of Hivemall
Ø T-mobile.au
Ø Klout – influencer marketing
bit.ly/klout-hivemall
bit.ly/2whJCQj
Ø Subaru
8
https://www.treasuredata.co.jp/customers/subaru/
Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
9
http://www.slideshare.net/masakazusano75/sano-hmm-20150512
2018/2/17 HackerTackle
2018/2/17 HackerTackle 10
Industry use cases of Hivemall
Minne (Japanese version of Etsy.com) uses Hivemall for Item
recommendation
https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
11
ØGender prediction of Ad click logs
•Scaleout Inc. and Fan commutations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
2018/2/17 HackerTackle
12
Industry use cases of Hivemall
Ø Value prediction of Real estates
•Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
2018/2/17 HackerTackle
13
ØChurn Detection
•OISIX
Industry use cases of Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
2018/2/17 HackerTackle
Web
Mobile
User attributes
User action log
Claim histories
Referrers
Services used
Direct countermeasure
In-direct countermeasure
Giving points Call to care
Guide to SuccessUI Change
Data used for Prediction
Find customers likely to
churn using Hivemall
Feedback
Loop
Customers
likely to leave
What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use
142018/2/17 HackerTackle
Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
152018/2/17 HackerTackle
Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
162018/2/17 HackerTackle
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
172018/2/17 HackerTackle
Hivemall on Apache Hive
182018/2/17 HackerTackle
Hivemall on Apache Spark Dataframe
192018/2/17 HackerTackle
Hivemall on SparkSQL
202018/2/17 HackerTackle
Hivemall on Apache Pig
212018/2/17 HackerTackle
Online Prediction by Apache Streaming
222018/2/17 HackerTackle
23
Generic Classifier/Regressor
OLD Style New Style from v0.5.0
2018/2/17 HackerTackle
24
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
2018/2/17 HackerTackle
Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
252018/2/17 HackerTackle
Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> TF/IDF
> TILE
> MAP_URL
262018/2/17 HackerTackle
2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
27
2018/2/17 HackerTackle
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
28
Map tiling functions
292018/2/17 HackerTackle
Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map tiling functions
Zoom=10
Zoom=15
302018/2/17 HackerTackle
31
SELECT count(distinct id) FROM data
More useful functions (Sketch, NLP)
SELECT approx_count_distinct(id) FROM data
select tokenize_ja(“ ",
"normal", null, null, "https://s3.amazonaws.com/td-
hivemall/dist/kuromoji-user-dict-neologd.csv.gz");
[“ ”, "," "," "]
2018/2/17 HackerTackle
List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
322018/2/17 HackerTackle
RandomForest in Hivemall
Ensemble of Decision Trees
332018/2/17 HackerTackle
Training of RandomForest
34
Sparse Vector Input (Libsvm format) is
supported since v0.5.0 in addition Dense
Vector!
2018/2/17 HackerTackle
Prediction of RandomForest
352018/2/17 HackerTackle
36
Decision Tree Visualization
2018/2/17 HackerTackle
37
Decision Tree Visualization
2018/2/17 HackerTackle
38
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
2018/2/17 HackerTackle
Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector Space
(Euclid/Cosine/Jaccard/Angular)
✓ DIMSUM (Cosine similarity)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
392018/2/17 HackerTackle
Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc
402018/2/17 HackerTackle
Evaluation Metrics
412018/2/17 HackerTackle
Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
422018/2/17 HackerTackle
Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
432018/2/17 HackerTackle
Take this…
Anomaly/Change-point Detection by ChangeFinder
442018/2/17 HackerTackle
Anomaly/Change-point Detection by ChangeFinder
…and do this!
452018/2/17 HackerTackle
Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
462018/2/17 HackerTackle
ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü More efficient XGBoost support
ü LightGBM support
ü DecisionTree prediction tracing
ü Gradient Boosting
Future work for v0.5.2 and later
47
PR#91
PR#116
PR#58
PR#111
2018/2/17 HackerTackle
48
ML workflows often be really complex…
2018/2/17 HackerTackle 49
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasource
#1
Datasource
#2
Datasource
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logistic Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
502018/2/17 HackerTackle
Hivemall Digdag
Technology Trends for 2017
2018/2/17 HackerTackle 51
https://www.thoughtworks.com/radar
2018/2/17 HackerTackle 52
Why Digdag?
Ø Manage workflows by codes (simple YAML syntax)
Ø REST API endpoints
Ø Parallel/Sequential execution
Ø SLA, error notification
Ø Secrets Managing
Ø Docker support
Ø TD, EMR, Bigquery/Slack operators
Ø Embedded Javascript engine
Programmer Friendly, Revision management
Plugin scheme for defining custom operator
2018/2/17 HackerTackle 53
Digdag features
SLA and error handling Nestable, Parallel/Sequential Execution
Embedded Javascript engine
542018/2/17 HackerTackle
Machine Learning Workflow using Digdag
552018/2/17 HackerTackle
Machine Learning Workflow using Digdag
2018/2/17 HackerTackle 56
Use case: CTR/CVR prediction
2018/2/17 HackerTackle 57
Workflow execution timeline
DEMO
Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The first Apache release (v0.5.0) will appear soon!
We welcome your contributions to Apache Hivemall J
582018/2/17 HackerTackle
Digdag is a great workflow engine for managing complex ML pipelines
Any feature request or questions?
592018/2/17 HackerTackle

More Related Content

What's hot

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
Stepan Pushkarev
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
Michal Bachman
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
Hydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations AutomationHydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations Automation
Rustem Zakiev
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
AI and ML 101
AI and ML 101AI and ML 101
AI and ML 101
Rustem Zakiev
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
Stepan Pushkarev
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
Databricks
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
Alok Singh
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Databricks
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
Databricks
 

What's hot (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
Hydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations AutomationHydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations Automation
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 
AI and ML 101
AI and ML 101AI and ML 101
AI and ML 101
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 

Similar to Hivemall meets Digdag @Hackertackle 2018-02-17

Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018
Scilab
 
Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3
Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
Treasure Data, Inc.
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018
Scilab
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013
Vincent Michel
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
Logilab
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Mulvery Detail - English
Mulvery Detail - EnglishMulvery Detail - English
Mulvery Detail - English
Daichi Teruya
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 

Similar to Hivemall meets Digdag @Hackertackle 2018-02-17 (20)

Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018University of Applied Science Esslingen @ Scilab Conference 2018
University of Applied Science Esslingen @ Scilab Conference 2018
 
Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3Hivemall Talk at TD tech talk #3
Hivemall Talk at TD tech talk #3
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018Arcelormittal @ Scilab Conference 2018
Arcelormittal @ Scilab Conference 2018
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Mulvery Detail - English
Mulvery Detail - EnglishMulvery Detail - English
Mulvery Detail - English
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 

More from Makoto Yui

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
Makoto Yui
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
Makoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
Makoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
Makoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
Makoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
Makoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
Makoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
Makoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
Makoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
Makoto Yui
 

More from Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Hivemall meets Digdag @Hackertackle 2018-02-17

  • 1. Hivemall meets DigDag Machine Learning Pipeline by SQL queries Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/2/17 HackerTackle
  • 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as-a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh About me … 2018/2/17 HackerTackle 2 • • let $succ := function($x) { $x+1 } return (for $i in (10,20,30) return $succ($i)) slideshare.net/myui/icde2010-nbgclock
  • 3. About me … 2018/2/17 HackerTackle 3 ü Ocaml (for/let, type inference) ü Lisp (every object is a sequence/atomization) ü XPath influenced by
  • 4. 2018/2/17 HackerTackle 4 We Open-source! TD invented .. Streaming log collector Bulk data import/export Efficient binary serialization Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
  • 5. Plan of the talk 1. Introduction to Hivemall 2. ML workflow using Digdag 2018/2/17 HackerTackle 5
  • 6. Hivemall entered Apache Incubator on Sept 13, 2016 Since then, we invited 3 contributors as new committers (a committer has been voted as PPMC). Currently, we are working toward the first Apache release (v0.5.0). hivemall.incubator.apache.org 62018/2/17 HackerTackle
  • 8. 2018/2/17 HackerTackle Industry use cases of Hivemall Ø T-mobile.au Ø Klout – influencer marketing bit.ly/klout-hivemall bit.ly/2whJCQj Ø Subaru 8 https://www.treasuredata.co.jp/customers/subaru/
  • 9. Ø CTR prediction of Ad click logs • Freakout Inc., Fan communication, and more • Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 9 http://www.slideshare.net/masakazusano75/sano-hmm-20150512 2018/2/17 HackerTackle
  • 10. 2018/2/17 HackerTackle 10 Industry use cases of Hivemall Minne (Japanese version of Etsy.com) uses Hivemall for Item recommendation https://speakerdeck.com/monochromegane/pepabo-minne-matrix-factorization-in-hivemall
  • 11. 11 ØGender prediction of Ad click logs •Scaleout Inc. and Fan commutations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall 2018/2/17 HackerTackle
  • 12. 12 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 2018/2/17 HackerTackle
  • 13. 13 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 2018/2/17 HackerTackle Web Mobile User attributes User action log Claim histories Referrers Services used Direct countermeasure In-direct countermeasure Giving points Call to care Guide to SuccessUI Change Data used for Prediction Find customers likely to churn using Hivemall Feedback Loop Customers likely to leave
  • 14. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 142018/2/17 HackerTackle
  • 15. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 152018/2/17 HackerTackle
  • 16. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 162018/2/17 HackerTackle
  • 17. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 172018/2/17 HackerTackle
  • 18. Hivemall on Apache Hive 182018/2/17 HackerTackle
  • 19. Hivemall on Apache Spark Dataframe 192018/2/17 HackerTackle
  • 21. Hivemall on Apache Pig 212018/2/17 HackerTackle
  • 22. Online Prediction by Apache Streaming 222018/2/17 HackerTackle
  • 23. 23 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/2/17 HackerTackle
  • 24. 24 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/2/17 HackerTackle
  • 25. Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself! 252018/2/17 HackerTackle
  • 26. Hivemall generic functions Array and Map Bit and compress String and NLP Brickhouse UDFs are merged in v0.5.2 release. We welcome contributing your generic UDFs to Hivemall Geo Spatial Top-k processing > TF/IDF > TILE > MAP_URL 262018/2/17 HackerTackle
  • 27. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes 27
  • 28. 2018/2/17 HackerTackle student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J 28
  • 30. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 302018/2/17 HackerTackle
  • 31. 31 SELECT count(distinct id) FROM data More useful functions (Sketch, NLP) SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/2/17 HackerTackle
  • 32. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 322018/2/17 HackerTackle
  • 33. RandomForest in Hivemall Ensemble of Decision Trees 332018/2/17 HackerTackle
  • 34. Training of RandomForest 34 Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/2/17 HackerTackle
  • 38. 38 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/2/17 HackerTackle
  • 39. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) ✓ DIMSUM (Cosine similarity) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 392018/2/17 HackerTackle
  • 40. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc 402018/2/17 HackerTackle
  • 42. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 422018/2/17 HackerTackle
  • 43. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 432018/2/17 HackerTackle
  • 44. Take this… Anomaly/Change-point Detection by ChangeFinder 442018/2/17 HackerTackle
  • 45. Anomaly/Change-point Detection by ChangeFinder …and do this! 452018/2/17 HackerTackle
  • 46. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 462018/2/17 HackerTackle
  • 47. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü More efficient XGBoost support ü LightGBM support ü DecisionTree prediction tracing ü Gradient Boosting Future work for v0.5.2 and later 47 PR#91 PR#116 PR#58 PR#111 2018/2/17 HackerTackle
  • 48. 48 ML workflows often be really complex…
  • 49. 2018/2/17 HackerTackle 49 Real-world ML pipelines (could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict
  • 51. Technology Trends for 2017 2018/2/17 HackerTackle 51 https://www.thoughtworks.com/radar
  • 52. 2018/2/17 HackerTackle 52 Why Digdag? Ø Manage workflows by codes (simple YAML syntax) Ø REST API endpoints Ø Parallel/Sequential execution Ø SLA, error notification Ø Secrets Managing Ø Docker support Ø TD, EMR, Bigquery/Slack operators Ø Embedded Javascript engine Programmer Friendly, Revision management Plugin scheme for defining custom operator
  • 53. 2018/2/17 HackerTackle 53 Digdag features SLA and error handling Nestable, Parallel/Sequential Execution Embedded Javascript engine
  • 56. 2018/2/17 HackerTackle 56 Use case: CTR/CVR prediction
  • 57. 2018/2/17 HackerTackle 57 Workflow execution timeline DEMO
  • 58. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The first Apache release (v0.5.0) will appear soon! We welcome your contributions to Apache Hivemall J 582018/2/17 HackerTackle Digdag is a great workflow engine for managing complex ML pipelines
  • 59. Any feature request or questions? 592018/2/17 HackerTackle