SlideShare a Scribd company logo
1 of 36
Download to read offline
Real World Machine
Learning in Java 8 at
Fumankaitori.com
Mathieu Dumoulin, Chief Data Scientist fumankaitori.com,
Data Science Team manager at en-japan
Today’s menu
● About me and 不満買取センータ
● The business problem: Post pricing
● Project Overview
○ Why use ML
○ How to use ML in projects
○ How we used ML in this project
● Results
● Live code (depends on time)
● Conclusion
Presentation goals
● Machine learning is possible by any Java Engineer
● Java is a great programming language for real-
world machine learning systems
● New ML APIs make it easy to focus on the problem
and the data, and get a well-performing model “for
free”
● You don’t need a ph.D. to use machine learning,
just some self-study, good tools and libraries and
build experience one project at a time
About me
Google map for Quebec City
here!
My Work: Java SE, Hadoop Engineer, Data Scientist
● Launched in Mar 2015. Provide web/Android/iOS
applications.
● An application to collect data about people's
dissatisfactions.
● Features:
○ Users can post any dissatisfaction of any products/services.
○ Users get points as a reward for their posts. And the point is
exchangeable with coupon code of EC sites.
● 250,000 users with 1,500,000 posts (accumulated)
(end of Nov 2015)
Problem statement: post point value prediction
● Fuman user posts have a money value
● We want to give more points for “good”
posts
● At first, operations staff checked all
posts, but they can’t check 10,000 posts
each day...
We made rules, but point value was worse:
● Rules can’t check the content of the posts
● Rules always miss something
● Making hundreds or thousands of rules by
hand is ridiculous
ML is the best solution for 不満買取センター
● ML Problem: Estimate the point value of a user posts (0-25)
● Project goal: Estimate the value of posts with less than 5 points
difference from human judgement
● Data: All user posts and user profile data
● Data with known output (labels): staff already set points for 200k
posts manually
This is a classic case of supervised learning (Wiki). Another reference from Microsoft
Prediction of a price requires to build a Regression model because the prediction is a number, as
opposed to a classification problem which predicts which of two classes each post would belong to.
Real world ML project overview
● Machine Learning Workflow
● Data Scientist and Java Engineer roles
● Java for production ML
● Java 8 benefits
● Our point prediction system details
● Results
Machine Learning Workflow
Load data
Extract Features
Train Model
Evaluate vs. business goal
Load new data
Extract Features
Predict using model
Act on prediction
data, labels (known result)
feature vectors, labels
prediction, labels
data
feature vectors
predictions
iterate
best model
the same
Workflow for machine learning system
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
Data Scientist’s role
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
Choose features
Build many models
Software Engineer’s role
Implement and integrate into production system
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
Get data from data source
Implement production code
But we don’t have a data scientist...
You can outsource!
Java for production ML
● Easy integration with Java applications
● Fast (vs. Python or R)
● Easy to program (vs. C++)
● Most common enterprise programming language, IDE support and excellent
support libraries
● Lots of state of the art machine learning libraries have a Java API
Machine Learning libraries
Benefits of Java 8
● Java 8’s functional style is a very good match with ML operations
a. Feature extraction: data in → transform → data out
● Java 8’s streams and Lambdas
a. Code is easier to understand and less verbose
● Easy parallel code
a. Faster “for free”
Post point prediction system: step by step
Feature
Extraction
Fuman
DB
Prediction Service
● Train/Test split
● Categorical features
transformation
● Select best features
● Try many algorithms
● Tune algorithms
● Evaluate models
● REST Prediction API
Iterate until results
meet business goals
CSV format
DR Prediction
API
posts, label
Feature Extraction details
● We added character and words statistics about each fuman user post
○ Number of hiragana, katakana, kanji, alphabet characters and words
○ Number of words, length of words
○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a
post
● User profile information
○ age, gender, job category, etc.
● Bag-of-word models:
○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …)
○ Part-of-speech (名詞、動詞、形容詞、 …)
○ Word types features (hiragana word, katakana word, kanji word, …)
マックのポテト揚げたてでお願いしたのに、揚げたてじゃ
なかった。
Feature Extraction: Example
Feature Example: MeCab analyzer
マックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。
マック 名詞,固有名詞,一般,*,*,*,マック,マック,マック
の 助詞,連体化,*,*,*,*,の,ノ,ノ
ポテト 名詞,一般,*,*,*,*,ポテト,ポテト,ポテト
揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
で 助詞,格助詞,一般,*,*,*,で,デ,デ
お願い 名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
のに 助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ
、 記号,読点,*,*,*,*,、,、,、
揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
じゃ 助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ
なかっ 助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。 記号,句点,*,*,*,*,。,。,。
EOS
Feature Extraction: Example
Character counts
Hiragana: 20
Katakana: 6
Kanji: 3
Alpha: 0
Digits: 0
Marks (!,?): 0
Token type counts
Hiragana: 8
Katakana: 2
Kanji: 3
Alpha: 0
Digits: 0
Marks: 0
Token length
1: 5
2: 2
3: 4
4: 2
5+: 0
Training and evaluation of our model
We reached the project goal!
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53
Business result:
● Higher quality evaluation than rules
● Operation staff don’t need to manually check posts
● We can validate points every day
Our result: 3.5 point difference from human judgement
Deployment issues
● Problem: The Prediction API was very slow (>1s / post) so we
had to run it as a batch process each night.
● We want: Make predictions locally with low latency, without losing
the good prediction performance we already have.
We solved this problem using the
excellent open source, distributed
machine learning library H2
O by H2o.ai.
Co-founder: Cliff Click, who made the
Java HotSpot Server Compiler
Post point prediction system: Current system
Feature
Extraction
Fuman
DB
Prediction Service
Prediction
POJO
● Train/Test split
● Categorical features
transformation
● Distributed, fast and state
of the art algorithms
● POJO prediction class
generation
CSV formatposts, label
Fuman Webapp
get new post
values
make feature
vectors
Train Production Model: H2
O
Overview: Making Predictions
● Use the prediction POJO generated
by H2O
● For each new post query Prediction
Service
○ Convert to vector (Double[] for H2O)
○ Get prediction from prediction POJO
(Double value, round to integer)
○ Update database with predicted price
We reached the business goal!
Project goal: Get similar performance from H2O as from DataRobot
H2O is not ideal to explore different models and features, but for
production, it is FAST with similar predictive performance. It is
implemented in pure Java (Github).
● H2O: Train a new model for
production
○ GBM (Gradient Boosting Machine)
○ MSE: 12.8
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53
Real world ML loves Java!
● Java is a top choice for making production machine
learning systems
● Benefits of Java 8 makes Java fun and relevant again
● Integration in a Java web application was not hard
● Java is not a good choice for experimentation
○ Start with a Python prototype with Scikit-learn
○ Use a Machine Learning service like DataRobot.com
You can use ML in your projects!
● Web API services are like a personal data
scientist
○ No need for Data Scientist for simple use of ML
○ But harder dataset will need expertise
● Real world ML projects needs Engineers:
○ Get data to train a good model (log files, sales results,
mail campaign results,…)
○ Transform data into input for ML library or web service
○ Deploy and integrate into production
● Most steps are just normal programming
○ Get data from DB
○ Transform data into a CSV
○ Call a REST API or Java POJO to make predictions
○ Integrate with the system that needs predictions
Questions?
Live code
Feature engineering with streams and lambdas
The goal is to take raw data from the DB and create arrays of numerical or
categorical features.
1. Get Fuman user post data from DB -> UserPost
2. Learn the vocabulary of all user posts word types
3. Create the dataset:
a. For each post,
i. Add the statistics features
ii. Add the word types features
4. Transform to csv output (for DataRobot)
Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in
retrospect, a specialized vector library would have been better, I think. Weka is a
terrible production library

More Related Content

What's hot

Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 

What's hot (20)

Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 
Productive Use of the Apache Spark Prompt with Sam Penrose
Productive Use of the Apache Spark Prompt with Sam PenroseProductive Use of the Apache Spark Prompt with Sam Penrose
Productive Use of the Apache Spark Prompt with Sam Penrose
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 

Viewers also liked

プログラム初心者がWebサービスをリリースして運営するまで
プログラム初心者がWebサービスをリリースして運営するまでプログラム初心者がWebサービスをリリースして運営するまで
プログラム初心者がWebサービスをリリースして運営するまで
Tomoaki Iwasaki
 
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Toshiaki Maki
 

Viewers also liked (20)

よくある業務開発の自動化事情 #jjug_ccc #ccc_cd3
よくある業務開発の自動化事情 #jjug_ccc #ccc_cd3よくある業務開発の自動化事情 #jjug_ccc #ccc_cd3
よくある業務開発の自動化事情 #jjug_ccc #ccc_cd3
 
【こっそり始める】Javaプログラマコーディングマイグレーション
【こっそり始める】Javaプログラマコーディングマイグレーション【こっそり始める】Javaプログラマコーディングマイグレーション
【こっそり始める】Javaプログラマコーディングマイグレーション
 
プログラム初心者がWebサービスをリリースして運営するまで
プログラム初心者がWebサービスをリリースして運営するまでプログラム初心者がWebサービスをリリースして運営するまで
プログラム初心者がWebサービスをリリースして運営するまで
 
日本 Java ユーザーグループ JJUG CCC 2015 Fall by ソラコム 片山
日本 Java ユーザーグループ JJUG CCC 2015 Fall  by ソラコム 片山 日本 Java ユーザーグループ JJUG CCC 2015 Fall  by ソラコム 片山
日本 Java ユーザーグループ JJUG CCC 2015 Fall by ソラコム 片山
 
Javaにおけるネイティブコード連携の各種手法の紹介
Javaにおけるネイティブコード連携の各種手法の紹介Javaにおけるネイティブコード連携の各種手法の紹介
Javaにおけるネイティブコード連携の各種手法の紹介
 
Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点
Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点
Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点
 
Java8移行から始めた技術的負債との戦い(jjug ccc 2015 fall)
Java8移行から始めた技術的負債との戦い(jjug ccc 2015 fall)Java8移行から始めた技術的負債との戦い(jjug ccc 2015 fall)
Java8移行から始めた技術的負債との戦い(jjug ccc 2015 fall)
 
デバッガのしくみ(JDI)を学んでみよう
デバッガのしくみ(JDI)を学んでみようデバッガのしくみ(JDI)を学んでみよう
デバッガのしくみ(JDI)を学んでみよう
 
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
Reactive Webアプリケーション - そしてSpring 5へ #jjug_ccc #ccc_ef3
 
Java EEハンズオン資料 JJUG CCC 2015 Fall
Java EEハンズオン資料 JJUG CCC 2015 FallJava EEハンズオン資料 JJUG CCC 2015 Fall
Java EEハンズオン資料 JJUG CCC 2015 Fall
 
マイクロサービスアーキテクチャ - アーキテクチャ設計の歴史を背景に
マイクロサービスアーキテクチャ - アーキテクチャ設計の歴史を背景にマイクロサービスアーキテクチャ - アーキテクチャ設計の歴史を背景に
マイクロサービスアーキテクチャ - アーキテクチャ設計の歴史を背景に
 
タイムマシン採用:明日のエンタープライズJavaの世界を予想する -Java EE7/クラウド/Docker/etc.-
タイムマシン採用:明日のエンタープライズJavaの世界を予想する -Java EE7/クラウド/Docker/etc.-タイムマシン採用:明日のエンタープライズJavaの世界を予想する -Java EE7/クラウド/Docker/etc.-
タイムマシン採用:明日のエンタープライズJavaの世界を予想する -Java EE7/クラウド/Docker/etc.-
 
Getting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with ThymeleafGetting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with Thymeleaf
 
VMの歩む道。 Dalvik、ART、そしてJava VM
VMの歩む道。 Dalvik、ART、そしてJava VMVMの歩む道。 Dalvik、ART、そしてJava VM
VMの歩む道。 Dalvik、ART、そしてJava VM
 
Java8移行は怖くない~エンタープライズ案件でのJava8移行事例~
Java8移行は怖くない~エンタープライズ案件でのJava8移行事例~Java8移行は怖くない~エンタープライズ案件でのJava8移行事例~
Java8移行は怖くない~エンタープライズ案件でのJava8移行事例~
 
Kotlin is charming; The reasons Java engineers should start Kotlin.
Kotlin is charming; The reasons Java engineers should start Kotlin.Kotlin is charming; The reasons Java engineers should start Kotlin.
Kotlin is charming; The reasons Java engineers should start Kotlin.
 
U-NEXT学生インターン、過激なJavaの学び方と過激な要求
U-NEXT学生インターン、過激なJavaの学び方と過激な要求U-NEXT学生インターン、過激なJavaの学び方と過激な要求
U-NEXT学生インターン、過激なJavaの学び方と過激な要求
 
Java libraries you can't afford to miss
Java libraries you can't afford to missJava libraries you can't afford to miss
Java libraries you can't afford to miss
 
Jjug ccc
Jjug cccJjug ccc
Jjug ccc
 
2017spring jjug ccc_f2
2017spring jjug ccc_f22017spring jjug ccc_f2
2017spring jjug ccc_f2
 

Similar to Real world machine learning with Java for Fumankaitori.com

From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
Bruce Kuo
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Lviv Startup Club
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 

Similar to Real world machine learning with Java for Fumankaitori.com (20)

DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Bridging the gap in enterprise AI
Bridging the gap in enterprise AIBridging the gap in enterprise AI
Bridging the gap in enterprise AI
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificEnabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Machine learning
Machine learningMachine learning
Machine learning
 
Overcome a Frontier
Overcome a FrontierOvercome a Frontier
Overcome a Frontier
 

More from Mathieu Dumoulin

Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
Mathieu Dumoulin
 

More from Mathieu Dumoulin (8)

State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifié
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
Introduction à Hadoop
Introduction à HadoopIntroduction à Hadoop
Introduction à Hadoop
 

Recently uploaded

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
benjamincojr
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
Kamal Acharya
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 

Recently uploaded (20)

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdf
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 

Real world machine learning with Java for Fumankaitori.com

  • 1. Real World Machine Learning in Java 8 at Fumankaitori.com Mathieu Dumoulin, Chief Data Scientist fumankaitori.com, Data Science Team manager at en-japan
  • 2. Today’s menu ● About me and 不満買取センータ ● The business problem: Post pricing ● Project Overview ○ Why use ML ○ How to use ML in projects ○ How we used ML in this project ● Results ● Live code (depends on time) ● Conclusion
  • 3. Presentation goals ● Machine learning is possible by any Java Engineer ● Java is a great programming language for real- world machine learning systems ● New ML APIs make it easy to focus on the problem and the data, and get a well-performing model “for free” ● You don’t need a ph.D. to use machine learning, just some self-study, good tools and libraries and build experience one project at a time
  • 5. Google map for Quebec City here!
  • 6. My Work: Java SE, Hadoop Engineer, Data Scientist
  • 7. ● Launched in Mar 2015. Provide web/Android/iOS applications. ● An application to collect data about people's dissatisfactions. ● Features: ○ Users can post any dissatisfaction of any products/services. ○ Users get points as a reward for their posts. And the point is exchangeable with coupon code of EC sites. ● 250,000 users with 1,500,000 posts (accumulated) (end of Nov 2015)
  • 8. Problem statement: post point value prediction ● Fuman user posts have a money value ● We want to give more points for “good” posts ● At first, operations staff checked all posts, but they can’t check 10,000 posts each day... We made rules, but point value was worse: ● Rules can’t check the content of the posts ● Rules always miss something ● Making hundreds or thousands of rules by hand is ridiculous
  • 9. ML is the best solution for 不満買取センター ● ML Problem: Estimate the point value of a user posts (0-25) ● Project goal: Estimate the value of posts with less than 5 points difference from human judgement ● Data: All user posts and user profile data ● Data with known output (labels): staff already set points for 200k posts manually This is a classic case of supervised learning (Wiki). Another reference from Microsoft Prediction of a price requires to build a Regression model because the prediction is a number, as opposed to a classification problem which predicts which of two classes each post would belong to.
  • 10. Real world ML project overview ● Machine Learning Workflow ● Data Scientist and Java Engineer roles ● Java for production ML ● Java 8 benefits ● Our point prediction system details ● Results
  • 11. Machine Learning Workflow Load data Extract Features Train Model Evaluate vs. business goal Load new data Extract Features Predict using model Act on prediction data, labels (known result) feature vectors, labels prediction, labels data feature vectors predictions iterate best model the same
  • 12. Workflow for machine learning system 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model
  • 13. Data Scientist’s role 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model Choose features Build many models
  • 14. Software Engineer’s role Implement and integrate into production system 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model Get data from data source Implement production code
  • 15. But we don’t have a data scientist...
  • 17. Java for production ML ● Easy integration with Java applications ● Fast (vs. Python or R) ● Easy to program (vs. C++) ● Most common enterprise programming language, IDE support and excellent support libraries ● Lots of state of the art machine learning libraries have a Java API
  • 19. Benefits of Java 8 ● Java 8’s functional style is a very good match with ML operations a. Feature extraction: data in → transform → data out ● Java 8’s streams and Lambdas a. Code is easier to understand and less verbose ● Easy parallel code a. Faster “for free”
  • 20. Post point prediction system: step by step Feature Extraction Fuman DB Prediction Service ● Train/Test split ● Categorical features transformation ● Select best features ● Try many algorithms ● Tune algorithms ● Evaluate models ● REST Prediction API Iterate until results meet business goals CSV format DR Prediction API posts, label
  • 21. Feature Extraction details ● We added character and words statistics about each fuman user post ○ Number of hiragana, katakana, kanji, alphabet characters and words ○ Number of words, length of words ○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a post ● User profile information ○ age, gender, job category, etc. ● Bag-of-word models: ○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …) ○ Part-of-speech (名詞、動詞、形容詞、 …) ○ Word types features (hiragana word, katakana word, kanji word, …)
  • 23. Feature Example: MeCab analyzer マックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。 マック 名詞,固有名詞,一般,*,*,*,マック,マック,マック の 助詞,連体化,*,*,*,*,の,ノ,ノ ポテト 名詞,一般,*,*,*,*,ポテト,ポテト,ポテト 揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ で 助詞,格助詞,一般,*,*,*,で,デ,デ お願い 名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ のに 助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ 、 記号,読点,*,*,*,*,、,、,、 揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ じゃ 助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ なかっ 助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ 。 記号,句点,*,*,*,*,。,。,。 EOS
  • 24. Feature Extraction: Example Character counts Hiragana: 20 Katakana: 6 Kanji: 3 Alpha: 0 Digits: 0 Marks (!,?): 0 Token type counts Hiragana: 8 Katakana: 2 Kanji: 3 Alpha: 0 Digits: 0 Marks: 0 Token length 1: 5 2: 2 3: 4 4: 2 5+: 0
  • 25. Training and evaluation of our model
  • 26. We reached the project goal! ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees ○ RMSE: 3.54 ○ MSE: 12.53 Business result: ● Higher quality evaluation than rules ● Operation staff don’t need to manually check posts ● We can validate points every day Our result: 3.5 point difference from human judgement
  • 27. Deployment issues ● Problem: The Prediction API was very slow (>1s / post) so we had to run it as a batch process each night. ● We want: Make predictions locally with low latency, without losing the good prediction performance we already have. We solved this problem using the excellent open source, distributed machine learning library H2 O by H2o.ai. Co-founder: Cliff Click, who made the Java HotSpot Server Compiler
  • 28. Post point prediction system: Current system Feature Extraction Fuman DB Prediction Service Prediction POJO ● Train/Test split ● Categorical features transformation ● Distributed, fast and state of the art algorithms ● POJO prediction class generation CSV formatposts, label Fuman Webapp get new post values make feature vectors
  • 30. Overview: Making Predictions ● Use the prediction POJO generated by H2O ● For each new post query Prediction Service ○ Convert to vector (Double[] for H2O) ○ Get prediction from prediction POJO (Double value, round to integer) ○ Update database with predicted price
  • 31. We reached the business goal! Project goal: Get similar performance from H2O as from DataRobot H2O is not ideal to explore different models and features, but for production, it is FAST with similar predictive performance. It is implemented in pure Java (Github). ● H2O: Train a new model for production ○ GBM (Gradient Boosting Machine) ○ MSE: 12.8 ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees ○ RMSE: 3.54 ○ MSE: 12.53
  • 32. Real world ML loves Java! ● Java is a top choice for making production machine learning systems ● Benefits of Java 8 makes Java fun and relevant again ● Integration in a Java web application was not hard ● Java is not a good choice for experimentation ○ Start with a Python prototype with Scikit-learn ○ Use a Machine Learning service like DataRobot.com
  • 33. You can use ML in your projects! ● Web API services are like a personal data scientist ○ No need for Data Scientist for simple use of ML ○ But harder dataset will need expertise ● Real world ML projects needs Engineers: ○ Get data to train a good model (log files, sales results, mail campaign results,…) ○ Transform data into input for ML library or web service ○ Deploy and integrate into production ● Most steps are just normal programming ○ Get data from DB ○ Transform data into a CSV ○ Call a REST API or Java POJO to make predictions ○ Integrate with the system that needs predictions
  • 36. Feature engineering with streams and lambdas The goal is to take raw data from the DB and create arrays of numerical or categorical features. 1. Get Fuman user post data from DB -> UserPost 2. Learn the vocabulary of all user posts word types 3. Create the dataset: a. For each post, i. Add the statistics features ii. Add the word types features 4. Transform to csv output (for DataRobot) Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in retrospect, a specialized vector library would have been better, I think. Weka is a terrible production library