Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real world machine learning with Java for


Published on

For the Japan Java User Group Fall 2015 conference

Published in: Engineering

Real world machine learning with Java for

  1. 1. Real World Machine Learning in Java 8 at Mathieu Dumoulin, Chief Data Scientist, Data Science Team manager at en-japan
  2. 2. Today’s menu ● About me and 不満買取センータ ● The business problem: Post pricing ● Project Overview ○ Why use ML ○ How to use ML in projects ○ How we used ML in this project ● Results ● Live code (depends on time) ● Conclusion
  3. 3. Presentation goals ● Machine learning is possible by any Java Engineer ● Java is a great programming language for real- world machine learning systems ● New ML APIs make it easy to focus on the problem and the data, and get a well-performing model “for free” ● You don’t need a ph.D. to use machine learning, just some self-study, good tools and libraries and build experience one project at a time
  4. 4. About me
  5. 5. Google map for Quebec City here!
  6. 6. My Work: Java SE, Hadoop Engineer, Data Scientist
  7. 7. ● Launched in Mar 2015. Provide web/Android/iOS applications. ● An application to collect data about people's dissatisfactions. ● Features: ○ Users can post any dissatisfaction of any products/services. ○ Users get points as a reward for their posts. And the point is exchangeable with coupon code of EC sites. ● 250,000 users with 1,500,000 posts (accumulated) (end of Nov 2015)
  8. 8. Problem statement: post point value prediction ● Fuman user posts have a money value ● We want to give more points for “good” posts ● At first, operations staff checked all posts, but they can’t check 10,000 posts each day... We made rules, but point value was worse: ● Rules can’t check the content of the posts ● Rules always miss something ● Making hundreds or thousands of rules by hand is ridiculous
  9. 9. ML is the best solution for 不満買取センター ● ML Problem: Estimate the point value of a user posts (0-25) ● Project goal: Estimate the value of posts with less than 5 points difference from human judgement ● Data: All user posts and user profile data ● Data with known output (labels): staff already set points for 200k posts manually This is a classic case of supervised learning (Wiki). Another reference from Microsoft Prediction of a price requires to build a Regression model because the prediction is a number, as opposed to a classification problem which predicts which of two classes each post would belong to.
  10. 10. Real world ML project overview ● Machine Learning Workflow ● Data Scientist and Java Engineer roles ● Java for production ML ● Java 8 benefits ● Our point prediction system details ● Results
  11. 11. Machine Learning Workflow Load data Extract Features Train Model Evaluate vs. business goal Load new data Extract Features Predict using model Act on prediction data, labels (known result) feature vectors, labels prediction, labels data feature vectors predictions iterate best model the same
  12. 12. Workflow for machine learning system 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model
  13. 13. Data Scientist’s role 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model Choose features Build many models
  14. 14. Software Engineer’s role Implement and integrate into production system 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model Get data from data source Implement production code
  15. 15. But we don’t have a data scientist...
  16. 16. You can outsource!
  17. 17. Java for production ML ● Easy integration with Java applications ● Fast (vs. Python or R) ● Easy to program (vs. C++) ● Most common enterprise programming language, IDE support and excellent support libraries ● Lots of state of the art machine learning libraries have a Java API
  18. 18. Machine Learning libraries
  19. 19. Benefits of Java 8 ● Java 8’s functional style is a very good match with ML operations a. Feature extraction: data in → transform → data out ● Java 8’s streams and Lambdas a. Code is easier to understand and less verbose ● Easy parallel code a. Faster “for free”
  20. 20. Post point prediction system: step by step Feature Extraction Fuman DB Prediction Service ● Train/Test split ● Categorical features transformation ● Select best features ● Try many algorithms ● Tune algorithms ● Evaluate models ● REST Prediction API Iterate until results meet business goals CSV format DR Prediction API posts, label
  21. 21. Feature Extraction details ● We added character and words statistics about each fuman user post ○ Number of hiragana, katakana, kanji, alphabet characters and words ○ Number of words, length of words ○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a post ● User profile information ○ age, gender, job category, etc. ● Bag-of-word models: ○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …) ○ Part-of-speech (名詞、動詞、形容詞、 …) ○ Word types features (hiragana word, katakana word, kanji word, …)
  22. 22. マックのポテト揚げたてでお願いしたのに、揚げたてじゃ なかった。 Feature Extraction: Example
  23. 23. Feature Example: MeCab analyzer マックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。 マック 名詞,固有名詞,一般,*,*,*,マック,マック,マック の 助詞,連体化,*,*,*,*,の,ノ,ノ ポテト 名詞,一般,*,*,*,*,ポテト,ポテト,ポテト 揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ で 助詞,格助詞,一般,*,*,*,で,デ,デ お願い 名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ のに 助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ 、 記号,読点,*,*,*,*,、,、,、 揚げたて 名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ じゃ 助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ なかっ 助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ 。 記号,句点,*,*,*,*,。,。,。 EOS
  24. 24. Feature Extraction: Example Character counts Hiragana: 20 Katakana: 6 Kanji: 3 Alpha: 0 Digits: 0 Marks (!,?): 0 Token type counts Hiragana: 8 Katakana: 2 Kanji: 3 Alpha: 0 Digits: 0 Marks: 0 Token length 1: 5 2: 2 3: 4 4: 2 5+: 0
  25. 25. Training and evaluation of our model
  26. 26. We reached the project goal! ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees ○ RMSE: 3.54 ○ MSE: 12.53 Business result: ● Higher quality evaluation than rules ● Operation staff don’t need to manually check posts ● We can validate points every day Our result: 3.5 point difference from human judgement
  27. 27. Deployment issues ● Problem: The Prediction API was very slow (>1s / post) so we had to run it as a batch process each night. ● We want: Make predictions locally with low latency, without losing the good prediction performance we already have. We solved this problem using the excellent open source, distributed machine learning library H2 O by Co-founder: Cliff Click, who made the Java HotSpot Server Compiler
  28. 28. Post point prediction system: Current system Feature Extraction Fuman DB Prediction Service Prediction POJO ● Train/Test split ● Categorical features transformation ● Distributed, fast and state of the art algorithms ● POJO prediction class generation CSV formatposts, label Fuman Webapp get new post values make feature vectors
  29. 29. Train Production Model: H2 O
  30. 30. Overview: Making Predictions ● Use the prediction POJO generated by H2O ● For each new post query Prediction Service ○ Convert to vector (Double[] for H2O) ○ Get prediction from prediction POJO (Double value, round to integer) ○ Update database with predicted price
  31. 31. We reached the business goal! Project goal: Get similar performance from H2O as from DataRobot H2O is not ideal to explore different models and features, but for production, it is FAST with similar predictive performance. It is implemented in pure Java (Github). ● H2O: Train a new model for production ○ GBM (Gradient Boosting Machine) ○ MSE: 12.8 ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees ○ RMSE: 3.54 ○ MSE: 12.53
  32. 32. Real world ML loves Java! ● Java is a top choice for making production machine learning systems ● Benefits of Java 8 makes Java fun and relevant again ● Integration in a Java web application was not hard ● Java is not a good choice for experimentation ○ Start with a Python prototype with Scikit-learn ○ Use a Machine Learning service like
  33. 33. You can use ML in your projects! ● Web API services are like a personal data scientist ○ No need for Data Scientist for simple use of ML ○ But harder dataset will need expertise ● Real world ML projects needs Engineers: ○ Get data to train a good model (log files, sales results, mail campaign results,…) ○ Transform data into input for ML library or web service ○ Deploy and integrate into production ● Most steps are just normal programming ○ Get data from DB ○ Transform data into a CSV ○ Call a REST API or Java POJO to make predictions ○ Integrate with the system that needs predictions
  34. 34. Questions?
  35. 35. Live code
  36. 36. Feature engineering with streams and lambdas The goal is to take raw data from the DB and create arrays of numerical or categorical features. 1. Get Fuman user post data from DB -> UserPost 2. Learn the vocabulary of all user posts word types 3. Create the dataset: a. For each post, i. Add the statistics features ii. Add the word types features 4. Transform to csv output (for DataRobot) Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in retrospect, a specialized vector library would have been better, I think. Weka is a terrible production library