Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Talk about Hivemall at Data Scientist Organization on 2015/09/17

2,154 views

Published on

Talk about Hivemall at Data Scientist Organization
http://eventdots.jp/event/569291

Published in: Data & Analytics
  • Be the first to comment

Talk about Hivemall at Data Scientist Organization on 2015/09/17

  1. 1. Introduction  to Machine  Learning  on   using  Hivemall Research  Engineer Makoto  YUI  @myui <myui@treasure-­‐data.com> 2014/09/17  Talk@Japan  DataScientist  Society 1
  2. 2. Ø 2015.04  Joined  Treasure  Data,  Inc. 1st Research  Engineer  in  Treasure  Data My  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service Ø 2010.04-­‐2015.03  Senior  Researcher  at  National   Institute  of  Advanced  Industrial  Science  and   Technology,  Japan.   Worked  on  a  large-­‐scale  Machine  Learning  project   and  Parallel  Databases   Ø 2009.03  Ph.D.  in  Computer  Science  from  NAIST Ø Super  programmer  award  from  the  MITOU   Foundation   Super  creators  in  TD:    Sada  Furuhashi,  Keisuke  Nishida Who  am    I  ? 2014/09/17  Talk@Japan  DataScientist  Society 2
  3. 3. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 3
  4. 4. What  is  Hivemall Scalable  machine  learning  library  built  as  a  collection  of   Hive  UDFs,  licensed  under  the  Apache  License  v2 2014/09/17  Talk@Japan  DataScientist  Society 4 https://github.com/myui/hivemall
  5. 5. What  is  Hivemall Hadoop  HDFS MapReduce (MR v1) Hive /  PIG Hivemall Apache  YARN Apache  Tez   DAG  processing MR  v2 Machine  Learning Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System 2014/09/17  Talk@Japan  DataScientist  Society 5 Scalable  machine  learning  library  built  as  a  collection  of   Hive  UDFs,  licensed  under  the  Apache  License  v2
  6. 6. R M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine (Tez /  Spark) No  intermediate  DFS  reads/writes! 62014/09/17  Talk@Japan  DataScientist  Society M MM M HDFS HDFS M M M R M M M R HDFS HDFS HDFS
  7. 7. Won  IDG’s  InfoWorld  2014 Bossie  Awards  2014:  The  best  open  source  big  data  tools InfoWorld's  top  picks  in  distributed  data  processing,  data  analytics,   machine  learning,  NoSQL  databases,  and  the  Hadoop  ecosystem bit.ly/hivemall-­‐award 2014/09/17  Talk@Japan  DataScientist  Society 7
  8. 8. List  of  Features  in  Hivemall  v0.3.2 Classification  (both   binary-­‐ and  multi-­‐class) ✓ Perceptron ✓ Passive  Aggressive  (PA) ✓ Confidence  Weighted  (CW) ✓ Adaptive  Regularization  of   Weight  Vectors  (AROW) ✓ Soft  Confidence  Weighted   (SCW) ✓ AdaGrad+RDA Regression ✓Logistic  Regression  (SGD) ✓PA  Regression ✓AROW  Regression ✓AdaGrad ✓AdaDELTA kNN and  Recommendation ✓ Minhash and  b-­‐Bit  Minhash (LSH  variant) ✓ Similarity   Search  using  K-­‐NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix  Factorization Feature  engineering ✓ Feature  Hashing ✓ Feature  Scaling (normalization,   z-­‐score)   ✓ TF-­‐IDF  vectorizer ✓ Polynomial  Expansion Anomaly  Detection ✓ Local  Outlier  Factor Treasure  Data  supports  Hivemall  v0.3.2-­‐3 2014/09/17  Talk@Japan  DataScientist  Society 8
  9. 9. Algorithms News20.binary Classification  Accuracy Perceptron 0.9460   Passive-­‐Aggressive (a.k.a.  Online-­‐SVM) 0.9604   LibLinear 0.9636   LibSVM/TinySVM 0.9643   Confidence Weighted  (CW) 0.9656   AROW  [1] 0.9660   SCW  [2] 0.9662   Better CW-­‐variants  are  very  smart online ML  algorithm Hivemall  supports  the  state-­‐of-­‐the-­‐art  online  learning   algorithms  (for  classification and  regression) 2014/09/17  Talk@Japan  DataScientist  Society 9 List  of  Features  in  Hivemall
  10. 10. Why  CW  variants  are  so  good? Suppose  a  binary  classification  setting  to  classify   sentences  positive  or  negative →  learn  the  weight  for  each  word  (each  word  is  a  feature) I  like  this  authorPositive I  like  this  author,  but  found  this  book  dullNegative Label Feature  Vector Naïve  update  will  reduce  both                                              at  same  rateWlike Wdull CW-­‐variants  adjust  weights  at  different  rates 2014/09/17  Talk@Japan  DataScientist  Society 10
  11. 11. Why  CW  variants  are  so  good? weight weight Adjust  a  weight Adjust  a  weight  &   confidence 0.6 0.80.6 0.80.6 At  this  confidence,   the  weight  is  0.5 Confidence (covariance) 0.5 2014/09/17  Talk@Japan  DataScientist  Society 11
  12. 12. Features to  be  supported  from  Hivemall  v0.4 2014/09/17  Talk@Japan  DataScientist  Society 12 1.RandomForest • classification,  regression 2.Gradient  Tree  Boosting • classifier,  regression 3.Factorization  Machine • classification,  regression  (factorization) 4.Online  LDA • topic  modeling,  clustering Planned  to  release  v0.4  in  Oct. Gradient  Boosting  and  Factorization  Machine are  often  used  by  data  science  competition  winners (very  important  for  practitioners)
  13. 13. 2014/09/17  Talk@Japan  DataScientist  Society 13 Factorization  Machine Matrix  Factorization
  14. 14. 2014/09/17  Talk@Japan  DataScientist  Society 14 Factorization  Machine Context  information  (e.g.,  time)   can  be  considered Source:  http://www.ismll.uni-­‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf
  15. 15. 2014/09/17  Talk@Japan  DataScientist  Society 15 Factorization  Machine Factorization  Model  with  degress=2  (2-­‐way  interaction) Global Bias Regression coefficience of j-th variable Pairwise Interaction Factorization
  16. 16. Ø CTR  prediction  of  Ad  click  logs • Algorithm:  Logistic  regression • Freakout Inc.  and  more Ø Gender  prediction  of  Ad  click  logs • Algorithm:  Classification • Scaleout Inc. Ø Churn  Detection • Algorithm:  Regression • OISIX  and  more Ø Item/User  recommendation • Algorithm:  Recommendation  (Matrix  Factorization  /  kNN)   • Wish.com,  Adtech Company,  Real-­‐estate  Portal,  and  more Ø Value  prediction  of  Real  estates • Algorithm:    Regression • Livesense Industry  use  cases  of  Hivemall 162014/09/17  Talk@Japan  DataScientist  Society
  17. 17. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 17
  18. 18. Why  Hivemall 1. In  my  experience  working  on  ML,  I  used  Hive   for  preprocessing  and  Python  (scikit-­‐learn  etc.)   for  ML.  This  was  INEFFICIENT  and  ANNOYING.   Also,  Python  is  not  as  scalable  as  Hive. 2. Why  not  run  ML  algorithms  inside  Hive?  Less   components  to  manage  and  more  scalable. That’s  why  I  build  Hivemall. 2014/09/17  Talk@Japan  DataScientist  Society 18
  19. 19. Data  Moving  in  Data  Analytics Data Collection Data Lake Data Processing Data Mart Amazon S3 Amazon EMR Redshift Amazon RDS Event  Data Insights  and  Decisions Data Analysis Data  Engineer Data  Scientist Data  Engineer 2014/09/17  Talk@Japan  DataScientist  Society 19
  20. 20. 2014/09/17  Talk@Japan  DataScientist  Society 20 What  Data  Scientists  actually  Do What  Data  Scientists  Should  Do Data  Moving  in  Data  Analytics Hive is a great data preprocessing tool due to its easiness/efficiency/scalability for join, filtering, and selection (data preprocessing)
  21. 21. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load Machine  Learning file 2014/09/17  Talk@Japan  DataScientist  Society 21
  22. 22. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load file Need  to  do  expensive  data   preprocessing   (Joins,  Filtering,  and  Formatting  of  Data   that  does  not  fit  in  memory) Machine  Learning 2014/09/17  Talk@Japan  DataScientist  Society 22
  23. 23. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load file Do  not  scale Have  to  learn  R/Python  APIs 2014/09/17  Talk@Japan  DataScientist  Society 23
  24. 24. How  I  used  to  do  ML  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load Does  not  meet  my  needs In  terms  of  its  scalability,  ML  algorithms,  and  usability I  ❤ scalable SQL  query 2014/09/17  Talk@Japan  DataScientist  Society 24
  25. 25. Framework User  interface Mahout Java  API  Programming Spark  MLlib/MLI Scala  API  programming Scala  Shell  (REPL) H2O R  programming GUI Cloudera  Oryx Http  REST  API  programming Vowpal  Wabbit (w/  Hadoop  streaming) C++  API  programming Command  Line Survey  on  existing  ML  frameworks Existing  distributed  machine  learning  frameworks are  NOT  easy  to  use 2014/09/17  Talk@Japan  DataScientist  Society 25
  26. 26. 2014/09/17  Talk@Japan  DataScientist  Society 26 Motivation:   Machine  Learning  need  to  be  more  easy   for  developers  (esp.  data  engineers)! People  are  saying  that  ..
  27. 27. Hivemall’s Vision:  ML  on  SQL Classification  with  Mahout CREATE  TABLE  lr_model  AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ✓Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ✓Interactive  and  Stable  APIs  w/ SQL  abstraction This  SQL  query  automatically  runs  in   parallel  on  Hadoop   2014/09/17  Talk@Japan  DataScientist  Society 27
  28. 28. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 28
  29. 29. Implemented  machine  learning  algorithms  as   User-­‐Defined  Table  generating  Functions  (UDTFs) How  Hivemall  works  in  training +1,  <1,2> .. +1,  <1,7,9> -­‐1,  <1,3,  9> .. +1,  <3,8> tuple <label,  array<features>> tuple<feature,  weights> Prediction  model UDTF Relation <feature,  weights> param-­‐mix param-­‐mix Training   table Shuffle   by  feature train train ● Resulting prediction model is a relation of feature and its weight ● # of mapper and reducers are configurable UDTF  is  a  function  that  returns  a  relation Parallelism  is  Powerful 2014/09/17  Talk@Japan  DataScientist  Society 29
  30. 30. train train +1,  <1,2> .. +1,  <1,7,9> -­‐1,  <1,3,  9> .. +1,  <3,8> merge tuple <label,  array<features  > array<weight> array<sum  of  weight>,   array<count> Training   table Prediction   model -­‐1,  <2,7,  9> .. +1,  <3,8> final   merge merge -­‐1,  <2,7,  9> .. +1,  <3,8> train train array<weight > Why  not  UDAF 4  ops  in  parallel 2  ops  in  parallel No  parallelism Machine  learning  as  an  aggregate  function Bottleneck  in  the  final  merge Throughput  limited  by  its  fan  out Memory   consumption grows Parallelism decreases 2014/09/17  Talk@Japan  DataScientist  Society 30
  31. 31. Problem  that  I  faced:  Iterations Iterations  are  mandatory  to  get  a  good  prediction   model • However,  MapReduce is  not  suited  for  iterations  because   IN/OUT  of  MR  job  is  through  HDFS • Spark  avoid  it  by  in-­‐memory  computation iter.  1 iter.  2 . . . Input HDFS read HDFS write HDFS read HDFS write iter.  1 iter.  2 Input 2014/09/17  Talk@Japan  DataScientist  Society 31
  32. 32. Training  with  Iterations  in  Spark val data = spark.textFile(...).map(readPoint).cache() for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } Repeated  MapReduce  steps to  do  gradient  descent For  each  node,  loads   data  in  memory  once This  is  just  a  toy  example!  Why? Logistic  Regression  example  of  Spark Input  to  the  gradient  computation  should  be  shuffled   for  each  iteration  (without  it,  more  iteration  is  required) 2014/09/17  Talk@Japan  DataScientist  Society 32
  33. 33. What  MLlib  actually  do? Val data = .. for (i <- 1 to numIterations) { val sampled = val gradient = w -= gradient } Mini-­‐batch  Gradient  Descent  with  Sampling Iterations  are  mandatory  for  convergence  because   each  iteration  uses  only  small  fraction  of  data GradientDescent.scala bit.ly/spark-­‐gd sample subset of data (partitioned RDD) averaging the subgradientsover the sampled data using Spark MapReduce 2014/09/17  Talk@Japan  DataScientist  Society 33
  34. 34. Alternative  Approach  in  Hivemall Hivemall  provides  the amplify UDTF  to  enumerate   iteration  effects  in  machine  learning  without  several   MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY rand() 2014/09/17  Talk@Japan  DataScientist  Society 34
  35. 35. Map-­‐only  shuffling  and  amplifying rand_amplify UDTF  randomly  shuffles  the   input  rows  for  each  Map  task CREATE VIEW training_x3 as SELECT rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features) FROM training; 2014/09/17  Talk@Japan  DataScientist  Society 35
  36. 36. Detailed  plan  w/  map-­‐local  shuffle … Reduce   task Merge Aggregate Reduce  write Map   task Table  scan Rand  Amplifier Map  write Logress  UDTF Partial  aggregate Map   task Table  scan Rand  Amplifier Map  write Logress UDTF Partial  aggregate Reduce   task Merge Aggregate Reduce  write Scanned  entries   are  amplified  and   then  shuffled Note  this  is  a  pipeline  op. The  Rand  Amplifier  operator  is  interleaved  between   the  table  scan  and  the  training  operator Shuffle   (distributed  by   feature) 2014/09/17  Talk@Japan  DataScientist  Society 36
  37. 37. Method ELAPSED  TIME   (sec) AUC Plain 89.718 0.734805 amplifier+clustered  by (a.k.a.  global  shuffle) 479.855 0.746214 rand_amplifier   (a.k.a.  map-­‐local  shuffle) 116.424 0.743392 Performance  effects  of  amplifiers With  the  map-­‐local  shuffle,  prediction  accuracy   got  improved  with  an  acceptable  overhead   2014/09/17  Talk@Japan  DataScientist  Society 37
  38. 38. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 38
  39. 39. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 2014/09/17  Talk@Japan  DataScientist  Society 39
  40. 40. CREATE  EXTERNAL  TABLE  e2006tfidf_train  ( rowid int, label float, features ARRAY<STRING> )   ROW  FORMAT  DELIMITED   FIELDS  TERMINATED  BY  '¥t'   COLLECTION  ITEMS  TERMINATED  BY  ",“ STORED  AS  TEXTFILE  LOCATION  '/dataset/E2006-­tfidf/train';; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 2014/09/17  Talk@Japan  DataScientist  Society 40
  41. 41. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 2014/09/17  Talk@Japan  DataScientist  Society 41
  42. 42. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 2014/09/17  Talk@Japan  DataScientist  Society 42
  43. 43. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 2014/09/17  Talk@Japan  DataScientist  Society 43
  44. 44. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 2014/09/17  Talk@Japan  DataScientist  Society 44
  45. 45. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 2014/09/17  Talk@Japan  DataScientist  Society 45
  46. 46. create table news20mc_ensemble_model1as select label, cast(feature as int) as feature, cast(voted_avg(weight)as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label,feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 26 / 43 462014/09/17  Talk@Japan  DataScientist  Society
  47. 47. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 2014/09/17  Talk@Japan  DataScientist  Society 47
  48. 48. How  to  use  Hivemall  -­‐ Prediction CREATE  TABLE  lr_predict as SELECT t.rowid,   sigmoid(sum(m.weight))   as  prob FROM testing_exploded t  LEFT  OUTER  JOIN lr_model m  ON  (t.feature =  m.feature) GROUP  BY   t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 2014/09/17  Talk@Japan  DataScientist  Society 48
  49. 49. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 2014/09/17  Talk@Japan  DataScientist  Society 49
  50. 50. Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export 2014/09/17  Talk@Japan  DataScientist  Society 50
  51. 51. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 51
  52. 52. Supervise  Learning:  Recommendation Rating  prediction  of  a  Matrix   Can  be  applied  for  user/Item  Recommendation 522014/09/17  Talk@Japan  DataScientist  Society
  53. 53. 53 Matrix  Factorization Factorize  a  matrix   into  a  product  of  matrices having  k-­‐latent  factor 2014/09/17  Talk@Japan  DataScientist  Society
  54. 54. 54 Mean  Rating Matrix  Factorization Regularization Bias   for  each  user/item Criteria  of  Biased  MF 2014/09/17  Talk@Japan  DataScientist  Society Factorization
  55. 55. 55 Training  of  Matrix  Factorization Support iterative training using local disk cache 2014/09/17  Talk@Japan  DataScientist  Society
  56. 56. 56 Prediction  of  Matrix  Factorization 2014/09/17  Talk@Japan  DataScientist  Society
  57. 57. ØAlgorithm  is  different Spark:  ALS-­‐WR   (considers  regularization) Hivemall:  Biased-­‐MF   (considers  regularization  and  biases) ØUsability Spark:  100+  line  Scala  coding Hivemall:  SQL ØPrediction  Accuracy Almost  same  for  MovieLens 10M  datasets 2014/09/17  Talk@Japan  DataScientist  Society 57 Comparison  to  Spark  MLlib
  58. 58. rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"]     Unsupervised  Learning:  Anomaly  Detection Sensor  data  etc. Anomaly  detection  runs  on  a  series  of  SQL  queries 582014/09/17  Talk@Japan  DataScientist  Society
  59. 59. 2014/09/17  Talk@Japan  DataScientist  Society 59 Anomalies  in  a  Sensor  Data Source:  https://codeiq.jp/q/207
  60. 60. Image  Source:  https://en.wikipedia.org/wiki/Local_outlier_factor 2014/09/17  Talk@Japan  DataScientist  Society 60 Local  Outlier  Factor  (LoF) Basic  idea  of  LOF:  comparing  the  local  density  of  a   point  with  the  densities of  its  neighbors
  61. 61. 2014/09/17  Talk@Japan  DataScientist  Society 61 DEMO:  Local  Outlier  Factor rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"]    
  62. 62. 2014/09/17  Talk@Japan  DataScientist  Society 62 RandomForest  in  Hivemall  v0.4 Ensemble  of  Decision  Trees Already  available  on  a  development  (smile)  branch and  it’s  usage  is  explained  in  the  project  wiki
  63. 63. 2014/09/17  Talk@Japan  DataScientist  Society 63 Training  of  RandomForest
  64. 64. Out-­‐of-­‐bag  tests  and  Variable  Importance   2014/09/17  Talk@Japan  DataScientist  Society 64
  65. 65. 2014/09/17  Talk@Japan  DataScientist  Society 65 Prediction  of  RandomForest
  66. 66. 2014/09/17  Talk@Japan  DataScientist  Society 66 Jupyter Integration DEMO
  67. 67. Conclusion  and  Takeaway Hivemall  provides  a  collection  of  machine   learning  algorithms  as  Hive  UDFs/UDTFs Ø For  SQL  users  that  need  ML Ø For  whom  already  using  Hive Ø Easy-­‐of-­‐use  and  scalability  in  mind Do  not  require  coding,  packaging,  compiling  or   introducing  a  new  programming  language  or APIs. Hivemall’s Positioning 2014/09/17  Talk@Japan  DataScientist  Society 67 v0.4  will  make  a  developmental  leap
  68. 68. 5/12の第一回目では Freakout, Scaleout様より利用事 例発表 10/20(火)の第2回目では OISIX, Livesense様より利用事例 発表 dotsで近日募集開始 2014/09/17  Talk@Japan  DataScientist  Society 68 告知: Hivemall  meetup
  69. 69. 2014/09/17  Talk@Japan  DataScientist  Society 69 Beyond  Query-­‐as-­‐a-­‐Service! We              Open-­‐source!  We  invented  .. We  are  hiring  machine  learning  engineer!

×