The Naive Bayes classifier provided by Apache Spark out of the box is quite limited. It was intended primarily for document classification. It does not automatically handle continuous features, strings, null values, or any sort of visualization of the resulting model. In this session, you’ll learn about changes made to allow it to work on datasets with any sort of column types. More importantly, you’ll get a demonstration of how the model can be visualized in order to gain insight, and interactively apply it to new data.
Hear how, in order to bin continuous columns, ESI Group employs the open source Minimum Description Length Principal (MDLP) library developed by Sergio Ramirez based on a paper by Fayyad and Irani. It is capable of using Spark to do entropy-based binning on many columns simultaneously with respect to a specific target. Once continuous columns have been binned and indexed, they can be processed by Spark’s Naive Bayes algorithm.