Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploring language classification with spark and the spark notebook

613 views

Published on

In this presentation and linked notebooks we learn the basics of creating a machine learning classifier from scratch using language classification as a running example. We start by implementing the naive intuition that letter frequency could provide a model for language classification, and then we will implement the n-gram paper from Cavnar and Trenkle.
In corresponding notebook we will create a Spark ML Transformer from the n-gram model that can be used to classify text in a Dataset or Dataframe

Published in: Software
  • Be the first to comment

Exploring language classification with spark and the spark notebook

  1. 1. Exploring Language Classification With Apache Spark and the Spark Notebook A practical introduction to interactive Data Engineering Gerard Maas
  2. 2. Gerard Maas Lead Engineer @ Kensu Computer Engineer Scala Programmer Early Spark Adopter Spark Notebook Dev Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe IoT Hacker Arduino Enthusiast @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  3. 3. DATA SCIENCE GOVERNANCE Adalog helps enterprises to ensure that data pipelines continually deliver their value by combining the contextual information when the pipeline was created with the evolving environment where the pipelines execute. CONNECT - COLLECT - LEARN
  4. 4. Language Classification
  5. 5. Language Classification Some inspiration...
  6. 6. What’s is a language? How is it composed?
  7. 7. Letter Frequency Could we characterize a language by calculating the relative frequency of letters in some text ? Spanish vs English letter frequency
  8. 8. n-grams "cavnar and trenkle" bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_ tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_ quad-grams: cavn,... http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf Could we characterize a language by calculating the relative frequency of sequence of letters in some text ?
  9. 9. Tech
  10. 10. Spark APIs RDD -> Resilient Distributed Datasets - Lazy, functional-oriented, low level API - Basis for execution of all high-level libraries Dataframes - Column-oriented, SQL-inspired DSL - Many optimizations under the hood (Catalyst, Tungsten) Dataset - Best of both worlds (except …)
  11. 11. Spark Notebook A dynamic and visual web-based notebook for Spark with Scala
  12. 12. Spark Notebook - Open Source Roadmap 2017 GIT Kerberos Project Generator Q1 Q2 Q3 Announcements: blog.kensu.io
  13. 13. Notebooks Notebooks for this presentation are located at: https://github.com/maasg/spark-notebooks - have fun!
  14. 14. https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb Implements the idea of using a letter frequency model to classify the language in a doc. Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/ It produces a training set of sampled strings that will be used also for the n-gram classifier (Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.) Notebook 1 : Naive Language Classification
  15. 15. Notebook 2 : n-gram Language Classification https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb Implements the n-gram algorithm described in the paper. Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/ Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity. (Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)

×