Advanced Spark Meetup - Jan 12, 2016

•

6 likes•2,172 views

Michelle Casbon

Training & Serving NLP/Spark ML Models in a Distributed Cloud-based Infrastructure

Engineering

Michelle Casbon
January 12, 2016 – Advanced Apache Spark Meetup
Training & Serving NLP
Models in a Distributed Cloud-
based Infrastructure

2
What do we do?
• Idibon creates
adaptive machine
intelligence that can
analyze text in any
language
natural language text
social media
structured insights

3
• Background
• Platform description
• Why we chose Spark
• How we’re using Spark ML & MLlib
• Challenges of adopting Spark in a distributed NLP system
Agenda

4
What are our use cases?
Intent to purchase
Global health
trends
Interactive Voice
Response
Multilingual news
SMS Prioritization
Supply Chain Risk
Change
reception

• Fewer annotations
• Lower costs
• Less time spent training
• Higher accuracy
• Improves over time
labeled training set
human annotation intelligent queuing
&
machine learning
unlabeled pool
Adaptive learning

7
How do we do it?
Dataset
Models
Identification2
Cleansing3
Training data
creation4
Quality Control5
Creation6
Hyperparameter
Tuning
7
Intelligent
Queueing
8
Rule Creation9
10 Unseen Data
Prediction
Goal Definition1

9
• Wide variety of algorithms
• Active development
• Open source
• Industry-standard algorithm implementation
• Intended for use in enterprise applications
• Scalability
Why are we using Spark?

10
• Feature Extraction
• TF-IDF
• Word2Vec
• Dimensionality reduction
• Training
• Logistic Regression
• SVM
• Naïve Bayes
• LDA
• Prediction
• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature
Extraction
Training
Prediction

11
Feature Extraction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[1.0, 0.0, 3.0]
Vector

12
Training
LogisticRegression
WithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model
Storage
[1.0, 0.0, 3.0]
Vector
Add
classification
LogisticRegressionModel

13
Prediction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[0.0, 1.0, 4.0]
Vector
Model
Lookup
Predict
New tweet
[0.0, 1.0, 4.0]
Vector
Classification
Lookup

14
How do we provide online predictions with
Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow
...

15
How do we fit Spark into our existing system?
Core
functionality
Idibon
custom ML
…
REST API
ML persistence
layer

16
• Real-time operationalization of many, many models
• Embed within different platforms
• Single save/load framework
• Rapidly incorporate new NLP features
• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use
Spark?

17
• Analyzing human language is hard
• We’re using the most exciting parts of Spark to build
performant NLP systems that are faster & better than ever
before
Summary

18
Questions?
Michelle Casbon
michelle@idibon.com
@texasmichelle

What's hot

Disrupting Big Data with Apache Spark in the CloudJen Aman

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Serverless sparkMamathaBusi

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Análisis de las novedades del Elastic StackElasticsearch

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit

H2O Rains with Databricks Cloud - Parisoma SFSri Ambati

From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks

MLflow on and inside AzureDatabricks

Scalable Search Analyticsenterprisesearchmeetup

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Changing the Way Viacom Looks at Video Performance with Mark Cohen and Michae...Databricks

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks

Insights Without Tradeoffs: Using Structured StreamingDatabricks

What's hot (20)

Disrupting Big Data with Apache Spark in the Cloud

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Serverless spark

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Análisis de las novedades del Elastic Stack

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

H2O Rains with Databricks Cloud - Parisoma SF

From Idea to Model: Productionizing Data Pipelines with Apache Airflow

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...

MLflow on and inside Azure

Scalable Search Analytics

Apache Spark in Scientific Applciations

Changing the Way Viacom Looks at Video Performance with Mark Cohen and Michae...

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...

Deep Learning on Apache® Spark™ : Workflows and Best Practices

A Microservices Framework for Real-Time Model Scoring Using Structured Stream...

Insights Without Tradeoffs: Using Structured Streaming

Similar to Advanced Spark Meetup - Jan 12, 2016

Using PySpark to Process Boat Loads of DataRobert Dempsey

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678

IBM Strategy for SparkMark Kerzner

Image Recognition on AWS with Apache Spark and BigDLAmazon Web Services

Gartner Catalyst 2015 Customer Presentation - MindTouchSplunk

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

Open, Secure & Transparent AI PipelinesNick Pentreath

A Mobile-First, Cloud-First Stack at PearsonMongoDB

Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin

DevOps for DataScienceStepan Pushkarev

Splice Machine's use of Apache Spark and MLflowDatabricks

DoneDeal - AWS Data Analytics Platformmartinbpeters

Building Powerful and Intelligent Applications with Azure Machine LearningDavid Walker, CSM,CSD,MCP,MCAD,MCSD,MVP

Getting Started with Splunk Breakout SessionSplunk

Global AI Bootcamp Madrid - Azure DatabricksAlberto Diaz Martin

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon

Getting Started with Splunk Breakout SessionSplunk

Splunk for Machine Learning and AnalyticsSplunk

Similar to Advanced Spark Meetup - Jan 12, 2016 (20)

Using PySpark to Process Boat Loads of Data

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

IBM Strategy for Spark

Image Recognition on AWS with Apache Spark and BigDL

Gartner Catalyst 2015 Customer Presentation - MindTouch

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Open, Secure & Transparent AI Pipelines

A Mobile-First, Cloud-First Stack at Pearson

Ai & Data Analytics 2018 - Azure Databricks for data scientist

DevOps for DataScience

Splice Machine's use of Apache Spark and MLflow

DoneDeal - AWS Data Analytics Platform

Building Powerful and Intelligent Applications with Azure Machine Learning

Getting Started with Splunk Breakout Session

Global AI Bootcamp Madrid - Azure Databricks

Combining Machine Learning frameworks with Apache Spark

Turn Data Into Actionable Insights - StampedeCon 2016

Getting Started with Splunk Breakout Session

Splunk for Machine Learning and Analytics

Recently uploaded

Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl

Research Methodology for Engineering pdfCaalaaAbdulkerim

US Department of Education FAFSA Week of ActionMebane Rash

2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon

Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7

Forming section troubleshooting checklist for improving wire life (1).pptNoman khan

AntColonyOptimizationManetNetworkAODV.pptxLina Kadam

March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton

Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz

Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...IJAEMSJORNAL

CS 3251 Programming in c all unit notes pdfBalamuruganV28

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork

Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1

Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar

ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370

Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot

Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher

Secure Key Crypto - Tech Paper JET Tech Labsamber724300

"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University

Python Programming for basic beginners.pptxmohitesoham12

Recently uploaded (20)

Katarzyna Lipka-Sidor - BIM School Course

Research Methodology for Engineering pdf

US Department of Education FAFSA Week of Action

2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.

Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...

Forming section troubleshooting checklist for improving wire life (1).ppt

AntColonyOptimizationManetNetworkAODV.pptx

March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...

Javier_Fernandez_CARS_workshop_presentation.pptx

Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...

CS 3251 Programming in c all unit notes pdf

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...

Comprehensive energy systems.pdf Comprehensive energy systems.pdf

Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm

ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt

Prach: A Feature-Rich Platform Empowering the Autism Community

Novel 3D-Printed Soft Linear and Bending Actuators

Secure Key Crypto - Tech Paper JET Tech Labs

"Exploring the Essential Functions and Design Considerations of Spillways in ...

Python Programming for basic beginners.pptx

Advanced Spark Meetup - Jan 12, 2016

1. Michelle Casbon January 12, 2016 – Advanced Apache Spark Meetup Training & Serving NLP Models in a Distributed Cloud- based Infrastructure

2. 2 What do we do? • Idibon creates adaptive machine intelligence that can analyze text in any language natural language text social media structured insights

3. 3 • Background • Platform description • Why we chose Spark • How we’re using Spark ML & MLlib • Challenges of adopting Spark in a distributed NLP system Agenda

4. 4 What are our use cases? Intent to purchase Global health trends Interactive Voice Response Multilingual news SMS Prioritization Supply Chain Risk Change reception

5. How do we do it?

6. • Fewer annotations • Lower costs • Less time spent training • Higher accuracy • Improves over time labeled training set human annotation intelligent queuing & machine learning unlabeled pool Adaptive learning

7. 7 How do we do it? Dataset Models Identification2 Cleansing3 Training data creation4 Quality Control5 Creation6 Hyperparameter Tuning 7 Intelligent Queueing 8 Rule Creation9 10 Unseen Data Prediction Goal Definition1

8. What does our platform look like?

9. 9 • Wide variety of algorithms • Active development • Open source • Industry-standard algorithm implementation • Intended for use in enterprise applications • Scalability Why are we using Spark?

10. 10 • Feature Extraction • TF-IDF • Word2Vec • Dimensionality reduction • Training • Logistic Regression • SVM • Naïve Bayes • LDA • Prediction • Evaluation metrics How are we using Spark? [1.0, [1.0, 0.0, 3.0]] Feature Extraction Training Prediction

11. 11 Feature Extraction Extract Content Tokenize Bigrams Trigrams Feature Lookup [1.0, 0.0, 3.0] Vector

12. 12 Training LogisticRegression WithLBFGS [1.0, [1.0, 0.0, 3.0]] LabeledPoint Model Storage [1.0, 0.0, 3.0] Vector Add classification LogisticRegressionModel

13. 13 Prediction Extract Content Tokenize Bigrams Trigrams Feature Lookup [0.0, 1.0, 4.0] Vector Model Lookup Predict New tweet [0.0, 1.0, 4.0] Vector Classification Lookup

14. 14 How do we provide online predictions with Spark? … if you have small data Task Time in µs Vector prediction 300 DataFrame prediction 7800 DataFrames are slow ...

15. 15 How do we fit Spark into our existing system? Core functionality Idibon custom ML … REST API ML persistence layer

16. 16 • Real-time operationalization of many, many models • Embed within different platforms • Single save/load framework • Rapidly incorporate new NLP features • Logging/monitoring standardized & abstracted How does a persistence layer enable us to use Spark?

17. 17 • Analyzing human language is hard • We’re using the most exciting parts of Spark to build performant NLP systems that are faster & better than ever before Summary

18. 18 Questions? Michelle Casbon michelle@idibon.com @texasmichelle

Editor's Notes

Detecting intent-to-buy in twitter data Extracting insights about global health trends Making the call center iVR smarter Informing investors of company performance based on news sentiment Identifying supply chain risk in news articles Understanding user reception of code pushes in online games Prioritizing urgent SMS messages for UNICEF
API: Document uploads, model training, cross-validation, annotation aggregation, queueing, topic modeling, prediction, IAA
Persistence layer Docker, Lambda, offline, on-device 0 Input/output streams vs. flat files slf4j

Advanced Spark Meetup - Jan 12, 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Spark Meetup - Jan 12, 2016

Similar to Advanced Spark Meetup - Jan 12, 2016 (20)

Recently uploaded

Recently uploaded (20)

Advanced Spark Meetup - Jan 12, 2016

Editor's Notes