The Snake in Your Data
How Python is Used Today by Data Science Teams
Matt Price
Principal Research Engineer
2019.09.24
2SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
3
About ZeroFOX
It’s a Digital World. Engage Securely.
Our Mission
ZeroFOX exists to protect digital engagement
Our Story
ZeroFOX was founded with the goal of creating
customer champions
With global reach and operation centers in the
United States, United Kingdom, Chile and India,
ZeroFOX provides best in class software, support
and services to organizations of all sizes.
Most Recognized. Most Awarded.
4
Social and Digital Channels
Your Organization
Domains | Executives | VIP’s | Employees | Brands | Locations
AI-Driven Analysis
Automated Analysis | Alerts | Reporting
Human-Driven Analysis
ZeroFOX OnWatch™ | ZeroFOX Alpha Team
Remediation
Takedown-as-a-Service™
Complete Digital Visibility & Protection
The ZeroFOX
Platform
Identify
Risks on social and
digital platforms
Protect
What matters to
your organization
Remediate
Threats to your brand
and business
Protection
Identification
Analysis
Remediation
5SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
6SLIDE
The Data Science Lifecycle
● Each stage builds on subsequent
stages
● Most effort is around data
collection efforts
● Iterative process
● Python is used throughout the
entire workflow
7SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
8SLIDE
ZeroFOX AI
Machine
Learning
Deep
Learning
Artificial Intelligence
NLP CV
Artificial Intelligence (AI)
The simulation of intelligent behavior
in machines
AI Techniques
Machine Learning (ML)
Study and use of algorithms and
statistical models that learn from data
Deep Learning
A technique within ML that uses
“large” Neural Networks
9SLIDE
ZeroFOX Data Science Architecture
● Tied into production data ingest
● Feedback loop from analysts
● Labeling is open to the entire
company
● Architecture is optimized for quick
iterations
10SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
11SLIDE
Python Tooling Categories
Data manipulation
Data structures and data transformations
Data visualization
Understanding what the data is
Modeling
Teaching machines to learn the underlying patterns in the data
Deployment
Integrating with the platform and making models available to the end customer
12SLIDE
Data Manipulation Tools
● Multi-dimensional arrays and matrices
● High level mathematical functions
● Fast, vectorized operations
● Multi-dimensional matrices wrapped in DataFrames
● Time series logic and operations
● Data analysis functions and tools
● CV and ML library
● Fast operations - focus on real-time video
● Low level operations
● PIL fork
● General image processing library
● High level operations
13SLIDE
ZeroFOX Data Science Architecture
NumPy
OpenCV
Pillow
NumPy
OpenCV
Pillow
NumPy
OpenCV
Pillow
NumPy
OpenCV
Pillow
NumPy
OpenCV
Pillow
Pandas
14SLIDE
Data Visualization Tools
● Interactive computing via notebooks
● Kernels run code and return output
● Focus on scientific computing
● Plotting library
● Low level plotting interface
● Compatible with a number of GUI toolkits
● Built on top of matplotlib
● High level plotting interface
● Categorical variable support
● Framework for building data visualization apps
● Open source and enterprise versions
● Interactive charts
15SLIDE
ZeroFOX Data Science Architecture
Jupyter
Matplotlib
Seaborn
Plotly
Matplotlib
Seaborn
Plotly
Jupyter
Matplotlib
Seaborn
Plotly
16SLIDE
Modeling Tools
● Solves the labeling problem
● Enables active learning
● Programmatic workflow definitions
● Extremely flexible
prodigy
● Machine learning and data analysis library
● Built on top of NumPy, SciPy, LIBSVM, and matplotlib
● Number of various scikits available
● High level deep learning library
● Serves as an interface to lower level backends
● Tensorflow supplies low level building blocks
● Pre-defined models
● Production-focused NLP framework
● Deep learning models powered by Thinc
● Define pipeline which outputs annotated
documents
17SLIDE
ZeroFOX Data Science Architecture
Prodigy
Prodigy
Scikit-learn
Prodigy
Keras + Tensorflow
spaCy
Scikit-learn
Keras + Tensorflow
spaCy
Scikit-learn
18SLIDE
Deployment
● Web server and framework focused on
high performance
● Secondarily focused on ease of use
● Flask-like framework API
● Decent extension ecosystem
● Python 3.6+ (heavily relies on async/await)
● MVC web framework
● Focused on easing development of
database-driven websites
● Large extension ecosystem
● CRUD interface for administrative tasks
19SLIDE
ZeroFOX Data Science Architecture
Sanic
Django
20SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
21SLIDE
Prodigy
● Created by Explosion.AI (Matthew Honnibal and Ines Montani)
○ Same company that develops spaCy and Thinc
● Designed to make annotating data simple but can do much more
● Is a tool (Python package) that you purchase
● Why Prodigy?
○ Solves the “hardest” problem in applied data science
○ Can programmatically define entire model workflow in a recipe
○ Out of the box support for spaCy
○ Supports computer vision annotation
○ Exports trained models as Python packages
22Slide
/
Prodigy Live Demo
23SLIDE
Agenda
● About ZeroFOX
● The Data Science Lifecycle
● Data Science at ZeroFOX
● Data Science Tools
● Prodigy Demo
● Q & A
24Slide
/
Questions?

Python meetup

  • 1.
    The Snake inYour Data How Python is Used Today by Data Science Teams Matt Price Principal Research Engineer 2019.09.24
  • 2.
    2SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 3.
    3 About ZeroFOX It’s aDigital World. Engage Securely. Our Mission ZeroFOX exists to protect digital engagement Our Story ZeroFOX was founded with the goal of creating customer champions With global reach and operation centers in the United States, United Kingdom, Chile and India, ZeroFOX provides best in class software, support and services to organizations of all sizes. Most Recognized. Most Awarded.
  • 4.
    4 Social and DigitalChannels Your Organization Domains | Executives | VIP’s | Employees | Brands | Locations AI-Driven Analysis Automated Analysis | Alerts | Reporting Human-Driven Analysis ZeroFOX OnWatch™ | ZeroFOX Alpha Team Remediation Takedown-as-a-Service™ Complete Digital Visibility & Protection The ZeroFOX Platform Identify Risks on social and digital platforms Protect What matters to your organization Remediate Threats to your brand and business Protection Identification Analysis Remediation
  • 5.
    5SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 6.
    6SLIDE The Data ScienceLifecycle ● Each stage builds on subsequent stages ● Most effort is around data collection efforts ● Iterative process ● Python is used throughout the entire workflow
  • 7.
    7SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 8.
    8SLIDE ZeroFOX AI Machine Learning Deep Learning Artificial Intelligence NLPCV Artificial Intelligence (AI) The simulation of intelligent behavior in machines AI Techniques Machine Learning (ML) Study and use of algorithms and statistical models that learn from data Deep Learning A technique within ML that uses “large” Neural Networks
  • 9.
    9SLIDE ZeroFOX Data ScienceArchitecture ● Tied into production data ingest ● Feedback loop from analysts ● Labeling is open to the entire company ● Architecture is optimized for quick iterations
  • 10.
    10SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 11.
    11SLIDE Python Tooling Categories Datamanipulation Data structures and data transformations Data visualization Understanding what the data is Modeling Teaching machines to learn the underlying patterns in the data Deployment Integrating with the platform and making models available to the end customer
  • 12.
    12SLIDE Data Manipulation Tools ●Multi-dimensional arrays and matrices ● High level mathematical functions ● Fast, vectorized operations ● Multi-dimensional matrices wrapped in DataFrames ● Time series logic and operations ● Data analysis functions and tools ● CV and ML library ● Fast operations - focus on real-time video ● Low level operations ● PIL fork ● General image processing library ● High level operations
  • 13.
    13SLIDE ZeroFOX Data ScienceArchitecture NumPy OpenCV Pillow NumPy OpenCV Pillow NumPy OpenCV Pillow NumPy OpenCV Pillow NumPy OpenCV Pillow Pandas
  • 14.
    14SLIDE Data Visualization Tools ●Interactive computing via notebooks ● Kernels run code and return output ● Focus on scientific computing ● Plotting library ● Low level plotting interface ● Compatible with a number of GUI toolkits ● Built on top of matplotlib ● High level plotting interface ● Categorical variable support ● Framework for building data visualization apps ● Open source and enterprise versions ● Interactive charts
  • 15.
    15SLIDE ZeroFOX Data ScienceArchitecture Jupyter Matplotlib Seaborn Plotly Matplotlib Seaborn Plotly Jupyter Matplotlib Seaborn Plotly
  • 16.
    16SLIDE Modeling Tools ● Solvesthe labeling problem ● Enables active learning ● Programmatic workflow definitions ● Extremely flexible prodigy ● Machine learning and data analysis library ● Built on top of NumPy, SciPy, LIBSVM, and matplotlib ● Number of various scikits available ● High level deep learning library ● Serves as an interface to lower level backends ● Tensorflow supplies low level building blocks ● Pre-defined models ● Production-focused NLP framework ● Deep learning models powered by Thinc ● Define pipeline which outputs annotated documents
  • 17.
    17SLIDE ZeroFOX Data ScienceArchitecture Prodigy Prodigy Scikit-learn Prodigy Keras + Tensorflow spaCy Scikit-learn Keras + Tensorflow spaCy Scikit-learn
  • 18.
    18SLIDE Deployment ● Web serverand framework focused on high performance ● Secondarily focused on ease of use ● Flask-like framework API ● Decent extension ecosystem ● Python 3.6+ (heavily relies on async/await) ● MVC web framework ● Focused on easing development of database-driven websites ● Large extension ecosystem ● CRUD interface for administrative tasks
  • 19.
    19SLIDE ZeroFOX Data ScienceArchitecture Sanic Django
  • 20.
    20SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 21.
    21SLIDE Prodigy ● Created byExplosion.AI (Matthew Honnibal and Ines Montani) ○ Same company that develops spaCy and Thinc ● Designed to make annotating data simple but can do much more ● Is a tool (Python package) that you purchase ● Why Prodigy? ○ Solves the “hardest” problem in applied data science ○ Can programmatically define entire model workflow in a recipe ○ Out of the box support for spaCy ○ Supports computer vision annotation ○ Exports trained models as Python packages
  • 22.
  • 23.
    23SLIDE Agenda ● About ZeroFOX ●The Data Science Lifecycle ● Data Science at ZeroFOX ● Data Science Tools ● Prodigy Demo ● Q & A
  • 24.