Open Source Framework for
Deploying Data Science Models and
Cloud Based Applications
Pivotal Data Science Team
What happened?
What should I do about it?
This is where Data Science comes in
What will happen next?
What Thought Leaders Have In Common
 Large amounts of structured and
unstructured data
 Deep personal knowledge of their
audience
 Quantified understanding of their
products
 Data-driven culture
 User experience optimized by data
science
Viewership
Advertisements Merchandise
Sales & Finance
$
Market Research &
Competitive Information
Audience Demographics
Internal Data Sources
Typical External Sources Semi/Unstructured Data
Clickstream
Social Media
Content
Data Science Impact
Business Motivation
Increase
Demand
Build Brand Equity
Increase Production
Efficiency
Optimize Ad
Spend Efficiency
Increase Customer
Engagement
• Campaign
Optimization
• Marketing Mix
Models
Data Science Opportunities
• Customer
segmentation
• Affinity analysis
• Social media analytics
• Supply/Demand
forecasting
Increase
Revenue
Reduce
Cost
Example Use Case: Ratings Prediction
Use Case: Increase ratings across viewer
demographics
How:
• Data: Viewership, transcripts and show
data combined in big data platform
• Model: Machine learning used to
identify the impact of production
decisions on viewership
Insights
Models  Insights  Actions
Models are built to
answer business
questions
e.g. what makes viewers tune-
in and tune-out?
Data Scientists
interpret models for
answers
e.g. On screen arguments
make viewers tune out
Report
Dashboard
BI Tool
Email
Presentation
Cloud App
End User
A good insight drives action that will generate value for stakeholders
Revisiting Rating Prediction Use Case
Model exposed to end users via cloud
application allowing what-if scenario building
Characteristics Of Actionable Insights
Real-time
ScalableSocial
Relevant
Accessible
Open
Benefits Of Cloud Based Applications
Service failure or
data loss at scale
Long innovation
cycles
Poor experience at
scale
Resilient, scale-out
messaging and
processing
Agile development
with cloud based
data services
Low-latency, in-
memory computing
Open Source Analytics Ecosystem
Media companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
Example Scalable Open Source Platform
Hadoop++: Complementing the Hadoop platform are Data Science modeling tools.
SQL on Hadoop (e.g. HAWQ), Python/R interfaces to SQL, Apache Spark etc.
http://opendataplatform.org/
Apps
Data
Analytics
Leading Media companies are moving towards a platform with Hadoop at the core.
Data Science Pipeline On Hadoop++
MLlib
PL/X
Data Lake
Hadoop++
Structured +
Unstructured
Data
Open Source Framework For Ratings Prediction
Data Lake
Insights and
Model Results
Ratings Predictions
Business Levers
Hosted on
What-if Scenario
ApplicationContains structured
+ unstructured data
MLlib
PL/X
Gather video ads
impression stats
Data Lake
Ingest
Message Broker Simulate Ad
Server
Behavior
Impression Forecasts
Business Levers
Hosted on
Business Metrics
Dashboard
Expanding The Framework To Include Impression
Forecasting Modeling
MLlib
PL/X
Measuring Audience Engagement : Workflow
Parallel Parsing
of JSON
(PL/Python)
Twitter Decahose
(~55 million tweets/day)
Source: http
Sink: hdfs
HDFS
External
Tables
PXF
Nightly Cron Jobs
Topic Analysis
through MADlib
pLDA
Unsupervised
Sentiment Analysis
(PL/Python)
Hosted on
Key Takeaways
• Blended data sets lead to richer models and more
valuable insights
• Turn Data Science models and insights into value
generating actions through data driven applications.
• Open source = power and flexibility
• Platform extensibility is key to supporting Data Science
• Turnkey PaaS is available through CloudFoundry,
including infrastructure monitoring, server
configuration and scalability.
THANK YOU!

Open Source Framework for Deploying Data Science Models and Cloud Based Applications by Noelle Sio of Pivotal

  • 1.
    Open Source Frameworkfor Deploying Data Science Models and Cloud Based Applications Pivotal Data Science Team
  • 3.
    What happened? What shouldI do about it? This is where Data Science comes in What will happen next?
  • 4.
    What Thought LeadersHave In Common  Large amounts of structured and unstructured data  Deep personal knowledge of their audience  Quantified understanding of their products  Data-driven culture  User experience optimized by data science
  • 5.
    Viewership Advertisements Merchandise Sales &Finance $ Market Research & Competitive Information Audience Demographics Internal Data Sources Typical External Sources Semi/Unstructured Data Clickstream Social Media Content
  • 6.
    Data Science Impact BusinessMotivation Increase Demand Build Brand Equity Increase Production Efficiency Optimize Ad Spend Efficiency Increase Customer Engagement • Campaign Optimization • Marketing Mix Models Data Science Opportunities • Customer segmentation • Affinity analysis • Social media analytics • Supply/Demand forecasting Increase Revenue Reduce Cost
  • 7.
    Example Use Case:Ratings Prediction Use Case: Increase ratings across viewer demographics How: • Data: Viewership, transcripts and show data combined in big data platform • Model: Machine learning used to identify the impact of production decisions on viewership Insights
  • 8.
    Models  Insights Actions Models are built to answer business questions e.g. what makes viewers tune- in and tune-out? Data Scientists interpret models for answers e.g. On screen arguments make viewers tune out Report Dashboard BI Tool Email Presentation Cloud App End User A good insight drives action that will generate value for stakeholders
  • 9.
    Revisiting Rating PredictionUse Case Model exposed to end users via cloud application allowing what-if scenario building
  • 10.
    Characteristics Of ActionableInsights Real-time ScalableSocial Relevant Accessible Open
  • 11.
    Benefits Of CloudBased Applications Service failure or data loss at scale Long innovation cycles Poor experience at scale Resilient, scale-out messaging and processing Agile development with cloud based data services Low-latency, in- memory computing
  • 12.
    Open Source AnalyticsEcosystem Media companies benefit from algorithmic breadth and scalability for building and socializing data science models MLlib PL/X Algorithms Visualization Best of breed in-memory and in-database tools for an MPP platform
  • 13.
    Example Scalable OpenSource Platform Hadoop++: Complementing the Hadoop platform are Data Science modeling tools. SQL on Hadoop (e.g. HAWQ), Python/R interfaces to SQL, Apache Spark etc. http://opendataplatform.org/ Apps Data Analytics Leading Media companies are moving towards a platform with Hadoop at the core.
  • 14.
    Data Science PipelineOn Hadoop++ MLlib PL/X Data Lake Hadoop++ Structured + Unstructured Data
  • 15.
    Open Source FrameworkFor Ratings Prediction Data Lake Insights and Model Results Ratings Predictions Business Levers Hosted on What-if Scenario ApplicationContains structured + unstructured data MLlib PL/X
  • 16.
    Gather video ads impressionstats Data Lake Ingest Message Broker Simulate Ad Server Behavior Impression Forecasts Business Levers Hosted on Business Metrics Dashboard Expanding The Framework To Include Impression Forecasting Modeling MLlib PL/X
  • 17.
    Measuring Audience Engagement: Workflow Parallel Parsing of JSON (PL/Python) Twitter Decahose (~55 million tweets/day) Source: http Sink: hdfs HDFS External Tables PXF Nightly Cron Jobs Topic Analysis through MADlib pLDA Unsupervised Sentiment Analysis (PL/Python) Hosted on
  • 18.
    Key Takeaways • Blendeddata sets lead to richer models and more valuable insights • Turn Data Science models and insights into value generating actions through data driven applications. • Open source = power and flexibility • Platform extensibility is key to supporting Data Science • Turnkey PaaS is available through CloudFoundry, including infrastructure monitoring, server configuration and scalability.
  • 19.