SlideShare a Scribd company logo
1 of 28
Download to read offline
APACHE SPARK FOR MACHINE LEARNING
WITH HIGH DIMENSIONAL LABELS
Michael Zargham and Stefan Panayotov
Cadent, Data Science & Engineering Research
2© 2016 Cadent. All rights reserved.
Data Technology Company specializing in Television Advertising
§ Cadent has a bicoastal data science and engineering team
- Our business runs on internally developed software
- Hybrid cloud Apache Spark infrastructure
- Analytical rather than rule driven algorithms
- Machine Learning APIs and custom mathematics in decision optimizations
- Collaborations with IBM Research (Spark TC) and Product team (Data Science Experience)
Cadent: Data Empowered Television Advertising
Data
Infrastructure
Engineering
Science
Decisions
Analytics
Motivation
• Business Model
– 2 sided business
– Upfront Sales sell Impressions
– Fulfill with Scatter Purchases based on
subscribers
– Impressions = ratings * subscribers
• Relevant Scales
– Weather-like View
• Shows
• Twitter trends
• Spectacle Events
– Climate-like View
• Seasonality
• Subscriber trends
• Daypart Variation
Theoretical Approach
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
QuarterHourofDay
RatingforQuarterHour
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Quarter Hour of Day
Rating Vectors
• 96 positive real values
Daily Patterns: Mean & Variance
Values in Log-like coordinate system:
value 0 = rating 0
value 3 = rating 10^(-5)
value 5 = rating 10^(-3)
mean variance
Label Dimensionality Reduction
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Quarter Hour of Day
Rating Vectors
• 96 positive real values
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc
Coef of
Principals
• J real values
Component
Captured Variance
Warning:
Uncaptured Variance is strictly
lost from the predictive model
Why Reduce Label Dimension
• The correlations between values capture by
reducing to principal components adds more
value than variance lost in “climate-like” view
• Apache Spark ML API doesn’t support nDim
regression so J dimensional regression is
computationally efficient for J<<n
Coordinate Systems Matter
• Regression works well when…
– Euclidean distance is fits well with human sense of “sameness”
– The labels being predicted are well conditioned
• A big part of our Methodology is understanding the
mathematical spaces our data lives in and using ‘change of
coordinate’ techniques
0:00 12:00 23:59
0:00
12:00
Define points on unit circle:
Using 2D (x,y) coordinates
Unknowns Imputed at (0,0)
Custom Log-Like Coordinates
This coordinate system is used to eliminate bias in error metrics,
In the domain the errors in large value ratings swamp those of small value ratings
Predictor Correct method
• Predictor-Corrector is a form of ensemble
– Build a naïve model and an estimator of that model’s bias function,
pipelining them together to create a PCM
X: y yPred=yHat-eHatyHat
X: e=yHat-y eHat
Implemented
Workflow
Naïve Estimator Model (NEM)
Domain Space Forecast
Correction Estimator Model (CEM)
Logspace Local Coordinate
Forecast
Both NEM and CEM are
Regressors in reduced
dimensional vector spaces created
using PCA linear subspace
reductions to find efficient
coordinate systems.
Principal Component Analysis (PCA): to
reduce the dimensionality of the problem
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc.
Features
• unique combinations of
targetable characteristics
• Network, Age, Gender,
Category, Season, etc.
Component1
ComponentJ
…
Train GBT Regressor 1 Train GBT Regressor J
PCA
Transform
PCA
pseudo-
inverse
Pipeline of J single label regression models
Vector
Disassemble
Vector
Assembler
Inverse PCA: Transform forecasts into
domain space
Gradient Boosted Tree Regressor
Pipeline: Stages = PCA Coefficients
Evaluation of Rating Feed Vectors
Exploring Model Performance
Big Data Results
X: y: pred y
(vectors)
Unpivot
Vector
X, qh:
Small
Data
Artifacts
Visualizations
&
Performance
StatisticsSmall
Data
Artifacts
…
*Like Estimators, Evaluators in Spark ML are 1 dimensional
RatingforQuarterHour
PredRatingforQuarterHour
ErrorforQuarterHour
Evaluation of Rating Feed Vectors
Ensure performance quality of our predictive
models:
Steps:
• UDF Composition
• Data Wrangling
• Machine Learning Evaluation
Evaluation of Rating Feed Vectors
UDF Composition:
• Perform element-wise calculations on Vectors
Evaluation of Rating Feed Vectors
UDF Composition:
• Zip and flatten relevant vectors
Evaluation of Rating Feed Vectors
Data Wrangling:
Evaluation of Rating Feed Vectors
Data Wrangling:
• Pivot & Aggregate using summary statistics
Evaluation of Rating Feed Vectors
Future Work
• Program Schedule based short term refinements
– While our sales teams work with the “climate-like”
ratings forecasts generated months in advance,
operations buys media with weeks lead time
• Rentrak Integrations & Sensor Fusion
– Nielsen Ratings are Panel driven and Rentrak is
census based, but both are fundamentally
observations of the same underlying phenomenon
Contributors
• Michael Zargham
– Director, Data Science @ Cadent
– PhD in Optimization and Decision Theory from Upenn
– Founder of Cadent Data Science Team
– Architect of Information and Decision systems
• Stefan Panayotov
– Sr.	Data	Engineer	@	Cadent	Technology
– PhD	in	Computer	Science	from	Bulgarian	Academy	of	Sciences.
– Implemented	the	Big	Data	platform	to	support	the	data	science	and	business	intelligence	teams	at	Cadent
– Built	ETL	& ELT	processes	and	worked	on	creating	ML	models	pipelines	for	predicting	ratings.
• Joshua Jodesty
– Jr. Data Engineer @ Cadent Technology
– Award-winning Learning Analytics researcher
– B.S. in Information Science & Technology from Temple University
Broader Data Team @ Cadent
• Stephanie Mitchko-Beal, CTO/COO – Driver of Cadent’s Data Driven Transition
• Dr. Joe Matarese – Chief Technologist, General Manager, Silicon Valley Office
– Former VP & GM of ARRIS On Demand, SVP Advanced Technology at C-COR and CTO nCUBE
– Experience in high performance computing applied to big data problems in seismology and geophysical inverse theory
• Dr. David Sisson – VP Strategic Technology
– Research in computational neuroscience and signal processing, data platform architect at Cadent Network
• Chris Frazier – VP Business Intelligence
• Mark Sun – VP Software Development
– MS, Computer Science; BS, Nuclear Engineering and leader of Cadent DAI platform development team
• Dr. Yun Huang – Data Engineer & Director, Software Development
• Matthew Plourde – Sr. Analytics Engineer & Lead Machine Learning Developer
• Team has state of the art skills
– Over a dozen engineers with Apache Spark Big Data platform development experience
– 8 Engineers and analysts with Machine Learning experience
– Expertise in a wide array of languages in Python, R, SQL, Java, Scala, C#
– Across our ranks the data team has 6 PhDs including from top universities like Penn, MIT, & Caltech
Special Thanks to Databricks
Databricks’ Spark platform provided:
• the necessary stability and scalability for work of
this sophistication
• made accessible to us by a quality support staff
• at a cost that a mid-sized business can afford
Thank You.
Contact Us
Mike: mzargham@cadent.tv
Stefan: spanayotov@cadent.tv
Josh: jjodesty@cadent.tv
Interested in our team?
http://cadent.tv/careers/

More Related Content

What's hot

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 

What's hot (20)

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 

Viewers also liked

Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 

Viewers also liked (20)

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
 
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
 FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by... FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 

Similar to Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham

Nikhila Marripati Resume - BI/DataEngineer
Nikhila Marripati Resume -  BI/DataEngineerNikhila Marripati Resume -  BI/DataEngineer
Nikhila Marripati Resume - BI/DataEngineer
bnikhila43
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 

Similar to Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham (20)

Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Resume anh chu data analyst
Resume anh chu data analystResume anh chu data analyst
Resume anh chu data analyst
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Pratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data AnalystPratik Patel Python/ Big Data Analyst
Pratik Patel Python/ Big Data Analyst
 
Pratik Patel resume
Pratik Patel  resumePratik Patel  resume
Pratik Patel resume
 
SOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdfSOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdf
 
S&OP as a Service.pdf
S&OP as a Service.pdfS&OP as a Service.pdf
S&OP as a Service.pdf
 
Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)
 
Resume
ResumeResume
Resume
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Rahul Chauhan Resume - Data Scientist.pdf
Rahul Chauhan Resume - Data Scientist.pdfRahul Chauhan Resume - Data Scientist.pdf
Rahul Chauhan Resume - Data Scientist.pdf
 
Resume anh chu
Resume anh chuResume anh chu
Resume anh chu
 
Nikhila Marripati Resume - BI/DataEngineer
Nikhila Marripati Resume -  BI/DataEngineerNikhila Marripati Resume -  BI/DataEngineer
Nikhila Marripati Resume - BI/DataEngineer
 
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Presented at useR! 2010
Presented at useR! 2010Presented at useR! 2010
Presented at useR! 2010
 
Vivek Adithya Mohankumar Resume
Vivek Adithya Mohankumar ResumeVivek Adithya Mohankumar Resume
Vivek Adithya Mohankumar Resume
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit East talk by Stefan Panayotov and Michael Zargham

  • 1. APACHE SPARK FOR MACHINE LEARNING WITH HIGH DIMENSIONAL LABELS Michael Zargham and Stefan Panayotov Cadent, Data Science & Engineering Research
  • 2. 2© 2016 Cadent. All rights reserved. Data Technology Company specializing in Television Advertising § Cadent has a bicoastal data science and engineering team - Our business runs on internally developed software - Hybrid cloud Apache Spark infrastructure - Analytical rather than rule driven algorithms - Machine Learning APIs and custom mathematics in decision optimizations - Collaborations with IBM Research (Spark TC) and Product team (Data Science Experience) Cadent: Data Empowered Television Advertising Data Infrastructure Engineering Science Decisions Analytics
  • 3. Motivation • Business Model – 2 sided business – Upfront Sales sell Impressions – Fulfill with Scatter Purchases based on subscribers – Impressions = ratings * subscribers • Relevant Scales – Weather-like View • Shows • Twitter trends • Spectacle Events – Climate-like View • Seasonality • Subscriber trends • Daypart Variation
  • 4. Theoretical Approach Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc QuarterHourofDay RatingforQuarterHour Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc Quarter Hour of Day Rating Vectors • 96 positive real values
  • 5. Daily Patterns: Mean & Variance Values in Log-like coordinate system: value 0 = rating 0 value 3 = rating 10^(-5) value 5 = rating 10^(-3) mean variance
  • 6. Label Dimensionality Reduction Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc Quarter Hour of Day Rating Vectors • 96 positive real values Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc Coef of Principals • J real values Component
  • 7. Captured Variance Warning: Uncaptured Variance is strictly lost from the predictive model
  • 8. Why Reduce Label Dimension • The correlations between values capture by reducing to principal components adds more value than variance lost in “climate-like” view • Apache Spark ML API doesn’t support nDim regression so J dimensional regression is computationally efficient for J<<n
  • 9. Coordinate Systems Matter • Regression works well when… – Euclidean distance is fits well with human sense of “sameness” – The labels being predicted are well conditioned • A big part of our Methodology is understanding the mathematical spaces our data lives in and using ‘change of coordinate’ techniques 0:00 12:00 23:59 0:00 12:00 Define points on unit circle: Using 2D (x,y) coordinates Unknowns Imputed at (0,0)
  • 10. Custom Log-Like Coordinates This coordinate system is used to eliminate bias in error metrics, In the domain the errors in large value ratings swamp those of small value ratings
  • 11. Predictor Correct method • Predictor-Corrector is a form of ensemble – Build a naïve model and an estimator of that model’s bias function, pipelining them together to create a PCM X: y yPred=yHat-eHatyHat X: e=yHat-y eHat
  • 12. Implemented Workflow Naïve Estimator Model (NEM) Domain Space Forecast Correction Estimator Model (CEM) Logspace Local Coordinate Forecast Both NEM and CEM are Regressors in reduced dimensional vector spaces created using PCA linear subspace reductions to find efficient coordinate systems.
  • 13. Principal Component Analysis (PCA): to reduce the dimensionality of the problem Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc. Features • unique combinations of targetable characteristics • Network, Age, Gender, Category, Season, etc. Component1 ComponentJ … Train GBT Regressor 1 Train GBT Regressor J PCA Transform PCA pseudo- inverse Pipeline of J single label regression models Vector Disassemble Vector Assembler
  • 14. Inverse PCA: Transform forecasts into domain space
  • 15. Gradient Boosted Tree Regressor Pipeline: Stages = PCA Coefficients
  • 16. Evaluation of Rating Feed Vectors
  • 17. Exploring Model Performance Big Data Results X: y: pred y (vectors) Unpivot Vector X, qh: Small Data Artifacts Visualizations & Performance StatisticsSmall Data Artifacts … *Like Estimators, Evaluators in Spark ML are 1 dimensional RatingforQuarterHour PredRatingforQuarterHour ErrorforQuarterHour
  • 18. Evaluation of Rating Feed Vectors Ensure performance quality of our predictive models: Steps: • UDF Composition • Data Wrangling • Machine Learning Evaluation
  • 19. Evaluation of Rating Feed Vectors UDF Composition: • Perform element-wise calculations on Vectors
  • 20. Evaluation of Rating Feed Vectors UDF Composition: • Zip and flatten relevant vectors
  • 21. Evaluation of Rating Feed Vectors Data Wrangling:
  • 22. Evaluation of Rating Feed Vectors Data Wrangling: • Pivot & Aggregate using summary statistics
  • 23. Evaluation of Rating Feed Vectors
  • 24. Future Work • Program Schedule based short term refinements – While our sales teams work with the “climate-like” ratings forecasts generated months in advance, operations buys media with weeks lead time • Rentrak Integrations & Sensor Fusion – Nielsen Ratings are Panel driven and Rentrak is census based, but both are fundamentally observations of the same underlying phenomenon
  • 25. Contributors • Michael Zargham – Director, Data Science @ Cadent – PhD in Optimization and Decision Theory from Upenn – Founder of Cadent Data Science Team – Architect of Information and Decision systems • Stefan Panayotov – Sr. Data Engineer @ Cadent Technology – PhD in Computer Science from Bulgarian Academy of Sciences. – Implemented the Big Data platform to support the data science and business intelligence teams at Cadent – Built ETL & ELT processes and worked on creating ML models pipelines for predicting ratings. • Joshua Jodesty – Jr. Data Engineer @ Cadent Technology – Award-winning Learning Analytics researcher – B.S. in Information Science & Technology from Temple University
  • 26. Broader Data Team @ Cadent • Stephanie Mitchko-Beal, CTO/COO – Driver of Cadent’s Data Driven Transition • Dr. Joe Matarese – Chief Technologist, General Manager, Silicon Valley Office – Former VP & GM of ARRIS On Demand, SVP Advanced Technology at C-COR and CTO nCUBE – Experience in high performance computing applied to big data problems in seismology and geophysical inverse theory • Dr. David Sisson – VP Strategic Technology – Research in computational neuroscience and signal processing, data platform architect at Cadent Network • Chris Frazier – VP Business Intelligence • Mark Sun – VP Software Development – MS, Computer Science; BS, Nuclear Engineering and leader of Cadent DAI platform development team • Dr. Yun Huang – Data Engineer & Director, Software Development • Matthew Plourde – Sr. Analytics Engineer & Lead Machine Learning Developer • Team has state of the art skills – Over a dozen engineers with Apache Spark Big Data platform development experience – 8 Engineers and analysts with Machine Learning experience – Expertise in a wide array of languages in Python, R, SQL, Java, Scala, C# – Across our ranks the data team has 6 PhDs including from top universities like Penn, MIT, & Caltech
  • 27. Special Thanks to Databricks Databricks’ Spark platform provided: • the necessary stability and scalability for work of this sophistication • made accessible to us by a quality support staff • at a cost that a mid-sized business can afford
  • 28. Thank You. Contact Us Mike: mzargham@cadent.tv Stefan: spanayotov@cadent.tv Josh: jjodesty@cadent.tv Interested in our team? http://cadent.tv/careers/