The document discusses analyzing YouTube trending video data to estimate the initial number of views needed for a video to be featured on YouTube's trending list. It cleans the data by removing duplicate trending days for videos and outliers. It then takes simple random samples and stratified samples by video category to estimate the mean initial views. It finds that a proportional allocation stratified sample after adjusting for design effects provides the most precise estimate near the true mean initial views of around 94,000.
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
The document analyzes metadata from trending YouTube videos collected between November 2017 and March 2018 to estimate the initial mean number of views for videos to be on YouTube's trending list. It explores using simple random sampling and stratified proportional allocation to estimate the mean views. The proportional allocation after adjusting for design effect is found to have the lowest average absolute difference from the true mean and highest percentage of the true mean falling within the confidence interval. The analysis also finds that "Music" and "Comedy" categories may be harder to trend while "Nonprofits & Activism" and "News & Politics" may be easier.
The document analyzes Seresto's baseline media performance across channels and platforms based on 2021 data. It finds that display outperforms other channels for key metrics like impressions and clicks but underperforms for conversions. It suggests building predictive models that consider objective, channel, and platform to optimize spend. Sample predictive models are provided that demonstrate how optimizing channel spend for different objectives like awareness or acquisition could significantly improve key performance metrics like conversions while keeping total budget constant.
The document analyzes the baseline performance of Seresto's media channels and platforms based on 2021 data. It finds that display outperforms other channels for key metrics like impressions and clicks, while digital video underperforms despite a large budget share. To optimize spend efficiency, predictive modeling is recommended to analyze performance considering multiple dimensions like channel, platform, and objective simultaneously. Sample predictive models are provided for different media objectives that predict optimized spend allocations could significantly improve key performance metrics over the baseline.
This document discusses Nielsen's product called NielsenConnections, which aims to integrate media currency data to provide a consumer-centric view of cross-media usage. It does this by fusing different Nielsen data sources like Nielsen Media Research, Nielsen Online, MRI, and Nielsen Mobile. This allows clients to plan and optimize their media around specific marketing targets. An example is provided showing how NielsenConnections can provide post-buy analysis for a comedy movie by measuring GRP delivery against specific audience segments like early adopters.
This document describes the development of a logistic regression model to predict whether a television show will be canceled based on ratings data and other variables. Several covariates are evaluated in univariate and multivariate regression analyses. Continuous variables are examined for nonlinearity, and fractional polynomials are used to account for nonlinearity. The resulting preliminary main effects model includes demographic and overall viewership, previous year's demographic viewership, whether the show is scripted, and the television network as predictors of cancellation.
The document discusses the importance of accurately summarizing service levels and not just presenting data that makes metrics appear positive. It notes that underlying issues can be obscured by how data is presented and that user experience does not always match reported numbers. Effective implementation of ITIL processes includes understanding what the data represents and ensuring service level reporting addresses the actual business needs.
This document discusses using agile metrics to measure and improve different dimensions of an agile engagement, including cost, time, and quality. It outlines the need for defining KPIs and metrics, preparing for measurement, collecting and analyzing measurement data, and using the analysis to set improvement targets. Examples of potential metrics are provided, such as defect removal efficiency, productivity, and effort variance. The document also shows an example of normalizing sprint data and using regression analysis to establish a baseline and improvement goals for defect removal efficiency.
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
The document analyzes metadata from trending YouTube videos collected between November 2017 and March 2018 to estimate the initial mean number of views for videos to be on YouTube's trending list. It explores using simple random sampling and stratified proportional allocation to estimate the mean views. The proportional allocation after adjusting for design effect is found to have the lowest average absolute difference from the true mean and highest percentage of the true mean falling within the confidence interval. The analysis also finds that "Music" and "Comedy" categories may be harder to trend while "Nonprofits & Activism" and "News & Politics" may be easier.
The document analyzes Seresto's baseline media performance across channels and platforms based on 2021 data. It finds that display outperforms other channels for key metrics like impressions and clicks but underperforms for conversions. It suggests building predictive models that consider objective, channel, and platform to optimize spend. Sample predictive models are provided that demonstrate how optimizing channel spend for different objectives like awareness or acquisition could significantly improve key performance metrics like conversions while keeping total budget constant.
The document analyzes the baseline performance of Seresto's media channels and platforms based on 2021 data. It finds that display outperforms other channels for key metrics like impressions and clicks, while digital video underperforms despite a large budget share. To optimize spend efficiency, predictive modeling is recommended to analyze performance considering multiple dimensions like channel, platform, and objective simultaneously. Sample predictive models are provided for different media objectives that predict optimized spend allocations could significantly improve key performance metrics over the baseline.
This document discusses Nielsen's product called NielsenConnections, which aims to integrate media currency data to provide a consumer-centric view of cross-media usage. It does this by fusing different Nielsen data sources like Nielsen Media Research, Nielsen Online, MRI, and Nielsen Mobile. This allows clients to plan and optimize their media around specific marketing targets. An example is provided showing how NielsenConnections can provide post-buy analysis for a comedy movie by measuring GRP delivery against specific audience segments like early adopters.
This document describes the development of a logistic regression model to predict whether a television show will be canceled based on ratings data and other variables. Several covariates are evaluated in univariate and multivariate regression analyses. Continuous variables are examined for nonlinearity, and fractional polynomials are used to account for nonlinearity. The resulting preliminary main effects model includes demographic and overall viewership, previous year's demographic viewership, whether the show is scripted, and the television network as predictors of cancellation.
The document discusses the importance of accurately summarizing service levels and not just presenting data that makes metrics appear positive. It notes that underlying issues can be obscured by how data is presented and that user experience does not always match reported numbers. Effective implementation of ITIL processes includes understanding what the data represents and ensuring service level reporting addresses the actual business needs.
This document discusses using agile metrics to measure and improve different dimensions of an agile engagement, including cost, time, and quality. It outlines the need for defining KPIs and metrics, preparing for measurement, collecting and analyzing measurement data, and using the analysis to set improvement targets. Examples of potential metrics are provided, such as defect removal efficiency, productivity, and effort variance. The document also shows an example of normalizing sprint data and using regression analysis to establish a baseline and improvement goals for defect removal efficiency.
Presentation at SAS Analytics conference 2014
Predictive analytics has been applied to solve a wide range of real-world problems. Nevertheless, current state-of-the-art predictive analytics models are not well aligned with business needs since they don't include the real financial costs and benefits during the training and evaluation phases. Churn modeling does not yield the best results when it's measured by investment per subscriber on a loyalty campaign and the financial impact of failing to detect a churner versus wrongly predicting a non-churner. This presentation will show how using a cost-sensitive modeling approach leads to better results in terms of profitability and predictive power – and is applicable to many other business challenges.
IRJET- Survey on Face Recognition using BiometricsIRJET Journal
This document describes a survey on face recognition using biometrics. It discusses using the Haar cascade algorithm with OpenCV in Python to detect faces in images and video. The algorithm involves selecting Haar features, creating integral images for rapid calculation of features, training classifiers with AdaBoost, and cascading the classifiers. It trains on positive and negative image datasets to detect faces and then recognizes faces by extracting principal components and comparing to trained data. The system fulfills basic face detection and recognition needs at low cost for applications like security and real-time analysis. Improving the algorithm involves adding more training images to increase accuracy.
Learning curves describe how unit production time decreases at a constant rate as cumulative production increases. T.P. Wright first documented learning curves in 1936, finding that aircraft production times reduced consistently. Companies can use learning curves to forecast budgets and costs, determine staffing needs, and make pricing decisions. Learning curves are calculated using equations that relate time per unit to cumulative units produced. A key rate is the percentage of learning, with higher percentages meaning faster individual unit production times. While useful, learning curves do not directly reduce costs - management must take action to increase learning rates through training or tools.
The document discusses learning curves and how companies can use them. It begins by explaining that T.P. Wright first documented learning curves in 1936 and found that production time decreased at an unvarying rate as companies gained experience. Learning curves can be used to forecast budgets, determine staffing needs, and aid pricing decisions. While learning curves are useful, human factors must be considered and not all learning provides the same benefits to an organization.
Online Advertising as a Branding Tool Vizeum FinlandJukka Veteläinen
Finnish case study of online advertising as a branding tool. Conducted for international traveling company in Finland.
Case study demonstrates how online advertising can be compared to tv advertising and how online visibility can be optimized to reach and engage with more consumers.
Contact me on
Linkedin: http://fi.linkedin.com/in/jukkavetelainen/
Twitter: @Zugi83
- The document presents the findings of a project analyzing customer data from a cellular service provider to predict customer churn and identify factors that influence churn.
- Several predictive models were tested and cost-benefit analysis was used to reduce misclassification costs and improve accuracy from the baseline of 73% to as high as 81%.
- Key factors found to influence churn were month-to-month contracts, tenure of 1-12 months, and not having online security services.
This document outlines a Six Sigma project to improve the operations of a bank call center. It presents the problem statement of deteriorating call center performance, outlines exercises to analyze first call resolution data using process capability indices and measurement system analysis, and proposes potential failure modes and their effects using a Failure Modes and Effects Analysis. The conclusion is that Lean Six Sigma tools can help improve first call resolution and reduce costs by eliminating waste and non-value-added activities in call center processes.
Using Key Metrics to Supercharge Your Demand Management and S&OP ProcessSteelwedge
Forecast accuracy is the single most important metric for demand planning, and quite possibly for the entire sales and operations planning (S&OP) process. If you start with an accurate forecast, the rest of the process is a lot easier. Improvements in forecast accuracy have a cascade effect, leading to reductions in inventory, improvements in customer service/reductions in stock outs, and eventually to increases in gross revenues and/or margins. But do you really understand forecast accuracy, and how to properly leverage it for results?
In this webinar, we will focus on explaining the foundations of forecast accuracy metrics:
How to measure forecast accuracy
Selecting the most meaningful offset or lag period
Units vs. revenue
Aggregation levels
Time buckets: Weeks, months or quarters
Impacts to functional groups: Sales, Marketing, Demand Planning
Strategies for transparent and effective tracking for improved results
Time Dependent Video Compression For Efficient StorageSoumyaShaw4
We thereby propose a time-dependent compression technique that satisfies the need of the hour. The concept suggests different levels of compression based on the age of the recorded video. The study finds the dependence of block size with the time taken for compression and, in turn, finds its performance with the help of metrics like Object Identif ication, Motion Tracking, Activity Recognition, and Mean Squared Error. The user is free to choose from the compression stages mentioned based on the specific application and other essential
parameters like Storage capacity.
Six Sigma is a data-driven approach to process improvement that aims for near perfect product quality. It uses statistical methods to identify and remove defects in manufacturing and business processes. The goal of Six Sigma is to achieve no more than 3.4 defects per million opportunities. Projects follow the DMAIC cycle of Define, Measure, Analyze, Improve, and Control to close gaps between current and target performance. Six Sigma was pioneered by companies like Motorola and GE and helped significantly reduce costs and increase profits through quality improvement.
Growing Viewership Through Data-Driven Program PromotionJeff Storan
Simulmedia is a New York City-based TV program promotion ad network that uses anonymous viewing data from 15+ million US households to deliver targeted on-air promotions for TV networks. By understanding individual viewers' preferences, Simulmedia can predict which viewers are most receptive to specific programs and reach them effectively. Trials show Simulmedia consistently delivers higher viewership and return on investment for networks' programs compared to traditional promotion methods.
The document discusses sampling techniques used in business research. It defines key terms like target population, sampling unit, and sampling frame. It also classifies different sampling techniques into probability and non-probability categories. Some examples are provided to illustrate sample size determination for estimating means and proportions within given levels of precision and confidence.
This document provides an overview of Six Sigma in 10 steps. It discusses key Six Sigma concepts like the normal distribution and defects per million opportunities (DPMO). The document explains that (1) processes can be measured and improved, (2) measurements typically follow a normal distribution, (3) most outcomes fall within 3 sigmas of the mean, and (4) Six Sigma aims for less than 4 defects per million by requiring 6 sigmas between the mean and specification limits. This achieves over 99.99966% success rate in meeting customer requirements.
Determine an optimal target to maximize net profit from marketing campaign with the use of a scoring model ablo to predict the client's probability of upselling
Succes op YouTube: hoe data & creativiteit de winnende formule isBBPMedia1
Ondanks de enorme populariteit van YouTube zetten nog te weinig merken dit platform succesvol in. Hier gaan we met deze sessie verandering in brengen. Wat zijn de laatste trends en ontwikkelingen? Van welke merken kan jij leren? En hoe pak jij het vooral niet aan als je succes wilt? Maar vooral: hoe je met een combinatie van data-analyse en creativiteit een winnende formule te pakken hebt. In een gezamenlijke sessie van Team5pm en YouTube zullen zij al deze vragen beantwoorden.
The document discusses the origins and concepts of Six Sigma. It began at Motorola in 1987 and was later adopted by other companies like GE. Six Sigma aims to reduce defects to 3.4 per million opportunities by driving processes to operate with as little variability as possible. It uses statistical methods and process improvement tools to help companies lower costs, increase customer satisfaction and profits.
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
In this project, we will discuss the results of our analysis of customer satisfaction
conducted using R Studio. Our team did this by carefully analyzing a number of factors that would have an effect on customer happiness of that certain company, such as Complaint
Resolution, Delivery Speed, Order Billing, Warranty Claims, Technical Support, E- commerce, Product Quality, Sales Force Image, Advertising, Price, and Product Line.
We performed a thorough study using a random sample of 70 data points, using a pre- chosen seed value which we obtained from the largest student number in our group. To
ensure the accuracy of our results, we replace missing values of the dataset with the mean of the dataset.
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
1) Have confidence that with hard work and the right attitude you can achieve your goals, regardless of perceived difficulties.
2) Remove negative influences and don't let others' assumptions discourage you. Follow the data and your own strengths.
3) Be adaptable - have placeholder decisions but be willing to change them as new information arrives. Act on intentions to learn from outcomes.
4) Inhibit procrastination by starting early and working efficiently to completion. Get into a productive "flow state."
More Related Content
Similar to Estimating the initial mean number of views for videos to be on youtube's trending list
Presentation at SAS Analytics conference 2014
Predictive analytics has been applied to solve a wide range of real-world problems. Nevertheless, current state-of-the-art predictive analytics models are not well aligned with business needs since they don't include the real financial costs and benefits during the training and evaluation phases. Churn modeling does not yield the best results when it's measured by investment per subscriber on a loyalty campaign and the financial impact of failing to detect a churner versus wrongly predicting a non-churner. This presentation will show how using a cost-sensitive modeling approach leads to better results in terms of profitability and predictive power – and is applicable to many other business challenges.
IRJET- Survey on Face Recognition using BiometricsIRJET Journal
This document describes a survey on face recognition using biometrics. It discusses using the Haar cascade algorithm with OpenCV in Python to detect faces in images and video. The algorithm involves selecting Haar features, creating integral images for rapid calculation of features, training classifiers with AdaBoost, and cascading the classifiers. It trains on positive and negative image datasets to detect faces and then recognizes faces by extracting principal components and comparing to trained data. The system fulfills basic face detection and recognition needs at low cost for applications like security and real-time analysis. Improving the algorithm involves adding more training images to increase accuracy.
Learning curves describe how unit production time decreases at a constant rate as cumulative production increases. T.P. Wright first documented learning curves in 1936, finding that aircraft production times reduced consistently. Companies can use learning curves to forecast budgets and costs, determine staffing needs, and make pricing decisions. Learning curves are calculated using equations that relate time per unit to cumulative units produced. A key rate is the percentage of learning, with higher percentages meaning faster individual unit production times. While useful, learning curves do not directly reduce costs - management must take action to increase learning rates through training or tools.
The document discusses learning curves and how companies can use them. It begins by explaining that T.P. Wright first documented learning curves in 1936 and found that production time decreased at an unvarying rate as companies gained experience. Learning curves can be used to forecast budgets, determine staffing needs, and aid pricing decisions. While learning curves are useful, human factors must be considered and not all learning provides the same benefits to an organization.
Online Advertising as a Branding Tool Vizeum FinlandJukka Veteläinen
Finnish case study of online advertising as a branding tool. Conducted for international traveling company in Finland.
Case study demonstrates how online advertising can be compared to tv advertising and how online visibility can be optimized to reach and engage with more consumers.
Contact me on
Linkedin: http://fi.linkedin.com/in/jukkavetelainen/
Twitter: @Zugi83
- The document presents the findings of a project analyzing customer data from a cellular service provider to predict customer churn and identify factors that influence churn.
- Several predictive models were tested and cost-benefit analysis was used to reduce misclassification costs and improve accuracy from the baseline of 73% to as high as 81%.
- Key factors found to influence churn were month-to-month contracts, tenure of 1-12 months, and not having online security services.
This document outlines a Six Sigma project to improve the operations of a bank call center. It presents the problem statement of deteriorating call center performance, outlines exercises to analyze first call resolution data using process capability indices and measurement system analysis, and proposes potential failure modes and their effects using a Failure Modes and Effects Analysis. The conclusion is that Lean Six Sigma tools can help improve first call resolution and reduce costs by eliminating waste and non-value-added activities in call center processes.
Using Key Metrics to Supercharge Your Demand Management and S&OP ProcessSteelwedge
Forecast accuracy is the single most important metric for demand planning, and quite possibly for the entire sales and operations planning (S&OP) process. If you start with an accurate forecast, the rest of the process is a lot easier. Improvements in forecast accuracy have a cascade effect, leading to reductions in inventory, improvements in customer service/reductions in stock outs, and eventually to increases in gross revenues and/or margins. But do you really understand forecast accuracy, and how to properly leverage it for results?
In this webinar, we will focus on explaining the foundations of forecast accuracy metrics:
How to measure forecast accuracy
Selecting the most meaningful offset or lag period
Units vs. revenue
Aggregation levels
Time buckets: Weeks, months or quarters
Impacts to functional groups: Sales, Marketing, Demand Planning
Strategies for transparent and effective tracking for improved results
Time Dependent Video Compression For Efficient StorageSoumyaShaw4
We thereby propose a time-dependent compression technique that satisfies the need of the hour. The concept suggests different levels of compression based on the age of the recorded video. The study finds the dependence of block size with the time taken for compression and, in turn, finds its performance with the help of metrics like Object Identif ication, Motion Tracking, Activity Recognition, and Mean Squared Error. The user is free to choose from the compression stages mentioned based on the specific application and other essential
parameters like Storage capacity.
Six Sigma is a data-driven approach to process improvement that aims for near perfect product quality. It uses statistical methods to identify and remove defects in manufacturing and business processes. The goal of Six Sigma is to achieve no more than 3.4 defects per million opportunities. Projects follow the DMAIC cycle of Define, Measure, Analyze, Improve, and Control to close gaps between current and target performance. Six Sigma was pioneered by companies like Motorola and GE and helped significantly reduce costs and increase profits through quality improvement.
Growing Viewership Through Data-Driven Program PromotionJeff Storan
Simulmedia is a New York City-based TV program promotion ad network that uses anonymous viewing data from 15+ million US households to deliver targeted on-air promotions for TV networks. By understanding individual viewers' preferences, Simulmedia can predict which viewers are most receptive to specific programs and reach them effectively. Trials show Simulmedia consistently delivers higher viewership and return on investment for networks' programs compared to traditional promotion methods.
The document discusses sampling techniques used in business research. It defines key terms like target population, sampling unit, and sampling frame. It also classifies different sampling techniques into probability and non-probability categories. Some examples are provided to illustrate sample size determination for estimating means and proportions within given levels of precision and confidence.
This document provides an overview of Six Sigma in 10 steps. It discusses key Six Sigma concepts like the normal distribution and defects per million opportunities (DPMO). The document explains that (1) processes can be measured and improved, (2) measurements typically follow a normal distribution, (3) most outcomes fall within 3 sigmas of the mean, and (4) Six Sigma aims for less than 4 defects per million by requiring 6 sigmas between the mean and specification limits. This achieves over 99.99966% success rate in meeting customer requirements.
Determine an optimal target to maximize net profit from marketing campaign with the use of a scoring model ablo to predict the client's probability of upselling
Succes op YouTube: hoe data & creativiteit de winnende formule isBBPMedia1
Ondanks de enorme populariteit van YouTube zetten nog te weinig merken dit platform succesvol in. Hier gaan we met deze sessie verandering in brengen. Wat zijn de laatste trends en ontwikkelingen? Van welke merken kan jij leren? En hoe pak jij het vooral niet aan als je succes wilt? Maar vooral: hoe je met een combinatie van data-analyse en creativiteit een winnende formule te pakken hebt. In een gezamenlijke sessie van Team5pm en YouTube zullen zij al deze vragen beantwoorden.
The document discusses the origins and concepts of Six Sigma. It began at Motorola in 1987 and was later adopted by other companies like GE. Six Sigma aims to reduce defects to 3.4 per million opportunities by driving processes to operate with as little variability as possible. It uses statistical methods and process improvement tools to help companies lower costs, increase customer satisfaction and profits.
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
In this project, we will discuss the results of our analysis of customer satisfaction
conducted using R Studio. Our team did this by carefully analyzing a number of factors that would have an effect on customer happiness of that certain company, such as Complaint
Resolution, Delivery Speed, Order Billing, Warranty Claims, Technical Support, E- commerce, Product Quality, Sales Force Image, Advertising, Price, and Product Line.
We performed a thorough study using a random sample of 70 data points, using a pre- chosen seed value which we obtained from the largest student number in our group. To
ensure the accuracy of our results, we replace missing values of the dataset with the mean of the dataset.
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
1) Have confidence that with hard work and the right attitude you can achieve your goals, regardless of perceived difficulties.
2) Remove negative influences and don't let others' assumptions discourage you. Follow the data and your own strengths.
3) Be adaptable - have placeholder decisions but be willing to change them as new information arrives. Act on intentions to learn from outcomes.
4) Inhibit procrastination by starting early and working efficiently to completion. Get into a productive "flow state."
In this paper, we present an analysis of features influencing Yelp's proprietary review filtering algorithm. Classifying or misclassifying reviews as recommended or non-recommended affects average ratings, consumer decisions, and ultimately, business revenue. Our analysis involves systematically sampling and scraping Yelp restaurant reviews. Features are extracted from review metadata and engineered from metrics and scores generated using text classifiers and sentiment analysis. The coefficients of a multivariate logistic regression model were interpreted as quantifications of the relative importance of features in classifying reviews as recommended or non-recommended. The model classified review recommendations with an accuracy of 78%. We found that reviews were most likely to be recommended when conveying an overall positive message written in a few moderately complex sentences expressing substantive detail with an informative range of varied sentiment. Other factors relating to patterns and frequency of platform use also bear strongly on review recommendations. Though not without important ethical implications, the findings are logically consistent with Yelp’s efforts to facilitate, inform, and empower consumer decisions.
- Data scraping, cluster, stratify
- Feature creation from metadata, NLP sentiment, spelling, readability, deceptive and extreme text classifiers
- Balance and scale, logistic regression, feature selection, final model
- Find features that correspond to Yelp's Algorithm and evaluate
Video of presentation: https://youtu.be/uavbPKiUg9M
Presentation: https://www.slideshare.net/YaoYao44/yelps-review-filtering-algorithm-powerpoint
Paper:
Github: https://github.com/post2web/capstone
Yelp's Review Filtering Algorithm PowerpointYao Yao
- Data scraping, cluster, stratify
- Feature creation from metadata, NLP sentiment, spelling, readability, deceptive and extreme text classifiers
- Balance and scale, logistic regression, feature selection, final model
- Find features that correspond to Yelp's Algorithm and evaluate
Video of presentation: https://youtu.be/uavbPKiUg9M
Poster: https://www.slideshare.net/YaoYao44/yelps-review-filtering-algorithm-poster
Paper:
Github: https://github.com/post2web/capstone
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Presentation with Q&A: https://youtu.be/c_Bhtw5UWkA
Powerpoint: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model
Paper: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model-101442471
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Presentation with Q&A: https://youtu.be/c_Bhtw5UWkA
Powerpoint: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model
Paper: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model-101442471
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Train and Adjust Parameters for KMeans, Spectral Clustering, and Agglomerative Clustering while evaluating metrics, deployment, Interactive Heatmap
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Data Reduction and Classification for Lumosity DataYao Yao
In 2015 researchers at Lumos Labs, the Lumosity cognitive training games platform maker, sought to determine if cognitive training (via the Lumosity platform) would result in cognitive performance gains. Can the randomization grouping of participants in the original study be predicted? Utilizing cognitive ability measurements, participant activity measurements, and participants’ ages, we attempt to predict randomization grouping utilizing linear discriminant analysis and principal component analysis techniques.
https://github.com/yaowser/LDA-PCA-Lumosity-Categorical-Prediction
Predicting Sales Price of Homes Using Multiple Linear RegressionYao Yao
Develop a regression model to predict final selling price of homes based on explanatory variables in the Ames Housing dataset. The analysis was performed on Ames housing data that were collected from the Ames Assessor’s Office for individual residential properties sold in Ames, IA from 2006 to 2010.
https://github.com/yaowser/MLR-iowa-housing
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Yao Yao, Jack Rasmus-Vorrath, Ivelin Angelov
https://github.com/yaowser/basic_blockchain
https://www.slideshare.net/YaoYao44/blockchain-security-and-demonstration/
Distributed ledger technology over a network of computers, which provides an alternative to the centralized system
Distributed Database
Peer-to-Peer Transmission
Transparency with Pseudonymity
Records are immutable
Computational Logic
https://www.youtube.com/watch?v=5ArZxRdhyPc
API Python Chess: Distribution of Chess Wins based on random movesYao Yao
Yao Yao
https://github.com/yaowser/python-chess
https://youtu.be/MayH8Yd3yqE
Distribution of chess wins: black and white expected values from random moves
Python chess library with move generation and validation, Polyglot opening book probing, PGN reading and writing, Gaviota tablebase probing, Syzygy tablebase probing, and UCI engine communication
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Blockchain provides an alternative to centralized systems through a distributed ledger maintained across a peer-to-peer network. Key characteristics include an immutable record of transactions stored in time-stamped blocks, with data integrity ensured by cryptographic hashes and distributed consensus. While increasing transparency, blockchain also allows for privacy through pseudonymity and encryption. It has the potential to reduce costs for financial institutions and open up new applications by cutting out intermediaries and enabling direct peer-to-peer transactions with transparency and traceability.
Top IPTV UK Providers of A Comprehensive Review.pdfXtreame HDTV
The television landscape in the UK has evolved significantly with the rise of Internet Protocol Television (IPTV). IPTV offers a modern alternative to traditional cable and satellite TV, allowing viewers to stream live TV, on-demand videos, and other multimedia content directly to their devices over the internet. This review provides an in-depth look at the top IPTV UK providers, their features, pricing, and what sets them apart.
Barbie Movie Review - The Astras.pdffffftheastras43
Barbie Movie Review has gotten brilliant surveys for its fun and creative story. Coordinated by Greta Gerwig, it stars Margot Robbie as Barbie and Ryan Gosling as Insight. Critics adore its perky humor, dynamic visuals, and intelligent take on the notorious doll's world. It's lauded for being engaging for both kids and grown-ups. The Astras profoundly prescribes observing the Barbie Review for a delightful and colorful cinematic involvement.https://theastras.com/hca-member-gradebooks/hca-gradebook-barbie/
Everything You Need to Know About IPTV Ireland.pdfXtreame HDTV
The way we consume television has evolved dramatically over the past decade. Internet Protocol Television (IPTV) has emerged as a popular alternative to traditional cable and satellite TV, offering a wide range of channels and on-demand content via the internet. In Ireland, IPTV is rapidly gaining traction, with Xtreame HDTV being one of the prominent providers in the market. This comprehensive guide will delve into everything you need to know about IPTV Ireland, focusing on Xtreame HDTV, its features, benefits, and how it is revolutionizing TV viewing for Irish audiences.
The Unbelievable Tale of Dwayne Johnson Kidnapping: A Riveting Sagagreendigital
Introduction
The notion of Dwayne Johnson kidnapping seems straight out of a Hollywood thriller. Dwayne "The Rock" Johnson, known for his larger-than-life persona, immense popularity. and action-packed filmography, is the last person anyone would envision being a victim of kidnapping. Yet, the bizarre and riveting tale of such an incident, filled with twists and turns. has captured the imagination of many. In this article, we delve into the intricate details of this astonishing event. exploring every aspect, from the dramatic rescue operation to the aftermath and the lessons learned.
Follow us on: Pinterest
The Origins of the Dwayne Johnson Kidnapping Saga
Dwayne Johnson: A Brief Background
Before discussing the specifics of the kidnapping. it is crucial to understand who Dwayne Johnson is and why his kidnapping would be so significant. Born May 2, 1972, Dwayne Douglas Johnson is an American actor, producer, businessman. and former professional wrestler. Known by his ring name, "The Rock," he gained fame in the World Wrestling Federation (WWF, now WWE) before transitioning to a successful career in Hollywood.
Johnson's filmography includes blockbuster hits such as "The Fast and the Furious" series, "Jumanji," "Moana," and "San Andreas." His charismatic personality, impressive physique. and action-star status have made him a beloved figure worldwide. Thus, the news of his kidnapping would send shockwaves across the globe.
Setting the Scene: The Day of the Kidnapping
The incident of Dwayne Johnson's kidnapping began on an ordinary day. Johnson was filming his latest high-octane action film set to break box office records. The location was a remote yet scenic area. chosen for its rugged terrain and breathtaking vistas. perfect for the film's climactic scenes.
But, beneath the veneer of normalcy, a sinister plot was unfolding. Unbeknownst to Johnson and his team, a group of criminals had planned his abduction. hoping to leverage his celebrity status for a hefty ransom. The stage was set for an event that would soon dominate worldwide headlines and social media feeds.
The Abduction: Unfolding the Dwayne Johnson Kidnapping
The Moment of Capture
On the day of the kidnapping, everything seemed to be proceeding as usual on set. Johnson and his co-stars and crew were engrossed in shooting a particularly demanding scene. As the day wore on, the production team took a short break. providing the kidnappers with the perfect opportunity to strike.
The abduction was executed with military precision. A group of masked men, armed and organized, infiltrated the set. They created chaos, taking advantage of the confusion to isolate Johnson. Johnson was outnumbered and caught off guard despite his formidable strength and fighting skills. The kidnappers overpowered him, bundled him into a waiting vehicle. and sped away, leaving everyone on set in a state of shock and disbelief.
The Immediate Aftermath
The immediate aftermath of the Dwayne Johnson kidnappin
Orpah Winfrey Dwayne Johnson: Titans of Influence and Inspirationgreendigital
Introduction
In the realm of entertainment, few names resonate as Orpah Winfrey Dwayne Johnson. Both figures have carved unique paths in the industry. achieving unparalleled success and becoming iconic symbols of perseverance, resilience, and inspiration. This article delves into the lives, careers. and enduring legacies of Orpah Winfrey Dwayne Johnson. exploring how their journeys intersect and what we can learn from their remarkable stories.
Follow us on: Pinterest
Early Life and Backgrounds
Orpah Winfrey: From Humble Beginnings to Media Mogul
Orpah Winfrey, often known as Oprah due to a misspelling on her birth certificate. was born on January 29, 1954, in Kosciusko, Mississippi. Raised in poverty by her grandmother, Winfrey's early life was marked by hardship and adversity. Despite these challenges. she demonstrated a keen intellect and an early talent for public speaking.
Winfrey's journey to success began with a scholarship to Tennessee State University. where she studied communication. Her first job in media was as a co-anchor for the local evening news in Nashville. This role paved the way for her eventual transition to talk show hosting. where she found her true calling.
Dwayne Johnson: From Wrestling Royalty to Hollywood Superstar
Dwayne Johnson, also known by his ring name "The Rock," was born on May 2, 1972, in Hayward, California. He comes from a family of professional wrestlers, with both his father, Rocky Johnson. and his grandfather, Peter Maivia, being notable figures in the wrestling world. Johnson's early life was spent moving between New Zealand and the United States. experiencing a variety of cultural influences.
Before entering the world of professional wrestling. Johnson had aspirations of becoming a professional football player. He played college football at the University of Miami. where he was part of a national championship team. But, injuries curtailed his football career, leading him to follow in his family's footsteps and enter the wrestling ring.
Career Milestones
Orpah Winfrey: The Queen of All Media
Winfrey's career breakthrough came in 1986 when she launched "The Oprah Winfrey Show." The show became a cultural phenomenon. drawing millions of viewers daily and earning many awards. Winfrey's empathetic and candid interviewing style resonated with audiences. helping her tackle diverse and often challenging topics.
Beyond her talk show, Winfrey expanded her empire to include the creation of Harpo Productions. a multimedia production company. She also launched "O, The Oprah Magazine" and OWN: Oprah Winfrey Network, further solidifying her status as a media mogul.
Dwayne Johnson: From The Ring to The Big Screen
Dwayne Johnson's wrestling career took off in the late 1990s. when he became one of the most charismatic and popular figures in WWE. His larger-than-life persona and catchphrases endeared him to fans. making him a household name. But, Johnson had ambitions beyond the wrestling ring.
In the early 20
At Digidev, we are working to be the leader in interactive streaming platforms of choice by smart device users worldwide.
Our goal is to become the ultimate distribution service of entertainment content. The Digidev application will offer the next generation television highway for users to discover and engage in a variety of content. While also providing a fresh and
innovative approach towards advertainment with vast revenue opportunities. Designed and developed by Joe Q. Bretz
Meet Dinah Mattingly – Larry Bird’s Partner in Life and Loveget joys
Get an intimate look at Dinah Mattingly’s life alongside NBA icon Larry Bird. From their humble beginnings to their life today, discover the love and partnership that have defined their relationship.
The Evolution of the Leonardo DiCaprio Haircut: A Journey Through Style and C...greendigital
Leonardo DiCaprio, a name synonymous with Hollywood stardom and acting excellence. has captivated audiences for decades with his talent and charisma. But, the Leonardo DiCaprio haircut is one aspect of his public persona that has garnered attention. From his early days as a teenage heartthrob to his current status as a seasoned actor and environmental activist. DiCaprio's hairstyles have evolved. reflecting both his personal growth and the changing trends in fashion. This article delves into the many phases of the Leonardo DiCaprio haircut. exploring its significance and impact on pop culture.
240529_Teleprotection Global Market Report 2024.pdfMadhura TBRC
The teleprotection market size has grown
exponentially in recent years. It will grow from
$21.92 billion in 2023 to $28.11 billion in 2024 at a
compound annual growth rate (CAGR) of 28.2%. The
teleprotection market size is expected to see
exponential growth in the next few years. It will grow
to $70.77 billion in 2028 at a compound annual
growth rate (CAGR) of 26.0%.
Matt Rife Cancels Shows Due to Health Concerns, Reschedules Tour Dates.pdfAzura Everhart
Matt Rife's comedy tour took an unexpected turn. He had to cancel his Bloomington show due to a last-minute medical emergency. Fans in Chicago will also have to wait a bit longer for their laughs, as his shows there are postponed. Rife apologized and assured fans he'd be back on stage soon.
https://www.theurbancrews.com/celeb/matt-rife-cancels-bloomington-show/
Modern Radio Frequency Access Control Systems: The Key to Efficiency and SafetyAITIX LLC
Today's fast-paced environment worries companies of all sizes about efficiency and security. Businesses are constantly looking for new and better solutions to solve their problems, whether it's data security or facility access. RFID for access control technologies have revolutionized this.
Unveiling Paul Haggis Shaping Cinema Through Diversity. .pdfkenid14983
Paul Haggis is undoubtedly a visionary filmmaker whose work has not only shaped cinema but has also pushed boundaries when it comes to diversity and representation within the industry. From his thought-provoking scripts to his engaging directorial style, Haggis has become a prominent figure in the world of film.
I Know Dino Trivia: Part 3. Test your dino knowledge
Estimating the initial mean number of views for videos to be on youtube's trending list
1. Yao Yao
MSDS 6370 Section 403
Class Project
Estimating the Initial Mean Number of Views for Videos to be on Youtube's Trending List
Introduction
From Kaggle, the metadata of top trending YouTube videos from United States, Canada, Great Britain,
Germany, and France are collected daily from November 14, 2017 to March 20, 2018 (127 days) via the
YouTube API [1]. The YouTube algorithm for videos to start trending is derived from social interactions
of users who view, share, comment, and like certain videos. From the dataset, the chance for a YouTube
video to start trending during this time frame could span from a few hours since its publish time or from
as old as 2010. Once the video starts trending on YouTube, there can be multiple days in which the video
stays on the trending list while gaining more views. We are aware that time of day that the data is
collected matters and there may be other algorithmic factors of social media interactions on YouTube
and on other platforms that YouTube considers for a video to start trending [2]. For a rough estimate of
how many views a video must garner to be on the YouTube trending list, we decided to focus our
estimation approach without the extended interactions. The number of shares, comments, and likes are
directly correlated as a result from how many views a video has and the video publisher, such as for
Apple commercials, could disable embedding, comments, and likes for certain videos, which can skew
their respective resulting dataset even further. Only the number of views a video has is the true initial
indicator to where all subsequent user interaction could result and is the primary focal number of how a
YouTube video can be on the top trending list.
Data Cleaning
From the scope of our analysis, we are only concerned about estimating the initial number of views a
video must acquire before being featured on YouTube's top trending list. Therefore, for all the duplicate
days that a video is featured, only the initial number of views would be counted towards the analysis.
Due to the limited time frame of the data collection, older videos that may have higher view counts may
have gotten their initial boost in views from a time when they were trending outside of the 127-day data
collection period. Therefore, we decided to only keep the YouTube data from the videos that are within
1.5 times the inter-quartile range of the whole dataset for a more accurate model of estimating the
mean number of views a video takes to get on the trending list (Figure 1).
Figure 1a: Uncleaned Box Plot Distribution of
Video Views (N = 49841)
Figure 1b: Cleaned Box Plot Distribution of Video
Views (N = 44506)
2. Figure 2 shows a numerical representation of how drastically the numbers change once cleaned, where
the true population mean decreases from 224,403 views to 93,981 views, the standard deviation
decreases from 731,176 views to 109607 views, and the standard error decreases from 3275.13 view to
519.55 views for that of uncleaned and cleaned datasets, respectively, where the statistics from the
cleaned dataset is used for the further estimation procedures.
Figure 2a: Uncleaned Quartile Distribution of Video Views (N = 49841)
Figure 2b: Cleaned Quartile Distribution of Video Views (N = 44506)
Task 1: Simple Random Sample
Using the 95% confidence interval threshold, where margin of error is 5000 views for the cleaned
dataset, we take a simple random sample of 1847 observations. The FPC adjustment is ignored because
the cleaned dataset is 89.29% of the original population, which is less than 10%.
For simple random sample where sample size is 1847 and set seed is 1847, the mean estimate is
91,932.82 views with a standard error of 2,490.11 views for 95% confidence interval, where the true
mean of 93,891.70 views is within the standard error range of the survey mean estimate.
Figure 3a: Stratified YouTube Views Box Plot Distribution by Category
3. Figure 3b: Stratified YouTube Views Quartile Distribution by Category
Task 1: Proportional Allocation of Stratified Sample
We stratified the videos by category, to see if there is a difference in mean and distributions for how a
YouTube video becomes trending on YouTube (Figure 3). 'Entertainment' and 'People & Blogs' have the
most observations while 'Nonprofits & Activism' and 'Travel & Events' have the least observations for
the dataset collected. 'Nonprofits & Activism' and 'News and Politics' require the least mean views for a
video to be on trending while 'Music' and 'Comedy' require the most mean views for a video to be on
trending. 'Entertainment' and 'People & Blogs' have the lowest standard error for their mean views to
be on trending while 'Pets & Animals' and 'Travel & Events' have the highest standard error for their
mean views to be on trending.
For proportional allocation of sample of size 1847 for the stratified YouTube trending views by category,
the categories with the least sample size ('Nonprofits & Activism' and 'Travel & Events') were rounded
up for the sampled total to be 1847 (Figure 4).
Stratum Observ Nh/N 1847*Nh/N Sample Size 1893*Nh/N Sample Size
Autos & Vehicles 898 0.020177 37.26702 37 38.19516 38
Comedy 2732 0.061385 113.3781 113 116.2018 116
Education 1165 0.026176 48.34753 48 49.55163 50
Entertainment 13136 0.295151 545.1443 545 558.7213 559
Film & Animation 2307 0.051836 95.74055 96 98.12499 98
Gaming 1806 0.040579 74.94904 75 76.81567 77
Howto & Style 2824 0.063452 117.1961 117 120.1149 120
Music 2467 0.055431 102.3806 102 104.9304 105
News & Politics 4991 0.112142 207.1266 207 212.2852 212
Nonprofits & Activism 201 0.004516 8.341505 9 8.549252 9
4. People & Blogs 6522 0.146542 270.6631 271 277.4041 277
Pets & Animals 399 0.008965 16.55851 17 16.9709 17
Science & Technology 1043 0.023435 43.28452 43 44.36254 44
Sports 3740 0.084034 155.2101 155 159.0756 159
Travel & Events 275 0.006179 11.41251 12 11.69674 12
Total 44506 1847 1893
Figure 4: Proportional Allocation of Stratified Views by Category for Sample Size 1847 and 1893 After
Design Effect
For proportional allocation of stratified sample where sample size is 1847 and set seed is 1847, the
mean estimate is 96,607.95 views with a standard error of 2,551.78 views for 95% confidence interval,
where the true mean of 93,891.70 views is not within the standard error range of the survey mean
estimate.
To account for design effect, the standard errors of simple random sampling and proportional allocation
for stratified sampling show that the estimate for simple random sample is more precise. The new
sample size for proportional allocation is 1893 views (Figure 4).
For proportional allocation of stratified sample after design effect where sample size is 1893 and set
seed is 1893, the mean estimate is 95,055.87 views with a standard error of 2,484.97 views for 95%
confidence interval, where the true mean of 93,891.70 views is within the standard error range of the
survey mean estimate.
Task 1: Comparisons of Sampling Procedures x1
Sample Procedure Sample
Size
Mean
Estimate
Standard
Error
Lower CI Upper CI True Mean
Within CI?
Abs Diff From
True Mean
Simple Random 1847 91932.82 2490.11 89442.71 94422.93 Yes 1958.88
Proportional Allocation 1847 96607.95 2551.78 94056.17 99159.73 No 2716.25
Proportional Allocation
After Design Effect
1893 95055.87 2484.97 92570.9 97540.84 Yes 1164.17
Figure 5: Comparisons of Mean Estimate for Task 1 Sampling Procedures for True Mean of 93891.7
For sample size 1847, simple random sample performed better than proportional allocation where the
true mean was within confidence interval for simple random and not for that of proportional allocation.
Once corrected for design effect for a sample size of 1893, the proportional allocation mean estimate is
closer to the true mean than that of simple random sample (Figure 5). The standard error is the smallest
for that of proportional allocation after design effect.
Task 2: Comparisons of Sampling Procedures x5
5. Sample Procedure Sample
Size
Mean
Estimate
Standard
Error
Lower CI Upper CI True Mean
Within CI?
Abs Diff From
True Mean
Simple Random 1847 91494.1 2511.972 88982.13 94006.07 Yes 2397.6
Simple Random 1847 95733.22 2562.7 93170.52 98295.92 Yes 1841.52
Simple Random 1847 94591.21 2575.722 92015.49 97166.93 Yes 699.51
Simple Random 1847 93210.51 2568.106 90642.4 95778.62 Yes 681.19
Simple Random 1847 96893.45 2609.723 94283.73 99503.17 No 3001.75
Proportional Allocation 1847 98122.43 2600.976 95521.45 100723.4 No 4230.73
Proportional Allocation 1847 92762.59 2533.739 90228.85 95296.33 Yes 1129.11
Proportional Allocation 1847 95200.77 2577.481 92623.29 97778.25 Yes 1309.07
Proportional Allocation 1847 91645.58 2441.759 89203.82 94087.34 Yes 2246.12
Proportional Allocation 1847 96482.08 2465.507 94016.57 98947.59 No 2590.38
Proportional Allocation
After Design Effect
1893 97117.92 2547.008 94570.91 99664.93 No 3226.22
Proportional Allocation
After Design Effect
1893 92751.74 2451.187 90300.55 95202.93 Yes 1139.96
Proportional Allocation
After Design Effect
1893 91844.2 2463.036 89381.16 94307.24 Yes 2047.5
Proportional Allocation
After Design Effect
1893 93422.21 2506.443 90915.77 95928.65 Yes 469.49
Proportional Allocation
After Design Effect
1893 95806.36 2549.604 93256.76 98355.96 Yes 1914.66
Figure 6a: Comparisons of Mean Estimate for Task 2 Sampling Procedures for True Mean of 93891.7
After conducting the Task 1 sampling procedures for 5 independent sample seeds, every different
sampling procedure has at least one result where the true mean is not within the 95% confidence
interval, where proportional allocation has two (Figure 6).
Average Sample
Procedure
Sample
Size
Mean
Estimate
Standard
Error
Lower CI Upper CI %True Mean
Within CI
Abs Diff From
True Mean
Simple Random 1847 94384.5 2565.645 91818.86 96950.15 80% 492.8
Proportional Allocation 1847 94842.69 2523.892 92318.8 97366.58 60% 950.99
Proportional Allocation
After Design Effect
1893 94188.49 2503.456 91685.03 96691.95 80% 296.79
Figure 6b: Average Comparisons of Mean Estimate for Task 2 Sampling Procedures for True Mean of
93891.7
By averaging the samples by procedure for sample size of 1847, simple random sample performed
better than proportional allocation where the estimated mean is closer than the true mean. Averaging
the proportional allocation after design effect for sample size of 1893, the proportional allocation mean
estimate is closer to the true mean than that of simple random sample. The average standard error is
the smallest for that of proportional allocation after design effect. Proportional allocation after design
effect is the best method to estimate the mean views of 94188 for a video to get on YouTube Trending.
6. References
[1] "Trending YouTube Video Statistics: Daily statistics for trending YouTube videos," Kaggle, 2018.
[Online]. Available: https://www.kaggle.com/datasnaek/youtube-new [Accessed 23-Mar-2018]
[2] "Trending on YouTube," Google, 2018. [Online]. Available:
https://support.google.com/youtube/answer/7239739 [Accessed 23-Mar-2018]
Appendix: R Code
cat("014")
options(warn=1)
require(survey)
require(dplyr)
require(lattice)
#read in, remove columns, from https://www.kaggle.com/datasnaek/youtube-new/data
setwd("C:/Users/Yao/Desktop/you")
youtubeRawUS <- read.csv(file="USvideos.csv", header=TRUE, sep=",")
youtubeRawUS$country <- rep("US",nrow(youtubeRawUS))
youtubeRawUS <- youtubeRawUS[-c(3:4,7,12:16)]
youtubeRawCA <- read.csv(file="CAvideos.csv", header=TRUE, sep=",")
youtubeRawCA$country <- rep("CA",nrow(youtubeRawCA))
youtubeRawCA <- youtubeRawCA[-c(3:4,7,12:16)]
youtubeRawDE <- read.csv(file="DEvideos.csv", header=TRUE, sep=",")
youtubeRawDE$country <- rep("DE",nrow(youtubeRawDE))
youtubeRawDE <- youtubeRawDE[-c(3:4,7,12:16)]
youtubeRawFR <- read.csv(file="FRvideos.csv", header=TRUE, sep=",")
youtubeRawFR$country <- rep("FR",nrow(youtubeRawFR))
youtubeRawFR <- youtubeRawFR[-c(3:4,7,12:16)]
youtubeRawGB <- read.csv(file="GBvideos.csv", header=TRUE, sep=",")
youtubeRawGB$country <- rep("GB",nrow(youtubeRawGB))
youtubeRawGB <- youtubeRawGB[-c(3:4,7,12:16)]
youtubeRaw<- rbind(youtubeRawUS, youtubeRawCA)
youtubeRaw<- rbind(youtubeRaw, youtubeRawDE)
youtubeRaw<- rbind(youtubeRaw, youtubeRawFR)
youtubeRaw<- rbind(youtubeRaw, youtubeRawGB)
head(youtubeRaw)
youtubeRaw2 <- youtubeRaw
#remove duplicates because they can be trending in multiple months, keep the least views to get on
trending
youtubeRaw2 = youtubeRaw2[order(youtubeRaw2[,'video_id'],youtubeRaw2[,'views']),]
youtubeRaw2 = youtubeRaw2[!duplicated(youtubeRaw2$video_id),]
head(youtubeRaw2)
write.csv(youtubeRaw2, file = "youtubeRaw2.csv")
7. #replace strata categories into real names
youtubeRaw2$category_id2[youtubeRaw2$category_id == '1'] <- 'Film & Animation'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '2'] <- 'Autos & Vehicles'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '10'] <- 'Music'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '15'] <- 'Pets & Animals'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '17'] <- 'Sports'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '19'] <- 'Travel & Events'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '20'] <- 'Gaming'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '22'] <- 'People & Blogs'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '23'] <- 'Comedy'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '24'] <- 'Entertainment'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '25'] <- 'News & Politics'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '26'] <- 'Howto & Style'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '27'] <- 'Education'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '28'] <- 'Science & Technology'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '29'] <- 'Nonprofits & Activism'
youtubeRaw2$category_id2[youtubeRaw2$category_id == '43'] <- 'Science & Technology'
youtubeRaw2$category_id[youtubeRaw2$category_id == '43'] <- '28'
#sanity check, remove column names
head(youtubeRaw2)
rownames(youtubeRaw2) <- c()
youtubeRaw3 <- youtubeRaw2[order(youtubeRaw2$category_id2),]
head(youtubeRaw3)
write.csv(youtubeRaw3, file = "youtubeRaw3.csv", row.names=FALSE)
#use sas for more youtubeRaw3.csv dataset stats
#check initial distribution and number of rows
boxplot(youtubeRaw3$views, main="Uncleaned Boxplot Distribution of Videos Views", xlab="Trending
Videos", ylab="Number of Views")
nrow(youtubeRaw3)
bwplot(views ~ category_id2 , data = youtubeRaw3, scales=list(x=list(rot=45)), main="Uncleaned Boxplot
Distribution of Videos Views", xlab="Trending Videos Categories", ylab="Number of Views")
#solve for view count without cleaning data
sum(as.numeric(youtubeRaw3$views))
max(youtubeRaw3$views)
min(youtubeRaw3$views)
mean(youtubeRaw3$views)
#remove outliers more than 1.5 quant, save into new clean dataset
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
8. y
}
youtubeClean <- youtubeRaw3
youtubeClean$views <- remove_outliers(youtubeRaw3$views)
#check new distribution, only keep data within distribution
boxplot(youtubeClean$views, main="Cleaned Boxplot Distribution of Videos Views", xlab="Trending
Videos", ylab="Number of Views")
bwplot(views ~ category_id2 , data = youtubeClean, scales=list(x=list(rot=45)), main="Cleaned Boxplot
Distribution of Videos Views", xlab="Trending Videos Categories", ylab="Number of Views")
youtubeClean2 <- youtubeClean[complete.cases(youtubeClean), ]
write.csv(youtubeClean2, file = "youtubeClean2.csv", row.names=FALSE)
#solve for actual target average view count and number of rows for strata
sum(as.numeric(youtubeClean2$views))
nrow(youtubeClean2)
max(youtubeClean2$views)
min(youtubeClean2$views)
mean(youtubeClean2$views)
sd(youtubeClean2$views)
#average amount of views for the outliers removed
(sum(as.numeric(youtubeRaw3$views)) - sum(as.numeric(youtubeClean2$views))) /
(nrow(youtubeRaw3) - nrow(youtubeClean2))
max(youtubeRaw3$views)-max(youtubeClean2$views)
min(youtubeRaw3$views)-min(youtubeClean2$views)
mean(youtubeRaw3$views)-mean(youtubeClean2$views)
nrow(youtubeClean2)/nrow(youtubeRaw3)
#use sas for more youtubeClean2.csv dataset stats
#how do we choose MOE? currently, MOE = 5000 views
#do we remove outliers per strata or for the whole dataset? currently, we remove for whole dataset
#after removing outliers, 88% of the dataset is kept...do we ignore fpc adjustment? currently we ignore
b/c more than 10%
n0srs <- ceiling((1.96^2*sd(youtubeClean2$views)^2)/(5000^2))
n0srs
#SRS function
SrsMeanEstimate<-function(Seed, SampSize, printOutput= TRUE){
set.seed(Seed)
youtubeClean2.SRSSampled = sample_n(youtubeClean2,SampSize)
if(printOutput == TRUE){
print(nrow(youtubeClean2.SRSSampled))
9. print(bwplot(views ~ category_id2, data = youtubeClean2.SRSSampled, scales=list(x=list(rot=45)),
main="SRS Boxplot Distribution of Videos Views", xlab="Trending Videos Categories", ylab="Number
of Views"))
}
mydesign <- svydesign(id = ~1, data = youtubeClean2.SRSSampled)
srsMean = svymean(~views, design = mydesign)
srsSE = SE(srsMean)
srsCI = confint(srsMean)
rm(youtubeClean2.SRSSampled)
rm(mydesign)
return(list(as.numeric(srsMean[1]),
as.numeric(srsSE),
as.numeric(srsCI[1]),
as.numeric(srsCI[2])))
}
srsMean <- SrsMeanEstimate(n0srs, n0srs)
print(paste('The Mean Estimate =', srsMean[[1]]))
print(paste('The Standard Error =', srsMean[[2]]))
mean(youtubeClean2$views)
#Proportional Strata
PropMeanEstimate<-function(Seed, SampSize, printOutput= TRUE){
set.seed(Seed)
# Identify Frequency of category_id2 Stratum
PropFreq <- as.data.frame(table(youtubeClean2[,c("category_id2")]))
names(PropFreq)[1] = 'category_id2'
PropFreq
PropFreq$N = nrow(youtubeClean2)
PropFreq$p = PropFreq$Freq/PropFreq$N
PropFreq$SampSizeh = (PropFreq$p * SampSize)
PropFreq$SampSizehRounded = round(PropFreq$SampSizeh)
youtubeClean2.PropSampled <- NULL
for (i in 1:nrow(PropFreq)){
youtubeClean2.PropSampled<-rbind(youtubeClean2.PropSampled,
sample_n(youtubeClean2[(youtubeClean2$category_id2 ==
PropFreq[i,"category_id2"]),]
,PropFreq[i,"SampSizehRounded"]))