1. Modeling Techniques in Predictive Analytics:
Business Problems and Solutions with R
TEXT ANALYTICS
2. objective of case study
To analyze the trend of movies released over the years and how they differ from decade to decade using
text analytics tools and methods.
3. Methodology
We have the data of movies released over last 100 years in the file. We will capture each
and every text from that file and store that text in the form of text corpus. We will perform
text formatting on the text and only use the relevant information for our analysis. We make
use of R Programming Language for our statistical analysis.
The Internet Movie Database (IMDb.com) is a good source of information about movies
and which is freely available on Internet. We have downloaded the information in the form
of text file for our use. For our example, we choose a smaller text file from IMDb, the tagline
file.
Text analytics like predictive analytics is also number game, but with words rather than
numbers as the raw input. We will turn words into numbers for analysis.
4. Data Preprocessing
This is how the unstructured text file looks.
We must process the text before we can understand what it says.
We have to process and clean this data to understand the content of the data.
We make use of this formatting in parsing the tagline file for entry into text database.
This is how structured data looks like:
Packages Used
library(tm)
library(stringr)
library(grid)
library(ggplot2)
library(latticeExtra
library(cluster)
library(proxy
5. Visualization using HISTOGRAM
To determine the ranges of year to consider in our study, we look at the distribution of release
dates in the movie taglines data. The histogram figure below shows more than one hundred
movies a year from the mid 1970’s through 2013 and more than one thousand movies a year
from 2003 to 2013.
6. Understanding the Trend Using Plot
We use a horizon plot to visualize text measure in time.
We identify five common groups or cluster of words, defining the text measures that we call LOVED,
WORLDS, TRUTH, LIFE, STORY.
7. INTERPRETATION/EVALUATION
Story based movie produced more with fluctuations.
Autobiographical movies has been produced more after 2000. Prominent increase of autobiographical movies have
been noticed from 2000-2010.
Non fiction movies has been produced more. Prominent increase of non fiction movies from 98-2010.
Movies related to natural Geography/Wildlife, world has been produced more with fluctuations,a bit up and down till
1980---up in 89--down till 2002-increse 2010.
Movies with subject as love story has been pretty fluctuating,UP till 79--down in1980-81--82high-low till 86-high-
majorly low till 2003--high till 2005-low-high trend in 2010.
8. CONCLUSION
Based on our Analysis: The current trend of movies are Non-Fictional movies.
As production of movies are directly proportional to revenue, it is preferred for producers to invest in
non-fictional movies, as category “Truth” is quite higher and production of over all Movies has also
increased after 2000.
As per our Analysis, Producers can have higher revenue, if they produce/make “Non-Fictional-Movies”