data science and analytics in computer science

NADAR SARASWATHI
COLLEGE OF ARTS AND
SCIENCE
DATA SCIENCE & ANALYTICS
A.Uthra Devi
II M.Sc Computer Science

COLLECT RAW TEXT
 The data science team investigates the problem,
understands the necessary data source, and formulates
initial hypotheses
 Data must be collected before anything can happen
 The data science tam starts by actively monitoring
various websites for user generated contents. the user
generated contents being collected being collected
could be related articles from news portals and blogs.
Comments on ACME’s products from online shops or
reviews sites, or social media posts that contain
keywords phone or be book.

 Regardless of data source, the team would deal with
semi structured data such as HTML web pages.
Really simple syndication feeds. XML, or JavaScript
object notation files.
 Enough structure needs to be imposed to find the part
of the raw text that the team really cares about.
 Many news portals and blogs provide data feeds that
are in an open standard format. Such as RSS or XML.
 Regular expressions can find words and strings that
match particular patterns in the text effectively and
efficiently.

REPRESENT TEXT
 In this data representation step, raw text is first
transformed with text normalization techniques:
 Tokenization
 Case folding
Tokenization or tokenizing is the task of separating
words from the body of text.
Raw text is converted into collections of tokens after
the tokenization, where each token is a word.
A common approach is tokenizing on space. For
example, with the tweet shown previously.

 TOKENIZATION:
A common approach is tokenizing on spaces.
Example: text analysis sometimes called text analysis
Another way is to tokenize the text based on
punctuation marks and spaces.
Data science and big data analytics, ”has
become well accepted across academia and the
industry.

 CASE FOLDING:
It reduces all letters to lowercase
Example: text analysis sometimes called text
analytics=text analysis sometimes called text
analytics.
One needs to be cautious applying case folding
to tasks such as information extraction, sentiment
analysis, and machine translation.
If case folding must be present. One way to
reduce problems is to create a lookup table of words
not to be case folded.

TF-IDF
TF-IDF stands for Term Frequency Inverse Document
Frequency of records.
It can be defined as the calculation of how relevant a
word in a series or corpus is to a text.
 The meaning increases proportionally to the number
of times in the text a word appears but is compensated
by the word frequency in the corpus (data-set).

IDF
 Besides stop words, words that are more general in
meaning tend to appear more often, thus having
higher term frequencies.
 The additional variable should reduce the effect of the
term frequency as the term appears in more
documents.
 The highest corpus-wide term frequencies(TF)
 The highest document frequencies(DF)
 The highest inverse document frequencies(IDF)

data science and analytics in computer science

More Related Content

Similar to data science and analytics in computer science

Recently uploaded

data science and analytics in computer science