1 UNIT-DSP.pptx

DATA SCIENCE
IN BIG DATA
U N I T - 1

SYLLABUS
UNIT I - INTRODUCTION TO DATASCIENCE AND BIG DATA
Data Science - Fundamentals and Components – Data Scientist – Terminologies Used in Big Data Environments -
Types of Digital Data - Classification of Digital Data - Introduction to Big Data - Characteristics of Data - Evolution of Big
Data - Big Data Analytics - Classification of Analytics.
UNIT II - DESCRIPTIVE ANALYTICS USING STATISTICS
Types of Data – Mean, Median and Mode – Standard Deviation and Variance – Probability – Probability Density
Function – Types of Data Distribution – Percentiles and Moments – Correlation and Covariance – Conditional Probability –
Bayes Theorem – Introduction to Univariate, Bivariate and Multivariate Analysis.
UNIT III - PREDICTIVE MODELING AND MACHINE LEARNING
Linear Regression – Polynomial Regression – Multivariate Regression – Multi Level Models – Data warehousing
overview – Bias / variance trade off – K Fold cross validation – Data Cleaning and Normalization – Cleaning web log Data –
Normalizing numerical Data – Detecting Outliers – Introduction to Supervised and Unsupervised learning.

SYLLABUS
UNIT IV - DATAANALYTICAL FRAMEWORKS
Introducing Hadoop: - Hadoop Overview - RDBMS versus Hadoop - HDFS (Hadoop Distributed File
System): Components and block replication – Processing Data with Hadoop - Introduction to MapReduce –
Features of MapReduce – Introduction to NoSQL: CAP theorem, MongoDB.
UNIT V - DATA SCIENCE USING PYTHON
Introduction to essential data science packages: NumPy, SciPy, Jupyter, Statsmodels and pandas Package –
Introduction to Data Munging, Data pipeline and Machine learning in Python - Data visualization using matplotlib –
Interactive visualization with advanced data learning representation in Python.

Data Science – Definition
Data Science is the science which uses computer science, statistics and machine learning,
visualization and human-computer interactions to collect, clean, integrate, analyze, visualize,
interact with data to create data products.
Goal of Data Science - Turn data into data products.
How is data science related to big data: It is a blend of the field of Computer Science, Business
and Statistics together. Data Science is an area. Big Data is a technique to collect, maintain and
process the huge information. It is about collection, processing, analyzing and utilizing of data
into various operations.

Data Science Process
1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources . The
data can be: Logs from webservers, Data gathered from social media, Census datasets
Data streamed from online sources using APIs.
2. Preparation:
Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned. You need to process, explore, and condition data before modeling. The
cleaner your data, the better are your predictions.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.

Data Science Process
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model once prepared is tested against the “testing” dataset.
5. Operationalize:
In this stage, you deliver the final baselined model with reports, code, and technical documents.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you to decide if
the results of the project are a success or a failure based on the inputs from the model.

DATA SCIENCE COMPONENTS
Statistics:
Statistics is the most critical unit of Data Science basics. It is the method or
of collecting and analyzing numerical data in large quantities to get useful
insights.
Visualization:
Visualization technique helps you to access huge amounts of data in easy to
understand and digestible visuals.

Data Scientist:
Role: A Data Scientist is a professional who manages enormous amounts of data
to come up with compelling business visions by using various tools, techniques,
methodologies, algorithms, etc.
Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark.
A data scientist’s work typically involves making sense of messy, unstructured
data, from sources such as smart devices, social media feeds, and emails that
don’t neatly fit into a database.
Data scientists are analytical experts who utilize their skills in both technology
and social science to find trends and manage data.

Terminologies Used in Big Data Environment
5 V’s of Big Data:
•Volume – a large amount of data.
•Velocity – the speed of data processing.
•Variety – large data diversity.
•Veracity – verification of data.
•Value – what big data can bring to the user.

VOLUME:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If the volume of data is very
large then it is actually considered as a ‘Big Data’. This means whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

VELOCITY:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media, mobile
phones etc.
There is a massive and continuous flow of data. This determines the potential of data that how
fast the data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.

VARIETY:
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.

◦ Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
◦ Semi- Structured data: This data is basically a semi-organised data. It is generally a form of
data that do not conform to the formal structure of data. Log files are the examples of this type
of data.
◦ Unstructured data: This data basically refers to unorganized data. It generally refers to data
that doesn’t fit neatly into the traditional row and column structure of the relational database.
Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.

Veracity:
It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes
get messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.
Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.

Value:
After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of
Data having no Value is of no good to the company, unless you turn it into something useful.
Data in itself is of no use or importance but it needs to be converted into something valuable to
extract Information. Hence, you can state that Value! is the most important V of all the 5V’s.

Terminologies Used In Big Data Environments
◦ As-a-service infrastructure
Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather
than selling data, licences to use data, or platforms for running Big Data technology, it can be
provided “as a service”, rather than as a product. This reduces the upfront capital investment
necessary for customers to begin putting their data, or platforms, to work for them, as the
provider bears all of the costs of setting up and hosting the infrastructure. As a customer, as-a-
service infrastructure can greatly reduce the initial cost and setup time of getting Big Data
initiatives up and running.

Data science
Data science is the professional field that deals with turning data into value such as
new insights or predictive models. It brings together expertise from fields including
statistics, mathematics, computer science, communication as well as domain expertise
such as business knowledge. Data scientist has recently been voted the No 1 job in the
U.S., based on current demand and salary and career opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data,
because it is so large, this is generally done by computational methods in an
automated way using methods such as decision trees, clustering analysis and, most
recently, machine learning. This can be thought of as using the brute mathematical
power of computers to spot patterns in data which would not be visible to the human
eye due to the complexity of the dataset.

Hadoop
Hadoop is a framework for Big Data computing which has been released into the public
domain as open source software, and so can freely be used by anyone. It consists of a
number of modules all tailored for a different vital step of the Big Data process – from file
storage (Hadoop File System
HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see
below). It has become so popular due to its power and flexibility that it has developed its
own industry of retailers (selling tailored versions), support service providers and
consultants.
Predictive modelling
At its simplest, this is predicting what will happen next based on data about what has
happened previously. In the Big Data age, because there is more data around than ever
before, predictions are becoming more and more accurate. Predictive modelling is a core
component of most Big Data initiatives, which are formulated to help us choose the course
of action which will lead to the most desirable outcome. The speed of modern computers
and the volume of data available means that predictions can be made based on a huge
number of variables, allowing an ever-increasing number of variables to be assessed for the
probability that it will lead to success.

MapReduce
MapReduce is a computing procedure for working with large datasets, which was
devised due to difficulty of reading and analysing really Big Data using conventional
computing methodologies. As its name suggest, it consists of two procedures – mapping
(sorting information into the format needed for analysis – i.e. sorting a list of people
according to their age) and reducing (performing an operation, such checking the age of
everyone in the dataset to see who is over 21).
NoSQL
NoSQL refers to a database format designed to hold more than data which is simply
arranged into tables, rows, and columns, as is the case in a conventional relational
database. This database format has proven very popular in Big Data applications
because Big Data is often messy, unstructured and does not easily fit into traditional
database frameworks.

 Python
Python is a programming language which has become very popular in the Big Data space due to
its abilityto workverywellwith large, unstructured datasets (see Part IIfor thedifference between
structured and unstructured data). It is considered to be easier to learn for a data science beginner
than other languages such as R (see also Part II) and more flexible.
 R Programming
R is another programming language commonly used in Big Data, and can be thought of as more
specialised than Python, being geared towards statistics. Its strength lies in its powerful handling of
structured data. Like Python, it has anactive communityofusers who are constantlyexpandingand adding
to its capabilities bycreating new libraries and extensions.

Recommendation engine
A recommendation engine is basically an algorithm, or collection of algorithms, designed to
match an entity (for example, a customer) with something they are looking for.
Recommendation engines used by the likes of Netflix or Amazon heavily rely on Big Data
technology to gain an overview of their customers and, using predictive modelling, match them
with products to buy or content to consume. The economic incentives offered by
recommendation engines has been a driving force behind a lot of commercial Big Data
initiatives and developments over the last decade.
Real-time
Real-time means “as it happens” and in Big Data refers to a system or process which is able to
give data-driven insights based on what is happening at the present moment. Recent years
have seen a large push for the development of systems capable of processing and offering
insights in real-time (or near-real-time), and advances in computing power as well as
development of techniques such as machine learning have made it a reality in many
applications today.

Reporting
The crucial “last step” of many Big Data initiative involves getting the right
information to the people who need it to make decisions, at the right time. When
this step is automated, analytics is applied to the insights themselves to ensure that
they are communicated in a way that they will be understood and easy to act on.
This will usually involve creating multiple reports based on the same data or insights
but each intended for a different audience (for example, in-depth technical analysis
for engineers, and an overview of the impact on the bottom line for c-level
executives).
Spark
Spark is another open source framework like Hadoop but more recently developed
and more suited to handling cutting-edge Big Data tasks involving real time analytics
and machine learning. Unlike Hadoop it does not include its own filesystem, though
it is designed to work with Hadoop’s HDFS or a number of other options. However,
for certain data related processes it is able to calculate at over 100 times the speed
of Hadoop, thanks to its in-memory processing capability. This means it is becoming
an increasingly popular choice for projects involving deep learning, neural networks
and other compute-intensive tasks.

can be classified into three forms: –
unstructured – semi-structured –
structured

DIGITAL DATA:
Today, data undoubtedly is an invaluable asset of any enterprise (big or small). Even
though professionals work with data all the time, the understanding, management and
analysis of data from heterogeneous sources remains a serious challenge.
• In this lecture, the various formats of digital data (structured, semi-structured and
unstructured data), data storage mechanism, data access methods, management of data,
the process of extracting desired information from data, challenges posed by various
formats of data, etc. will be explained.
• Data growth has seen exponential acceleration since the advent of the computer and
Internet.

TYPES OF DATA
It is divided into three different types: Structured, Unstructured, Semi-structured
1. Structured:
Structured is one of the types of big data and By structured data, we mean data that can
be processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.

TYPES OF DATA
2. Unstructured:
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data. Email is
an example of unstructured data. Structured and unstructured are two important types of big data.
for example, memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, body of an email, etc.
3. Semi-structured:
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be
precise, it refers to the data that although has not been classified under a particular repository
(database),yet contains vital information or tags that segregate individual elements within the data.
for example, emails, XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.

Characteristics of Unstructured Data

How to Store Unstructured Data?

UIMA
 UIMA (Unstructured Information Management Architecture) is an opensource platform from IBM which
integrates different kinds of analysis engines to provide a complete solution for edge discovery from
unstructured data.
 In UIMA, the analysis engines integration and analysis of unstructured information and bridge the gap
between structured and unstructured data.
 UIMA stores information in a structured format. The structured resources can be mined, searched, and put to
other uses. The information obtained from structured structured sources sources is also for sub-sequent
sequent analysis analysis of unstructured unstructured data.
 Various analysis engines analyze unstructured data in different ways such as:
 – Breaking up of documents into separate words.
 – Grouping and classifying according to taxonomy.
 – Detecting parts of speech, grammar, and synonyms.
 – Detecting events and times.¢ Detecting relationships between various elements.
 CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a unique name to every
object stored in it

Advantages of structured data(Easy to work
with structured data)
• It is easy to work with structured data.
• The advantages are :
• Storage: Both defined and user- defined data types help with the storage of structured
• data.
• Scalability: Scalability is not generally an issue with increase in data
• Security: ensuring security is easy
• Update and Delete: Updating, deleting etc is easy due to structured form.
• Transaction Properties : ACID

TYPES OF DATA
4 Types of Data: Nominal, Ordinal, Discrete, Continuous
1. Nominal
These are the set of values that don’t possess a natural ordering. Let’s understand this
with some examples. The color of a smartphone can be considered as a nominal data type as we
can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is
another one where we can’t differentiate between male, female, or others. Mobile phone categories
whether it is midrange, budget segment, or premium smartphone is also nominal data type.

TYPES OF DATA
2. Ordinal:
These types of values have a natural ordering while maintaining their class of
values. If we consider the size of a clothing brand then we can easily sort them according
to their name tag in the order of small < medium < large. The grading system while
marking candidates in a test can also be considered as an ordinal data type where A+ is
definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to
which type of data.

TYPES OF DATA
3. Discrete:
The numerical values which fall under are integers or whole numbers are placed
under this category. The number of speakers in the phone, cameras, cores in the
processor, the number of sims supported all these are some of the examples of the
discrete data type.
4. Continuous:
The fractional numbers are considered as continuous values. These can take
the form of the operating frequency of the processors, the android version of the phone,
wifi frequency, temperature of the cores, and so on.

What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and
large data sets that have to be processed and analyzed to uncover valuable information that can
benefit businesses and organizations.

Introduction to Big Data
Big data is a collection of massive and complex data sets and data volume that include the
huge quantities of data, data management capabilities, social media analytics and real-time data.
Big data analytics is the process of examining large amounts of data. There exist large
amounts of heterogeneous digital data. Big data is about data volume and large data set's measured
in terms of terabytes or petabytes. This phenomenon is called Bigdata.
The high volumes of data sets, that a traditional computing tool cannot process, are being
collected daily. We refer to these high volumes of data as big data.

BIG DATA
The process of analysis of large volumes of diverse data sets, using advanced analytic
techniques is referred to as Big Data Analytics.
These diverse data sets include structured, semi-structured, and unstructured data, from
different sources, and in different sizes from terabytes to zettabytes. We also reckon them as big
data.
Big Data is a term that is used for data sets whose size or type is beyond the capturing,
managing, and processing ability of traditional rotational databases. The database required to
process big data should have low latency that traditional databases don’t have.
Big data has one or more characteristics among high volume, high velocity, and high
variety.

Classification of Analytics
Big data analytics is categorized into four subcategories that are:
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics

1. Descriptive Analytics :
Descriptive Analytics is considered a useful technique for
uncovering patterns within a certain segment of customers. It simplifies the
data and summarizes past data into a readable form.
It provide insights into what has occurred in the past and with the
trends to dig into for more detail. This helps in creating reports like a
company’s revenue, profits, sales, and so on.
Examples of descriptive analytics include summary statistics, clustering,
and association rules used in market basket analysis.

2. Diagnostic Analytics:
Diagnostic Analytics, as the name suggests, gives a diagnosis to a problem. It gives a
detailed and in-depth insight into the root cause of a problem. Data scientists turn to this analytics
craving for the reason behind a particular happening.
Techniques like drill-down, data mining, and data recovery, churn reason analysis, and
customer health score analysis are all examples of diagnostic analytics. In business terms,
diagnostic analytics is useful when you are researching the reasons leading churn indicators and
usage trends among your most loyal customers.

3. Predictive Analytics:
Predictive Analytics, as can be discerned from the name itself, is concerned with
predicting future incidents. These future incidents can be market trends, consumer trends,
and many such market-related events.
This type of analytics makes use of historical and present data to predict future
events. This is the most commonly used form of analytics among businesses.
Predictive analytics doesn’t only work for the service providers but also for the
consumers. It keeps track of our past activities and based on them, predicts what we may
do next.

4. Prescriptive Analytics:
Prescriptive analytics is the most valuable yet underused form of analytics. It is the next
step in predictive analytics. The prescriptive analysis explores several possible actions and
suggests actions depending on the results of descriptive and predictive analytics of a given dataset.
Prescriptive analytics is a combination of data and various business rules. The data of
prescriptive analytics can be both internal (organizational inputs) and external (social media
insights).
Examples of prescriptive analytics for customer retention is the next best action and next best offer
analysis.

Introduction
The optimum utilization of data with analytics is helping organizations scale their business to the
next level. With data being the new currency, more and more companies are becoming data-
driven. Data analytics help organizations understand their consumers, enhance their advertising
campaigns, personalize their content, and improve their products to meet the desired goal.
While raw data have immense potential, you cannot leverage data’s advantages without the
proper data analytics tools and types of analytics processes. As a Business or Data Analyst, you
need data analytics to maximize your efforts to grow a business and achieve its goals.

What Is Data Analytics?
Data Analytics refers to the process of analyzing datasets to draw out the insights they contain. Data Analytics
empowers Business Analysts to take raw data and reveal patterns to extract significant knowledge. Business
Analysts use Data Analytics techniques in their work to make smart business decisions. Using Data Analytics in
Business Analysis can help organizations better understand their consumers’ patterns and needs. Ultimately,
organizations can use various types of data analytics to boost business performance and improve their
products.
There are mainly 4 broad categories of analytics. These different types of analytics used by Business Analysts
empower them with insights that can help them improve business performance. Let’s take a detailed look at the
four types of analytics.

It is the most straightforward one in the top categories of analytics. Descriptive analytics shuffles
raw data from various data sources to give meaningful insights into the past, i.e., it helps you
understand the impact of past actions. However, these discoveries can only signal whether
something is right or not without any clarification. Therefore, Business Analysts don’t prescribe
exceptionally data-driven organizations to agree to descriptive analytics only; they’d preferably
combine it with other types of analytics.

It is a significant step to make raw data justifiable to stockholders, investors, and leaders. This
way, it becomes simple to recognize and address shortcomings that require attention. Data
aggregation and mining are the two fundamental procedures in descriptive analytics. It is to be
noted that this technique is beneficial for understanding the underlying behavior and not
making any estimations.

Example of Descriptive Analytics
Traffic and Engagement Reports – to analyze and understand website traffic and other
engagement metrics.
Financial Statement Analysis – Used to obtain a holistic view of the company’s financial health.

Diagnostic Analytics is one of the 4 broad categories of analytics utilized to decide
why something occurred in the past. It is characterized by techniques like drill-down,
data discovery, data mining, and correlations. Diagnostic Analytics investigates data
to comprehend the main drivers of the events. It is useful in figuring out what
elements and events led to a specific outcome. It generally utilizes probabilities,
likelihoods, and the distribution of results for the analysis.
It gives comprehensive insights into a particular problem. Simultaneously, an
organization must have detailed data available to them.

Examples Of Diagnostic Analytics
Examining Market Demand – Used to analyze market demands beforehand and
meet the supply accordingly.
Explaining Customer Behavior – Very helpful in understanding customer needs and
necessities and planning business operations accordingly
Identifying Technology Issues – Utilized to run tests and identify technological issues
Improving Company Culture – Ideally done by the HR department, the necessary
employee data is collected to observe employee behavior.

Predictive analytics is one of the four types of data analytics used by Business
Analysts that determine what will probably occur. It utilizes the discoveries of
descriptive and diagnostic analytics to distinguish groups and exceptional cases
and anticipate future patterns, making it an essential tool for forecasting.
One of the primary applications of predictive analytics is sentiment analysis. All
the opinions posted via online media are gathered and analyzed (existing text
data) to forecast the individual’s opinion on a specific subject as positive,
negative, or neutral (future prediction). Hence, predictive analytics comprises
designing and validating models that render precise predictions.

Examples Of Predictive Analytics
Finance: Forecasting Future Cash Flow – Used to predict and maintain the financial
need and health of the organization
Entertainment & Hospitality: Determining Staffing Needs – Used to fulfill the
staffing needs based on the influx and outflux of the customers.
Marketing: Behavioral Targeting – Leveraging the data obtained from consumer
behaviors for creating stronger marketing strategies.
Manufacturing: Preventing Malfunction – Used to predict a probable malfunction
or breakdown and avoid the same to save time and money.

Predictive analytics is the basis of these types of data analytics used in Business
Analytics. Still, it goes past the other three categories of analytics mentioned
above to recommend future solutions. It can recommend all favorable outcomes
per a predefined game plan and propose a different course of action to achieve a
specific result. Therefore, it utilizes a robust feedback system that continually
learns and updates the connection between actions and outcomes.

Prescriptive analytics utilizes emerging technologies and tools, such as
Machine Learning, Deep Learning, and Artificial Intelligence algorithms,
making it modern to execute and oversee. Furthermore, this cutting-edge data
analytics type requires internal as well as external past data to provide users
with favorable outcomes. That is why Business Analysts suggest considering
the needed efforts against a demanded added value before implementing
prescriptive analytics to any business system.

Examples Of Prescriptive Analytics
Venture Capital- Investment Decisions – Often taken by gut feeling, these decisions
sometimes can also be supported with necessary algorithms.
Sales: Lead Scoring – Used to analyze and predict the probability of a lead converting to a
successful conversion
Content Curation: Algorithmic Recommendations – Used to predict the creation of
necessary content to keep consumers engaged and interested.
Banking: Fraud Detection – It is used to detect and flag fraudulent actions that might have
occurred in banking transactions.
Product Management- Development and Improvement – Here, the necessary data can be
collected and collated to derive necessary inputs regarding a product and its develop

Conclusion
Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics
are the 4 types of analytics used by Business Analysts to unlock raw data’s potential in
order to improve business performance. If you’re someone who loves to play with data and
wants to build a successful career in Business Analytics, check our Integrated Program In
Business Analytics (IPBA) in collaboration with IIM Indore. It is a 10-month-long online
Future Leaders Program aimed at senior executives and mid-career professionals to help
them give their careers a significant boost.

1 UNIT-DSP.pptx

Recommended

Recommended

More Related Content

Similar to 1 UNIT-DSP.pptx

Similar to 1 UNIT-DSP.pptx (20)

Recently uploaded

Recently uploaded (20)

1 UNIT-DSP.pptx