SlideShare a Scribd company logo
1 of 60
DATA SCIENCE
IN BIG DATA
U N I T - 1
SYLLABUS
UNIT I - INTRODUCTION TO DATASCIENCE AND BIG DATA
Data Science - Fundamentals and Components – Data Scientist – Terminologies Used in Big Data Environments -
Types of Digital Data - Classification of Digital Data - Introduction to Big Data - Characteristics of Data - Evolution of Big
Data - Big Data Analytics - Classification of Analytics.
UNIT II - DESCRIPTIVE ANALYTICS USING STATISTICS
Types of Data – Mean, Median and Mode – Standard Deviation and Variance – Probability – Probability Density
Function – Types of Data Distribution – Percentiles and Moments – Correlation and Covariance – Conditional Probability –
Bayes Theorem – Introduction to Univariate, Bivariate and Multivariate Analysis.
UNIT III - PREDICTIVE MODELING AND MACHINE LEARNING
Linear Regression – Polynomial Regression – Multivariate Regression – Multi Level Models – Data warehousing
overview – Bias / variance trade off – K Fold cross validation – Data Cleaning and Normalization – Cleaning web log Data –
Normalizing numerical Data – Detecting Outliers – Introduction to Supervised and Unsupervised learning.
SYLLABUS
UNIT IV - DATAANALYTICAL FRAMEWORKS
Introducing Hadoop: - Hadoop Overview - RDBMS versus Hadoop - HDFS (Hadoop Distributed File
System): Components and block replication – Processing Data with Hadoop - Introduction to MapReduce –
Features of MapReduce – Introduction to NoSQL: CAP theorem, MongoDB.
UNIT V - DATA SCIENCE USING PYTHON
Introduction to essential data science packages: NumPy, SciPy, Jupyter, Statsmodels and pandas Package –
Introduction to Data Munging, Data pipeline and Machine learning in Python - Data visualization using matplotlib –
Interactive visualization with advanced data learning representation in Python.
Data Science – Definition
Data Science is the science which uses computer science, statistics and machine learning,
visualization and human-computer interactions to collect, clean, integrate, analyze, visualize,
interact with data to create data products.
Goal of Data Science - Turn data into data products.
How is data science related to big data: It is a blend of the field of Computer Science, Business
and Statistics together. Data Science is an area. Big Data is a technique to collect, maintain and
process the huge information. It is about collection, processing, analyzing and utilizing of data
into various operations.
A Visual Definition:
Data Science Process
Data Science Process
1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources . The
data can be: Logs from webservers, Data gathered from social media, Census datasets
Data streamed from online sources using APIs.
2. Preparation:
Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned. You need to process, explore, and condition data before modeling. The
cleaner your data, the better are your predictions.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
Data Science Process
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model once prepared is tested against the “testing” dataset.
5. Operationalize:
In this stage, you deliver the final baselined model with reports, code, and technical documents.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you to decide if
the results of the project are a success or a failure based on the inputs from the model.
DATA SCIENCE COMPONENTS
DATA SCIENCE COMPONENTS
Statistics:
Statistics is the most critical unit of Data Science basics. It is the method or
of collecting and analyzing numerical data in large quantities to get useful
insights.
Visualization:
Visualization technique helps you to access huge amounts of data in easy to
understand and digestible visuals.
Data Scientist:
Role: A Data Scientist is a professional who manages enormous amounts of data
to come up with compelling business visions by using various tools, techniques,
methodologies, algorithms, etc.
Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark.
A data scientist’s work typically involves making sense of messy, unstructured
data, from sources such as smart devices, social media feeds, and emails that
don’t neatly fit into a database.
Data scientists are analytical experts who utilize their skills in both technology
and social science to find trends and manage data.
Terminologies Used in Big Data Environment
5 V’s of Big Data:
•Volume – a large amount of data.
•Velocity – the speed of data processing.
•Variety – large data diversity.
•Veracity – verification of data.
•Value – what big data can bring to the user.
Terminologies Used in Big Data Environment
VOLUME:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If the volume of data is very
large then it is actually considered as a ‘Big Data’. This means whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
Terminologies Used in Big Data Environment
VELOCITY:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media, mobile
phones etc.
There is a massive and continuous flow of data. This determines the potential of data that how
fast the data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.
Terminologies Used in Big Data Environment
VARIETY:
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
Terminologies Used in Big Data Environment
◦ Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
◦ Semi- Structured data: This data is basically a semi-organised data. It is generally a form of
data that do not conform to the formal structure of data. Log files are the examples of this type
of data.
◦ Unstructured data: This data basically refers to unorganized data. It generally refers to data
that doesn’t fit neatly into the traditional row and column structure of the relational database.
Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.
Terminologies Used in Big Data Environment
Veracity:
It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes
get messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.
Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.
Terminologies Used in Big Data Environment
Value:
After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of
Data having no Value is of no good to the company, unless you turn it into something useful.
Data in itself is of no use or importance but it needs to be converted into something valuable to
extract Information. Hence, you can state that Value! is the most important V of all the 5V’s.
Terminologies Used In Big Data Environments
◦ As-a-service infrastructure
Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather
than selling data, licences to use data, or platforms for running Big Data technology, it can be
provided “as a service”, rather than as a product. This reduces the upfront capital investment
necessary for customers to begin putting their data, or platforms, to work for them, as the
provider bears all of the costs of setting up and hosting the infrastructure. As a customer, as-a-
service infrastructure can greatly reduce the initial cost and setup time of getting Big Data
initiatives up and running.
Data science
Data science is the professional field that deals with turning data into value such as
new insights or predictive models. It brings together expertise from fields including
statistics, mathematics, computer science, communication as well as domain expertise
such as business knowledge. Data scientist has recently been voted the No 1 job in the
U.S., based on current demand and salary and career opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data,
because it is so large, this is generally done by computational methods in an
automated way using methods such as decision trees, clustering analysis and, most
recently, machine learning. This can be thought of as using the brute mathematical
power of computers to spot patterns in data which would not be visible to the human
eye due to the complexity of the dataset.
Hadoop
Hadoop is a framework for Big Data computing which has been released into the public
domain as open source software, and so can freely be used by anyone. It consists of a
number of modules all tailored for a different vital step of the Big Data process – from file
storage (Hadoop File System
HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see
below). It has become so popular due to its power and flexibility that it has developed its
own industry of retailers (selling tailored versions), support service providers and
consultants.
Predictive modelling
At its simplest, this is predicting what will happen next based on data about what has
happened previously. In the Big Data age, because there is more data around than ever
before, predictions are becoming more and more accurate. Predictive modelling is a core
component of most Big Data initiatives, which are formulated to help us choose the course
of action which will lead to the most desirable outcome. The speed of modern computers
and the volume of data available means that predictions can be made based on a huge
number of variables, allowing an ever-increasing number of variables to be assessed for the
probability that it will lead to success.
MapReduce
MapReduce is a computing procedure for working with large datasets, which was
devised due to difficulty of reading and analysing really Big Data using conventional
computing methodologies. As its name suggest, it consists of two procedures – mapping
(sorting information into the format needed for analysis – i.e. sorting a list of people
according to their age) and reducing (performing an operation, such checking the age of
everyone in the dataset to see who is over 21).
NoSQL
NoSQL refers to a database format designed to hold more than data which is simply
arranged into tables, rows, and columns, as is the case in a conventional relational
database. This database format has proven very popular in Big Data applications
because Big Data is often messy, unstructured and does not easily fit into traditional
database frameworks.
 Python
Python is a programming language which has become very popular in the Big Data space due to
its abilityto workverywellwith large, unstructured datasets (see Part IIfor thedifference between
structured and unstructured data). It is considered to be easier to learn for a data science beginner
than other languages such as R (see also Part II) and more flexible.
 R Programming
R is another programming language commonly used in Big Data, and can be thought of as more
specialised than Python, being geared towards statistics. Its strength lies in its powerful handling of
structured data. Like Python, it has anactive communityofusers who are constantlyexpandingand adding
to its capabilities bycreating new libraries and extensions.
Recommendation engine
A recommendation engine is basically an algorithm, or collection of algorithms, designed to
match an entity (for example, a customer) with something they are looking for.
Recommendation engines used by the likes of Netflix or Amazon heavily rely on Big Data
technology to gain an overview of their customers and, using predictive modelling, match them
with products to buy or content to consume. The economic incentives offered by
recommendation engines has been a driving force behind a lot of commercial Big Data
initiatives and developments over the last decade.
Real-time
Real-time means “as it happens” and in Big Data refers to a system or process which is able to
give data-driven insights based on what is happening at the present moment. Recent years
have seen a large push for the development of systems capable of processing and offering
insights in real-time (or near-real-time), and advances in computing power as well as
development of techniques such as machine learning have made it a reality in many
applications today.
Reporting
The crucial “last step” of many Big Data initiative involves getting the right
information to the people who need it to make decisions, at the right time. When
this step is automated, analytics is applied to the insights themselves to ensure that
they are communicated in a way that they will be understood and easy to act on.
This will usually involve creating multiple reports based on the same data or insights
but each intended for a different audience (for example, in-depth technical analysis
for engineers, and an overview of the impact on the bottom line for c-level
executives).
Spark
Spark is another open source framework like Hadoop but more recently developed
and more suited to handling cutting-edge Big Data tasks involving real time analytics
and machine learning. Unlike Hadoop it does not include its own filesystem, though
it is designed to work with Hadoop’s HDFS or a number of other options. However,
for certain data related processes it is able to calculate at over 100 times the speed
of Hadoop, thanks to its in-memory processing capability. This means it is becoming
an increasingly popular choice for projects involving deep learning, neural networks
and other compute-intensive tasks.
can be classified into three forms: –
unstructured – semi-structured –
structured
DIGITAL DATA:
Today, data undoubtedly is an invaluable asset of any enterprise (big or small). Even
though professionals work with data all the time, the understanding, management and
analysis of data from heterogeneous sources remains a serious challenge.
• In this lecture, the various formats of digital data (structured, semi-structured and
unstructured data), data storage mechanism, data access methods, management of data,
the process of extracting desired information from data, challenges posed by various
formats of data, etc. will be explained.
• Data growth has seen exponential acceleration since the advent of the computer and
Internet.
TYPES OF DATA
It is divided into three different types: Structured, Unstructured, Semi-structured
1. Structured:
Structured is one of the types of big data and By structured data, we mean data that can
be processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
TYPES OF DATA
2. Unstructured:
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data. Email is
an example of unstructured data. Structured and unstructured are two important types of big data.
for example, memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, body of an email, etc.
3. Semi-structured:
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be
precise, it refers to the data that although has not been classified under a particular repository
(database),yet contains vital information or tags that segregate individual elements within the data.
for example, emails, XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
Characteristics of Unstructured Data
How to Store Unstructured Data?
UIMA
 UIMA (Unstructured Information Management Architecture) is an opensource platform from IBM which
integrates different kinds of analysis engines to provide a complete solution for edge discovery from
unstructured data.
 In UIMA, the analysis engines integration and analysis of unstructured information and bridge the gap
between structured and unstructured data.
 UIMA stores information in a structured format. The structured resources can be mined, searched, and put to
other uses. The information obtained from structured structured sources sources is also for sub-sequent
sequent analysis analysis of unstructured unstructured data.
 Various analysis engines analyze unstructured data in different ways such as:
 – Breaking up of documents into separate words.
 – Grouping and classifying according to taxonomy.
 – Detecting parts of speech, grammar, and synonyms.
 – Detecting events and times.¢ Detecting relationships between various elements.
 CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a unique name to every
object stored in it
Advantages of structured data(Easy to work
with structured data)
• It is easy to work with structured data.
• The advantages are :
• Storage: Both defined and user- defined data types help with the storage of structured
• data.
• Scalability: Scalability is not generally an issue with increase in data
• Security: ensuring security is easy
• Update and Delete: Updating, deleting etc is easy due to structured form.
• Transaction Properties : ACID
TYPES OF DATA
4 Types of Data: Nominal, Ordinal, Discrete, Continuous
1. Nominal
These are the set of values that don’t possess a natural ordering. Let’s understand this
with some examples. The color of a smartphone can be considered as a nominal data type as we
can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is
another one where we can’t differentiate between male, female, or others. Mobile phone categories
whether it is midrange, budget segment, or premium smartphone is also nominal data type.
TYPES OF DATA
2. Ordinal:
These types of values have a natural ordering while maintaining their class of
values. If we consider the size of a clothing brand then we can easily sort them according
to their name tag in the order of small < medium < large. The grading system while
marking candidates in a test can also be considered as an ordinal data type where A+ is
definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to
which type of data.
TYPES OF DATA
3. Discrete:
The numerical values which fall under are integers or whole numbers are placed
under this category. The number of speakers in the phone, cameras, cores in the
processor, the number of sims supported all these are some of the examples of the
discrete data type.
4. Continuous:
The fractional numbers are considered as continuous values. These can take
the form of the operating frequency of the processors, the android version of the phone,
wifi frequency, temperature of the cores, and so on.
What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and
large data sets that have to be processed and analyzed to uncover valuable information that can
benefit businesses and organizations.
Introduction to Big Data
Big data is a collection of massive and complex data sets and data volume that include the
huge quantities of data, data management capabilities, social media analytics and real-time data.
Big data analytics is the process of examining large amounts of data. There exist large
amounts of heterogeneous digital data. Big data is about data volume and large data set's measured
in terms of terabytes or petabytes. This phenomenon is called Bigdata.
The high volumes of data sets, that a traditional computing tool cannot process, are being
collected daily. We refer to these high volumes of data as big data.
BIG DATA
The process of analysis of large volumes of diverse data sets, using advanced analytic
techniques is referred to as Big Data Analytics.
These diverse data sets include structured, semi-structured, and unstructured data, from
different sources, and in different sizes from terabytes to zettabytes. We also reckon them as big
data.
Big Data is a term that is used for data sets whose size or type is beyond the capturing,
managing, and processing ability of traditional rotational databases. The database required to
process big data should have low latency that traditional databases don’t have.
Big data has one or more characteristics among high volume, high velocity, and high
variety.
Classification of Analytics
Big data analytics is categorized into four subcategories that are:
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Classification of Analytics
1. Descriptive Analytics :
Descriptive Analytics is considered a useful technique for
uncovering patterns within a certain segment of customers. It simplifies the
data and summarizes past data into a readable form.
It provide insights into what has occurred in the past and with the
trends to dig into for more detail. This helps in creating reports like a
company’s revenue, profits, sales, and so on.
Examples of descriptive analytics include summary statistics, clustering,
and association rules used in market basket analysis.
Classification of Analytics
2. Diagnostic Analytics:
Diagnostic Analytics, as the name suggests, gives a diagnosis to a problem. It gives a
detailed and in-depth insight into the root cause of a problem. Data scientists turn to this analytics
craving for the reason behind a particular happening.
Techniques like drill-down, data mining, and data recovery, churn reason analysis, and
customer health score analysis are all examples of diagnostic analytics. In business terms,
diagnostic analytics is useful when you are researching the reasons leading churn indicators and
usage trends among your most loyal customers.
Classification of Analytics
3. Predictive Analytics:
Predictive Analytics, as can be discerned from the name itself, is concerned with
predicting future incidents. These future incidents can be market trends, consumer trends,
and many such market-related events.
This type of analytics makes use of historical and present data to predict future
events. This is the most commonly used form of analytics among businesses.
Predictive analytics doesn’t only work for the service providers but also for the
consumers. It keeps track of our past activities and based on them, predicts what we may
do next.
Classification of Analytics
4. Prescriptive Analytics:
Prescriptive analytics is the most valuable yet underused form of analytics. It is the next
step in predictive analytics. The prescriptive analysis explores several possible actions and
suggests actions depending on the results of descriptive and predictive analytics of a given dataset.
Prescriptive analytics is a combination of data and various business rules. The data of
prescriptive analytics can be both internal (organizational inputs) and external (social media
insights).
Examples of prescriptive analytics for customer retention is the next best action and next best offer
analysis.
FOUR TYPES OF ANALYTICS
Introduction
The optimum utilization of data with analytics is helping organizations scale their business to the
next level. With data being the new currency, more and more companies are becoming data-
driven. Data analytics help organizations understand their consumers, enhance their advertising
campaigns, personalize their content, and improve their products to meet the desired goal.
While raw data have immense potential, you cannot leverage data’s advantages without the
proper data analytics tools and types of analytics processes. As a Business or Data Analyst, you
need data analytics to maximize your efforts to grow a business and achieve its goals.
What Is Data Analytics?
Data Analytics refers to the process of analyzing datasets to draw out the insights they contain. Data Analytics
empowers Business Analysts to take raw data and reveal patterns to extract significant knowledge. Business
Analysts use Data Analytics techniques in their work to make smart business decisions. Using Data Analytics in
Business Analysis can help organizations better understand their consumers’ patterns and needs. Ultimately,
organizations can use various types of data analytics to boost business performance and improve their
products.
There are mainly 4 broad categories of analytics. These different types of analytics used by Business Analysts
empower them with insights that can help them improve business performance. Let’s take a detailed look at the
four types of analytics.
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Descriptive Analytics
It is the most straightforward one in the top categories of analytics. Descriptive analytics shuffles
raw data from various data sources to give meaningful insights into the past, i.e., it helps you
understand the impact of past actions. However, these discoveries can only signal whether
something is right or not without any clarification. Therefore, Business Analysts don’t prescribe
exceptionally data-driven organizations to agree to descriptive analytics only; they’d preferably
combine it with other types of analytics.
It is a significant step to make raw data justifiable to stockholders, investors, and leaders. This
way, it becomes simple to recognize and address shortcomings that require attention. Data
aggregation and mining are the two fundamental procedures in descriptive analytics. It is to be
noted that this technique is beneficial for understanding the underlying behavior and not
making any estimations.
Example of Descriptive Analytics
Traffic and Engagement Reports – to analyze and understand website traffic and other
engagement metrics.
Financial Statement Analysis – Used to obtain a holistic view of the company’s financial health.
Diagnostic Analytics
Diagnostic Analytics is one of the 4 broad categories of analytics utilized to decide
why something occurred in the past. It is characterized by techniques like drill-down,
data discovery, data mining, and correlations. Diagnostic Analytics investigates data
to comprehend the main drivers of the events. It is useful in figuring out what
elements and events led to a specific outcome. It generally utilizes probabilities,
likelihoods, and the distribution of results for the analysis.
It gives comprehensive insights into a particular problem. Simultaneously, an
organization must have detailed data available to them.
Examples Of Diagnostic Analytics
Examining Market Demand – Used to analyze market demands beforehand and
meet the supply accordingly.
Explaining Customer Behavior – Very helpful in understanding customer needs and
necessities and planning business operations accordingly
Identifying Technology Issues – Utilized to run tests and identify technological issues
Improving Company Culture – Ideally done by the HR department, the necessary
employee data is collected to observe employee behavior.
Predictive Analytics
Predictive analytics is one of the four types of data analytics used by Business
Analysts that determine what will probably occur. It utilizes the discoveries of
descriptive and diagnostic analytics to distinguish groups and exceptional cases
and anticipate future patterns, making it an essential tool for forecasting.
One of the primary applications of predictive analytics is sentiment analysis. All
the opinions posted via online media are gathered and analyzed (existing text
data) to forecast the individual’s opinion on a specific subject as positive,
negative, or neutral (future prediction). Hence, predictive analytics comprises
designing and validating models that render precise predictions.
Examples Of Predictive Analytics
Finance: Forecasting Future Cash Flow – Used to predict and maintain the financial
need and health of the organization
Entertainment & Hospitality: Determining Staffing Needs – Used to fulfill the
staffing needs based on the influx and outflux of the customers.
Marketing: Behavioral Targeting – Leveraging the data obtained from consumer
behaviors for creating stronger marketing strategies.
Manufacturing: Preventing Malfunction – Used to predict a probable malfunction
or breakdown and avoid the same to save time and money.
Prescriptive Analytics
Predictive analytics is the basis of these types of data analytics used in Business
Analytics. Still, it goes past the other three categories of analytics mentioned
above to recommend future solutions. It can recommend all favorable outcomes
per a predefined game plan and propose a different course of action to achieve a
specific result. Therefore, it utilizes a robust feedback system that continually
learns and updates the connection between actions and outcomes.
Prescriptive analytics utilizes emerging technologies and tools, such as
Machine Learning, Deep Learning, and Artificial Intelligence algorithms,
making it modern to execute and oversee. Furthermore, this cutting-edge data
analytics type requires internal as well as external past data to provide users
with favorable outcomes. That is why Business Analysts suggest considering
the needed efforts against a demanded added value before implementing
prescriptive analytics to any business system.
Examples Of Prescriptive Analytics
Venture Capital- Investment Decisions – Often taken by gut feeling, these decisions
sometimes can also be supported with necessary algorithms.
Sales: Lead Scoring – Used to analyze and predict the probability of a lead converting to a
successful conversion
Content Curation: Algorithmic Recommendations – Used to predict the creation of
necessary content to keep consumers engaged and interested.
Banking: Fraud Detection – It is used to detect and flag fraudulent actions that might have
occurred in banking transactions.
Product Management- Development and Improvement – Here, the necessary data can be
collected and collated to derive necessary inputs regarding a product and its develop
Conclusion
Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics
are the 4 types of analytics used by Business Analysts to unlock raw data’s potential in
order to improve business performance. If you’re someone who loves to play with data and
wants to build a successful career in Business Analytics, check our Integrated Program In
Business Analytics (IPBA) in collaboration with IIM Indore. It is a 10-month-long online
Future Leaders Program aimed at senior executives and mid-career professionals to help
them give their careers a significant boost.

More Related Content

Similar to 1 UNIT-DSP.pptx

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
IRJET- Big Data: A Study
IRJET-  	  Big Data: A StudyIRJET-  	  Big Data: A Study
IRJET- Big Data: A StudyIRJET Journal
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptxShambhavi Vats
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 

Similar to 1 UNIT-DSP.pptx (20)

Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
IRJET- Big Data: A Study
IRJET-  	  Big Data: A StudyIRJET-  	  Big Data: A Study
IRJET- Big Data: A Study
 
M.Florence Dayana
M.Florence DayanaM.Florence Dayana
M.Florence Dayana
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Data Science
Data ScienceData Science
Data Science
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptx
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

1 UNIT-DSP.pptx

  • 1. DATA SCIENCE IN BIG DATA U N I T - 1
  • 2. SYLLABUS UNIT I - INTRODUCTION TO DATASCIENCE AND BIG DATA Data Science - Fundamentals and Components – Data Scientist – Terminologies Used in Big Data Environments - Types of Digital Data - Classification of Digital Data - Introduction to Big Data - Characteristics of Data - Evolution of Big Data - Big Data Analytics - Classification of Analytics. UNIT II - DESCRIPTIVE ANALYTICS USING STATISTICS Types of Data – Mean, Median and Mode – Standard Deviation and Variance – Probability – Probability Density Function – Types of Data Distribution – Percentiles and Moments – Correlation and Covariance – Conditional Probability – Bayes Theorem – Introduction to Univariate, Bivariate and Multivariate Analysis. UNIT III - PREDICTIVE MODELING AND MACHINE LEARNING Linear Regression – Polynomial Regression – Multivariate Regression – Multi Level Models – Data warehousing overview – Bias / variance trade off – K Fold cross validation – Data Cleaning and Normalization – Cleaning web log Data – Normalizing numerical Data – Detecting Outliers – Introduction to Supervised and Unsupervised learning.
  • 3. SYLLABUS UNIT IV - DATAANALYTICAL FRAMEWORKS Introducing Hadoop: - Hadoop Overview - RDBMS versus Hadoop - HDFS (Hadoop Distributed File System): Components and block replication – Processing Data with Hadoop - Introduction to MapReduce – Features of MapReduce – Introduction to NoSQL: CAP theorem, MongoDB. UNIT V - DATA SCIENCE USING PYTHON Introduction to essential data science packages: NumPy, SciPy, Jupyter, Statsmodels and pandas Package – Introduction to Data Munging, Data pipeline and Machine learning in Python - Data visualization using matplotlib – Interactive visualization with advanced data learning representation in Python.
  • 4. Data Science – Definition Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. Goal of Data Science - Turn data into data products. How is data science related to big data: It is a blend of the field of Computer Science, Business and Statistics together. Data Science is an area. Big Data is a technique to collect, maintain and process the huge information. It is about collection, processing, analyzing and utilizing of data into various operations.
  • 7. Data Science Process 1. Discovery: Discovery step involves acquiring data from all the identified internal & external sources . The data can be: Logs from webservers, Data gathered from social media, Census datasets Data streamed from online sources using APIs. 2. Preparation: Data can have lots of inconsistencies like missing value, blank columns, incorrect data format which needs to be cleaned. You need to process, explore, and condition data before modeling. The cleaner your data, the better are your predictions. 3. Model Planning: In this stage, you need to determine the method and technique to draw the relation between input variables. Planning for a model is performed by using different statistical formulas and visualization tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
  • 8. Data Science Process 4. Model Building: In this step, the actual model building process starts. Here, Data scientist distributes datasets for training and testing. Techniques like association, classification, and clustering are applied to the training data set. The model once prepared is tested against the “testing” dataset. 5. Operationalize: In this stage, you deliver the final baselined model with reports, code, and technical documents. Model is deployed into a real-time production environment after thorough testing. 6. Communicate Results In this stage, the key findings are communicated to all stakeholders. This helps you to decide if the results of the project are a success or a failure based on the inputs from the model.
  • 10. DATA SCIENCE COMPONENTS Statistics: Statistics is the most critical unit of Data Science basics. It is the method or of collecting and analyzing numerical data in large quantities to get useful insights. Visualization: Visualization technique helps you to access huge amounts of data in easy to understand and digestible visuals.
  • 11. Data Scientist: Role: A Data Scientist is a professional who manages enormous amounts of data to come up with compelling business visions by using various tools, techniques, methodologies, algorithms, etc. Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark. A data scientist’s work typically involves making sense of messy, unstructured data, from sources such as smart devices, social media feeds, and emails that don’t neatly fit into a database. Data scientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data.
  • 12. Terminologies Used in Big Data Environment 5 V’s of Big Data: •Volume – a large amount of data. •Velocity – the speed of data processing. •Variety – large data diversity. •Veracity – verification of data. •Value – what big data can bring to the user.
  • 13. Terminologies Used in Big Data Environment VOLUME: The name ‘Big Data’ itself is related to a size which is enormous. Volume is a huge amount of data. To determine the value of data, size of data plays a very crucial role. If the volume of data is very large then it is actually considered as a ‘Big Data’. This means whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
  • 14. Terminologies Used in Big Data Environment VELOCITY: Velocity refers to the high speed of accumulation of data. In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc. There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands. Sampling data can help in dealing with the issue like ‘velocity’.
  • 15. Terminologies Used in Big Data Environment VARIETY: It refers to nature of data that is structured, semi-structured and unstructured data. It also refers to heterogeneous sources. Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured and unstructured.
  • 16. Terminologies Used in Big Data Environment ◦ Structured data: This data is basically an organized data. It generally refers to data that has defined the length and format of data. ◦ Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data that do not conform to the formal structure of data. Log files are the examples of this type of data. ◦ Unstructured data: This data basically refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows and columns.
  • 17. Terminologies Used in Big Data Environment Veracity: It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are difficult to control. Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources. Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information.
  • 18. Terminologies Used in Big Data Environment Value: After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s.
  • 19. Terminologies Used In Big Data Environments ◦ As-a-service infrastructure Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather than selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as a service”, rather than as a product. This reduces the upfront capital investment necessary for customers to begin putting their data, or platforms, to work for them, as the provider bears all of the costs of setting up and hosting the infrastructure. As a customer, as-a- service infrastructure can greatly reduce the initial cost and setup time of getting Big Data initiatives up and running.
  • 20. Data science Data science is the professional field that deals with turning data into value such as new insights or predictive models. It brings together expertise from fields including statistics, mathematics, computer science, communication as well as domain expertise such as business knowledge. Data scientist has recently been voted the No 1 job in the U.S., based on current demand and salary and career opportunities. Data mining Data mining is the process of discovering insights from data. In terms of Big Data, because it is so large, this is generally done by computational methods in an automated way using methods such as decision trees, clustering analysis and, most recently, machine learning. This can be thought of as using the brute mathematical power of computers to spot patterns in data which would not be visible to the human eye due to the complexity of the dataset.
  • 21. Hadoop Hadoop is a framework for Big Data computing which has been released into the public domain as open source software, and so can freely be used by anyone. It consists of a number of modules all tailored for a different vital step of the Big Data process – from file storage (Hadoop File System HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see below). It has become so popular due to its power and flexibility that it has developed its own industry of retailers (selling tailored versions), support service providers and consultants. Predictive modelling At its simplest, this is predicting what will happen next based on data about what has happened previously. In the Big Data age, because there is more data around than ever before, predictions are becoming more and more accurate. Predictive modelling is a core component of most Big Data initiatives, which are formulated to help us choose the course of action which will lead to the most desirable outcome. The speed of modern computers and the volume of data available means that predictions can be made based on a huge number of variables, allowing an ever-increasing number of variables to be assessed for the probability that it will lead to success.
  • 22. MapReduce MapReduce is a computing procedure for working with large datasets, which was devised due to difficulty of reading and analysing really Big Data using conventional computing methodologies. As its name suggest, it consists of two procedures – mapping (sorting information into the format needed for analysis – i.e. sorting a list of people according to their age) and reducing (performing an operation, such checking the age of everyone in the dataset to see who is over 21). NoSQL NoSQL refers to a database format designed to hold more than data which is simply arranged into tables, rows, and columns, as is the case in a conventional relational database. This database format has proven very popular in Big Data applications because Big Data is often messy, unstructured and does not easily fit into traditional database frameworks.
  • 23.  Python Python is a programming language which has become very popular in the Big Data space due to its abilityto workverywellwith large, unstructured datasets (see Part IIfor thedifference between structured and unstructured data). It is considered to be easier to learn for a data science beginner than other languages such as R (see also Part II) and more flexible.  R Programming R is another programming language commonly used in Big Data, and can be thought of as more specialised than Python, being geared towards statistics. Its strength lies in its powerful handling of structured data. Like Python, it has anactive communityofusers who are constantlyexpandingand adding to its capabilities bycreating new libraries and extensions.
  • 24. Recommendation engine A recommendation engine is basically an algorithm, or collection of algorithms, designed to match an entity (for example, a customer) with something they are looking for. Recommendation engines used by the likes of Netflix or Amazon heavily rely on Big Data technology to gain an overview of their customers and, using predictive modelling, match them with products to buy or content to consume. The economic incentives offered by recommendation engines has been a driving force behind a lot of commercial Big Data initiatives and developments over the last decade. Real-time Real-time means “as it happens” and in Big Data refers to a system or process which is able to give data-driven insights based on what is happening at the present moment. Recent years have seen a large push for the development of systems capable of processing and offering insights in real-time (or near-real-time), and advances in computing power as well as development of techniques such as machine learning have made it a reality in many applications today.
  • 25. Reporting The crucial “last step” of many Big Data initiative involves getting the right information to the people who need it to make decisions, at the right time. When this step is automated, analytics is applied to the insights themselves to ensure that they are communicated in a way that they will be understood and easy to act on. This will usually involve creating multiple reports based on the same data or insights but each intended for a different audience (for example, in-depth technical analysis for engineers, and an overview of the impact on the bottom line for c-level executives). Spark Spark is another open source framework like Hadoop but more recently developed and more suited to handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike Hadoop it does not include its own filesystem, though it is designed to work with Hadoop’s HDFS or a number of other options. However, for certain data related processes it is able to calculate at over 100 times the speed of Hadoop, thanks to its in-memory processing capability. This means it is becoming an increasingly popular choice for projects involving deep learning, neural networks and other compute-intensive tasks.
  • 26. can be classified into three forms: – unstructured – semi-structured – structured
  • 27. DIGITAL DATA: Today, data undoubtedly is an invaluable asset of any enterprise (big or small). Even though professionals work with data all the time, the understanding, management and analysis of data from heterogeneous sources remains a serious challenge. • In this lecture, the various formats of digital data (structured, semi-structured and unstructured data), data storage mechanism, data access methods, management of data, the process of extracting desired information from data, challenges posed by various formats of data, etc. will be explained. • Data growth has seen exponential acceleration since the advent of the computer and Internet.
  • 28. TYPES OF DATA It is divided into three different types: Structured, Unstructured, Semi-structured 1. Structured: Structured is one of the types of big data and By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner.
  • 29. TYPES OF DATA 2. Unstructured: Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data. Structured and unstructured are two important types of big data. for example, memos, chat rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an email, etc. 3. Semi-structured: Semi structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database),yet contains vital information or tags that segregate individual elements within the data. for example, emails, XML, markup languages like HTML, etc. Metadata for this data is available but is not sufficient.
  • 31. How to Store Unstructured Data?
  • 32. UIMA  UIMA (Unstructured Information Management Architecture) is an opensource platform from IBM which integrates different kinds of analysis engines to provide a complete solution for edge discovery from unstructured data.  In UIMA, the analysis engines integration and analysis of unstructured information and bridge the gap between structured and unstructured data.  UIMA stores information in a structured format. The structured resources can be mined, searched, and put to other uses. The information obtained from structured structured sources sources is also for sub-sequent sequent analysis analysis of unstructured unstructured data.  Various analysis engines analyze unstructured data in different ways such as:  – Breaking up of documents into separate words.  – Grouping and classifying according to taxonomy.  – Detecting parts of speech, grammar, and synonyms.  – Detecting events and times.¢ Detecting relationships between various elements.  CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a unique name to every object stored in it
  • 33. Advantages of structured data(Easy to work with structured data) • It is easy to work with structured data. • The advantages are : • Storage: Both defined and user- defined data types help with the storage of structured • data. • Scalability: Scalability is not generally an issue with increase in data • Security: ensuring security is easy • Update and Delete: Updating, deleting etc is easy due to structured form. • Transaction Properties : ACID
  • 34. TYPES OF DATA 4 Types of Data: Nominal, Ordinal, Discrete, Continuous 1. Nominal These are the set of values that don’t possess a natural ordering. Let’s understand this with some examples. The color of a smartphone can be considered as a nominal data type as we can’t compare one color with others. It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we can’t differentiate between male, female, or others. Mobile phone categories whether it is midrange, budget segment, or premium smartphone is also nominal data type.
  • 35. TYPES OF DATA 2. Ordinal: These types of values have a natural ordering while maintaining their class of values. If we consider the size of a clothing brand then we can easily sort them according to their name tag in the order of small < medium < large. The grading system while marking candidates in a test can also be considered as an ordinal data type where A+ is definitely better than B grade. These categories help us deciding which encoding strategy can be applied to which type of data.
  • 36. TYPES OF DATA 3. Discrete: The numerical values which fall under are integers or whole numbers are placed under this category. The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all these are some of the examples of the discrete data type. 4. Continuous: The fractional numbers are considered as continuous values. These can take the form of the operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores, and so on.
  • 37. What is Big Data? According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations.
  • 38. Introduction to Big Data Big data is a collection of massive and complex data sets and data volume that include the huge quantities of data, data management capabilities, social media analytics and real-time data. Big data analytics is the process of examining large amounts of data. There exist large amounts of heterogeneous digital data. Big data is about data volume and large data set's measured in terms of terabytes or petabytes. This phenomenon is called Bigdata. The high volumes of data sets, that a traditional computing tool cannot process, are being collected daily. We refer to these high volumes of data as big data.
  • 39. BIG DATA The process of analysis of large volumes of diverse data sets, using advanced analytic techniques is referred to as Big Data Analytics. These diverse data sets include structured, semi-structured, and unstructured data, from different sources, and in different sizes from terabytes to zettabytes. We also reckon them as big data. Big Data is a term that is used for data sets whose size or type is beyond the capturing, managing, and processing ability of traditional rotational databases. The database required to process big data should have low latency that traditional databases don’t have. Big data has one or more characteristics among high volume, high velocity, and high variety.
  • 40. Classification of Analytics Big data analytics is categorized into four subcategories that are: Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics
  • 41. Classification of Analytics 1. Descriptive Analytics : Descriptive Analytics is considered a useful technique for uncovering patterns within a certain segment of customers. It simplifies the data and summarizes past data into a readable form. It provide insights into what has occurred in the past and with the trends to dig into for more detail. This helps in creating reports like a company’s revenue, profits, sales, and so on. Examples of descriptive analytics include summary statistics, clustering, and association rules used in market basket analysis.
  • 42. Classification of Analytics 2. Diagnostic Analytics: Diagnostic Analytics, as the name suggests, gives a diagnosis to a problem. It gives a detailed and in-depth insight into the root cause of a problem. Data scientists turn to this analytics craving for the reason behind a particular happening. Techniques like drill-down, data mining, and data recovery, churn reason analysis, and customer health score analysis are all examples of diagnostic analytics. In business terms, diagnostic analytics is useful when you are researching the reasons leading churn indicators and usage trends among your most loyal customers.
  • 43. Classification of Analytics 3. Predictive Analytics: Predictive Analytics, as can be discerned from the name itself, is concerned with predicting future incidents. These future incidents can be market trends, consumer trends, and many such market-related events. This type of analytics makes use of historical and present data to predict future events. This is the most commonly used form of analytics among businesses. Predictive analytics doesn’t only work for the service providers but also for the consumers. It keeps track of our past activities and based on them, predicts what we may do next.
  • 44. Classification of Analytics 4. Prescriptive Analytics: Prescriptive analytics is the most valuable yet underused form of analytics. It is the next step in predictive analytics. The prescriptive analysis explores several possible actions and suggests actions depending on the results of descriptive and predictive analytics of a given dataset. Prescriptive analytics is a combination of data and various business rules. The data of prescriptive analytics can be both internal (organizational inputs) and external (social media insights). Examples of prescriptive analytics for customer retention is the next best action and next best offer analysis.
  • 45. FOUR TYPES OF ANALYTICS
  • 46.
  • 47.
  • 48. Introduction The optimum utilization of data with analytics is helping organizations scale their business to the next level. With data being the new currency, more and more companies are becoming data- driven. Data analytics help organizations understand their consumers, enhance their advertising campaigns, personalize their content, and improve their products to meet the desired goal. While raw data have immense potential, you cannot leverage data’s advantages without the proper data analytics tools and types of analytics processes. As a Business or Data Analyst, you need data analytics to maximize your efforts to grow a business and achieve its goals.
  • 49. What Is Data Analytics? Data Analytics refers to the process of analyzing datasets to draw out the insights they contain. Data Analytics empowers Business Analysts to take raw data and reveal patterns to extract significant knowledge. Business Analysts use Data Analytics techniques in their work to make smart business decisions. Using Data Analytics in Business Analysis can help organizations better understand their consumers’ patterns and needs. Ultimately, organizations can use various types of data analytics to boost business performance and improve their products. There are mainly 4 broad categories of analytics. These different types of analytics used by Business Analysts empower them with insights that can help them improve business performance. Let’s take a detailed look at the four types of analytics. Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics
  • 50. Descriptive Analytics It is the most straightforward one in the top categories of analytics. Descriptive analytics shuffles raw data from various data sources to give meaningful insights into the past, i.e., it helps you understand the impact of past actions. However, these discoveries can only signal whether something is right or not without any clarification. Therefore, Business Analysts don’t prescribe exceptionally data-driven organizations to agree to descriptive analytics only; they’d preferably combine it with other types of analytics.
  • 51. It is a significant step to make raw data justifiable to stockholders, investors, and leaders. This way, it becomes simple to recognize and address shortcomings that require attention. Data aggregation and mining are the two fundamental procedures in descriptive analytics. It is to be noted that this technique is beneficial for understanding the underlying behavior and not making any estimations.
  • 52. Example of Descriptive Analytics Traffic and Engagement Reports – to analyze and understand website traffic and other engagement metrics. Financial Statement Analysis – Used to obtain a holistic view of the company’s financial health.
  • 53. Diagnostic Analytics Diagnostic Analytics is one of the 4 broad categories of analytics utilized to decide why something occurred in the past. It is characterized by techniques like drill-down, data discovery, data mining, and correlations. Diagnostic Analytics investigates data to comprehend the main drivers of the events. It is useful in figuring out what elements and events led to a specific outcome. It generally utilizes probabilities, likelihoods, and the distribution of results for the analysis. It gives comprehensive insights into a particular problem. Simultaneously, an organization must have detailed data available to them.
  • 54. Examples Of Diagnostic Analytics Examining Market Demand – Used to analyze market demands beforehand and meet the supply accordingly. Explaining Customer Behavior – Very helpful in understanding customer needs and necessities and planning business operations accordingly Identifying Technology Issues – Utilized to run tests and identify technological issues Improving Company Culture – Ideally done by the HR department, the necessary employee data is collected to observe employee behavior.
  • 55. Predictive Analytics Predictive analytics is one of the four types of data analytics used by Business Analysts that determine what will probably occur. It utilizes the discoveries of descriptive and diagnostic analytics to distinguish groups and exceptional cases and anticipate future patterns, making it an essential tool for forecasting. One of the primary applications of predictive analytics is sentiment analysis. All the opinions posted via online media are gathered and analyzed (existing text data) to forecast the individual’s opinion on a specific subject as positive, negative, or neutral (future prediction). Hence, predictive analytics comprises designing and validating models that render precise predictions.
  • 56. Examples Of Predictive Analytics Finance: Forecasting Future Cash Flow – Used to predict and maintain the financial need and health of the organization Entertainment & Hospitality: Determining Staffing Needs – Used to fulfill the staffing needs based on the influx and outflux of the customers. Marketing: Behavioral Targeting – Leveraging the data obtained from consumer behaviors for creating stronger marketing strategies. Manufacturing: Preventing Malfunction – Used to predict a probable malfunction or breakdown and avoid the same to save time and money.
  • 57. Prescriptive Analytics Predictive analytics is the basis of these types of data analytics used in Business Analytics. Still, it goes past the other three categories of analytics mentioned above to recommend future solutions. It can recommend all favorable outcomes per a predefined game plan and propose a different course of action to achieve a specific result. Therefore, it utilizes a robust feedback system that continually learns and updates the connection between actions and outcomes.
  • 58. Prescriptive analytics utilizes emerging technologies and tools, such as Machine Learning, Deep Learning, and Artificial Intelligence algorithms, making it modern to execute and oversee. Furthermore, this cutting-edge data analytics type requires internal as well as external past data to provide users with favorable outcomes. That is why Business Analysts suggest considering the needed efforts against a demanded added value before implementing prescriptive analytics to any business system.
  • 59. Examples Of Prescriptive Analytics Venture Capital- Investment Decisions – Often taken by gut feeling, these decisions sometimes can also be supported with necessary algorithms. Sales: Lead Scoring – Used to analyze and predict the probability of a lead converting to a successful conversion Content Curation: Algorithmic Recommendations – Used to predict the creation of necessary content to keep consumers engaged and interested. Banking: Fraud Detection – It is used to detect and flag fraudulent actions that might have occurred in banking transactions. Product Management- Development and Improvement – Here, the necessary data can be collected and collated to derive necessary inputs regarding a product and its develop
  • 60. Conclusion Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics are the 4 types of analytics used by Business Analysts to unlock raw data’s potential in order to improve business performance. If you’re someone who loves to play with data and wants to build a successful career in Business Analytics, check our Integrated Program In Business Analytics (IPBA) in collaboration with IIM Indore. It is a 10-month-long online Future Leaders Program aimed at senior executives and mid-career professionals to help them give their careers a significant boost.