Data science and business analytics

Data science andbusiness analytics
Dr.M.Inbavalli
Vice Principal & Head Research Department of Computer Science
Marudhar Kesari Jain College for Women
Vaniyambadi-635751

Overview
• Evolution of Data
• Data Science
• Business Analytics
• Applications
• AI, ML, DL, Data science – Relationship
• Tools for Data Science
• Life cycle of data science with case study
• Algorithms for Data Science
• Data Science Research Areas
• Future of Data Science

Data All Around
• Data has become the most abundant thing today
• Explosion of data, in pretty much every domain
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network

•Data All Around
• Sensing devices and sensor networks that can monitor everything 24/7 from
temperature to pollution to vital signs
• Increasingly sophisticated smart phones
• Internet, social networks makes it easy to publish data
• Scientific experiments and simulations produce astronomical volumes of data
• Internet of Things(IOT)
• Dataification: taking all aspects of life and turning them into data (e.g., what you
like/enjoy has been turned into a stream of your "likes")

• Data Science – Why all the excitement?

• How Much Data Do We have?
• Data volumes expected to get much worse
• Over 2.5 quintillion bytes of data are created every single day.

How Much Data Do We have?
What can you do with the Traffic Prediction data?
9
Crowdsourcing + physical modeling + sensing + data assimilation
From Institute for Transportation Studies

• How to handle that data?
• Data is just like crude oil. It’s valuable, but if unrefined it cannot really be
used. It has to be changed into gas, plastic, chemicals, etc to create a
valuable entity that drives profitable activity; so data must be broken
down, analyzed for it to have value.
• How to extract interesting actionable insights and scientific knowledge?

•Data Science why excitement?
• Data Science is the science
which uses computer science, statistics
and machine learning, visualization
and human-computer interactions to
collect, clean, integrate, analyze,
visualize, interact with data to create
data products.
• Turn data into data products.

• Data Science why excitement?
Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High performance computing,
Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
Data science (DS) is a multidisciplinary field of study with goal
to address the challenges in big data

• Data Science why excitement?(cont)
• Data Science blend of tools, algorithms, and machine learning principles with the goal to discover
hidden patterns from the raw data.
• focus on statistical modeling, machine learning, management and analysis of data sets, and data
acquisition.
• Data Science makes use of several statistical procedures
• These procedures range from data transformations, data modeling, statistical operations
(descriptive and inferential statistics) and machine learning modeling.
• In order to gain predictive responses from the models, it is an essential requirement to understand
the underlying patterns of the data model. Furthermore, optimization techniques can be utilized to
meet the business requirements of the user.

•Data Science why excitement?(cont)
• Using various statistical tools, a Data Scientist has to develop models. With the help of
these models, they help their clients in the decision-making process. Furthermore,
these models support demand generation initiatives.
Data Science also covers:
• Data Integration.
• Distributed Architecture.
• Automating Machine learning.
• Data Visualization.
• Dashboards and BI.
• Data Engineering.
• Deployment in production mode
• Automated, data-driven decisions.

Example Search
• Google revenue around $50 bn/year from marketing, 97% of the companies
revenue.
• Sponsored search uses an action – a pure competition for marketers trying to
win access to consumers.
• In other words, a competition for models of consumers – their likelihood of
responding to the ad – and of determining the right bid for the item.
• There are around 30 billion search requests a month. Perhaps a trillion events
of history between search providers.
• Google Adwords and Adsense

Data Science Applications
• Transaction Databases  Recommender systems (NetFlix), Fraud Detection
(Security and Privacy)
• Wireless Sensor Data  Smart Home, Real-time Monitoring, Internet of Things
• Text Data, Social Media Data  Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
• Software Log Data  Automatic Trouble Shooting (Splunk)
• Genotype and Phenotype Data  Epic, 23andme, Patient-Centered Care,
Personalized Medicine

• Other Applications
• Bank -make smarter decisions through fraud detection, management of
customer data, risk modeling, real-time predictive analytics, customer
segmentation, etc.
• In case of fraud detection -- a credit card, insurance, and accounting.
• able to analyze investment patterns and cycles of customers and suggest you
several offers that suit you accordingly.
• ability to risk modeling through data science through which they can assess their
overall performance.
• In real-time and predictive analytics, banks use machine learning algorithms to
improve their analytics strategy

Other Applications
• customer sentiment analysis techniques
can boost the social media interaction, boost their feedback and analyze
customer reviews.
Manufacturing-IOT
enabled the companies to predict potential problems, monitor systems
and analyze the continuous stream of data.
Uber is using data science for price optimization and providing better
experiences to their customers.
Using powerful predictive tools, they accurately predict the price based
on parameters like a weather pattern, availability of transport,
customers, etc.

Data
• Measureable units of information gathered or captured from activity of people, places
and things.
• data is generated from different sources like financial logs, text files, multimedia forms,
sensors, and instruments.
• need to understand
• which data to use
• how to organize the data, and so on.
• prepare the structured, and the unstructured data to be used by the Analytics team for
model building purpose.
• Types of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data

What do we do with the Data ?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
• Example –Data Science
• Companies learn your secrets, shopping patterns, and preferences
• Eg. can we know if a child likes animation games , even if they doesn’t
want us to know?
• Building, and maintain a Data warehouse is a key skill which a Data
Engineer must have.

• They build pipelines which extract data from multiple sources and then manipulates
it to make it usable.
• Business analytics (BA) is the practice of iterative, methodical exploration of an
organization's data, with an emphasis on statistical analysis.
Business analytics is used by companies committed to data-driven decision-making.
• BA activities must be anchored to a strategically relevant business question to be
answered by using data analysis.

• Data Science and Business Analytics
• Data science or analytics is the process of deriving insights from data in order to
make optimal decisions.
• data science and analytics techniques such as basic statistics, regressions, simulation
and optimization modeling, data mining and machine learning, text analytics,
artificial intelligence and visualizations.
• Data science focuses on data modelling and data warehousing to track the ever-
growing data set. The information extracted through data science applications are
used to guide business processes and reach organisational goals.

Databases Data Science
Data Volume Modest Massive
Examples Bank records,
Personnel records,
Census,
Medical records
Online clicks,
GPS logs,
Tweets,
Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …

Features Business Intelligence (BI) Data Science
Data Sources
Structured
(Usually SQL, often Data Warehouse)
Both Structured and Unstructured
( logs, cloud data, SQL, NoSQL, text)
Approach Statistics and Visualization
Statistics, Machine Learning, Graph Analysis,
Neuro- linguistic Programming (NLP)
Focus Past and Present Present and Future
Tools Pentaho, Microsoft BI, QlikView, R Rapid Miner, BigML, Weka, R

Data Science ML AI
Tools -1. SAS2. Tableau3. Apache
Spark4. MATLAB, SQL,
1. Amazon Lex2. IBM Watson
Studio3. Microsoft Azure ML Studio
1.TensorFlow2. Scikit Learn
3. Keras, Amazon lex, Google cloud
platform, Data robot.
Data Science deals with structured
and unstructured data.
Machine Learning uses statistical
models.
Artificial Intelligence uses logic and
decision trees.
Fraud Detection and Healthcare
analysis are popular examples of
Data Science.
Recommendation Systems such as
Spotify, and Facial Recognition are
popular examples.
Chatbots, and Voice assistants are
popular applications of AI.
The main applications of Data
Science are credit card fraud, ATM
theft, disease prediction, pattern
identification etc.
The main applications of machine
learning are Online recommender
system, Google search
algorithms, Facebook auto friend
tagging suggestions, etc.
The main applications of AI are Siri,
customer support using catboats,
Expert System, Online game playing,
intelligent humanoid robot, etc.

• Relationship between Data Science, Artificial Intelligence and Machine
Learning
• Machine Learning for Predictive Reporting
• to study transactional data to make valuable predictions .
• Also known as supervised learning
• implemented to suggest the most effective courses of action for any company.
Machine Learning for Pattern Discovery
• set parameters in various data reports
• unsupervised learning where there are no pre-decided parameters.
Artificial Intelligence represents an action planned feedback of
perception.
Perception > Planning > Action > Feedback of Perception
Data Science uses different parts of this pattern or loop to solve specific
problems

• For instance, in the first step, i.e. Perception,
• data scientists try to identify patterns with the help of the data.
• planning, there are two aspects:
• Finding all possible solutions
• Finding the best solution among all solutions
• machine learning by taking it as a standalone subject- understood in the context
of its environment.
AI is the tool that helps data science get results and the solutions for specific
problems. However, machine learning is what helps in achieving that goal
Example : Google’s search engine is a product of data science
It uses predictive analysis, a system used by artificial intelligence, to deliver
intelligent results to the users

• Tools for Data Science
• Reporting and Business Intelligence
• Predictive Modelling and Machine Learning
• Artificial Intelligence
• Data Science Tools for Big Data(Volume)
• Data 1GB to 10 GB - Traditional DB Excel, Access, SQl etc.
• >10 GB – Haddop, Hive
• Tools for Handling Variety

• Voluminous
• customer feedback may vary in length, sentiments, and other factors.
• Example for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular
databases like MongoDB, Cassandra, etc.
• These NoSQL databases are seeing huge adoption numbers because of their ability
to scale and handle dynamic data.
.

• Tools for Handling Velocity
• speed at which the data is captured.
• includes both real-time and non-real-time data.
• Example for realtime data
• sensor data collected by self-driving cars- automatic actions
• CCTV
• Stock trading
• Fraud detection for credit card transaction
• Network data – social media (Facebook, Twitter, etc.)
Tools -Apache Kafka- real-time data pipelines.
Apache Storm- process up to 1 Million tuples per second and it is highly scalable
Amazon Kinesis-Licensed and powerful
Apache Flink- high performance, fault tolerance, and efficient memory
management.

Reporting and BI Tools Predictive Analytics and
Machine Learning Tools
Frameworks for Deep
Learning
AI Tools
Excel, QlikView, Tableau ,
Microstrategy, powerBI,
Google
Analytics,Dundas,SISENSE
etc
Python , R, Apache spark,
Julia, Jupyter Notebooks
TensorFlow, Pytroch,
Keras and Caffe
AutoKeras, Google Cloud
AutoML, IBM Watson,
DataRobot, H20’s Driverless
AI, and Amazon’s Lex
SAS, SPSS,MATLAB- Licensed

• Role of Data Scientist
• Identifying the data-analytics problems that offer the greatest opportunities to the
organization
• Determining the correct data sets and variables
• Collecting large sets of structured and unstructured data from disparate sources
• Cleaning and validating the data to ensure accuracy, completeness, and uniformity
• Devising and applying models and algorithms to mine the stores of big data
• Analyzing the data to identify patterns and trends
• Interpreting the data to discover solutions and opportunities
• Communicating findings to stakeholders using visualization and other means

• Phase 1—Discovery
• various specifications, requirements, priorities and required budget.
• the ability to ask the right questions.
• need to frame the business problem and formulate initial hypotheses (IH) to test.
• Phase 2—Data preparation
• data cleaning, transformation, and visualization. This will help you to spot the outliers
and establish a relationship between the variables.----R
• Phase 3—Model planning
• methods and techniques to draw the relationships between variables
• These relationships will set the base for the algorithms in next phase
• apply Exploratory Data Analytics (EDA) using various statistical formulas and
visualization tools.

• R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
• SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
• SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.

• Phase 4—Model building
• develop datasets for training and testing purposes
• various learning techniques like classification, association and clustering to build the
model.
Example :
1. Classification (decision trees)
2. Clustering (K-means, Fuzzy C-means, Hierarchical Clustering, DBSCAN)
3. Association rules
4. Advanced supervised machine learning algorithms (Naive Bayes, k-NN, SVM)
5. Intro to ensemble learning algorithms (Random Forest, Gradient Boosting)

• Phase 5—Operationalize
• Analyzing the data to identify patterns and trends
• Interpreting results
• deliver final reports, briefings, code and technical documents
• pilot project
• Phase 6—Communicate results
• identify all the key findings, communicate to the stakeholders and determine if the
results of the project are a success or a failure

• Basic statistics
• 1. Random variables, sampling
• 2. Distributions and statistical measures
• 3. Hypothesis testing
Overview of linear algebra
1. Linear algebra and matrix computations
2. Functions, derivatives, convexity
Modeling techniques regression
1. Mathematical modeling process 2. Linear regression 3. Logistic regression
• Data visualization and visual analytics
• 1. Visual analytics 2. Visualizations in Python and visual analytics in IBM Watson Analytics

• Data visualization and visual analytics
• 1. Visual analytics 2. Visualizations in Python and visual analytics in IBM Watson
Analytics
• Data mining and machine learning
• 1. Classification (decision trees) 2. Clustering (K-means, Fuzzy C-means,
Hierarchical Clustering, DBSCAN) 3. Association rules 4. Advanced supervised
machine learning algorithms (Naive Bayes, k-NN, SVM) 5. Intro to ensemble
learning algorithms (Random Forest, Gradient Boosting)
• Simulation modeling 1. Random number generation 2. Monte Carlo simulations 3.
Simulation in Ipython

• Real time example
• Case Study: Diabetes Prevention
• What if we could predict the occurrence of diabetes and take appropriate
measures beforehand to prevent it?
• 1. You can refer to the sample data below.
• Step 1: Discovery
• Attributes:
• npreg – Number of times pregnant
• glucose – Plasma glucose concentration
• bp – Blood pressure
• skin – Triceps skinfold thickness
• bmi – Body mass index
• ped – Diabetes pedigree function
• age – Age
• income – Income

• Step 2 Data Preparation
• once we have the data, we need to clean and prepare the data for data
analysis.
• data has a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
• we have organized the data into a single table under different
attributes – making it look more structured.

• Step 2(Cont)
• This data has a lot of inconsistencies.
• In the column npreg, “one” is written in words, whereas it should be in the numeric
form like 1.
• In column bp one of the values is 6600 which is impossible (at least for humans) as bp
cannot go up to such huge value.
• Income column is blank and also makes no sense in predicting diabetes.
• Therefore, it is redundant to have it here and should be removed from the table.
• clean and preprocess this data by removing the outliers, filling up the null values and
normalizing the data type. -data preprocessing.
• Finally, we get the clean data which can be used for analysis.

• Step 3 Model Planning
• load the data into the analytical sandbox and apply various statistical functions
• R has functions like describe which gives us the number of missing values and unique
values.
• We can also use the summary function which will give us statistical information like
mean, median, range, min and max values.
• Then, we use visualization techniques like histograms, line graphs, box plots to get a
fair idea of the distribution of data.

• Step 4 Model Building
• supervised learning technique to build a model here.

• Step 5 Deliver the Model
• Check with sample data.
Data :Data tables and data types
○ Operations on tables
○ Basic plotting
○ Tidy data / the ER model
○ Relational Operations
○ SQL
wrangling
○ Data acquisition (load and scrape)
○ EDA Vis / grammar of graphics
○ Data cleaning (text, dates)
○ EDA: Summary statistics
○ Data analysis with optimization (derivatives)
○ Data transformations
○ Missing data

• Modeling
○ Univariate probability and statistics
○ Hypothesis testing
○ Multivariate probablity and statistics (joint and conditional probability, Bayes
thm)
○ Data Analysis with geometry (vectors, inner products, gradients and matrices)
○ Linear regression
○ Logistic regression
○ Gradient descent (batch and stochastic)
○ Trees and random forests
○ K-NN
○ Naïve Bayes
○ Clustering
○ PCA

• Sample Algorithms for Data Science analytics
Regression
• The most popular technique for this algorithm is least of squares. This method
calculates the best-fitting line.
• Based on historical data
Example :
• Weather forecasting
• Assessing risk
Tools
• TensorFlow and PyTorch

• Logistic Regression
• Logistic regression is similar to linear regression, but it is used when the output is
binary (i.e. when outcome can have only two possible values). The prediction for this
final output will be a non-linear S-shaped function called the logistic function, g().
• Graph of a logistic regression curve showing probability of passing an exam versus
hours studying

• Decision Trees
• Decision Trees can be used for both regression and classification tasks.
• Categorical Variable Decision Tree-predict whether a customer will pay his
renewal premium with an insurance company (yes/ no).
• Continuous Variable Decision Tree.-predict customer income based on occupation,
product, and various other variables.
• Example C4.5, CART
• Naive Bayes
• classification technique
• It measures the probability of each class, and the conditional probability for each
class give values of x. This algorithm is used for classification problems to reach a
binary yes/no outcome.

Example:
Text classification/ Spam Filtering/ Sentiment Analysis
Recommendation System
Types
Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli
SVM
KNN
Kmeans
Dimensionality Reduction

• ANN
• Feed forward -multilayer perceptrons
• convolution neural networks-classification, object detection, or even
image segmentation,
• hierarchical object extractors.

What do Data Scientists do?
• National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….

Data Scientist must posses
• Mathematics and Applied
Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R,
Python, Julia, SQL)
• Data Mining
• Data Base Storage and
Management
• Machine Learning and
discovery

• Data Science Research Areas
• machine learning.
• artificial intelligence.
• Deep learning
• databases.
• statistics.
• optimization.
• natural language processing.
• computer vision.
• speech processing.
• Privacy
• Ethics
• Energy consumption
• Cloud computing
• IOT
• Cloud
• Social Media
• Block Chain etc.

• Future of Data Science and Analytics

Data science and business analytics

More Related Content

What's hot

Similar to Data science and business analytics

Recently uploaded

Data science and business analytics