SlideShare a Scribd company logo
1 of 69
Download to read offline
1
1
Overview of Data Analytics Processes
Demeke A. Ayele
(demekeayele@gmail.com)
School of Computer Science
Emeraled International School
Data Analytics
“...what enables the wise commander to strike and conquer, and
achieve things beyond the reach of ordinary men, is
foreknowledge. Now, this foreknowledge cannot be elicited
from spirits...”
The Art of War, Sun Tzu (as seen in Giles, 1994)
2
Peter Drucker first spoke of the “knowledge economy” (1969).
The knowledge economy refers to the use of knowledge “to
generate tangible and intangible value.”Nearly 50 years later,
organizations have virtually transformed them-selves to meet this
challenge, and data and analytics have become central to that
transformation.
Data Analytics
◆ What is data?
● At its most general form, data is simply infor-mation that has
been stored for later use.
◆ What is analytics?
● A structured approach to data-driven problem solving—one
that helps us break up problems through careful consideration
of the facts.
3
But, analytics is one of the most overused yet least understood terms in
business. For some, it relates to the technologies used to“beat data into
submission,” or it is simply an extension of business intelligence and
data warehousing. And yet for others, it relates to the statistical,
mathematical, or quantitative methods used in the development of
models.
Goal
◆ To introduce you the basic data science and analytics
technologies
● Enables you to know the practices used and the potential
use of large scale data analytics.
● To ascertain you to know the fundamental techniques and
tools used to design and analyze large volumes of data.
● Enables you to practice the analysis and visualization of
big data
● It enables you to understand the challenges in big data and
possible technological solutions
4
Outline
5
◆ Course title: Data Exploration and Visulization
● Chapter 1: Overview of Data Anlytics, technologies and techniques
● Chapter 2: Data analytics methods and Tools such as Python and R
Programming, and associated technologies
● Chapter 3: (Big) Data analytics platforms and environments, such as
hadoop and spark ecosystem, and associated technologies and
applications
● Chapter 4: Data analytics techninques and Hadoop and spark
ecosystem technologies
● Reference: Lots of books on the web! ask me also!
● Course Evaluation: Project (50%), Final Exam(50%)
Scope of this Course
◆ The theoritical and practical part of each
contributing technologies, programming models,
tools, configurations and techniques will be discussed
in the lecture session
◆ tool and package installations, configurations, various
programmings and experimentations will be dealt by
the students
◆ Critical topic reviews and programming assignments
will be given to the students
6
Data Analytics Life cycle
7
Data Analytics Process (Lifecycle)
◆ The volume of data has exploded to unimaginable levels in
the past decade,
◆ In contrast, the price of data storage has systematically
reduced.
◆ Private companies and research institutions capture
terabytes of data about their users’ interactions, business,
social media, and also sensors from devices such as mobile
phones and automobiles.
◆ The challenge is to make sense of this ocean of data. This is
where data analytics comes into picture.
8
Data Analytics Process (Lifecycle)
9
◆ Data Analytics largely involves collecting data from
different sources, munge it in a way that it becomes
available to be consumed by analysts and finally deliver
data products useful to the organization's business.
◆ The process of converting large amounts of unstructured
raw data, retrieved from different sources, to a data
product useful for organizations forms the core of Data
Analytics.
Traditional Data Mining Process (Life Cycle)
◆ In order to provide a framework to organize the work
needed by an organization and deliver clear insights from
Big Data, it’s useful to think of as a cycle/process with
different stages.
◆ It is by no means linear, meaning all the stages are
related with each other. This cycle has superficial
similarities with the more traditional data mining cycle as
described in CRISP methodology.
10
Data Analytics - Data Life Cycle
11
◆ Data science projects differ from BI projects:
● More exploratory in nature
● Critical to have a project process
● Participants should be thorough and rigorous
◆ Break large projects into smaller pieces
◆ Spend more time to plan and scope the work
◆ Documenting adds rigor and credibility…
Data Analytics Process (Lifecycle)
12
◆ Data Analytics Lifecycle Overview
●Phase 1: Discovery
●Phase 2: Data Preparation
●Phase 3: Model Planning
●Phase 4: Model Building Phase
●Phase 5: Communicate Results
●Phase 6: Operationalize
◆ Case Study: GINA
Data Analytics Lifecycle Overview
13
• The data analytic lifecycle is designed for Big Data
problems and data science projects
• With six phases, the project work can occur in several
phases simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information
is uncovered
Key Roles for a Successful Analytics Project
14
15
◆ Key Roles for a Successful Analytics Project
● Business User – understands the domain area
● Project Sponsor – provides requirements
● Project Manager – ensures meeting objectives
● Business Intelligence Analyst – provides business domain expertise
based on deep understanding of the data
● Database Administrator (DBA) – creates DB environment
● Data Engineer – provides technical skills, assists data management
and extraction, supports analytic sandbox
● Data Scientist – provides analytic techniques and modeling
Data Analytics Lifecycle Overview
16
◆ Background and Overview of Data Analytics Lifecycle
● Data Analytics Lifecycle defines the analytics process and best
practices from discovery to project completion
◆ The Lifecycle employs aspects of
◆ Scientific method
◆ Cross Industry Standard Process for Data Mining (CRISP-DM) -
Process model for data mining
◆ Davenport’s DELTA framework
◆ Hubbard’s Applied Information Economics (AIE) approach
◆ MAD Skills: New Analysis Practices for Big Data by Cohen et al.
Data Analytics Lifecycle Overview
Overview of Data Analytics Lifecycle
17
Phase I: Discovery
18
19
Data Analytics Lifecycle: Discovery
◆ Learning the Business Domain
◆ Resources
◆ Framing the Problem
◆ Identifying Key Stakeholders
In data discovery, stakeholders constantly analyze the business trends,
similar data analytics case studies, and domain of the business industry.
An assessment is done on the in-house resources, in-house infrastructure,
and technology. Once the evaluation is complete, the stakeholders begin
to build the hypothesis for resolving the critical business conundrums
keeping in view of the customer experience and market.
◆ Interviewing the Analytics
Sponsor
◆ Developing Initial Hypotheses
◆ Identifying Potential Data
Sources
What, how, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for Discovery
phase?
20
Phase 2: Data Preparation
21
22
Data Analytics Lifecycle: Data Preparation
◆ Includes steps to explore, preprocess, and condition data
◆ Create robust environment – analytics sandbox
◆ Data preparation tends to be the most labor-intensive
step in the analytics lifecycle - often at least 50% of the
data science project’s time
◆ The data preparation phase is generally the most iterative
and the one that teams tend to underestimate most often
23
Data Analytics Lifecycle: Data Preparation
◆ Preparing the Analytic Sandbox
● Create the analytic sandbox (also called workspace)
● Allows team to explore data without interfering with live
production data
● Sandbox collects all kinds of data (expansive approach)
● The sandbox allows organizations to undertake ambitious
projects beyond traditional data analysis and BI to perform
advanced predictive analytics
● Although the concept of an analytics sandbox is relatively
new, this concept has become acceptable to data science
teams and IT groups
24
◆ Performing ETL (Extract, Transform, Load, )
● In ETL, users perform extract, transform, load
● In the sandbox, the process is often ELT – early load
preserves the raw data which can be useful to examine
● EG. in credit card fraud detection, outliers can represent
high-risk transactions that might be inadvertently
filtered out or transformed before being loaded into the
database
● Hadoop is often used here
Data Analytics Lifecycle: Data Preparation
25
◆ Modern technology has changed most organizations’ approach
to ELT, for several reasons. The biggest is the advent of
powerful analytics warehouses like Amazon Redshift and
Google BigQuery. These newer cloud-based analytics databases
have the horsepower to perform transformations in place rather
than requiring a special staging area.
Data Analytics Lifecycle: Data Preparation
26
◆ Learning about the Data
Data Analytics Lifecycle: Data Preparation
● Becoming familiar with the data is critical
● This activity accomplishes several goals:
- Determines the data available to the team early in the
project
- Highlights gaps – identifies data not currently available
- Identifies data outside the organization that might be useful
◆ Learning about the Data Sample Dataset
Inventory
27
Data Analytics Lifecycle: Data Preparation
28
◆ Data Conditioning
● includes cleaning data, normalizing datasets, and
performing transformations
● Often viewed as a preprocessing step prior to data
analysis, it might be performed by data owner, IT
department, DBA, etc.
● Best to have data scientists involved
● Data science teams prefer more data than too little
Data Analytics Lifecycle: Data Preparation
29
◆ Data Conditioning Additional questions and
considerations
● What are the data sources? Target fields?
● How clean is the data?
● How consistent are the contents and files? Missing or
inconsistent values?
● Assess the consistence of the data types – numeric,
alphanumeric?
● Review the contents to ensure the data makes sense
● Look for evidence of systematic error
Data Analytics Lifecycle: Data Preparation
30
Data Analytics Lifecycle: Data Preparation
◆ Survey and Visualize
● Leverage data visualization tools to gain an overview of
the data
● Shneiderman’s mantra:
- “Overview first, zoom and filter, then details-on-
demand”
- This enables the user to find areas of interest, zoom and
filter to find more detailed information about a
particular area, then find the detailed data in that area
31
◆ Survey and Visualize Guidelines and Considerations
● Review data to ensure calculations are consistent
● Does the data distribution stay consistent?
● Assess the granularity of the data, the range of values, and the
level of aggregation of the data
● Does the data represent the population of interest?
● Check time-related variables – daily, weekly, monthly? Is this
good enough?
● Is the data standardized/normalized? Scales consistent?
● For geospatial datasets, are state/country abbreviations
consistent
Data Analytics Lifecycle: Data Preparation
32
Data Analytics Lifecycle: Data Preparation
◆ Common Tools for Data Preparation
● Hadoop can perform parallel ingest and analysis
● Alpine Miner provides a graphical user interface for
creating analytic workflows
● OpenRefine (formerly Google Refine) is a free, open
source tool for working with messy data
● Similar to OpenRefine, Data Wrangler is an interactive
tool for data cleansing and transformation
33
What, how, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for Data
Preparation phase?
Phase 3: Model Planning
34
35
Data Analytics Lifecycle: Model Planning
◆ Activities to consider
● Assess the structure of the data – this dictates the tools and
analytic techniques for the next phase
● Ensure the analytic techniques - enable the team to meet
the business objectives and accept or reject the working
hypotheses
● Determine if the situation warrants a single model or a
series of techniques as part of a larger analytic workflow
● Research and understand how other analysts have
approached this kind or similar kind of problem
36
Data Analytics Lifecycle: Model Planning
◆ Model Planning in Industry Verticals
● Example of other analysts approaching a
similar problem
◆ Data Exploration and Variable Selection
● Explore the data to understand the relationships
among the variables to inform selection of the
variables and methods
● A common way to do this is to use data
visualization tools
● Often, stakeholders and subject matter experts
may have ideas - for example, some hypothesis
that led to the project
● Aim for capturing the most essential predictors
and variables - this often requires iterations and
testing to identify key variables
● If the team plans to run regression analysis,
identify the candidate predictors and outcome
37
Data Analytics Lifecycle: Model Planning
38
◆ Model Selection
● The main goal is to choose an analytical
technique, or several candidates, based on the
end goal of the project
● We observe events in the real world and attempt
to construct models that emulate this behavior
with a set of rules and conditions - a model is
simply an abstraction from reality
● Determine whether to use techniques best suited
for structured data, unstructured data, or a
hybrid approach
● Teams often create initial models using statistical
software packages such as R, SAS, or Matlab,
Which may have limitations when applied to very
large datasets
Data Analytics Lifecycle: Model Planning
◆ Common Tools for the Model Planning
Phase
● R has a complete set of modeling capabilities
- R contains about 5000 packages for data
analysis and graphical presentation
● SQL Analysis services can perform in-
database analytics of common data mining
functions, involved aggregations, and basic
predictive models
● SAS/ACCESS provides integration between
SAS and the analytics sandbox via multiple
39
Data Analytics Lifecycle: Model Planning
40
What, ho, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for model
plannng phase?
Phase 4: Model Building
41
◆ Execute the models defined in Phase 3
◆ Develop datasets for training, testing, and
production
◆ Develop analytic model on training data, test on
test data
◆ Questions to consider
● Does the model appear valid and accurate on the test
data?
● Does the model output/behavior make sense to the
domain experts?
● Do the parameter values make sense in the context of
the domain?
● Is the model sufficiently accurate to meet the goal?
● Does the model avoid intolerable mistakes?
● Are more data or inputs needed?
● Will the kind of model chosen support the runtime
42
Data Analytics Lifecycle: Model Building
◆ Common Tools for the Model Building
Phase
●Commercial Tools
- SAS Enterprise Miner – built for enterprise-
level computing and analytics
- SPSS Modeler (IBM) – provides enterprise-
level computing and analytics
- Matlab – high-level language for data
analytics, algorithms, data exploration
- Alpine Miner – provides GUI frontend for
backend analytics tools
- STATISTICA and MATHEMATICA – popular
data mining and analytics tools 43
Data Analytics Lifecycle: Model Building
◆ Free or Open Source Tools
- R and PL/R - PL/R is a procedural language for
PostgreSQL with R
- Octave – language for computational modeling
- WEKA – data mining software package with
analytic workbench
- Python – language providing toolkits for
machine learning and analysis
- SQL – in-database implementations provide an
alternative tool (see Chap 11)
44
Data Analytics Lifecycle: Model Building
45
What, ho, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for model
building phase?
Phase 5: Communicate Results
46
Data Analytics Lifecycle: Communicating the Results
◆ Determine if the team succeeded or failed
in its objectives
◆ Assess if the results are statistically
significant and valid
● If so, identify aspects of the results that
present salient findings
● Identify surprising results and those in
line with the hypotheses
◆ Communicate and document the key
findings and major insights derived from
the analysis
● This is the most visible portion of the
47
48
What, ho, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for result
communicating phase?
Phase 6: Operationalize
49
Data Analytics Lifecycle: Operationalize
◆ In this last phase, the team communicates the
benefits of the project more broadly and sets
up a pilot project to deploy the work in a
controlled way
◆ Risk is managed effectively by undertaking
small scope, pilot deployment before a wide-
scale rollout
◆ During the pilot project, the team may need to
execute the algorithm more efficiently in the
database rather than with in-memory tools like
R, especially with larger datasets
◆ To test the model in a live setting, consider
running the model in a production environment
for a discrete set of products or a single line of
50
Data Analytics Lifecycle: Operationalize
51
◆ Key outputs from successful analytics
project
52
◆ Key outputs from successful analytics
project
● Business user – tries to determine business
benefits and implications
● Project sponsor – wants business impact,
risks, ROI
● Project manager – needs to determine if
project completed on time, within budget,
goals met
● Business intelligence analyst – needs to know
if reports and dashboards will be impacted
and need to change
● Data engineer and DBA – must share code and
document
Data Analytics Lifecycle: Operationalize
◆ Four main deliverables - Although the
seven roles represent many interests,
the interests overlap and can be met
with four main deliverables
● Presentation for project sponsors – high-
level takeaways for executive level
stakeholders
● Presentation for analysts – describes
business process changes and reporting
changes, includes details and technical
graphs
● Code for technical people 53
Data Analytics Lifecycle: Operationalize
54
What, how, when and why?
tools, packages, platforms,
environments, techniques and
technologies we will use for
Operationalizing phase?
Case Study:
Global Innovation
Network and Analysis
(GINA)
55
Case Study: GINA
56
◆ In 2012, EMC’s new director wanted to
improve the company’s engagement of
employees across the global centers of
excellence (GCE) to drive innovation,
research, and university partnerships
◆ This project was created to accomplish
● Store formal and informal data
● Track research from global technologists
● Mine the data for patterns and insights to
improve the team’s operations and strategy
Phase 1: Discovery
57
◆ Team members and roles
● Business user, project sponsor, project
manager – Vice President from Office of
CTO
● BI analyst – person from IT
● Data engineer and DBA – people from IT
● Data scientist – distinguished engineer
◆ The data fell into two categories
● Five years of idea submissions from
internal innovation contests
● Minutes and notes representing innovation
and research activity from around the
Phase 1: Discovery
◆ Hypotheses grouped into two categories
● Descriptive analytics of what is happening
to spark further creativity, collaboration,
and asset generation
● Predictive analytics to advise executive
management of where it should be
investing in the future
58
Phase 2: Data Preparation
59
◆ Set up an analytics sandbox
◆ Discovered that certain data needed
conditioning and normalization and that
missing datasets were critical
◆ Team recognized that poor quality data
could impact subsequent steps
◆ They discovered many names were
misspelled and problems with extra spaces
◆ These seemingly small problems had to be
addressed
Phase 3: Model Planning
60
◆ The study included the following
considerations
● Identify the right milestones to achieve the
goals
● Trace how people move ideas from each
milestone toward the goal
● Tract ideas that die and others that reach the
goal
● Compare times and outcomes using a few
different methods
Phase 4: Model Building
61
◆ Several analytic method were employed
● NLP on textual descriptions
● Social network analysis using R and
Rstudio
● Developed social graphs and
visualizations
62
◆ Social graph of data submitters and
finalists
Phase 4: Model Building
◆ Social graph of top innovation influencers
63
Phase 4: Model Building
Phase 5: Communicate
Results
64
◆ Study was successful in in identifying
hidden innovators
● Found high density of innovators in Cork,
Ireland
◆ The CTO office launched longitudinal
studies
Phase 6: Operationalize
65
◆ Deployment was not really discussed
◆ Key findings
● Need more data in future
● Some data were sensitive
● A parallel initiative needs to be created
to improve basic BI activities
● A mechanism is needed to continually
reevaluate the model after deployment
Phase 6: Operationalize
66
Summary
67
◆ The Data Analytics Lifecycle is an
approach to managing and executing
analytic projects
◆ Lifecycle has six phases
◆ Bulk of the time usually spent on
preparation – phases 1 and 2
◆ Seven roles needed for a data science
team
Theme of the course
68
What Techniques, Tools and
Technologies are required at each
phase of the data analytics Process?
and why they are required there? how
and when they will be used?
Some of the technologies you need to explorer by
yourself as a data science professional:
69
◆ File system technologies and techniques
◆ Distributed processing and storage
◆ Parallel processing technologies and techniques
◆ Cloud, Grid and Utility computing technologies and
techniques
◆ Data centers and virtualizations

More Related Content

Similar to 1. Overview_of_data_analytics (1).pdf

Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxnikshaikh786
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 

Similar to 1. Overview_of_data_analytics (1).pdf (20)

Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 

Recently uploaded

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Recently uploaded (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

1. Overview_of_data_analytics (1).pdf

  • 1. 1 1 Overview of Data Analytics Processes Demeke A. Ayele (demekeayele@gmail.com) School of Computer Science Emeraled International School
  • 2. Data Analytics “...what enables the wise commander to strike and conquer, and achieve things beyond the reach of ordinary men, is foreknowledge. Now, this foreknowledge cannot be elicited from spirits...” The Art of War, Sun Tzu (as seen in Giles, 1994) 2 Peter Drucker first spoke of the “knowledge economy” (1969). The knowledge economy refers to the use of knowledge “to generate tangible and intangible value.”Nearly 50 years later, organizations have virtually transformed them-selves to meet this challenge, and data and analytics have become central to that transformation.
  • 3. Data Analytics ◆ What is data? ● At its most general form, data is simply infor-mation that has been stored for later use. ◆ What is analytics? ● A structured approach to data-driven problem solving—one that helps us break up problems through careful consideration of the facts. 3 But, analytics is one of the most overused yet least understood terms in business. For some, it relates to the technologies used to“beat data into submission,” or it is simply an extension of business intelligence and data warehousing. And yet for others, it relates to the statistical, mathematical, or quantitative methods used in the development of models.
  • 4. Goal ◆ To introduce you the basic data science and analytics technologies ● Enables you to know the practices used and the potential use of large scale data analytics. ● To ascertain you to know the fundamental techniques and tools used to design and analyze large volumes of data. ● Enables you to practice the analysis and visualization of big data ● It enables you to understand the challenges in big data and possible technological solutions 4
  • 5. Outline 5 ◆ Course title: Data Exploration and Visulization ● Chapter 1: Overview of Data Anlytics, technologies and techniques ● Chapter 2: Data analytics methods and Tools such as Python and R Programming, and associated technologies ● Chapter 3: (Big) Data analytics platforms and environments, such as hadoop and spark ecosystem, and associated technologies and applications ● Chapter 4: Data analytics techninques and Hadoop and spark ecosystem technologies ● Reference: Lots of books on the web! ask me also! ● Course Evaluation: Project (50%), Final Exam(50%)
  • 6. Scope of this Course ◆ The theoritical and practical part of each contributing technologies, programming models, tools, configurations and techniques will be discussed in the lecture session ◆ tool and package installations, configurations, various programmings and experimentations will be dealt by the students ◆ Critical topic reviews and programming assignments will be given to the students 6
  • 8. Data Analytics Process (Lifecycle) ◆ The volume of data has exploded to unimaginable levels in the past decade, ◆ In contrast, the price of data storage has systematically reduced. ◆ Private companies and research institutions capture terabytes of data about their users’ interactions, business, social media, and also sensors from devices such as mobile phones and automobiles. ◆ The challenge is to make sense of this ocean of data. This is where data analytics comes into picture. 8
  • 9. Data Analytics Process (Lifecycle) 9 ◆ Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization's business. ◆ The process of converting large amounts of unstructured raw data, retrieved from different sources, to a data product useful for organizations forms the core of Data Analytics.
  • 10. Traditional Data Mining Process (Life Cycle) ◆ In order to provide a framework to organize the work needed by an organization and deliver clear insights from Big Data, it’s useful to think of as a cycle/process with different stages. ◆ It is by no means linear, meaning all the stages are related with each other. This cycle has superficial similarities with the more traditional data mining cycle as described in CRISP methodology. 10
  • 11. Data Analytics - Data Life Cycle 11 ◆ Data science projects differ from BI projects: ● More exploratory in nature ● Critical to have a project process ● Participants should be thorough and rigorous ◆ Break large projects into smaller pieces ◆ Spend more time to plan and scope the work ◆ Documenting adds rigor and credibility…
  • 12. Data Analytics Process (Lifecycle) 12 ◆ Data Analytics Lifecycle Overview ●Phase 1: Discovery ●Phase 2: Data Preparation ●Phase 3: Model Planning ●Phase 4: Model Building Phase ●Phase 5: Communicate Results ●Phase 6: Operationalize ◆ Case Study: GINA
  • 13. Data Analytics Lifecycle Overview 13 • The data analytic lifecycle is designed for Big Data problems and data science projects • With six phases, the project work can occur in several phases simultaneously • The cycle is iterative to portray a real project • Work can return to earlier phases as new information is uncovered
  • 14. Key Roles for a Successful Analytics Project 14
  • 15. 15 ◆ Key Roles for a Successful Analytics Project ● Business User – understands the domain area ● Project Sponsor – provides requirements ● Project Manager – ensures meeting objectives ● Business Intelligence Analyst – provides business domain expertise based on deep understanding of the data ● Database Administrator (DBA) – creates DB environment ● Data Engineer – provides technical skills, assists data management and extraction, supports analytic sandbox ● Data Scientist – provides analytic techniques and modeling Data Analytics Lifecycle Overview
  • 16. 16 ◆ Background and Overview of Data Analytics Lifecycle ● Data Analytics Lifecycle defines the analytics process and best practices from discovery to project completion ◆ The Lifecycle employs aspects of ◆ Scientific method ◆ Cross Industry Standard Process for Data Mining (CRISP-DM) - Process model for data mining ◆ Davenport’s DELTA framework ◆ Hubbard’s Applied Information Economics (AIE) approach ◆ MAD Skills: New Analysis Practices for Big Data by Cohen et al. Data Analytics Lifecycle Overview
  • 17. Overview of Data Analytics Lifecycle 17
  • 19. 19 Data Analytics Lifecycle: Discovery ◆ Learning the Business Domain ◆ Resources ◆ Framing the Problem ◆ Identifying Key Stakeholders In data discovery, stakeholders constantly analyze the business trends, similar data analytics case studies, and domain of the business industry. An assessment is done on the in-house resources, in-house infrastructure, and technology. Once the evaluation is complete, the stakeholders begin to build the hypothesis for resolving the critical business conundrums keeping in view of the customer experience and market. ◆ Interviewing the Analytics Sponsor ◆ Developing Initial Hypotheses ◆ Identifying Potential Data Sources
  • 20. What, how, when and why? tools, packages, platforms, environments, techniques and technologies we will use for Discovery phase? 20
  • 21. Phase 2: Data Preparation 21
  • 22. 22 Data Analytics Lifecycle: Data Preparation ◆ Includes steps to explore, preprocess, and condition data ◆ Create robust environment – analytics sandbox ◆ Data preparation tends to be the most labor-intensive step in the analytics lifecycle - often at least 50% of the data science project’s time ◆ The data preparation phase is generally the most iterative and the one that teams tend to underestimate most often
  • 23. 23 Data Analytics Lifecycle: Data Preparation ◆ Preparing the Analytic Sandbox ● Create the analytic sandbox (also called workspace) ● Allows team to explore data without interfering with live production data ● Sandbox collects all kinds of data (expansive approach) ● The sandbox allows organizations to undertake ambitious projects beyond traditional data analysis and BI to perform advanced predictive analytics ● Although the concept of an analytics sandbox is relatively new, this concept has become acceptable to data science teams and IT groups
  • 24. 24 ◆ Performing ETL (Extract, Transform, Load, ) ● In ETL, users perform extract, transform, load ● In the sandbox, the process is often ELT – early load preserves the raw data which can be useful to examine ● EG. in credit card fraud detection, outliers can represent high-risk transactions that might be inadvertently filtered out or transformed before being loaded into the database ● Hadoop is often used here Data Analytics Lifecycle: Data Preparation
  • 25. 25 ◆ Modern technology has changed most organizations’ approach to ELT, for several reasons. The biggest is the advent of powerful analytics warehouses like Amazon Redshift and Google BigQuery. These newer cloud-based analytics databases have the horsepower to perform transformations in place rather than requiring a special staging area. Data Analytics Lifecycle: Data Preparation
  • 26. 26 ◆ Learning about the Data Data Analytics Lifecycle: Data Preparation ● Becoming familiar with the data is critical ● This activity accomplishes several goals: - Determines the data available to the team early in the project - Highlights gaps – identifies data not currently available - Identifies data outside the organization that might be useful
  • 27. ◆ Learning about the Data Sample Dataset Inventory 27 Data Analytics Lifecycle: Data Preparation
  • 28. 28 ◆ Data Conditioning ● includes cleaning data, normalizing datasets, and performing transformations ● Often viewed as a preprocessing step prior to data analysis, it might be performed by data owner, IT department, DBA, etc. ● Best to have data scientists involved ● Data science teams prefer more data than too little Data Analytics Lifecycle: Data Preparation
  • 29. 29 ◆ Data Conditioning Additional questions and considerations ● What are the data sources? Target fields? ● How clean is the data? ● How consistent are the contents and files? Missing or inconsistent values? ● Assess the consistence of the data types – numeric, alphanumeric? ● Review the contents to ensure the data makes sense ● Look for evidence of systematic error Data Analytics Lifecycle: Data Preparation
  • 30. 30 Data Analytics Lifecycle: Data Preparation ◆ Survey and Visualize ● Leverage data visualization tools to gain an overview of the data ● Shneiderman’s mantra: - “Overview first, zoom and filter, then details-on- demand” - This enables the user to find areas of interest, zoom and filter to find more detailed information about a particular area, then find the detailed data in that area
  • 31. 31 ◆ Survey and Visualize Guidelines and Considerations ● Review data to ensure calculations are consistent ● Does the data distribution stay consistent? ● Assess the granularity of the data, the range of values, and the level of aggregation of the data ● Does the data represent the population of interest? ● Check time-related variables – daily, weekly, monthly? Is this good enough? ● Is the data standardized/normalized? Scales consistent? ● For geospatial datasets, are state/country abbreviations consistent Data Analytics Lifecycle: Data Preparation
  • 32. 32 Data Analytics Lifecycle: Data Preparation ◆ Common Tools for Data Preparation ● Hadoop can perform parallel ingest and analysis ● Alpine Miner provides a graphical user interface for creating analytic workflows ● OpenRefine (formerly Google Refine) is a free, open source tool for working with messy data ● Similar to OpenRefine, Data Wrangler is an interactive tool for data cleansing and transformation
  • 33. 33 What, how, when and why? tools, packages, platforms, environments, techniques and technologies we will use for Data Preparation phase?
  • 34. Phase 3: Model Planning 34
  • 35. 35 Data Analytics Lifecycle: Model Planning ◆ Activities to consider ● Assess the structure of the data – this dictates the tools and analytic techniques for the next phase ● Ensure the analytic techniques - enable the team to meet the business objectives and accept or reject the working hypotheses ● Determine if the situation warrants a single model or a series of techniques as part of a larger analytic workflow ● Research and understand how other analysts have approached this kind or similar kind of problem
  • 36. 36 Data Analytics Lifecycle: Model Planning ◆ Model Planning in Industry Verticals ● Example of other analysts approaching a similar problem
  • 37. ◆ Data Exploration and Variable Selection ● Explore the data to understand the relationships among the variables to inform selection of the variables and methods ● A common way to do this is to use data visualization tools ● Often, stakeholders and subject matter experts may have ideas - for example, some hypothesis that led to the project ● Aim for capturing the most essential predictors and variables - this often requires iterations and testing to identify key variables ● If the team plans to run regression analysis, identify the candidate predictors and outcome 37 Data Analytics Lifecycle: Model Planning
  • 38. 38 ◆ Model Selection ● The main goal is to choose an analytical technique, or several candidates, based on the end goal of the project ● We observe events in the real world and attempt to construct models that emulate this behavior with a set of rules and conditions - a model is simply an abstraction from reality ● Determine whether to use techniques best suited for structured data, unstructured data, or a hybrid approach ● Teams often create initial models using statistical software packages such as R, SAS, or Matlab, Which may have limitations when applied to very large datasets Data Analytics Lifecycle: Model Planning
  • 39. ◆ Common Tools for the Model Planning Phase ● R has a complete set of modeling capabilities - R contains about 5000 packages for data analysis and graphical presentation ● SQL Analysis services can perform in- database analytics of common data mining functions, involved aggregations, and basic predictive models ● SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple 39 Data Analytics Lifecycle: Model Planning
  • 40. 40 What, ho, when and why? tools, packages, platforms, environments, techniques and technologies we will use for model plannng phase?
  • 41. Phase 4: Model Building 41
  • 42. ◆ Execute the models defined in Phase 3 ◆ Develop datasets for training, testing, and production ◆ Develop analytic model on training data, test on test data ◆ Questions to consider ● Does the model appear valid and accurate on the test data? ● Does the model output/behavior make sense to the domain experts? ● Do the parameter values make sense in the context of the domain? ● Is the model sufficiently accurate to meet the goal? ● Does the model avoid intolerable mistakes? ● Are more data or inputs needed? ● Will the kind of model chosen support the runtime 42 Data Analytics Lifecycle: Model Building
  • 43. ◆ Common Tools for the Model Building Phase ●Commercial Tools - SAS Enterprise Miner – built for enterprise- level computing and analytics - SPSS Modeler (IBM) – provides enterprise- level computing and analytics - Matlab – high-level language for data analytics, algorithms, data exploration - Alpine Miner – provides GUI frontend for backend analytics tools - STATISTICA and MATHEMATICA – popular data mining and analytics tools 43 Data Analytics Lifecycle: Model Building
  • 44. ◆ Free or Open Source Tools - R and PL/R - PL/R is a procedural language for PostgreSQL with R - Octave – language for computational modeling - WEKA – data mining software package with analytic workbench - Python – language providing toolkits for machine learning and analysis - SQL – in-database implementations provide an alternative tool (see Chap 11) 44 Data Analytics Lifecycle: Model Building
  • 45. 45 What, ho, when and why? tools, packages, platforms, environments, techniques and technologies we will use for model building phase?
  • 46. Phase 5: Communicate Results 46
  • 47. Data Analytics Lifecycle: Communicating the Results ◆ Determine if the team succeeded or failed in its objectives ◆ Assess if the results are statistically significant and valid ● If so, identify aspects of the results that present salient findings ● Identify surprising results and those in line with the hypotheses ◆ Communicate and document the key findings and major insights derived from the analysis ● This is the most visible portion of the 47
  • 48. 48 What, ho, when and why? tools, packages, platforms, environments, techniques and technologies we will use for result communicating phase?
  • 50. Data Analytics Lifecycle: Operationalize ◆ In this last phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way ◆ Risk is managed effectively by undertaking small scope, pilot deployment before a wide- scale rollout ◆ During the pilot project, the team may need to execute the algorithm more efficiently in the database rather than with in-memory tools like R, especially with larger datasets ◆ To test the model in a live setting, consider running the model in a production environment for a discrete set of products or a single line of 50
  • 51. Data Analytics Lifecycle: Operationalize 51 ◆ Key outputs from successful analytics project
  • 52. 52 ◆ Key outputs from successful analytics project ● Business user – tries to determine business benefits and implications ● Project sponsor – wants business impact, risks, ROI ● Project manager – needs to determine if project completed on time, within budget, goals met ● Business intelligence analyst – needs to know if reports and dashboards will be impacted and need to change ● Data engineer and DBA – must share code and document Data Analytics Lifecycle: Operationalize
  • 53. ◆ Four main deliverables - Although the seven roles represent many interests, the interests overlap and can be met with four main deliverables ● Presentation for project sponsors – high- level takeaways for executive level stakeholders ● Presentation for analysts – describes business process changes and reporting changes, includes details and technical graphs ● Code for technical people 53 Data Analytics Lifecycle: Operationalize
  • 54. 54 What, how, when and why? tools, packages, platforms, environments, techniques and technologies we will use for Operationalizing phase?
  • 55. Case Study: Global Innovation Network and Analysis (GINA) 55
  • 56. Case Study: GINA 56 ◆ In 2012, EMC’s new director wanted to improve the company’s engagement of employees across the global centers of excellence (GCE) to drive innovation, research, and university partnerships ◆ This project was created to accomplish ● Store formal and informal data ● Track research from global technologists ● Mine the data for patterns and insights to improve the team’s operations and strategy
  • 57. Phase 1: Discovery 57 ◆ Team members and roles ● Business user, project sponsor, project manager – Vice President from Office of CTO ● BI analyst – person from IT ● Data engineer and DBA – people from IT ● Data scientist – distinguished engineer ◆ The data fell into two categories ● Five years of idea submissions from internal innovation contests ● Minutes and notes representing innovation and research activity from around the
  • 58. Phase 1: Discovery ◆ Hypotheses grouped into two categories ● Descriptive analytics of what is happening to spark further creativity, collaboration, and asset generation ● Predictive analytics to advise executive management of where it should be investing in the future 58
  • 59. Phase 2: Data Preparation 59 ◆ Set up an analytics sandbox ◆ Discovered that certain data needed conditioning and normalization and that missing datasets were critical ◆ Team recognized that poor quality data could impact subsequent steps ◆ They discovered many names were misspelled and problems with extra spaces ◆ These seemingly small problems had to be addressed
  • 60. Phase 3: Model Planning 60 ◆ The study included the following considerations ● Identify the right milestones to achieve the goals ● Trace how people move ideas from each milestone toward the goal ● Tract ideas that die and others that reach the goal ● Compare times and outcomes using a few different methods
  • 61. Phase 4: Model Building 61 ◆ Several analytic method were employed ● NLP on textual descriptions ● Social network analysis using R and Rstudio ● Developed social graphs and visualizations
  • 62. 62 ◆ Social graph of data submitters and finalists Phase 4: Model Building
  • 63. ◆ Social graph of top innovation influencers 63 Phase 4: Model Building
  • 64. Phase 5: Communicate Results 64 ◆ Study was successful in in identifying hidden innovators ● Found high density of innovators in Cork, Ireland ◆ The CTO office launched longitudinal studies
  • 65. Phase 6: Operationalize 65 ◆ Deployment was not really discussed ◆ Key findings ● Need more data in future ● Some data were sensitive ● A parallel initiative needs to be created to improve basic BI activities ● A mechanism is needed to continually reevaluate the model after deployment
  • 67. Summary 67 ◆ The Data Analytics Lifecycle is an approach to managing and executing analytic projects ◆ Lifecycle has six phases ◆ Bulk of the time usually spent on preparation – phases 1 and 2 ◆ Seven roles needed for a data science team
  • 68. Theme of the course 68 What Techniques, Tools and Technologies are required at each phase of the data analytics Process? and why they are required there? how and when they will be used?
  • 69. Some of the technologies you need to explorer by yourself as a data science professional: 69 ◆ File system technologies and techniques ◆ Distributed processing and storage ◆ Parallel processing technologies and techniques ◆ Cloud, Grid and Utility computing technologies and techniques ◆ Data centers and virtualizations