SlideShare a Scribd company logo
1 of 40
INTRODUCTION TO BIG DATA
ANALYTICS
Utkarsh Sharma
Asst. Prof. (CSE)
Jaypee University Of Engineering & Technology
Big Data Overview
Several industries have led the way in developing their ability to
gather and exploit data:
• Credit card companies monitor every purchase their customers make and
can identify fraudulent purchases with a high degree of accuracy using
rules derived by processing billions of transactions.
• Mobile phone companies analyze subscriber’s calling patterns to
determine, If that rival network is offering an attractive promotion that might
cause the subscriber to defect.
• For companies such as Linked In and Facebook, data itself is their primary
product.
Big Data Overview
Three attributes stand out as defining Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of new data sources, formats,
and structures, including digital traces being left on the web and other digital repositories for
subsequent analysis.
• Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.
Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness
require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value.
McKinsey's definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the
new role of the data scientist.
Data Deluge
An Example(Genomic sequencing)
While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one
human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now,
websites such as 23andme offer genotyping for less than $100.
Data Structures
• Big data can come in multiple forms, including structured and
non-structured data such as financial data, text files, multimedia
files, and genetic mappings.
• Most of the Big Data is unstructured or semi-structured in
nature, which requires different techniques and tools to process
and analyze.
• Distributed computing environments and massively parallel
processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data.
Data Structures
Structured Data
• Data containing a defined data type, format, and structure (that is, transaction data, online analytical
processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data
• Textual data files with a discernible pattern that enables parsing (such as Extensible Markup
Language [XML] data files that are self-describing and defined by an XML schema).
Quasi-structured data
• Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance,
web clickstream data that may contain inconsistencies in data values and formats).
• Consider the following example. A user attends the EMC World conference and subsequently runs
a Google search online to find information related to EMC and Data Science. This would produce a
URL such as https: I /www . google. com/ #q=EMC+ data+science
• After doing this search, the user may choose the second link, to read more about the headline "Data
Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site
focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_
science.aspx
• Arriving at this site, the user may decide to click to learn more about the process of becoming
certified in data science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/
data_science . aspx,
Unstructured data
• Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
• All of these heterogenous types of data structures created the need
of some specialized data storage and retrieval techniques, such as
Data warehouses and analytics sandbox.
Data Warehouse
• A data warehouse is a central repository of information that can be analyzed to make more informed
decisions.
• Data flows into a data warehouse from transactional systems, relational databases, and other sources,
typically on a regular cadence.
• Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytics applications.
Intro. to Data Warehouse
• The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This
data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place whereas a Data Warehouse keeps historical data also.
• A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us Online
Analytical Processing (OLAP) tools.
Understanding a Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.
Analytics sandbox
• A workspace in which data assets are gathered from multiple sources
and technologies for analysis.
• To lessen the performance burden of the analysis, the workspace may
use in-database processing and is considered to be owned by the
analysts rather than database administrators.
• Often, this workspace is created by using a sampling of the dataset
rather than the entire dataset.
• The sandbox may also reduce the stove-piped and partial versions of
the true data that may have been developed in business units.
Analytics sandbox
Types of Data Repositories
Business Intelligence vs Data Science
Examples of Big Data Analytics
• As mentioned earlier, Big Data presents many opportunities to improve sales and marketing
analytics.
• An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main life-
event situations.
• Marriage, when people tend to buy many new products.
• Divorce, when people buy new products and change their spending habits.
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
• Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using
data collected from shoppers, Target was able to identify this fact and predict which of its shoppers
were pregnant. In one case, Target knew a female shopper was pregnant even before her family
knew
Data Science Project Lifecycle
Data Science Project Lifecycle
• 1. Obtain Data
• Skills required
• how to use MySQL, PostgreSQL or MongoDB
• 2. Scrub Data
• Skills required
• You will need scripting tools like Python or R to help you to scrub the data.
• 3. Explore Data
• Skills required
• If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then
GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge
and skills in inferential statistics and data visualization.
• 4. Model Data
• Skills required
• In Machine Learning, the skills you will need is both supervised and unsupervised algorithms.
• 5. Interpreting Data
• Skills required
• You will need strong business domain knowledge to present your findings in a way that can
answer the business questions you set out to answer
The Analytics Process
An Analysis process contains all or some of the following phases:
• Business understanding: Identifying and understanding the business objectives
• Data Collection: Collection of data from different sources and its representation
in terms of its application.
• Data Preparation: Removing the unnecessary and unwanted data
• Data Modelling: Create a model to analyse the different relationships between
the objects.
• Data Evaluation: Evaluation and preparation
of analysis report
• Deployment: Finalizing the plan for
deployment
Types of Analytics
On the basis of problem description, four types of data analytics are used:
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Descriptive analytics : What is happening?
• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• Descriptive analytics juggles raw data from
multiple data sources to give valuable insights
into the past.
• However, these findings simply signal that something
is wrong or right, without explaining why.
Diagnostic: Why is it happening?
• At this stage, historical data can be measured against other data to answer the question
of why something happened.
• Diagnostic analytics gives in-depth insights into a
particular problem.
• On assessment of the descriptive data, diagnostic
analytical tools will empower an analyst to drill down
and in so doing isolate the root-cause of a problem.
Predictive: What is likely to happen?
• Predictive analytics tells what is likely to happen. It uses the findings
of descriptive and diagnostic analytics to detect clusters and
exceptions, and to predict future trends.
• Predictive models typically utilize
a variety of variable data to make
the prediction.
• Predictive analytics belongs to
advanced analytics types and brings
many advantages like sophisticated
analysis based on machine or deep
learning.
Prescriptive: What do I need to do?
• The purpose of prescriptive analytics is to literally prescribe what action to take to
eliminate a future problem or take full advantage of a promising trend.
• The prescriptive model utilizes an understanding of what has
happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine
the best course of action to take.
• Besides, this state-of-the-art type of data analytics requires not
only historical internal data but also external information due
to the nature of algorithms it’s based on.
Big Data Analytics(One more categorization)
• Basic Analytics
Slicing & Dicing
Basic monitoring
Anomaly identification
• Advanced Analytics
Predictive Modelling
Text Analytics
Statistics and data mining algorithms
• Operational Analytics
• Monetized Analytics
Data Analytics Lifecycle
Brief Overview
• The Data Analytics Lifecycle is designed specifically for Big Data problems and data
science projects.
• The lifecycle has six phases, and project work can occur in several phases at once.
• For most phases in the lifecycle, the movement can be either forward or backward.
• In recent years, substantial attention has been placed on the emerging role of the data
scientist.
• Despite this strong focus on the emerging role of the data scientist specifically, there are
actually seven key roles that need to be fulfilled for a high-functioning data science team
to execute analytic projects successfully.
Key Roles for a Successful Analytics Project
• For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large
project may require 20 or more people. The seven roles follow:
Key Roles for a Successful Analytics Project
• Business User :- business analyst, line manager, or deep subject matter expert in the project
domain.
• Project Sponsor :- provides the funding and gauges
• Project Manager :- Ensures that key milestones and objectives are met on time and at the expected
quality.
• Business Intelligence Analyst :- Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis).
• Database Administrator (DBA) :- Provisions and configures the database environment to support
the analytics needs of the working team.
• Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic sandbox.
• Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and
applying valid analytical techniques to given business problems.
Data Analytics Lifecycle
Phase 1- Discovery
• Learning the Business Domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize
Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
Phase 4: Model Building
• The team develops data sets for testing, training, and production purposes.
Phase 5: Communicate Results
• The team, in collaboration with major stakeholders, determines if the results of the project
are a success or a failure based on the criteria developed in Phase 1.
Phase 6: Operationalize
• The team delivers final reports, briefings, code, and technical documents.
• In addition, the team may run a pilot project to implement the models in a production
environment.
Key Outputs from a Successful Analytic Project
Big Data Pre-processing
• The set of techniques used prior to the application of a data mining
method is named as data preprocessing for data mining.
• The bigger amounts of data collected require more sophisticated
mechanisms to analyze it.
• Data preprocessing is able to adapt the data to the requirements
posed by each data mining algorithm, enabling to process data that
would be unfeasible otherwise.
Introduction to Big Data Analytics

More Related Content

What's hot

Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big DataSaurabh Shanbhag
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesSlideTeam
 
Advanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsAdvanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsArvind Sathi
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and ImplementationSHIKHA GAUTAM
 
Data warehouse
Data warehouseData warehouse
Data warehouseRishabh Dogra
 
Data warehousing
Data warehousingData warehousing
Data warehousingJuhi Mahajan
 
Big data unit i
Big data unit iBig data unit i
Big data unit iNavjot Kaur
 
Business analytics
Business analyticsBusiness analytics
Business analyticsDinakar nk
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityDATAVERSITY
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecyclebartlowe
 

What's hot (20)

Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big Data
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation Slides
 
Advanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsAdvanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and Implementation
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Business analytics
Business analyticsBusiness analytics
Business analytics
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
Data mining
Data mining Data mining
Data mining
 
How to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data QualityHow to Strengthen Enterprise Data Governance with Data Quality
How to Strengthen Enterprise Data Governance with Data Quality
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 

Similar to Introduction to Big Data Analytics

Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdfssuser0413ec
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data WarehousingAAKANKSHA JAIN
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics Venkat .P
 
ERP technology Areas.pptx
ERP technology Areas.pptxERP technology Areas.pptx
ERP technology Areas.pptxssuserdd904d
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptxCallplanetsDeveloper
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehousessuser7fc7eb
 

Similar to Introduction to Big Data Analytics (20)

Data mining
Data miningData mining
Data mining
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
KIT601 Unit I.pptx
KIT601 Unit I.pptxKIT601 Unit I.pptx
KIT601 Unit I.pptx
 
Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics
 
ERP technology Areas.pptx
ERP technology Areas.pptxERP technology Areas.pptx
ERP technology Areas.pptx
 
Abstract
AbstractAbstract
Abstract
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehouse
 

More from Utkarsh Sharma

Model validation
Model validationModel validation
Model validationUtkarsh Sharma
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsUtkarsh Sharma
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applicationsUtkarsh Sharma
 
Time series analysis
Time series analysisTime series analysis
Time series analysisUtkarsh Sharma
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithmsUtkarsh Sharma
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component AnalysisUtkarsh Sharma
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Utkarsh Sharma
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningUtkarsh Sharma
 

More from Utkarsh Sharma (10)

Model validation
Model validationModel validation
Model validation
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applications
 
Time series analysis
Time series analysisTime series analysis
Time series analysis
 
Text analytics
Text analyticsText analytics
Text analytics
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component Analysis
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 

Recently uploaded

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 

Recently uploaded (20)

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 

Introduction to Big Data Analytics

  • 1. INTRODUCTION TO BIG DATA ANALYTICS Utkarsh Sharma Asst. Prof. (CSE) Jaypee University Of Engineering & Technology
  • 2. Big Data Overview Several industries have led the way in developing their ability to gather and exploit data: • Credit card companies monitor every purchase their customers make and can identify fraudulent purchases with a high degree of accuracy using rules derived by processing billions of transactions. • Mobile phone companies analyze subscriber’s calling patterns to determine, If that rival network is offering an attractive promotion that might cause the subscriber to defect. • For companies such as Linked In and Facebook, data itself is their primary product.
  • 3. Big Data Overview Three attributes stand out as defining Big Data characteristics: • Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns. • Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis. • Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.
  • 4. Another definition of Big Data comes from the McKinsey Global report from 2011: • Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. McKinsey's definition of Big Data implies that organizations will need new data architectures and analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the new role of the data scientist.
  • 6. An Example(Genomic sequencing) While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now, websites such as 23andme offer genotyping for less than $100.
  • 7. Data Structures • Big data can come in multiple forms, including structured and non-structured data such as financial data, text files, multimedia files, and genetic mappings. • Most of the Big Data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze. • Distributed computing environments and massively parallel processing (MPP) architectures that enable parallelized data ingest and analysis are the preferred approach to process such complex data.
  • 9. Structured Data • Data containing a defined data type, format, and structure (that is, transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
  • 10. Semi-structured data • Textual data files with a discernible pattern that enables parsing (such as Extensible Markup Language [XML] data files that are self-describing and defined by an XML schema).
  • 11. Quasi-structured data • Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance, web clickstream data that may contain inconsistencies in data values and formats). • Consider the following example. A user attends the EMC World conference and subsequently runs a Google search online to find information related to EMC and Data Science. This would produce a URL such as https: I /www . google. com/ #q=EMC+ data+science • After doing this search, the user may choose the second link, to read more about the headline "Data Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_ science.aspx • Arriving at this site, the user may decide to click to learn more about the process of becoming certified in data science. The user chooses a link toward the top of the page on Certifications, bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/ data_science . aspx,
  • 12. Unstructured data • Data that has no inherent structure, which may include text documents, PDFs, images, and video. • All of these heterogenous types of data structures created the need of some specialized data storage and retrieval techniques, such as Data warehouses and analytics sandbox.
  • 13. Data Warehouse • A data warehouse is a central repository of information that can be analyzed to make more informed decisions. • Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. • Business analysts, data engineers, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.
  • 14. Intro. to Data Warehouse • The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. • An operational database undergoes frequent changes on a daily basis on account of the transactions that take place whereas a Data Warehouse keeps historical data also. • A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools.
  • 15. Understanding a Data Warehouse • A data warehouse is a database, which is kept separate from the organization's operational database. • There is no frequent updating done in a data warehouse. • It possesses consolidated historical data, which helps the organization to analyze its business. • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. • Data warehouse systems help in the integration of diversity of application systems. • A data warehouse system helps in consolidated historical data analysis.
  • 16. Analytics sandbox • A workspace in which data assets are gathered from multiple sources and technologies for analysis. • To lessen the performance burden of the analysis, the workspace may use in-database processing and is considered to be owned by the analysts rather than database administrators. • Often, this workspace is created by using a sampling of the dataset rather than the entire dataset. • The sandbox may also reduce the stove-piped and partial versions of the true data that may have been developed in business units.
  • 18. Types of Data Repositories
  • 19. Business Intelligence vs Data Science
  • 20. Examples of Big Data Analytics • As mentioned earlier, Big Data presents many opportunities to improve sales and marketing analytics. • An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior, Target's statisticians determined that the retailer made a great deal of money from three main life- event situations. • Marriage, when people tend to buy many new products. • Divorce, when people buy new products and change their spending habits. • Pregnancy, when people have many new things to buy and have an urgency to buy them. • Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using data collected from shoppers, Target was able to identify this fact and predict which of its shoppers were pregnant. In one case, Target knew a female shopper was pregnant even before her family knew
  • 21. Data Science Project Lifecycle
  • 22. Data Science Project Lifecycle • 1. Obtain Data • Skills required • how to use MySQL, PostgreSQL or MongoDB • 2. Scrub Data • Skills required • You will need scripting tools like Python or R to help you to scrub the data. • 3. Explore Data • Skills required • If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge and skills in inferential statistics and data visualization. • 4. Model Data • Skills required • In Machine Learning, the skills you will need is both supervised and unsupervised algorithms. • 5. Interpreting Data • Skills required • You will need strong business domain knowledge to present your findings in a way that can answer the business questions you set out to answer
  • 23. The Analytics Process An Analysis process contains all or some of the following phases: • Business understanding: Identifying and understanding the business objectives • Data Collection: Collection of data from different sources and its representation in terms of its application. • Data Preparation: Removing the unnecessary and unwanted data • Data Modelling: Create a model to analyse the different relationships between the objects. • Data Evaluation: Evaluation and preparation of analysis report • Deployment: Finalizing the plan for deployment
  • 24. Types of Analytics On the basis of problem description, four types of data analytics are used: • Descriptive Analytics • Diagnostic Analytics • Predictive Analytics • Prescriptive Analytics
  • 25. Descriptive analytics : What is happening? • This is the most common of all forms. In business it provides the analyst a view of key metrics and measures within the business. • Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past. • However, these findings simply signal that something is wrong or right, without explaining why.
  • 26. Diagnostic: Why is it happening? • At this stage, historical data can be measured against other data to answer the question of why something happened. • Diagnostic analytics gives in-depth insights into a particular problem. • On assessment of the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem.
  • 27. Predictive: What is likely to happen? • Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics to detect clusters and exceptions, and to predict future trends. • Predictive models typically utilize a variety of variable data to make the prediction. • Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated analysis based on machine or deep learning.
  • 28. Prescriptive: What do I need to do? • The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future problem or take full advantage of a promising trend. • The prescriptive model utilizes an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take. • Besides, this state-of-the-art type of data analytics requires not only historical internal data but also external information due to the nature of algorithms it’s based on.
  • 29. Big Data Analytics(One more categorization) • Basic Analytics Slicing & Dicing Basic monitoring Anomaly identification • Advanced Analytics Predictive Modelling Text Analytics Statistics and data mining algorithms • Operational Analytics • Monetized Analytics
  • 30. Data Analytics Lifecycle Brief Overview • The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. • The lifecycle has six phases, and project work can occur in several phases at once. • For most phases in the lifecycle, the movement can be either forward or backward. • In recent years, substantial attention has been placed on the emerging role of the data scientist. • Despite this strong focus on the emerging role of the data scientist specifically, there are actually seven key roles that need to be fulfilled for a high-functioning data science team to execute analytic projects successfully.
  • 31. Key Roles for a Successful Analytics Project • For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large project may require 20 or more people. The seven roles follow:
  • 32. Key Roles for a Successful Analytics Project • Business User :- business analyst, line manager, or deep subject matter expert in the project domain. • Project Sponsor :- provides the funding and gauges • Project Manager :- Ensures that key milestones and objectives are met on time and at the expected quality. • Business Intelligence Analyst :- Provides business domain expertise based on a deep understanding of the data, key performance indicators (KPis). • Database Administrator (DBA) :- Provisions and configures the database environment to support the analytics needs of the working team. • Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox. • Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems.
  • 34. Phase 1- Discovery • Learning the Business Domain • Resources • Framing the Problem • Identifying Key Stakeholders • Interviewing the Analytics Sponsor • Developing Initial Hypotheses
  • 35. Phase 2: Data Preparation • Preparing the Analytic Sandbox • Performing ETLT • Learning About the Data • Data Conditioning • Survey and Visualize
  • 36. Phase 3: Model Planning • Data Exploration and Variable Selection • Model Selection Phase 4: Model Building • The team develops data sets for testing, training, and production purposes.
  • 37. Phase 5: Communicate Results • The team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. Phase 6: Operationalize • The team delivers final reports, briefings, code, and technical documents. • In addition, the team may run a pilot project to implement the models in a production environment.
  • 38. Key Outputs from a Successful Analytic Project
  • 39. Big Data Pre-processing • The set of techniques used prior to the application of a data mining method is named as data preprocessing for data mining. • The bigger amounts of data collected require more sophisticated mechanisms to analyze it. • Data preprocessing is able to adapt the data to the requirements posed by each data mining algorithm, enabling to process data that would be unfeasible otherwise.