DATA SCIENCE
BASIC INTRODUCTION
What is Data science
 Data Science as a multi-disciplinary subject encompasses
the use of mathematics, statistics, and computer science to
study and evaluate data.
 Key objective - To extract valuable information for use in
strategic decision making, product development, trend
analysis, and forecasting..
Data science– Standard definition.
 Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from structured and unstructured
data.
 Data Science is the science which uses computer science,
statistics and machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze,
visualize, interact with data to create data products.
Data science Skills.
 Data Science as a multi-disciplinary field revolves around reading
and processing data, pulling knowledge from that data.
Evolution of Data Science.
 Initially rooted in statistics and data analysis, it has evolved into a
multi-faceted discipline, incorporating advanced techniques like
machine learning and deep learning.
 Over the years, as data has grown exponentially, so too has the
need for sophisticated tools and methods to process and analyze
it.
Where we obtain data from?
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit transactions
 Online trading and purchasing
 Social Network
How do we have ?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs (100 MB/s)
What actually is Data Science?
 An area that manages, manipulates, extracts, and interprets
knowledge from tremendous amount of data
 Data science (DS) is a multidisciplinary field of study with
goal to address the challenges in big data
 Data science principles apply to all data – big and small
theories and techniques from many fields and disciplines
are used to investigate and analyze a large amount of data
to help decision makers in many industries such as
science, engineering, economics, politics, finance, and
education
 Computer Science
 Pattern recognition, visualization, data warehousing,
High performance computing, Databases, AI
 Mathematics
 Mathematical Modeling
 Statistics
 Statistical and Stochastic modeling, Probability.
Scope of Data Science
 Healthcare
 Predictive analytics
 personalized medicine
 improving patient care.
 For example, algorithms can predict patient outcomes based on historical data, allowing for more proactive
treatment plans.
 Finance
 Risk management
 Fraud detection
 Investment strategies.
 Machine learning models can analyze market trends and assist in making informed trading decisions.
 Marketing
 To segment audiences
 personalize campaigns
 predict customer behavior.
 This helps companies target the right customers with the right message at the right time.
 Technology
 Recommendation systems
 Enhance user experience
 Improve operational efficiency.
 Think of how Netflix suggests shows you might like—that’s data science in action.
Different milestone in Data Science
R.A. Fisher W.E. Deming Peter Luhn
Howard
Dresner
What are the expectations according to Gartner’s 2014 Hype Cycle
What we invest in Data Science
Data Science Process
 Data Science is the process of analysing and interpreting data to
uncover hidden trends, correlations and insights that can support
decision-making and strategic planning.
 It involves manipulating raw data using analytical and
computational techniques to transform it into valuable
information.
Data Science Process Life Cycle
Data Science Process Life Cycle
1.Data Collection
Gathering relevant data from multiple sources such as databases,APIs, surveys,
logs, sensors or web scraping.
2. Data Cleaning
 Data contains missing values, inconsistencies, duplicates and noise.
 Data cleaning focuses on correcting errors, handling missing data,
removing irrelevant records and converting data into a structured format
suitable for analysis.
3. Exploratory Data Analysis (EDA)
 Understand the data in depth by applying descriptive statistics and
visualization techniques.
 It helps identify trends, outliers, correlations and relationships between
variables and guides decisions related to feature selection and modeling
strategies.
Data Science Process Life Cycle
4. Model Building
 Suitable machine learning algorithms are selected and trained on
historical data.
 The goal is to identify patterns that allow the model to make
accurate predictions or classifications on unseen data.
 5. Model Deployment
 After validation, the trained model is deployed into a production
environment.
 Its performance is continuously monitored and updates are made as
new data becomes available or conditions change.
Real Life Examples
 Companies learn your secrets, shopping patterns, and
preferences
 Data Science and election (2008, 2012)
 …that was just one of several ways that Mr. Obama’s
campaign operations, some unnoticed by Mr.
Romney’s aides in Boston, helped save the president’s
candidacy. In Chicago, the campaign recruited a team
of behavioral scientists to build an extraordinarily
sophisticated database
 …that allowed the Obama campaign not only to alter
the very nature of the electorate, making it younger and
less white, but also to create a portrait of shifting voter
allegiances. The power of this operation stunned Mr.
Romney’s aides on election night, as they saw voters
they never even knew existed turn out in places like
Osceola County, Fla.
-- New York Times, Wed Nov 7, 2012
Real life examples (contd..)
 Exciting new effective
applications of data analytics
 Example: Google Flu Trends:
Detecting outbreaks two weeks
ahead of CDC data
 New models are estimating
which cities are most at risk
for spread of the Ebola virus.
 Prediction model is built on
Various data sources , types and
analysis.
Page Rank: The web as a behavioral dataset
Sponsored search
 Google revenue around $50 bn/year from
marketing, 97% of the companies revenue.
 Sponsored search uses an auction – a pure
competition for marketers trying to win
access to consumers.
 In other words, a competition for models of
consumers – their likelihood of responding to
the ad – and of determining the right bid for
the item.
 There are around 30 billion search requests a
month. Perhaps a trillion events of history
between search providers.
 Google Adwords and Adsense
Other data science application
 Transaction Databases  Recommender systems
(NetFlix), Fraud Detection (Security and Privacy)
 Wireless Sensor Data  Smart Home, Real-time
Monitoring, Internet of Things
 Text Data, Social Media Data  Product Review
and Consumer Satisfaction (Facebook, Twitter,
LinkedIn), E-discovery
 Software Log Data  Automatic Trouble
Shooting (Splunk)
 Genotype and Phenotype Data  Epic, 23andme,
Patient-Centered Care, Personalized Medicine
What can you do with the data?
Traffic Prediction and Earthquake Warning
Crowdsourcing + physical modeling + sensing + data assimilation
to produce:
From Alex Bayen, UCB, Director, Institute for Transportation Studies
Who are data scientists?
 Data scientists are a new breed of
analytical data expert who have the
technical skills to solve complex
problems – and the curiosity to explore
what problems need to be solved.
 They find stories, extract knowledge.
They are not reporters
 Data scientists are the key to realizing
the opportunities presented by big data.
They bring structure to it, find
compelling patterns in it, and advise
executives on the implications for
products, processes, and decisions
Duties of Data Scientists.
There's not a definitive job description when it comes to a
data scientist role. But here are a few things you'll likely
be doing:
 Collecting large amounts of unruly data and
transforming it into a more usable format.
 Staying on top of analytical techniques such as machine
learning, deep learning and text analytics.
 Solving business-related problems using data-driven
techniques.
 Communicating and collaborating with both IT and
business.
 Working with a variety of programming languages,
including SAS, R and Python.
 Looking for order and patterns in data, as well as
spotting trends that can help a business’s bottom line.
 Having a solid grasp of statistics, including statistical
tests and distributions.
What are the tools of Data Scientists?
 Data visualization: the presentation of data in a
pictorial or graphical format so it can be easily
analyzed.
 Pattern recognition: technology that
recognizes patterns in data (often used
interchangeably with machine learning).
 Machine learning: a branch of artificial
intelligence based on mathematical algorithms
and automation.
 Data preparation: the process of converting
raw data into another format so it can be more
easily consumed.
 Deep learning: an area of machine learning
research that uses data to model complex
abstractions.
 Text analytics: the process of examining
unstructured data to glean key business insights.
Companies that use Data Science.
 Accenture
 Fidelity Investments
 Bank of America
 Google
 Facebook
 Tata Consultancy Services
 Intel
 Many more……..
Contrast between Database and DataScience
Databases Data Science
DataValue “Precious” “Cheap”
DataVolume Modest Massive
Examples Bank records,
Personnel records,
Census,
Medical records
Online clicks,
GPS logs,
Tweets,
Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions,ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Memcached,
Apache River, …
Contrast: Machine Learning v/s Data
Science.
Machine Learning
Develop new (individual)
models
Prove mathematical
properties of models
Improve/validate on a few,
relatively clean, small
datasets
Publish a paper
Data Science
Explore many models, build
and tune hybrids
Understand empirical
properties of models
Develop/use tools that can
handle massive datasets
Take action!
Requirements for being a Data Scientist.
 Mathematics and Applied Mathematics
 Applied Statistics/Data Analysis
 Solid Programming Skills (R, Python, Julia, SQL)
 Data Mining
 Data Base Storage and Management
 Machine Learning and discovery
Data Science – A Visual Definition
THANK YOU…..
Q & A

Definition,scope&evolution_datascience.pptx

  • 1.
  • 2.
    What is Datascience  Data Science as a multi-disciplinary subject encompasses the use of mathematics, statistics, and computer science to study and evaluate data.  Key objective - To extract valuable information for use in strategic decision making, product development, trend analysis, and forecasting..
  • 3.
    Data science– Standarddefinition.  Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.  Data Science is the science which uses computer science, statistics and machine learning, visualization and human- computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.
  • 4.
    Data science Skills. Data Science as a multi-disciplinary field revolves around reading and processing data, pulling knowledge from that data.
  • 5.
    Evolution of DataScience.  Initially rooted in statistics and data analysis, it has evolved into a multi-faceted discipline, incorporating advanced techniques like machine learning and deep learning.  Over the years, as data has grown exponentially, so too has the need for sophisticated tools and methods to process and analyze it.
  • 6.
    Where we obtaindata from?  Lots of data is being collected and warehoused  Web data, e-commerce  Financial transactions, bank/credit transactions  Online trading and purchasing  Social Network
  • 7.
    How do wehave ? Google processes 20 PB a day (2008) Facebook has 60 TB of daily logs eBay has 6.5 PB of user data + 50 TB/day (5/2009) 1000 genomes project: 200 TB Cost of 1 TB of disk: $35 Time to read 1 TB disk: 3 hrs (100 MB/s)
  • 8.
    What actually isData Science?  An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data  Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data  Data science principles apply to all data – big and small theories and techniques from many fields and disciplines are used to investigate and analyze a large amount of data to help decision makers in many industries such as science, engineering, economics, politics, finance, and education  Computer Science  Pattern recognition, visualization, data warehousing, High performance computing, Databases, AI  Mathematics  Mathematical Modeling  Statistics  Statistical and Stochastic modeling, Probability.
  • 9.
    Scope of DataScience  Healthcare  Predictive analytics  personalized medicine  improving patient care.  For example, algorithms can predict patient outcomes based on historical data, allowing for more proactive treatment plans.  Finance  Risk management  Fraud detection  Investment strategies.  Machine learning models can analyze market trends and assist in making informed trading decisions.  Marketing  To segment audiences  personalize campaigns  predict customer behavior.  This helps companies target the right customers with the right message at the right time.  Technology  Recommendation systems  Enhance user experience  Improve operational efficiency.  Think of how Netflix suggests shows you might like—that’s data science in action.
  • 10.
    Different milestone inData Science R.A. Fisher W.E. Deming Peter Luhn Howard Dresner
  • 11.
    What are theexpectations according to Gartner’s 2014 Hype Cycle
  • 13.
    What we investin Data Science
  • 14.
    Data Science Process Data Science is the process of analysing and interpreting data to uncover hidden trends, correlations and insights that can support decision-making and strategic planning.  It involves manipulating raw data using analytical and computational techniques to transform it into valuable information.
  • 15.
  • 16.
    Data Science ProcessLife Cycle 1.Data Collection Gathering relevant data from multiple sources such as databases,APIs, surveys, logs, sensors or web scraping. 2. Data Cleaning  Data contains missing values, inconsistencies, duplicates and noise.  Data cleaning focuses on correcting errors, handling missing data, removing irrelevant records and converting data into a structured format suitable for analysis. 3. Exploratory Data Analysis (EDA)  Understand the data in depth by applying descriptive statistics and visualization techniques.  It helps identify trends, outliers, correlations and relationships between variables and guides decisions related to feature selection and modeling strategies.
  • 17.
    Data Science ProcessLife Cycle 4. Model Building  Suitable machine learning algorithms are selected and trained on historical data.  The goal is to identify patterns that allow the model to make accurate predictions or classifications on unseen data.  5. Model Deployment  After validation, the trained model is deployed into a production environment.  Its performance is continuously monitored and updates are made as new data becomes available or conditions change.
  • 18.
    Real Life Examples Companies learn your secrets, shopping patterns, and preferences  Data Science and election (2008, 2012)  …that was just one of several ways that Mr. Obama’s campaign operations, some unnoticed by Mr. Romney’s aides in Boston, helped save the president’s candidacy. In Chicago, the campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database  …that allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. -- New York Times, Wed Nov 7, 2012
  • 19.
    Real life examples(contd..)  Exciting new effective applications of data analytics  Example: Google Flu Trends: Detecting outbreaks two weeks ahead of CDC data  New models are estimating which cities are most at risk for spread of the Ebola virus.  Prediction model is built on Various data sources , types and analysis.
  • 20.
    Page Rank: Theweb as a behavioral dataset
  • 21.
    Sponsored search  Googlerevenue around $50 bn/year from marketing, 97% of the companies revenue.  Sponsored search uses an auction – a pure competition for marketers trying to win access to consumers.  In other words, a competition for models of consumers – their likelihood of responding to the ad – and of determining the right bid for the item.  There are around 30 billion search requests a month. Perhaps a trillion events of history between search providers.  Google Adwords and Adsense
  • 22.
    Other data scienceapplication  Transaction Databases  Recommender systems (NetFlix), Fraud Detection (Security and Privacy)  Wireless Sensor Data  Smart Home, Real-time Monitoring, Internet of Things  Text Data, Social Media Data  Product Review and Consumer Satisfaction (Facebook, Twitter, LinkedIn), E-discovery  Software Log Data  Automatic Trouble Shooting (Splunk)  Genotype and Phenotype Data  Epic, 23andme, Patient-Centered Care, Personalized Medicine
  • 23.
    What can youdo with the data? Traffic Prediction and Earthquake Warning Crowdsourcing + physical modeling + sensing + data assimilation to produce: From Alex Bayen, UCB, Director, Institute for Transportation Studies
  • 24.
    Who are datascientists?  Data scientists are a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solved.  They find stories, extract knowledge. They are not reporters  Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions
  • 25.
    Duties of DataScientists. There's not a definitive job description when it comes to a data scientist role. But here are a few things you'll likely be doing:  Collecting large amounts of unruly data and transforming it into a more usable format.  Staying on top of analytical techniques such as machine learning, deep learning and text analytics.  Solving business-related problems using data-driven techniques.  Communicating and collaborating with both IT and business.  Working with a variety of programming languages, including SAS, R and Python.  Looking for order and patterns in data, as well as spotting trends that can help a business’s bottom line.  Having a solid grasp of statistics, including statistical tests and distributions.
  • 26.
    What are thetools of Data Scientists?  Data visualization: the presentation of data in a pictorial or graphical format so it can be easily analyzed.  Pattern recognition: technology that recognizes patterns in data (often used interchangeably with machine learning).  Machine learning: a branch of artificial intelligence based on mathematical algorithms and automation.  Data preparation: the process of converting raw data into another format so it can be more easily consumed.  Deep learning: an area of machine learning research that uses data to model complex abstractions.  Text analytics: the process of examining unstructured data to glean key business insights.
  • 27.
    Companies that useData Science.  Accenture  Fidelity Investments  Bank of America  Google  Facebook  Tata Consultancy Services  Intel  Many more……..
  • 28.
    Contrast between Databaseand DataScience Databases Data Science DataValue “Precious” “Cheap” DataVolume Modest Massive Examples Bank records, Personnel records, Census, Medical records Online clicks, GPS logs, Tweets, Building sensor readings Priorities Consistency, Error recovery, Auditability Speed, Availability, Query richness Structured Strongly (Schema) Weakly or none (Text) Properties Transactions,ACID* CAP* theorem (2/3), eventual consistency Realizations SQL NoSQL: MongoDB, CouchDB, Hbase, Cassandra, Riak, Memcached, Apache River, …
  • 29.
    Contrast: Machine Learningv/s Data Science. Machine Learning Develop new (individual) models Prove mathematical properties of models Improve/validate on a few, relatively clean, small datasets Publish a paper Data Science Explore many models, build and tune hybrids Understand empirical properties of models Develop/use tools that can handle massive datasets Take action!
  • 30.
    Requirements for beinga Data Scientist.  Mathematics and Applied Mathematics  Applied Statistics/Data Analysis  Solid Programming Skills (R, Python, Julia, SQL)  Data Mining  Data Base Storage and Management  Machine Learning and discovery
  • 31.
    Data Science –A Visual Definition
  • 32.