Lecturer: Dr.Ahmed Hussein Elmi
DATA SCIENCE – ONE DEFINITION
WHAT IS DATA ?
Data is row facts about people, things,
places..etc
Example:
Item No. Item Name Price
103 Mobile $499
What is Science?
 A branch of study that deals with a
connected body of demonstrated truths or
with observed facts systematically classified
and more or less comprehended by general
laws, and incorporating trustworthy
methods (now esp. those involving the
scientific method and which incorporate
falsifiable hypotheses) for the discovery of
new truth in its own domain.
What is Science?
 1. Generate a hypothesis
 2. Generate data through observation
and/or experiment
 3. Assess whether the data are
consistent with the hypothesis or not.
WHAT IS DATA SCIENCE
Its skill of extracting of knowledge from data
Using knowledge to predict the unknown
Data science is the application of computational and statistical techniques
to address or gain insight into some problem in the real world.

What is Data Science?
 An area that manages, manipulates,
extracts, and interprets knowledge from
tremendous amount of data
 Data science (DS) is a multidisciplinary
field of study with goal to address the
challenges in big data
 Data science principles apply to all data
– big and small
What is Data Science?
 Theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
 Computer Science
 Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
 Mathematics
 Mathematical Modeling
 Statistics
 Statistical and Stochastic modeling, Probability.
Contrast: Databases
Databases Data Science
Querying the past Querying the future
Business intelligence (BI) is the transformation of raw
data into meaningful and useful information for
business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop
and otherwise create new strategic business
Big Data and Data Science
 “… the sexy job in the next 10 years will be
statisticians,” Hal Varian, Google Chief Economist
 The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by
2018. McKinsey Global Institute’s June 2011
 New Data Science institutes being created or
repurposed – NYU, Columbia, Washington, UCB,...
 New degree programs, courses, boot-camps:
 e.g., at Berkeley: Stats, I-School, CS, Astronomy…
 One proposal (elsewhere) for an MS in “Big Data Science”
FOR EXAMPLE THIS ROW DATA
THIS PUZZLE
SETP 1 Prepare
SETP 2 analyze
SETP 3 which final get insights
The data science pipeline
Data Science
Concentration in Data Science
 Mathematics and Applied Mathematics
 Applied Statistics/Data Analysis
 Solid Programming Skills (R, Python, Julia, SQL)
 Data Mining
 Data Base Storage and Management
 Machine Learning and discovery
WHY IS PYTHON PREFERRED OVER OTHER
DATA SCIENCE TOOLS?
 Easy to learn
 Scalability
 Choice of data science libraries
 Python community
 Graphics and visualization
Data Scientist’s Practice
Digging Around
in Data
Hypothesize
Model
Large Scale
Exploitation
Evaluate
Interpret
Clean,
prep
What Data Science do?
 A typical data science process looks like this,
which can be modified for specific use case:
 ● Understand the business
 ● Collect & explore the data
 ● Prepare & process the data
 ● Build & validate the models
 ● Deploy & monitor the performance
Data Scientists
 Data Scientist is a person who is better
at statistics than any programmer and
better at programming than any
statistician.
 Data scientists are the key to realizing
the opportunities presented by big
data. They bring structure to it, find
compelling patterns in it, and advise
executives on the implications for
products, processes, and decisions
What do Data Scientists do?
 National Security
 Cyber Security
 Business Analytics
 Engineering
 Healthcare
 And more ….
What data scientists spend the
most?
The Kind of Data Scientist
 Data science for humans the consumers of
the output are decision makers like
executives, product managers, designers, or
clinicians.
 Data Science for Machines Data science for
machines: here the consumers of the output
are computers which consume data in the
form of training data, models, and
algorithms.
Data All Around
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit
transactions
 Online trading and purchasing
 Social Network
How Much Data Do We
have?
 Google processes 20 PB a day (2008)
 Facebook has 60 TB of daily logs
 eBay has 6.5 PB of user data + 50
TB/day (5/2009)
 1000 genomes project: 200 TB
 Cost of 1 TB of disk: $35
 Time to read 1 TB disk: 3 hrs
(100 MB/s)
Types of Data We Have
 Relational Data
(Tables/Transaction/Legacy Data)
 Non Relational Data eg. Big data
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF)
(Resource Description Framework ), …
 Streaming Data
 You can afford to scan the data once
Why Data Science is Important? Every business has data but
its business value depends on how much they know about the
data they have. Data Science has gained importance in recent
times because it can help businesses to increase business
value of its available data which in turn can help them to take
competitive advantage against their competitors. It can help
us to know our customers better, it can help us to optimize
our processes, it can help us to take better decisions. Because
of data science, data has become strategic asset
Why Data Science is Important?
1-Every business has data but its business value
depends on how much they know about the data they
have.
2-Data Science has gained importance in recent times
because it can help businesses to increase business
value of its available data which in turn can help them to
take competitive advantage against their competitors.
3-It can help us to know our customers better, it can help
us to optimize our processes, it can help us to take better
decisions. Because of data science, data has become
strategic asset
Why Data Science is Important? Every business has data but
its business value depends on how much they know about the
data they have. Data Science has gained importance in recent
times because it can help businesses to increase business
value of its available data which in turn can help them to take
competitive advantage against their competitors. It can help
us to know our customers better, it can help us to optimize
our processes, it can help us to take better decisions. Because
of data science, data has become strategic asset
Aspects in Data Science
Step 1. Statistics, Math, Linear Algebra
Step 2. Programming (Python)
DATA: An Enterprise Asset



Data and information are the lifeblood of 21st century
economy.
“Organizations that do not understand the
overwhelming importance of managing data and
information as tangible asset in the new economy will
not survive” (Tom Peters,2001).
Assets are resources with recognized value under the
control of an individual or organization.
DATA: An Enterprise Asset


Organizations rely on their data assets to make more
informed and more competitive decisions.
Through a partnership of business leadership and
technical expertise,the data management function can
effectively provide and control data and information
assets.
Data, Information, Knowledge


Data: representation of facts.
Information: data in context.
◦ This context includes:




The business meaning of data elements and related terms
The format in which data is presented
The timeframe represented by the data
The relevance of the data to a given usage
 Knowledge: understanding, awareness, and recognition of
a situation and familiarity with its complexity.
The Data Lifecycle



Data is created or acquired,stored and maintained,used,
and finally destroyed.
Data has value when it is actually used,or can be useful in
the future.
All data lifecycle stages associated with costs and risks,
but only the“use” stage adds business value.
The Data Management Function
DM is the business function of planning, controlling
and delivering the data and information assets.
This function includes:
The disciplines of development, execution, and supervision of
plans, policies, programs, projects, processes, practices, and
procedures that control, protect, deliver and enhance the value
of data and information assets.
The Data Management Function
Data and Information
DATA: Facts concerning people, objects, vents or
other entities. Databases store data.
INFORMATION: Data presented in a form suitable
for interpretation.
Data is converted into information by programs
and queries. Data may be stored in files or in
databases. Neither one stores information.
KNOWLEDGE: Insights into appropriate actions
based on interpreted data.
Knowledge Generation
DATA
INFORMATION
Analytics and the DIKW Pipeline
 Data goes through a pipeline
Raw data  Data  Information  Knowledge 
Wisdom  Decisions
 Each link enabled by a filter which is “business logic”
or “analytics”
 We are interested in filters that involve “sophisticated
analytics” which require non trivial parallel algorithms
 Improve state of art in both algorithm quality and
(parallel) performance
More
Analytics
Knowledge
Information
Analytic
s
Information
Data
ASSIGNMENT
1-What is the difference of data, information, knowledge, wisdom?
2-Who is Data Science?
3-What data scientist do?
4-Why data science is important?
5- WHY IS PYTHON PREFERRED OVER OTHER DATA SCIENCE TOOLS?
6-List data science lifecycle
7-Write steps of data analysis
8-Write Types of Data We Have
9-Address Aspects in Data Science
10-What data scientists spend the most?
11-How Much Data Do We have?
12- Talk about data science pipeline.
13- what is the Data management?

Lecture 00 data scienceDC.ppt data sci

  • 1.
  • 2.
    DATA SCIENCE –ONE DEFINITION
  • 3.
    WHAT IS DATA? Data is row facts about people, things, places..etc Example: Item No. Item Name Price 103 Mobile $499
  • 4.
    What is Science? A branch of study that deals with a connected body of demonstrated truths or with observed facts systematically classified and more or less comprehended by general laws, and incorporating trustworthy methods (now esp. those involving the scientific method and which incorporate falsifiable hypotheses) for the discovery of new truth in its own domain.
  • 5.
    What is Science? 1. Generate a hypothesis  2. Generate data through observation and/or experiment  3. Assess whether the data are consistent with the hypothesis or not.
  • 6.
    WHAT IS DATASCIENCE Its skill of extracting of knowledge from data Using knowledge to predict the unknown Data science is the application of computational and statistical techniques to address or gain insight into some problem in the real world. 
  • 7.
    What is DataScience?  An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data  Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data  Data science principles apply to all data – big and small
  • 8.
    What is DataScience?  Theories and techniques from many fields and disciplines are used to investigate and analyze a large amount of data to help decision makers in many industries such as science, engineering, economics, politics, finance, and education  Computer Science  Pattern recognition, visualization, data warehousing, High performance computing, Databases, AI  Mathematics  Mathematical Modeling  Statistics  Statistical and Stochastic modeling, Probability.
  • 11.
    Contrast: Databases Databases DataScience Querying the past Querying the future Business intelligence (BI) is the transformation of raw data into meaningful and useful information for business analysis purposes. BI can handle enormous amounts of unstructured data to help identify, develop and otherwise create new strategic business
  • 12.
    Big Data andData Science  “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief Economist  The U.S. will need 140,000-190,000 predictive analysts and 1.5 million managers/analysts by 2018. McKinsey Global Institute’s June 2011  New Data Science institutes being created or repurposed – NYU, Columbia, Washington, UCB,...  New degree programs, courses, boot-camps:  e.g., at Berkeley: Stats, I-School, CS, Astronomy…  One proposal (elsewhere) for an MS in “Big Data Science”
  • 13.
    FOR EXAMPLE THISROW DATA THIS PUZZLE
  • 14.
  • 15.
  • 16.
    SETP 3 whichfinal get insights
  • 17.
  • 19.
  • 20.
    Concentration in DataScience  Mathematics and Applied Mathematics  Applied Statistics/Data Analysis  Solid Programming Skills (R, Python, Julia, SQL)  Data Mining  Data Base Storage and Management  Machine Learning and discovery
  • 21.
    WHY IS PYTHONPREFERRED OVER OTHER DATA SCIENCE TOOLS?  Easy to learn  Scalability  Choice of data science libraries  Python community  Graphics and visualization
  • 23.
    Data Scientist’s Practice DiggingAround in Data Hypothesize Model Large Scale Exploitation Evaluate Interpret Clean, prep
  • 24.
    What Data Sciencedo?  A typical data science process looks like this, which can be modified for specific use case:  ● Understand the business  ● Collect & explore the data  ● Prepare & process the data  ● Build & validate the models  ● Deploy & monitor the performance
  • 25.
    Data Scientists  DataScientist is a person who is better at statistics than any programmer and better at programming than any statistician.  Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions
  • 26.
    What do DataScientists do?  National Security  Cyber Security  Business Analytics  Engineering  Healthcare  And more ….
  • 28.
    What data scientistsspend the most?
  • 29.
    The Kind ofData Scientist  Data science for humans the consumers of the output are decision makers like executives, product managers, designers, or clinicians.  Data Science for Machines Data science for machines: here the consumers of the output are computers which consume data in the form of training data, models, and algorithms.
  • 31.
    Data All Around Lots of data is being collected and warehoused  Web data, e-commerce  Financial transactions, bank/credit transactions  Online trading and purchasing  Social Network
  • 32.
    How Much DataDo We have?  Google processes 20 PB a day (2008)  Facebook has 60 TB of daily logs  eBay has 6.5 PB of user data + 50 TB/day (5/2009)  1000 genomes project: 200 TB  Cost of 1 TB of disk: $35  Time to read 1 TB disk: 3 hrs (100 MB/s)
  • 33.
    Types of DataWe Have  Relational Data (Tables/Transaction/Legacy Data)  Non Relational Data eg. Big data  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF) (Resource Description Framework ), …  Streaming Data  You can afford to scan the data once
  • 34.
    Why Data Scienceis Important? Every business has data but its business value depends on how much they know about the data they have. Data Science has gained importance in recent times because it can help businesses to increase business value of its available data which in turn can help them to take competitive advantage against their competitors. It can help us to know our customers better, it can help us to optimize our processes, it can help us to take better decisions. Because of data science, data has become strategic asset Why Data Science is Important? 1-Every business has data but its business value depends on how much they know about the data they have. 2-Data Science has gained importance in recent times because it can help businesses to increase business value of its available data which in turn can help them to take competitive advantage against their competitors. 3-It can help us to know our customers better, it can help us to optimize our processes, it can help us to take better decisions. Because of data science, data has become strategic asset
  • 36.
    Why Data Scienceis Important? Every business has data but its business value depends on how much they know about the data they have. Data Science has gained importance in recent times because it can help businesses to increase business value of its available data which in turn can help them to take competitive advantage against their competitors. It can help us to know our customers better, it can help us to optimize our processes, it can help us to take better decisions. Because of data science, data has become strategic asset Aspects in Data Science Step 1. Statistics, Math, Linear Algebra Step 2. Programming (Python)
  • 37.
    DATA: An EnterpriseAsset    Data and information are the lifeblood of 21st century economy. “Organizations that do not understand the overwhelming importance of managing data and information as tangible asset in the new economy will not survive” (Tom Peters,2001). Assets are resources with recognized value under the control of an individual or organization.
  • 38.
    DATA: An EnterpriseAsset   Organizations rely on their data assets to make more informed and more competitive decisions. Through a partnership of business leadership and technical expertise,the data management function can effectively provide and control data and information assets.
  • 39.
    Data, Information, Knowledge   Data:representation of facts. Information: data in context. ◦ This context includes:     The business meaning of data elements and related terms The format in which data is presented The timeframe represented by the data The relevance of the data to a given usage  Knowledge: understanding, awareness, and recognition of a situation and familiarity with its complexity.
  • 40.
    The Data Lifecycle    Datais created or acquired,stored and maintained,used, and finally destroyed. Data has value when it is actually used,or can be useful in the future. All data lifecycle stages associated with costs and risks, but only the“use” stage adds business value.
  • 41.
    The Data ManagementFunction DM is the business function of planning, controlling and delivering the data and information assets. This function includes: The disciplines of development, execution, and supervision of plans, policies, programs, projects, processes, practices, and procedures that control, protect, deliver and enhance the value of data and information assets.
  • 42.
  • 43.
    Data and Information DATA:Facts concerning people, objects, vents or other entities. Databases store data. INFORMATION: Data presented in a form suitable for interpretation. Data is converted into information by programs and queries. Data may be stored in files or in databases. Neither one stores information. KNOWLEDGE: Insights into appropriate actions based on interpreted data.
  • 44.
  • 45.
    Analytics and theDIKW Pipeline  Data goes through a pipeline Raw data  Data  Information  Knowledge  Wisdom  Decisions  Each link enabled by a filter which is “business logic” or “analytics”  We are interested in filters that involve “sophisticated analytics” which require non trivial parallel algorithms  Improve state of art in both algorithm quality and (parallel) performance More Analytics Knowledge Information Analytic s Information Data
  • 46.
    ASSIGNMENT 1-What is thedifference of data, information, knowledge, wisdom? 2-Who is Data Science? 3-What data scientist do? 4-Why data science is important? 5- WHY IS PYTHON PREFERRED OVER OTHER DATA SCIENCE TOOLS? 6-List data science lifecycle 7-Write steps of data analysis 8-Write Types of Data We Have 9-Address Aspects in Data Science 10-What data scientists spend the most? 11-How Much Data Do We have? 12- Talk about data science pipeline. 13- what is the Data management?

Editor's Notes

  • #1 Data Science: Dealing with unstructured and structured data, Data Science is a field that comprises of everything that related to data cleansing, preparation, and analysis. Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing and aligning the data. In simple terms, it is the umbrella of techniques used when trying to extract insights and information from data. Big Data: Big Data refers to humongous volumes of data that cannot be processed effectively with the traditional applications that exist. The processing of Big Data begins with the raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer. A buzzword that is used to describe immense volumes of data, both unstructured and structured, Big Data inundates a business on a day-to-day basis. Big Data is something that can be used to analyze insights which can lead to better decisions and strategic business moves. The definition of Big Data, given by Gartner is, “Big data is high-volume, and high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation”
  • #8 Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation Visualization or visualization (see spelling differences) is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. 1. A data product is digital information that can be purchased.  According to the McKinsey Global Institute Data, data is a $300 billion-a-year industry. In e-commerce, data brokers can use customer analytics to aggregate information about particular customer segments for marketing purposes. The information is then sold as a product to businesses that want to grow sales and target their advertising dollars. Brokers can also collect information about specific consumers from a variety of public and non-public sources including courthouse records, census data and loyalty card programs. 
  • #11 Querying the future: What happens if I show this ad? Or recommend this product? Or filter this email? Microsoft lost an estimated $1.7B on Surface computers (past) but what do they expect to make in future?
  • #12 University of California, Berkeley – A well-known university in California. The statement "e.g., at Berkeley: Stats, I-School, CS, Astronomy" refers to various academic disciplines or departments at the University of California, Berkeley (UCB). Here's what each abbreviation means: Stats – Statistics Department, which focuses on data analysis, probability, and statistical methods. I-School – The School of Information, which studies data science, technology, and information management. CS – Computer Science, a major department known for its contributions to artificial intelligence, software engineering, and computing. Astronomy – The Astronomy Department, which researches space, astrophysics, and celestial bodies.
  • #20 Julia is a high-level general-purpose[13] dynamic programming language that was originally designed to address the needs of high-performance numerical analysis and computational science, without the typical need of separate compilation to be fast,[14][15][16][17]also usable for client and server web use,[18][19] low-level systems programming[20] or as a specification language.
  • #22 What is Knowledge Discovery in Databases (KDD)? - Definition from ... https://www.techopedia.com/definition/25827/knowledge-discovery-in-databases-kdd Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data.
  • #33 www.ontotext.com The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. Streaming data is data that is continuously generated by different sources. Such data should be processed incrementally usingStream Processing techniques without having access to all of thedata. The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata.
  • #43 Data is a collection of facts in a raw or unorganized form such as numbers or characters. Information is the next building block of the DIKW Pyramid. This is data that has been “cleaned” of errors and further processed in a way that makes it easier to measure, visualize and analyze for a specific purpose. Wisdom: “How” is the information, derived from the collected data, relevant to our goals? “How” are the pieces of this information connected to other pieces to add more meaning and value? And, maybe most importantly, “how” can we apply the information to achieve our goal? Wisdom is the top of the DIKW hierarchy and to get there, we must answer questions such as ‘why do something’ and ‘what is best’. In other words, wisdom is knowledge applied in action.