WHAT IS DATA?
Data is row facts about people, things,
places..etc
Example:
Item No. Item Name Price
103 Mobile $499
4.
What is Science?
A branch of study that deals with a
connected body of demonstrated truths or
with observed facts systematically classified
and more or less comprehended by general
laws, and incorporating trustworthy
methods (now esp. those involving the
scientific method and which incorporate
falsifiable hypotheses) for the discovery of
new truth in its own domain.
5.
What is Science?
1. Generate a hypothesis
2. Generate data through observation
and/or experiment
3. Assess whether the data are
consistent with the hypothesis or not.
6.
WHAT IS DATASCIENCE
Its skill of extracting of knowledge from data
Using knowledge to predict the unknown
Data science is the application of computational and statistical techniques
to address or gain insight into some problem in the real world.
7.
What is DataScience?
An area that manages, manipulates,
extracts, and interprets knowledge from
tremendous amount of data
Data science (DS) is a multidisciplinary
field of study with goal to address the
challenges in big data
Data science principles apply to all data
– big and small
8.
What is DataScience?
Theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
11.
Contrast: Databases
Databases DataScience
Querying the past Querying the future
Business intelligence (BI) is the transformation of raw
data into meaningful and useful information for
business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop
and otherwise create new strategic business
12.
Big Data andData Science
“… the sexy job in the next 10 years will be
statisticians,” Hal Varian, Google Chief Economist
The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by
2018. McKinsey Global Institute’s June 2011
New Data Science institutes being created or
repurposed – NYU, Columbia, Washington, UCB,...
New degree programs, courses, boot-camps:
e.g., at Berkeley: Stats, I-School, CS, Astronomy…
One proposal (elsewhere) for an MS in “Big Data Science”
Concentration in DataScience
Mathematics and Applied Mathematics
Applied Statistics/Data Analysis
Solid Programming Skills (R, Python, Julia, SQL)
Data Mining
Data Base Storage and Management
Machine Learning and discovery
21.
WHY IS PYTHONPREFERRED OVER OTHER
DATA SCIENCE TOOLS?
Easy to learn
Scalability
Choice of data science libraries
Python community
Graphics and visualization
What Data Sciencedo?
A typical data science process looks like this,
which can be modified for specific use case:
● Understand the business
● Collect & explore the data
● Prepare & process the data
● Build & validate the models
● Deploy & monitor the performance
25.
Data Scientists
DataScientist is a person who is better
at statistics than any programmer and
better at programming than any
statistician.
Data scientists are the key to realizing
the opportunities presented by big
data. They bring structure to it, find
compelling patterns in it, and advise
executives on the implications for
products, processes, and decisions
26.
What do DataScientists do?
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
The Kind ofData Scientist
Data science for humans the consumers of
the output are decision makers like
executives, product managers, designers, or
clinicians.
Data Science for Machines Data science for
machines: here the consumers of the output
are computers which consume data in the
form of training data, models, and
algorithms.
31.
Data All Around
Lots of data is being collected
and warehoused
Web data, e-commerce
Financial transactions, bank/credit
transactions
Online trading and purchasing
Social Network
32.
How Much DataDo We
have?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50
TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
33.
Types of DataWe Have
Relational Data
(Tables/Transaction/Legacy Data)
Non Relational Data eg. Big data
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF)
(Resource Description Framework ), …
Streaming Data
You can afford to scan the data once
34.
Why Data Scienceis Important? Every business has data but
its business value depends on how much they know about the
data they have. Data Science has gained importance in recent
times because it can help businesses to increase business
value of its available data which in turn can help them to take
competitive advantage against their competitors. It can help
us to know our customers better, it can help us to optimize
our processes, it can help us to take better decisions. Because
of data science, data has become strategic asset
Why Data Science is Important?
1-Every business has data but its business value
depends on how much they know about the data they
have.
2-Data Science has gained importance in recent times
because it can help businesses to increase business
value of its available data which in turn can help them to
take competitive advantage against their competitors.
3-It can help us to know our customers better, it can help
us to optimize our processes, it can help us to take better
decisions. Because of data science, data has become
strategic asset
36.
Why Data Scienceis Important? Every business has data but
its business value depends on how much they know about the
data they have. Data Science has gained importance in recent
times because it can help businesses to increase business
value of its available data which in turn can help them to take
competitive advantage against their competitors. It can help
us to know our customers better, it can help us to optimize
our processes, it can help us to take better decisions. Because
of data science, data has become strategic asset
Aspects in Data Science
Step 1. Statistics, Math, Linear Algebra
Step 2. Programming (Python)
37.
DATA: An EnterpriseAsset
Data and information are the lifeblood of 21st century
economy.
“Organizations that do not understand the
overwhelming importance of managing data and
information as tangible asset in the new economy will
not survive” (Tom Peters,2001).
Assets are resources with recognized value under the
control of an individual or organization.
38.
DATA: An EnterpriseAsset
Organizations rely on their data assets to make more
informed and more competitive decisions.
Through a partnership of business leadership and
technical expertise,the data management function can
effectively provide and control data and information
assets.
39.
Data, Information, Knowledge
Data:representation of facts.
Information: data in context.
◦ This context includes:
The business meaning of data elements and related terms
The format in which data is presented
The timeframe represented by the data
The relevance of the data to a given usage
Knowledge: understanding, awareness, and recognition of
a situation and familiarity with its complexity.
40.
The Data Lifecycle
Datais created or acquired,stored and maintained,used,
and finally destroyed.
Data has value when it is actually used,or can be useful in
the future.
All data lifecycle stages associated with costs and risks,
but only the“use” stage adds business value.
41.
The Data ManagementFunction
DM is the business function of planning, controlling
and delivering the data and information assets.
This function includes:
The disciplines of development, execution, and supervision of
plans, policies, programs, projects, processes, practices, and
procedures that control, protect, deliver and enhance the value
of data and information assets.
Data and Information
DATA:Facts concerning people, objects, vents or
other entities. Databases store data.
INFORMATION: Data presented in a form suitable
for interpretation.
Data is converted into information by programs
and queries. Data may be stored in files or in
databases. Neither one stores information.
KNOWLEDGE: Insights into appropriate actions
based on interpreted data.
Analytics and theDIKW Pipeline
Data goes through a pipeline
Raw data Data Information Knowledge
Wisdom Decisions
Each link enabled by a filter which is “business logic”
or “analytics”
We are interested in filters that involve “sophisticated
analytics” which require non trivial parallel algorithms
Improve state of art in both algorithm quality and
(parallel) performance
More
Analytics
Knowledge
Information
Analytic
s
Information
Data
46.
ASSIGNMENT
1-What is thedifference of data, information, knowledge, wisdom?
2-Who is Data Science?
3-What data scientist do?
4-Why data science is important?
5- WHY IS PYTHON PREFERRED OVER OTHER DATA SCIENCE TOOLS?
6-List data science lifecycle
7-Write steps of data analysis
8-Write Types of Data We Have
9-Address Aspects in Data Science
10-What data scientists spend the most?
11-How Much Data Do We have?
12- Talk about data science pipeline.
13- what is the Data management?
Editor's Notes
#1 Data Science: Dealing with unstructured and structured data, Data Science is a field that comprises of everything that related to data cleansing, preparation, and analysis.
Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing and aligning the data.
In simple terms, it is the umbrella of techniques used when trying to extract insights and information from data.
Big Data: Big Data refers to humongous volumes of data that cannot be processed effectively with the traditional applications that exist. The processing of Big Data begins with the raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer.
A buzzword that is used to describe immense volumes of data, both unstructured and structured, Big Data inundates a business on a day-to-day basis. Big Data is something that can be used to analyze insights which can lead to better decisions and strategic business moves.
The definition of Big Data, given by Gartner is, “Big data is high-volume, and high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation”
#8 Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation
Visualization or visualization (see spelling differences) is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity.
1. A data product is digital information that can be purchased.
According to the McKinsey Global Institute Data, data is a $300 billion-a-year industry. In e-commerce, data brokers can use customer analytics to aggregate information about particular customer segments for marketing purposes. The information is then sold as a product to businesses that want to grow sales and target their advertising dollars. Brokers can also collect information about specific consumers from a variety of public and non-public sources including courthouse records, census data and loyalty card programs.
#11 Querying the future:
What happens if I show this ad?
Or recommend this product?
Or filter this email?
Microsoft lost an estimated $1.7B on Surface computers (past) but what do they expect to make in future?
#12 University of California, Berkeley – A well-known university in California.
The statement "e.g., at Berkeley: Stats, I-School, CS, Astronomy" refers to various academic disciplines or departments at the University of California, Berkeley (UCB).
Here's what each abbreviation means:
Stats – Statistics Department, which focuses on data analysis, probability, and statistical methods.
I-School – The School of Information, which studies data science, technology, and information management.
CS – Computer Science, a major department known for its contributions to artificial intelligence, software engineering, and computing.
Astronomy – The Astronomy Department, which researches space, astrophysics, and celestial bodies.
#20 Julia is a high-level general-purpose[13] dynamic programming language that was originally designed to address the needs of high-performance numerical analysis and computational science, without the typical need of separate compilation to be fast,[14][15][16][17]also usable for client and server web use,[18][19] low-level systems programming[20] or as a specification language.
#22 What is Knowledge Discovery in Databases (KDD)? - Definition from ...
https://www.techopedia.com/definition/25827/knowledge-discovery-in-databases-kdd
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data.
#33 www.ontotext.com
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model.
Streaming data is data that is continuously generated by different sources. Such data should be processed incrementally usingStream Processing techniques without having access to all of thedata.
The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata.
#43 Data is a collection of facts in a raw or unorganized form such as numbers or characters.
Information is the next building block of the DIKW Pyramid. This is data that has been “cleaned” of errors and further processed in a way that makes it easier to measure, visualize and analyze for a specific purpose.
Wisdom: “How” is the information, derived from the collected data, relevant to our goals? “How” are the pieces of this information connected to other pieces to add more meaning and value? And, maybe most importantly, “how” can we apply the information to achieve our goal?
Wisdom is the top of the DIKW hierarchy and to get there, we must answer questions such as ‘why do something’ and ‘what is best’. In other words, wisdom is knowledge applied in action.