2. Capgemini Internal 2
1. What is Data ?
2. What is Information?
3. What is Science?
Data is a raw, unorganized set of things that need to be
processed to have a meaning.
Information is when data was processed, organized, structured
or presented in a given context so as to make it useful
Science is a systematic enterprise that builds and
organizes knowledge in the form of
testable explanations and predictions about the universe.
Data science is Rebranding ?
3. Capgemini Internal 3
1. Structured
2. Unstructured
3. Semi Structured
Structured data is easily searchable by basic algorithms
Ex : spreadsheets and data from machine sensors
Unstructured data is more like human language. It doesn't fit
nicely into relational databases like SQL, and searching it
based on the old algorithms ranges from difficult to
completely impossible.
EX: Emails, text documents (Word docs, PDFs, etc.), social
media posts, videos, audio files, and images.
4. Capgemini Internal 4
Unstructured data is growing at the rate of 62% per year.
By 2022, 93% of all data in the digital universe was
unstructured.
Data volume is set to grow 800% over the next 5 years and 80%
of it will reside as unstructured data
5. 5
Data Analyst usually explains what is going on by processing
history of the data.
Data Scientist not only does the exploratory analysis to discover
insights from it.
7. 7
Features Business Intelligence Data Science
Data Sources
Structured (Usually SQL, often
Data Warehouse)
Both Structured and
Unstructured (logs, cloud
data, SQL, NoSQL, text)
Approach Statistics and Visualization Statistics, Machine
Learning, Graph
Analysis, Neuro-
linguistic Programming
(NLP)
Focus Past and Present Present and Future
Tools
Microsoft BI, QlikView, R,etc.,
R,Python,SAS, Scala &
Spark
8. Capgemini Internal 8
Supervised: All data is labeled and the algorithms learn to predict the
output from the input data.
Unsupervised: All data is unlabeled and the algorithms learn to inherent
structure from the input data.
Semi-supervised: Some data is labeled but most of it is unlabeled and a
mixture of supervised and unsupervised techniques can be used.
13. Capgemini Internal 13
Data science can be performed on data by using many tools :
R (popular these days, its free, its open source, lots of free help
online so gaining popularity)
SAS (Old but powerful giant of analytics, very expensive but now you
can download SAS University Edition for practice)
Tableau (Great for Visual analytics in small to mid sized data sets, its
expensive but very easy to use and popular as per Gartner leading
body of analytics research and rankings)
Python (Popular and in competition with R, lots of loves and followers
but roughly the IT folks / Coders like Python more)
Scala & Spark (Great for data sets exceeding 300 MB or surely for 1
GB + data sets)
16. Capgemini Internal 16
Suggest new connection in Linkedin
Suggest new people to follow on
Facebook / Instagram / twitter.
Select contents into Facebook ’s
personal feed
SOCIAL MEDIA
18. Capgemini Internal 18
BIO INFORMATICS
Discover relation between DNA
sequence and decease.
URBAN PLANNING
Resolve bus/train crowding issues.
19. Capgemini Internal 19
PUBLIC HEALTH
Predict the outbreak.
SPORTS
Predict game result base on team
player environment and opponent’s
features.
20. Capgemini Internal 20
Data Science is not Magic
Data Science is not Easy:
Data Science is not a Fad
Data Science is not Sexy
Data Science itself is not predictable
21. Capgemini Internal 21
Data Scientist:
Person who is better at statistics than any Software
engineer and better at software engineering than any
statistician.