What is Datascience
Data Science as a multi-disciplinary subject encompasses
the use of mathematics, statistics, and computer science to
study and evaluate data.
Key objective - To extract valuable information for use in
strategic decision making, product development, trend
analysis, and forecasting..
3.
Data science– Standarddefinition.
Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from structured and unstructured
data.
Data Science is the science which uses computer science,
statistics and machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze,
visualize, interact with data to create data products.
4.
Data science Skills.
Data Science as a multi-disciplinary field revolves around reading
and processing data, pulling knowledge from that data.
5.
Evolution of DataScience.
Initially rooted in statistics and data analysis, it has evolved into a
multi-faceted discipline, incorporating advanced techniques like
machine learning and deep learning.
Over the years, as data has grown exponentially, so too has the
need for sophisticated tools and methods to process and analyze
it.
6.
Where we obtaindata from?
Lots of data is being collected
and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
7.
How do wehave ?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs (100 MB/s)
8.
What actually isData Science?
An area that manages, manipulates, extracts, and interprets
knowledge from tremendous amount of data
Data science (DS) is a multidisciplinary field of study with
goal to address the challenges in big data
Data science principles apply to all data – big and small
theories and techniques from many fields and disciplines
are used to investigate and analyze a large amount of data
to help decision makers in many industries such as
science, engineering, economics, politics, finance, and
education
Computer Science
Pattern recognition, visualization, data warehousing,
High performance computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
9.
Scope of DataScience
Healthcare
Predictive analytics
personalized medicine
improving patient care.
For example, algorithms can predict patient outcomes based on historical data, allowing for more proactive
treatment plans.
Finance
Risk management
Fraud detection
Investment strategies.
Machine learning models can analyze market trends and assist in making informed trading decisions.
Marketing
To segment audiences
personalize campaigns
predict customer behavior.
This helps companies target the right customers with the right message at the right time.
Technology
Recommendation systems
Enhance user experience
Improve operational efficiency.
Think of how Netflix suggests shows you might like—that’s data science in action.
Data Science Process
Data Science is the process of analysing and interpreting data to
uncover hidden trends, correlations and insights that can support
decision-making and strategic planning.
It involves manipulating raw data using analytical and
computational techniques to transform it into valuable
information.
Data Science ProcessLife Cycle
1.Data Collection
Gathering relevant data from multiple sources such as databases,APIs, surveys,
logs, sensors or web scraping.
2. Data Cleaning
Data contains missing values, inconsistencies, duplicates and noise.
Data cleaning focuses on correcting errors, handling missing data,
removing irrelevant records and converting data into a structured format
suitable for analysis.
3. Exploratory Data Analysis (EDA)
Understand the data in depth by applying descriptive statistics and
visualization techniques.
It helps identify trends, outliers, correlations and relationships between
variables and guides decisions related to feature selection and modeling
strategies.
17.
Data Science ProcessLife Cycle
4. Model Building
Suitable machine learning algorithms are selected and trained on
historical data.
The goal is to identify patterns that allow the model to make
accurate predictions or classifications on unseen data.
5. Model Deployment
After validation, the trained model is deployed into a production
environment.
Its performance is continuously monitored and updates are made as
new data becomes available or conditions change.
18.
Real Life Examples
Companies learn your secrets, shopping patterns, and
preferences
Data Science and election (2008, 2012)
…that was just one of several ways that Mr. Obama’s
campaign operations, some unnoticed by Mr.
Romney’s aides in Boston, helped save the president’s
candidacy. In Chicago, the campaign recruited a team
of behavioral scientists to build an extraordinarily
sophisticated database
…that allowed the Obama campaign not only to alter
the very nature of the electorate, making it younger and
less white, but also to create a portrait of shifting voter
allegiances. The power of this operation stunned Mr.
Romney’s aides on election night, as they saw voters
they never even knew existed turn out in places like
Osceola County, Fla.
-- New York Times, Wed Nov 7, 2012
19.
Real life examples(contd..)
Exciting new effective
applications of data analytics
Example: Google Flu Trends:
Detecting outbreaks two weeks
ahead of CDC data
New models are estimating
which cities are most at risk
for spread of the Ebola virus.
Prediction model is built on
Various data sources , types and
analysis.
Sponsored search
Googlerevenue around $50 bn/year from
marketing, 97% of the companies revenue.
Sponsored search uses an auction – a pure
competition for marketers trying to win
access to consumers.
In other words, a competition for models of
consumers – their likelihood of responding to
the ad – and of determining the right bid for
the item.
There are around 30 billion search requests a
month. Perhaps a trillion events of history
between search providers.
Google Adwords and Adsense
22.
Other data scienceapplication
Transaction Databases Recommender systems
(NetFlix), Fraud Detection (Security and Privacy)
Wireless Sensor Data Smart Home, Real-time
Monitoring, Internet of Things
Text Data, Social Media Data Product Review
and Consumer Satisfaction (Facebook, Twitter,
LinkedIn), E-discovery
Software Log Data Automatic Trouble
Shooting (Splunk)
Genotype and Phenotype Data Epic, 23andme,
Patient-Centered Care, Personalized Medicine
23.
What can youdo with the data?
Traffic Prediction and Earthquake Warning
Crowdsourcing + physical modeling + sensing + data assimilation
to produce:
From Alex Bayen, UCB, Director, Institute for Transportation Studies
24.
Who are datascientists?
Data scientists are a new breed of
analytical data expert who have the
technical skills to solve complex
problems – and the curiosity to explore
what problems need to be solved.
They find stories, extract knowledge.
They are not reporters
Data scientists are the key to realizing
the opportunities presented by big data.
They bring structure to it, find
compelling patterns in it, and advise
executives on the implications for
products, processes, and decisions
25.
Duties of DataScientists.
There's not a definitive job description when it comes to a
data scientist role. But here are a few things you'll likely
be doing:
Collecting large amounts of unruly data and
transforming it into a more usable format.
Staying on top of analytical techniques such as machine
learning, deep learning and text analytics.
Solving business-related problems using data-driven
techniques.
Communicating and collaborating with both IT and
business.
Working with a variety of programming languages,
including SAS, R and Python.
Looking for order and patterns in data, as well as
spotting trends that can help a business’s bottom line.
Having a solid grasp of statistics, including statistical
tests and distributions.
26.
What are thetools of Data Scientists?
Data visualization: the presentation of data in a
pictorial or graphical format so it can be easily
analyzed.
Pattern recognition: technology that
recognizes patterns in data (often used
interchangeably with machine learning).
Machine learning: a branch of artificial
intelligence based on mathematical algorithms
and automation.
Data preparation: the process of converting
raw data into another format so it can be more
easily consumed.
Deep learning: an area of machine learning
research that uses data to model complex
abstractions.
Text analytics: the process of examining
unstructured data to glean key business insights.
27.
Companies that useData Science.
Accenture
Fidelity Investments
Bank of America
Google
Facebook
Tata Consultancy Services
Intel
Many more……..
28.
Contrast between Databaseand DataScience
Databases Data Science
DataValue “Precious” “Cheap”
DataVolume Modest Massive
Examples Bank records,
Personnel records,
Census,
Medical records
Online clicks,
GPS logs,
Tweets,
Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions,ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Memcached,
Apache River, …
29.
Contrast: Machine Learningv/s Data
Science.
Machine Learning
Develop new (individual)
models
Prove mathematical
properties of models
Improve/validate on a few,
relatively clean, small
datasets
Publish a paper
Data Science
Explore many models, build
and tune hybrids
Understand empirical
properties of models
Develop/use tools that can
handle massive datasets
Take action!
30.
Requirements for beinga Data Scientist.
Mathematics and Applied Mathematics
Applied Statistics/Data Analysis
Solid Programming Skills (R, Python, Julia, SQL)
Data Mining
Data Base Storage and Management
Machine Learning and discovery