SlideShare a Scribd company logo
1 of 14
Colleen M. Farrelly
 Oxford English Dictionary:
◦ “An all-encompassing term for any collection of data
sets so large and complex that it becomes difficult to
process using on-hand data management tools or
traditional data processing applications”
 Defined by volume, variety, velocity
 2008 computer scientist predictions:
◦ Big Data will “transform the activities of companies,
scientific researchers, medical practitioners, and our
nation’s defense and intelligence operations”
 According to the New York Times:
◦ Big data science “typically means applying the tools of
artificial application of intelligence, like machine
learning, to vast new troves of data beyond that
captured in standard databases”
 Wider
 Longer
 Wider and Longer
 Complex
subgroupings
within wider or
longer sets
 Many correlations
 Noisy
 Missing data
 Computational challenges of storage and
statistical program memory
◦ R space on a laptop is limited to 2 GB unless more RAM
is added
◦ Algorithm computing time grows according to scaling
rules, many of which are exponential. Thus, 2 GB takes 4
minutes, and 4 GB then takes 16 minutes…
 Statistical challenges from data structure
◦ Wide data violates many statistical assumptions.
◦ Correlations among predictors also violate statistical
assumptions and creates problems with the underlying
linear algebra calculation methods.
◦ Potential for lots of informative missing data that can’t
be imputed using existing statistical methods.
 More computing resources
◦ Expensive
◦ Cloud computing
◦ Does not solve statistical issues posed by big data
 New statistical methods
◦ Rely on a new set of tools from computer science
◦ Work around limitations of existing multivariate
data analysis methods
◦ Don’t always scale as big data grows
 Still have computational issues
 Need for larger and larger training sets for good
performance
 Hadoop
◦ Open-source software for storage and processing of big data across
computer cores/clusters
◦ Compatible with existing statistical software
 MapReduce
◦ Distributed computing strategy for big data processing and analyses
◦ Compute problem in parallel and combine final answers for shorter
compute times
 SQL/NoSQL
◦ Relational database language for:
 Database construction/modifications
 Pulling pieces of data for further analyses/reporting
 R
◦ Free open-source software with existing machine learning algorithms and
coding environment to create and test new machine learning algorithms
 Simulations
◦ Use data structure and relationship rules to create a dataset with pre-
specified structure to it
◦ Allows for testing and validation of new algorithms against datasets with
known answers
◦ Useful for comparing existing algorithms with new algorithms
 Statistics
◦ Hypothesis testing (parametric and nonparametric) and
experimental design
◦ Generalized linear models
◦ Longitudinal, time series, and survival models
◦ Bayesian methods
 Mathematics
◦ Multivariable calculus
◦ Linear algebra
◦ Probability theory
◦ Optimization
◦ Graph theory/discrete math
◦ Real analysis/topology
 Machine learning
◦ Technically, considered a branch of statistics
◦ Supervised, unsupervised, and semi-supervised models
◦ Serve to extend statistical models and relax assumptions on data
◦ Includes algorithms from topological data analysis and network
analysis
 A professional who blends several different
areas of expertise to draw insights from
disparate data sources (particularly big data)
such that inference can be made about
specific problems/decisions within the field
of application
 Data science is a blend of statistical, machine
learning, computer science, mathematical,
and domain knowledge to leverage data for
decision-making in that domain (business,
medical, social media…).
 Discuss problem with leadership to understand the
problem and how results might be used.
◦ Providing a predictive algorithm that performs well but doesn’t
provide insight into the problem might not be useful.
◦ There may be related items that leadership hasn’t considered,
items that can enrich the project.
 Define data that needs to be pulled.
◦ May exist in database.
◦ May need to find elsewhere.
 Pull and clean data.
◦ Examine for errors or bias.
◦ Deal with missing data.
 Perform analyses and interpret output.
◦ Can be supervised (fit to outcome) or unsupervised (exploratory).
◦ Typically involves visualization of important results.
 Compile summary of actionable insights for leadership.
◦ Simplification
◦ Business value (no point in doing analysis if it can’t be
implemented!)
 Mathematical/Statistical Background
◦ Graduate degree, typically in mathematics/statistics,
computer science, or engineering
◦ Training in machine learning and algorithm design
◦ Experience with R and SAS statistical languages/programs
 Computer Science Background
◦ Python/MATLAB/other high-level computing languages
◦ Hadoop/MapReduce concepts
◦ SQL or NoSQL coding for database extraction/management
◦ Experience with structured or unstructured data
◦ Data mining/algorithm design
 Field of Application Expertise
◦ Intellectual curiosity
◦ Understanding of the industry of application (marketing,
medical, finance…)
◦ Communication skills to relate findings to non-technical
leaders
 From a quick
Indeed.com search:
◦ Allstate Insurance
◦ Sprint
◦ Twitter
◦ APS Healthcare
◦ XOR Security
◦ LinkedIn
◦ IBM
◦ Intel
 Indeed.com search
continued:
◦ Roche
Pharmaceuticals
◦ Amazon
◦ Capital One
 According to NewVantage and others:
◦ 2016 revenue gained from data science is estimated at
$130.1 billion.
◦ This is expected to grow to $203 billion by 2020.
 Individual company results vary according to:
◦ Team talent and expertise
◦ Data collected (and quality of data)
◦ Competitor strengths in data science.
 Current and projected shortages of those with
analytics talent will impact the market.
◦ Hubs of data science are emerging outside California—
Boston, New York, Austin, Chicago, Jacksonville, Tampa,
Charlotte, Atlanta…
◦ Across industries—healthcare, tech, finance, energy…

More Related Content

What's hot

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 

What's hot (20)

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Data Science
Data ScienceData Science
Data Science
 
7 steps to Predictive Analytics
7 steps to Predictive Analytics 7 steps to Predictive Analytics
7 steps to Predictive Analytics
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Viewers also liked

Viewers also liked (8)

Guide to MD/PhD programs
Guide to MD/PhD programsGuide to MD/PhD programs
Guide to MD/PhD programs
 
Profiles of the Gifted
Profiles of the GiftedProfiles of the Gifted
Profiles of the Gifted
 
The Neurobiology of Addiction
The Neurobiology of AddictionThe Neurobiology of Addiction
The Neurobiology of Addiction
 
Trauma and Alcoholism: Risk and Resilience
Trauma and Alcoholism: Risk and ResilienceTrauma and Alcoholism: Risk and Resilience
Trauma and Alcoholism: Risk and Resilience
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Gender, Education, Skills, and Compensation in US Data Scientists
Gender, Education, Skills, and Compensation in US Data ScientistsGender, Education, Skills, and Compensation in US Data Scientists
Gender, Education, Skills, and Compensation in US Data Scientists
 
Understanding the Profoundly Gifted
Understanding the Profoundly GiftedUnderstanding the Profoundly Gifted
Understanding the Profoundly Gifted
 
Neuropsychopharmacology
NeuropsychopharmacologyNeuropsychopharmacology
Neuropsychopharmacology
 

Similar to Big data and data science overview

Data Engineer vs Data Scientist vs Data Analyst.pptx
Data Engineer vs Data Scientist vs Data Analyst.pptxData Engineer vs Data Scientist vs Data Analyst.pptx
Data Engineer vs Data Scientist vs Data Analyst.pptx
CarolineRebeccaD
 

Similar to Big data and data science overview (20)

Data Engineer vs Data Scientist vs Data Analyst.pptx
Data Engineer vs Data Scientist vs Data Analyst.pptxData Engineer vs Data Scientist vs Data Analyst.pptx
Data Engineer vs Data Scientist vs Data Analyst.pptx
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 

More from Colleen Farrelly

More from Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heap
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 

Big data and data science overview

  • 2.
  • 3.  Oxford English Dictionary: ◦ “An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”  Defined by volume, variety, velocity  2008 computer scientist predictions: ◦ Big Data will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations”  According to the New York Times: ◦ Big data science “typically means applying the tools of artificial application of intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases”
  • 4.  Wider  Longer  Wider and Longer  Complex subgroupings within wider or longer sets  Many correlations  Noisy  Missing data
  • 5.  Computational challenges of storage and statistical program memory ◦ R space on a laptop is limited to 2 GB unless more RAM is added ◦ Algorithm computing time grows according to scaling rules, many of which are exponential. Thus, 2 GB takes 4 minutes, and 4 GB then takes 16 minutes…  Statistical challenges from data structure ◦ Wide data violates many statistical assumptions. ◦ Correlations among predictors also violate statistical assumptions and creates problems with the underlying linear algebra calculation methods. ◦ Potential for lots of informative missing data that can’t be imputed using existing statistical methods.
  • 6.  More computing resources ◦ Expensive ◦ Cloud computing ◦ Does not solve statistical issues posed by big data  New statistical methods ◦ Rely on a new set of tools from computer science ◦ Work around limitations of existing multivariate data analysis methods ◦ Don’t always scale as big data grows  Still have computational issues  Need for larger and larger training sets for good performance
  • 7.  Hadoop ◦ Open-source software for storage and processing of big data across computer cores/clusters ◦ Compatible with existing statistical software  MapReduce ◦ Distributed computing strategy for big data processing and analyses ◦ Compute problem in parallel and combine final answers for shorter compute times  SQL/NoSQL ◦ Relational database language for:  Database construction/modifications  Pulling pieces of data for further analyses/reporting  R ◦ Free open-source software with existing machine learning algorithms and coding environment to create and test new machine learning algorithms  Simulations ◦ Use data structure and relationship rules to create a dataset with pre- specified structure to it ◦ Allows for testing and validation of new algorithms against datasets with known answers ◦ Useful for comparing existing algorithms with new algorithms
  • 8.  Statistics ◦ Hypothesis testing (parametric and nonparametric) and experimental design ◦ Generalized linear models ◦ Longitudinal, time series, and survival models ◦ Bayesian methods  Mathematics ◦ Multivariable calculus ◦ Linear algebra ◦ Probability theory ◦ Optimization ◦ Graph theory/discrete math ◦ Real analysis/topology  Machine learning ◦ Technically, considered a branch of statistics ◦ Supervised, unsupervised, and semi-supervised models ◦ Serve to extend statistical models and relax assumptions on data ◦ Includes algorithms from topological data analysis and network analysis
  • 9.
  • 10.  A professional who blends several different areas of expertise to draw insights from disparate data sources (particularly big data) such that inference can be made about specific problems/decisions within the field of application  Data science is a blend of statistical, machine learning, computer science, mathematical, and domain knowledge to leverage data for decision-making in that domain (business, medical, social media…).
  • 11.  Discuss problem with leadership to understand the problem and how results might be used. ◦ Providing a predictive algorithm that performs well but doesn’t provide insight into the problem might not be useful. ◦ There may be related items that leadership hasn’t considered, items that can enrich the project.  Define data that needs to be pulled. ◦ May exist in database. ◦ May need to find elsewhere.  Pull and clean data. ◦ Examine for errors or bias. ◦ Deal with missing data.  Perform analyses and interpret output. ◦ Can be supervised (fit to outcome) or unsupervised (exploratory). ◦ Typically involves visualization of important results.  Compile summary of actionable insights for leadership. ◦ Simplification ◦ Business value (no point in doing analysis if it can’t be implemented!)
  • 12.  Mathematical/Statistical Background ◦ Graduate degree, typically in mathematics/statistics, computer science, or engineering ◦ Training in machine learning and algorithm design ◦ Experience with R and SAS statistical languages/programs  Computer Science Background ◦ Python/MATLAB/other high-level computing languages ◦ Hadoop/MapReduce concepts ◦ SQL or NoSQL coding for database extraction/management ◦ Experience with structured or unstructured data ◦ Data mining/algorithm design  Field of Application Expertise ◦ Intellectual curiosity ◦ Understanding of the industry of application (marketing, medical, finance…) ◦ Communication skills to relate findings to non-technical leaders
  • 13.  From a quick Indeed.com search: ◦ Allstate Insurance ◦ Sprint ◦ Twitter ◦ APS Healthcare ◦ XOR Security ◦ LinkedIn ◦ IBM ◦ Intel  Indeed.com search continued: ◦ Roche Pharmaceuticals ◦ Amazon ◦ Capital One
  • 14.  According to NewVantage and others: ◦ 2016 revenue gained from data science is estimated at $130.1 billion. ◦ This is expected to grow to $203 billion by 2020.  Individual company results vary according to: ◦ Team talent and expertise ◦ Data collected (and quality of data) ◦ Competitor strengths in data science.  Current and projected shortages of those with analytics talent will impact the market. ◦ Hubs of data science are emerging outside California— Boston, New York, Austin, Chicago, Jacksonville, Tampa, Charlotte, Atlanta… ◦ Across industries—healthcare, tech, finance, energy…

Editor's Notes

  1. http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/ Bryant, R., Katz, R. H., & Lazowska, E. D. (2008). Big-data computing: creating revolutionary breakthroughs in commerce, science and society. Lohr, S. (2012). How big data became so big. New York Times, 11. Cuzzocrea, A., Song, I. Y., & Davis, K. C. (2011, October). Analytics over large-scale multidimensional data: the big data revolution!. In Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP (pp. 101-104). ACM. Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt. Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4, 24-35.
  2. Heidema, A. G., Boer, J. M., Nagelkerke, N., Mariman, E. C., & Feskens, E. J. (2006). The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC genetics, 7(1), 23. Draper, N. R., Smith, H., & Pownell, E. (1966). Applied regression analysis (Vol. 3). New York: Wiley. Gopalkrishnan, V., Steier, D., Lewis, H., & Guszcza, J. (2012, August). Big data, big business: bridging the gap. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 7-11). ACM.
  3. Bekkerman, R., Bilenko, M., & Langford, J. (Eds.). (2011). Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press. Christopher K. Riesbeck. From conceptual analyzer to Direct Memory Access Parsing: an overview., chapter 8. Ellis Horwood Limited, 1986. M. W. Berry. Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13–49, Spring, 1992. Caporaso, J. G., Baumgartner Jr, W. A., Kim, H., Lu, Z., Johnson, H. L., Medvedeva, O., ... & Hunter, L. (2006). Concept Recognition, Information Retrieval, and Machine Learning in Genomics Question-Answering. In TREC. Madden, S. (2012). From databases to big data. IEEE Internet Computing, 16(3), 4-6. Agrawal, D., Das, S., & El Abbadi, A. (2011, March). Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530-533). ACM.
  4. http://www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html