SlideShare a Scribd company logo

EMOS 2018 Big Data methods and techniques

Piet J.H. Daas
Piet J.H. Daas
Piet J.H. DaasSenior-Methodologist & lead Data Scientist at Centraal Bureau voor de Statistiek

Slides of the EMOS 2018 Webinar on Big Data methods and techniques

EMOS 2018 Big Data methods and techniques

1 of 52
EMOS Webinar
Big Data Methods and Techniques
Piet Daas & Marco Puts
Statistics Netherlands
Center for Big Data Statistics
March 21, 2018 16:15 – 17:30
Piet Daas
• Work
– Senior methodologist CBS
– Lead Data Scientist CBDS
– Project leader Big Data research CBS
– Involved in ESSnet Big Data
• Trainer
– ESTP training course leader
– EMOS Big Data trainer Univ. Utrecht
– CBDS trainer
– For lot’s and lot’s of students
– And 2 dogs and 8 hamsters
@pietdaas
Marco Puts
• Work
– Methodologist CBS
– Lead Data Scientist CBDS
– Big Data Task Force member
– Involved in ESSnet Big Data
– Vegan
• Trainer
– ESTP training course leader
– CBDS trainer
– 4 tortoises
Overview of Webinar
• Properties of Big Data
• Big Data processes
– Collect
– Process
– Analyse
– Disseminate
• Wrap up & Discussion
• Questions in between topics
Big Data
• What is Big Data?
– How do we use this term?
• Big Data is a source of data that is:
– Rapidly available
– Usually available in large amounts
– Often generated by an unknown population
– May have poor quality metadata
– Usually has low information content
– Requires processing prior to use
– Unknown design
Big Data’s most important property
Information content can be rather low for Big Data

Recommended

Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statisticsEdwin de Jonge
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentationPiet J.H. Daas
 
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyUsing Road Sensor Data for Official Statistics: towards a Big Data Methodology
Using Road Sensor Data for Official Statistics: towards a Big Data MethodologyPiet J.H. Daas
 
Political Science and Machine Learning - Neural Ideal Point Estimation Network
Political Science and Machine Learning - Neural Ideal Point Estimation NetworkPolitical Science and Machine Learning - Neural Ideal Point Estimation Network
Political Science and Machine Learning - Neural Ideal Point Estimation NetworkNAVER Engineering
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational StatisticsSetia Pramana
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresCrai Macdonald
 
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsBig Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsPiet J.H. Daas
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 

More Related Content

Similar to EMOS 2018 Big Data methods and techniques

Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their usePiet J.H. Daas
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsPiet J.H. Daas
 
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...Istituto nazionale di statistica
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big datawebwinkelvakdag
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
 
Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics Venkat .P
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptxChandra Meena
 

Similar to EMOS 2018 Big Data methods and techniques (20)

Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
Modern Information Systems
Modern Information SystemsModern Information Systems
Modern Information Systems
 
Big Data
Big DataBig Data
Big Data
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 

More from Piet J.H. Daas

IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsPiet J.H. Daas
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statisticsPiet J.H. Daas
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSPiet J.H. Daas
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45Piet J.H. Daas
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation MannheimPiet J.H. Daas
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media dataPiet J.H. Daas
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daasPiet J.H. Daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekPiet J.H. Daas
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityPiet J.H. Daas
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenPiet J.H. Daas
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaPiet J.H. Daas
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statisticsPiet J.H. Daas
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big DataPiet J.H. Daas
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidencePiet J.H. Daas
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data sciencePiet J.H. Daas
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet J.H. Daas
 

More from Piet J.H. Daas (20)

IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
Isi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and biasIsi 2017 presentation on Big Data and bias
Isi 2017 presentation on Big Data and bias
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Big Data presentation for Statistics Canada
Big Data presentation for Statistics CanadaBig Data presentation for Statistics Canada
Big Data presentation for Statistics Canada
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Big data Big impact?
Big data Big impact?Big data Big impact?
Big data Big impact?
 
Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 

Recently uploaded

Improved Seed for Nutrition and Food Security for Farming Communities in Mada...
Improved Seed for Nutrition and Food Security for Farming Communities in Mada...Improved Seed for Nutrition and Food Security for Farming Communities in Mada...
Improved Seed for Nutrition and Food Security for Farming Communities in Mada...SeedSystemsGroup
 
2024: The FAR, Federal Acquisition Regulations - Part 6
2024: The FAR, Federal Acquisition Regulations - Part 62024: The FAR, Federal Acquisition Regulations - Part 6
2024: The FAR, Federal Acquisition Regulations - Part 6JSchaus & Associates
 
UN Report: State of the Worlds Migratory Species report_E.pdf
UN Report: State of the Worlds Migratory Species report_E.pdfUN Report: State of the Worlds Migratory Species report_E.pdf
UN Report: State of the Worlds Migratory Species report_E.pdfEnergy for One World
 
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...Energy for One World
 
2024 IEA Ministerial Communique- Paris 14th February 2024
2024 IEA Ministerial Communique- Paris 14th February 20242024 IEA Ministerial Communique- Paris 14th February 2024
2024 IEA Ministerial Communique- Paris 14th February 2024Energy for One World
 
Trading standards: How exposure to global trade shapes our living standards
Trading standards: How exposure to global trade shapes our living standardsTrading standards: How exposure to global trade shapes our living standards
Trading standards: How exposure to global trade shapes our living standardsResolutionFoundation
 
Monroe Downtown Master Plan Public Meeting 2 Presentation
Monroe Downtown Master Plan Public Meeting 2 PresentationMonroe Downtown Master Plan Public Meeting 2 Presentation
Monroe Downtown Master Plan Public Meeting 2 Presentationabbystainfield
 
MC_forecasts_finals series 17_feb2024.pdf
MC_forecasts_finals series 17_feb2024.pdfMC_forecasts_finals series 17_feb2024.pdf
MC_forecasts_finals series 17_feb2024.pdfARCResearch
 
Observance of the International Mother Language Day (IMLD) 2024.pdf
Observance of the International Mother Language Day (IMLD) 2024.pdfObservance of the International Mother Language Day (IMLD) 2024.pdf
Observance of the International Mother Language Day (IMLD) 2024.pdfChristina Parmionova
 
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTURE
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTUREPM 24: TENTATIVE TECHNICAL PROGRAM STRUCTURE
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTUREnissamant
 
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docx
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docxDLL MATHEMATICS 6 DLL Quarter 3 Week4.docx
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docxSaRah862902
 
PM International Conference 24 Poster Template
PM International Conference 24 Poster TemplatePM International Conference 24 Poster Template
PM International Conference 24 Poster Templatenissamant
 
PMAI PM24: Guidelines for the Poster Presentation
PMAI PM24: Guidelines for the Poster PresentationPMAI PM24: Guidelines for the Poster Presentation
PMAI PM24: Guidelines for the Poster Presentationnissamant
 
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdf
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdfExtract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdf
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdfEnergy for One World
 
Manuscript Template Name of the Author PM24
Manuscript Template Name of the Author PM24Manuscript Template Name of the Author PM24
Manuscript Template Name of the Author PM24nissamant
 
501 c3 - Determination Letter - First Tee Puerto Rico
501 c3 - Determination Letter - First Tee Puerto Rico501 c3 - Determination Letter - First Tee Puerto Rico
501 c3 - Determination Letter - First Tee Puerto RicoFirst Tee Puerto Rico
 
""THE BARANGAY NUTRITION COMMITTEE.pptx"
""THE BARANGAY NUTRITION COMMITTEE.pptx"""THE BARANGAY NUTRITION COMMITTEE.pptx"
""THE BARANGAY NUTRITION COMMITTEE.pptx"rodilyAcua
 

Recently uploaded (20)

Improved Seed for Nutrition and Food Security for Farming Communities in Mada...
Improved Seed for Nutrition and Food Security for Farming Communities in Mada...Improved Seed for Nutrition and Food Security for Farming Communities in Mada...
Improved Seed for Nutrition and Food Security for Farming Communities in Mada...
 
World Radio Day 2024
World Radio Day 2024World Radio Day 2024
World Radio Day 2024
 
2024: The FAR, Federal Acquisition Regulations - Part 6
2024: The FAR, Federal Acquisition Regulations - Part 62024: The FAR, Federal Acquisition Regulations - Part 6
2024: The FAR, Federal Acquisition Regulations - Part 6
 
Revised Investigative Report on City Officials
Revised Investigative Report on City OfficialsRevised Investigative Report on City Officials
Revised Investigative Report on City Officials
 
UN Report: State of the Worlds Migratory Species report_E.pdf
UN Report: State of the Worlds Migratory Species report_E.pdfUN Report: State of the Worlds Migratory Species report_E.pdf
UN Report: State of the Worlds Migratory Species report_E.pdf
 
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...
A Simple Case for Re-building Trust (WEF Davos 2024)_ Energy, Transition, Cli...
 
2024 IEA Ministerial Communique- Paris 14th February 2024
2024 IEA Ministerial Communique- Paris 14th February 20242024 IEA Ministerial Communique- Paris 14th February 2024
2024 IEA Ministerial Communique- Paris 14th February 2024
 
Trading standards: How exposure to global trade shapes our living standards
Trading standards: How exposure to global trade shapes our living standardsTrading standards: How exposure to global trade shapes our living standards
Trading standards: How exposure to global trade shapes our living standards
 
Monroe Downtown Master Plan Public Meeting 2 Presentation
Monroe Downtown Master Plan Public Meeting 2 PresentationMonroe Downtown Master Plan Public Meeting 2 Presentation
Monroe Downtown Master Plan Public Meeting 2 Presentation
 
MC_forecasts_finals series 17_feb2024.pdf
MC_forecasts_finals series 17_feb2024.pdfMC_forecasts_finals series 17_feb2024.pdf
MC_forecasts_finals series 17_feb2024.pdf
 
Observance of the International Mother Language Day (IMLD) 2024.pdf
Observance of the International Mother Language Day (IMLD) 2024.pdfObservance of the International Mother Language Day (IMLD) 2024.pdf
Observance of the International Mother Language Day (IMLD) 2024.pdf
 
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTURE
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTUREPM 24: TENTATIVE TECHNICAL PROGRAM STRUCTURE
PM 24: TENTATIVE TECHNICAL PROGRAM STRUCTURE
 
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docx
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docxDLL MATHEMATICS 6 DLL Quarter 3 Week4.docx
DLL MATHEMATICS 6 DLL Quarter 3 Week4.docx
 
PM International Conference 24 Poster Template
PM International Conference 24 Poster TemplatePM International Conference 24 Poster Template
PM International Conference 24 Poster Template
 
PMAI PM24: Guidelines for the Poster Presentation
PMAI PM24: Guidelines for the Poster PresentationPMAI PM24: Guidelines for the Poster Presentation
PMAI PM24: Guidelines for the Poster Presentation
 
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdf
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdfExtract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdf
Extract Slides: Munich Security Conference 2024_ Lose- Lose_ (3).pdf
 
Manuscript Template Name of the Author PM24
Manuscript Template Name of the Author PM24Manuscript Template Name of the Author PM24
Manuscript Template Name of the Author PM24
 
501 c3 - Determination Letter - First Tee Puerto Rico
501 c3 - Determination Letter - First Tee Puerto Rico501 c3 - Determination Letter - First Tee Puerto Rico
501 c3 - Determination Letter - First Tee Puerto Rico
 
""THE BARANGAY NUTRITION COMMITTEE.pptx"
""THE BARANGAY NUTRITION COMMITTEE.pptx"""THE BARANGAY NUTRITION COMMITTEE.pptx"
""THE BARANGAY NUTRITION COMMITTEE.pptx"
 
PPT LESSON G-9.pptx
PPT LESSON G-9.pptxPPT LESSON G-9.pptx
PPT LESSON G-9.pptx
 

EMOS 2018 Big Data methods and techniques

  • 1. EMOS Webinar Big Data Methods and Techniques Piet Daas & Marco Puts Statistics Netherlands Center for Big Data Statistics March 21, 2018 16:15 – 17:30
  • 2. Piet Daas • Work – Senior methodologist CBS – Lead Data Scientist CBDS – Project leader Big Data research CBS – Involved in ESSnet Big Data • Trainer – ESTP training course leader – EMOS Big Data trainer Univ. Utrecht – CBDS trainer – For lot’s and lot’s of students – And 2 dogs and 8 hamsters @pietdaas
  • 3. Marco Puts • Work – Methodologist CBS – Lead Data Scientist CBDS – Big Data Task Force member – Involved in ESSnet Big Data – Vegan • Trainer – ESTP training course leader – CBDS trainer – 4 tortoises
  • 4. Overview of Webinar • Properties of Big Data • Big Data processes – Collect – Process – Analyse – Disseminate • Wrap up & Discussion • Questions in between topics
  • 5. Big Data • What is Big Data? – How do we use this term? • Big Data is a source of data that is: – Rapidly available – Usually available in large amounts – Often generated by an unknown population – May have poor quality metadata – Usually has low information content – Requires processing prior to use – Unknown design
  • 6. Big Data’s most important property Information content can be rather low for Big Data
  • 7. Paradigm shift • Need of change in the way (official) statisticians look at data https://en.wikipedia.org/wiki/Rabbit%E2%80%93duck_illusion#/media/File:Kaninchen_und_Ente.png
  • 8. Question 1 • What sources do you consider Big Data? – Social media messages – Product prices on web sites – Satellite pictures – Sensor data of cows – Persons register of China – Activity tracker data of 8 persons
  • 9. Big Data uses • Big Data was introduced in previous Webinars • A number of examples have been shown • Big Data: how can it be used?! (not only its potential) • General view on Big Data processes
  • 10. Big Data processes • In GSBPM terms, 4 phases 1. Collect Get (access to) data 2. Process Check and convert data 3. Analyse Learn from and extract knowledge from data 4. Disseminate Release results In this Webinar, we discuss Big Data as the main source of input for official statistics Alternatives uses are: as an additional source, to impute missing data and in model. *GSPBM = Generic Statistical Business Process Model
  • 11. In this Webinar • Focus is Big Data as the main source of input for official statistics • Alternative uses are: – as an additional source, – to impute missing data, – in a calibration model.
  • 12. Big Data processes (2) • Data driven: cycles in every stage and in between Collect Process Analyse Disseminate Developed from input to output (in cycles) input output
  • 13. Big Data processes (3) • Phases will be illustrated with examples – Road intensity statistics (published) • Using road sensor data – Innovative company statistics (work in progress) • Using text on main page of company web sites – Energy: solar panel detection (work in progress) • Using areal pictures to identify solar panels There are three types of ‘data’ that can be used
  • 14. 1. Collect phase • Assure stable access to data – Long-term access is essential – Often data from private companies • Try to get some data – Learn from the data (trail-and-error) – Check various potential uses • Include domain expert knowledge – But not to early (ideas come first)
  • 15. 1. Collect phase (2) • Road sensor data – Maintained by organisation paid by the government – Statistics Netherlands law gives us access • Web sites of companies – Scrape data for studies (scrape once, store raw data) – In the future: inform companies in advance • Solar panels work – Free access to areal pictures (updated once a year) – Is that frequent enough?
  • 16. Questions 2 • What is the biggest problem when creating a statistics fully based on Big Data? • Stability of the source content • Stability of access • Stability of population included • All of the above • No problems at all
  • 17. 2. Process phase • Composed of multiple steps – Preprocessing • Perform some very (time) efficient initial checks • Convert and adjust the data prior to use • May involve visualizations – Cleaning (a.k.a. editing) • Clean data when the quality for a specific use is not sufficient • Should be fully automatic (no manual checks) • May involve visualizations
  • 18. 2. Process phase - preprocessing • Raw data is usually preprocessed: – After receiving it at the office (in a secure way) Automatic Information System data (GPS of ships), Areal Images, web scraping • Dataset transmitted can be huge • High infrastructural needs – At the location of data maintainer Road sensor data, Mobile phone data • Use infrastructure of data maintainer • Smaller dataset is transferred • May solve privacy issue • Process chain should still be in control of NSI
  • 19. 2. Process phase - preprocessing • Increasing information content !! – Remove unneeded and clearly erroneous records • Size and population reduction – Remove unneeded and clearly erroneous values • Less columns, only keep what is needed – To convert event-based to unit-based data • Create a more suited dataset • E.g. combining check-in and check-out data to trips – Tailor data to needs of NSI • Convert values to ranges used by NSI – Extract features • Construct new variables For texts, very important choices (remove stop words, stemming, ….) Discuss road sensor data as example
  • 20. Population and variables • In general it is important to consider: – The units included in the source vs target population • Which units are included? How to identify them? • Topic of research at our office • Extract features indicative for background characteristics – Definition alignment • Correspondence between definition used by data maintainer (if known) and NSI • Similar to administrative data
  • 21. Example: Road sensor data • Data generated by 20.000 road sensors – Every minute data is produced by each sensor (1440 records per sensor per day) – A total of 48 variables are included in each record of which only 13 are needed – Some records contain clearly erroneous data: -1 vehicles, no measurement, error flag on, location not on road, ….
  • 22. Raw road sensor data Data of a single sensor during 196 subsequent days
  • 23. Implemented quality indicators - O: Number of zero measurements |O| D: S:
  • 24. Road sensor data indicators • Relation between indicators, for 12 million records L (number of measurements) versus B (block indicator)
  • 25. Question 3 • What is good quality? • What are the advantages of Big Data in this context?
  • 26. Data Cleaning • Why clean the data? – Many data sources are very noisy – Doing analysis on noisy data is difficult: X = sin(seq(from=0,to=2*pi,by=0.01)) Y1 = X+2*runif(length(x)) Y2 = X+2*runif(length(x)) Print(cor(y1,y2)) will give a correlation of 0.6!!!! – Erroneous data have a negative effect on the quality of the estimates
  • 27. The Signal and the Data What is noise? information noise data data = information + noise Noise is that part of the data that is not relevant!
  • 28. The Signal and the Data (2)
  • 29. Road Sensor Data • Counts per minute of vehicles • Arrivals of vehicles at a road sensor • (semi-) Poisson process
  • 30. Implemented quality indicators - O: Number of zero measurements |O| D: S:
  • 32. Cleaning the Data Recursive Bayesian Estimation
  • 33. Cleaning the Data Recursive Bayesian Estimation
  • 35. Example: Solar Panels • Detect solar panels on rooftops • Preprocessing by cutting out roofs
  • 36. 3. Analyse • Getting insights from the data Analysis Model Driven Data Driven
  • 37. Making sense of the data Survey statistics Target population (unit=person) Sample of Persons Questionnaire Based on demographics Frame Selection Measurements Weights Traffic statistics Roads (unit=km) Road sensors Sensor data Based on location …
  • 38. Modeling the network of Road Sensors
  • 39. Modeling the network of Road Sensors
  • 40. • Dutch Highways • Main routes only Modeling the network of Road Sensors
  • 42. Projections • Project road sensors on main routes • Determine points of bifurcation for all entrance and exit ramps Entrance ramp
  • 43. Calibration of road sensors (2) 3000 2000 2500 20 7 Traffic flow Count 3000 3000 2000 2000 2500 Length 1055 5*3000 + 5*3000 + 10*2000 + 20*2000 + 7*2500 = 107,500 vehicle-km
  • 44. 3. Analyse (2) • When only a part of the data is composed of records of interest – This amount may vary between 50% - 1% – For example: • Innovative companies (9%) • Social tension (‘rare event’) • Company accounts on social media (3%) • Identify Belgian users (18%) • All about modelling an imbalanced dataset Analysis Model Driven Data Driven
  • 45. 3. Analyse (3) • Dealing with imbalanced datasets – Manually classify a sample (~1000 or more) • Multiple persons, write down instructions! • Result: training and test set – Try various approaches to test what works best • Logistic regression, Naive Bayes, Random Forest, SVM, NN,… – Add features (try as many as possible) • This adds domain knowledge – Check effect of preprocessing steps • Especially relevant for texts – Sometimes over- or under-sampling training set works • Results may vary, add more positive cases – Visualize findings
  • 46. Example: innovative companies • Companies from innovation statistics survey – 3000 innovative companies – 3000 non-innovative companies Downside: only companies with 10 or more working persons! • Scraped websites – Language detection, remove stop words, stemming – Words: unigram, bigram, trigram, word embeddings – Features: language, URL’s, email addresses, phone numbers, address • Checked various approaches – Logistic regression, Naïve Bayes, Random Forest, NN
  • 47. Example: innovative companies (2) • Evaluated model on: – Dutch SME innovation top 100 (various years) and list of Dutch start-up's • SME use different definition, nearly all start-up’s are innovative • Model focusses on technological innovation – 1 million company web sites and created detailed maps • Way to reveals curious behaviour of model • Able to create very detailed maps, at 4-digit zip-code level • Around 9% is innovative (according to our model)
  • 48. Question 4 • Name a few ways of dealing with imbalanced datasets when modelling • Under sample negative cases • Add more positive cases • Change evaluation metric • Cross validate 10x
  • 49. 4. Dissemination • Very much resembles standard output • Visualization are particularly important for Big Data based statistics • A few examples – Dot maps – AIS journeys
  • 52. Questions? Thank you for your attention !!