This document discusses using big data as a source for official statistics and outlines some key challenges:
1. Big data is often noisy, dirty, and unstructured, requiring methods to extract useful information and reduce noise. Visualization tools help explore large datasets.
2. Big data sources are selective and contain events rather than full population coverage, requiring methods to convert events to units and correct for selectivity.
3. Beyond simple correlation, additional analysis is needed to establish causality between big data findings and other data sources.
4. Privacy and security laws must be followed, requiring anonymization of sensitive microdata or use of aggregates within a secure environment. Addressing these methodological and legal challenges will help realize
The elusive 'Data Scientist' is a word that pops up more and more. Is this a buzzword or is something really changing in the world? Piet Daas of the CBS will take us on a tour of the changes that he sees around him.
The elusive 'Data Scientist' is a word that pops up more and more. Is this a buzzword or is something really changing in the world? Piet Daas of the CBS will take us on a tour of the changes that he sees around him.
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...Mindtrek
Paulo Canas Rodrigues
Research Director
CAST (Centre for Applied Statistics and Data Analytics) University of Tampere
The role of Statistics in the Internet of Things
Mindtrek 2016
Big Data Analysis: The curse of dimensionality in official statisticsDario Buono
Statistical authorities need to produce accurate data faster and in a cost effective way, to become more responsive to users´ demands, while at the same time continuing to provide high quality output. One way to fulfil this is to make use of all new accessible data sources, as for example administrative data and big data. As a result, statistical offices will have to deal more and more with a "huge" number" of time series, in particular for producing model based statistics.
Using high dimensional datasets will most likely urge statistical authorities to follow a different approach, in particular to be conscious that the measurement of socio-economic variables will follow more and more non-linear processes that could not be described by probability distributions that could be easily described by few parameters.
It will thus imply to adapt the way to observe the world through data taking into account at a greater extent uncertainty and complexity, which will in turn impact dissemination and communication activities of statistical authorities.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
In this presentation, let's have a look at What is Data Science and it's applications. We discussed most common use cases of Data Science.
I presented this at LSPE-IN meetup happened on 10th March 2018 at Walmart Global Technology Services.
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
Diffusion of Big Data and Analytics in Developing Countriestheijes
The purpose of this study is to shed light on the capabilities for storing, analysing and sharing big data in developing countries. The study takes an in-depth look at adoption of big data as a technological innovation, as well as the adoption issues for Big Data, its availability and access. The paper presents a review of academic literature, policy documents from international agencies and reports from industry in order to assess the diffusion and adoption of big data innovation in developing countries. The study was broadened by a Google Scholar search for relevant literature where the combinations of the following key words were used big data and analytics, developing countries, and diffusion of Innovations. Diffusion of innovations can greatly accelerate adoption and utilization of Big Data, even though there are challenges faced by developing countries which limit capability and utilization of these technologies effectively. The paper presents the Innovations Diffusions Theoretical framework for the study of Big Data innovation adoption in developing countries. The study concludes that the diffusion theory concepts provide an effective mechanism for policy leaders in developing countries to maximize adoption of Big Data innovations, and can also be used in informing policy implementers on how to increase adoption rates for Big Data.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...Mindtrek
Paulo Canas Rodrigues
Research Director
CAST (Centre for Applied Statistics and Data Analytics) University of Tampere
The role of Statistics in the Internet of Things
Mindtrek 2016
Big Data Analysis: The curse of dimensionality in official statisticsDario Buono
Statistical authorities need to produce accurate data faster and in a cost effective way, to become more responsive to users´ demands, while at the same time continuing to provide high quality output. One way to fulfil this is to make use of all new accessible data sources, as for example administrative data and big data. As a result, statistical offices will have to deal more and more with a "huge" number" of time series, in particular for producing model based statistics.
Using high dimensional datasets will most likely urge statistical authorities to follow a different approach, in particular to be conscious that the measurement of socio-economic variables will follow more and more non-linear processes that could not be described by probability distributions that could be easily described by few parameters.
It will thus imply to adapt the way to observe the world through data taking into account at a greater extent uncertainty and complexity, which will in turn impact dissemination and communication activities of statistical authorities.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
Big and Social Media data opens up new scenarios and opportunities for management research (such as using internal communication data to map knowledge networks inside firms, or using web data to study firm capabilities and strategies). This presentation, given at the British Academy of Management 2014 conference proposes a typology of such scenarios, describes the skills required to exploit them, and considers implications for the education and training of management researchers.
In this presentation, let's have a look at What is Data Science and it's applications. We discussed most common use cases of Data Science.
I presented this at LSPE-IN meetup happened on 10th March 2018 at Walmart Global Technology Services.
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
Diffusion of Big Data and Analytics in Developing Countriestheijes
The purpose of this study is to shed light on the capabilities for storing, analysing and sharing big data in developing countries. The study takes an in-depth look at adoption of big data as a technological innovation, as well as the adoption issues for Big Data, its availability and access. The paper presents a review of academic literature, policy documents from international agencies and reports from industry in order to assess the diffusion and adoption of big data innovation in developing countries. The study was broadened by a Google Scholar search for relevant literature where the combinations of the following key words were used big data and analytics, developing countries, and diffusion of Innovations. Diffusion of innovations can greatly accelerate adoption and utilization of Big Data, even though there are challenges faced by developing countries which limit capability and utilization of these technologies effectively. The paper presents the Innovations Diffusions Theoretical framework for the study of Big Data innovation adoption in developing countries. The study concludes that the diffusion theory concepts provide an effective mechanism for policy leaders in developing countries to maximize adoption of Big Data innovations, and can also be used in informing policy implementers on how to increase adoption rates for Big Data.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Data Center Computing for Data Science: an evolution of machines, middleware,...Paco Nathan
Guest lecture 2013-08-27 at General Assembly in SF for the Data Science program taught by Jacob Bollinger and Thomson Nguyen https://generalassemb.ly/education/data-science/san-francisco
Many thanks to Thomson, Jacob, and the participants in the course. Excellent Q&A!
Received a bottle o' Cardhu (my fave Scotch) in payment for lecture, and since it's Burning Man Week, the city was emptied so we had enough to share with the class :)
Evidence:
https://plus.google.com/u/0/110794698656267747127/posts/GvjhhQ99CTs
An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases.
http://www.eventbrite.com/event/7476758185
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra
A talk presented at the Champions Leadership Conference Series - leveraging data provided by New York City’s Department of Homeless Services, software vendor Tibco partnered with SumAll.Org to help tackle the societal challenge of homelessness in New York City.
The Emerging Discipline of Data Science: Principles and Techniques for Data-Intensive Analysis, Keynote, 2nd Swiss Workshop on Data Science – SDS|2015, Winterthur, Switzerland, 12 June 2015
Abstract and other presentations at: http://michaelbrodie.com/?page_id=17
Presentation of the Sense4us project at the 2nd European TA Conference - Berlin, 26 February 2015
"Policy Making in a Complex World:
The Opportunities and Risks Presented
by New Technologies"
Similar to Big data as a source for official statistics (20)
Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge
Using many rules to check the validity of your data often results in a large collection of rules that may generate unwanted interactions. This may seem obvious, but often happens (unknowningly). validatetools helps detecting and resolving accumulated redundant or (partially) contradictory rules. We will demonstrate the included methods as well as describe the inner workings of detecting and resolving the issues with Mixed Integer Programming.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. Big Data as a source for Official Statistics
Edwin de Jonge and Piet Daas
November 12, London
2. Overview
• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
• Visualization as a tool
•Why?
•Examples in our office
• Issues & challenges
• From an official statistical perspective
• Focus on methodological and legal ones
2
6. Anscombe’s quartet
Property
Value
Mean of x1, x2, x3, x4
All equal: 9
Variance of x1, x2, x3, x4
All equal: 11
Mean of y1, y2, y3, y4
All equal: 7.50
Variance of y1, y2, y3, y4
All equal: 4.1
Correlation for ds1, ds2, ds3, ds4
All equal 0.816
Linear regression for ds1, ds2, ds3,
ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?
9. Why visualization?
Tool for data analysis
– Effective display of information
– Summary of data
– Show outliers / patterns
– Helps exploring data
– Helps checking assumptions
10. Often Maps
Many visualizations are maps
– Positive:
‐ Is familiar
‐ Attractive
But: only makes sense:
‐ When data geographically distributed
‐ When locality is meaningful
‐ When data is correctly normalized
13. Many maps just population maps!
A better map:
‐ Takes population size into account (e.g.
by making figures relative)
‐ May plot difference w.r.t. an expected
value.
13
14. Visualization is not easy
– Creating good visualizations is hard
– “Easy Reading” is not “Easy Writing”
Visualization must be:
– Faithful
– Objective
Thus not introduce perceptial bias
15. Visualization
– Use appropriate chart
– Use approprate scales
‐ x,y, color, time
– Use appropriate granularity
Research: What works for which data?
17. Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands
• Last traditional census was in 1971
‐ Now by (re-)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11
18. Making the Tableplot
1.
2.
Load file
Sort record according to
key variable
• Age in this example
3. Combine records
each)
• Numeric variables
•
•
100 groups (170,000 records
Calculate average (avg. age)
Categorical variables
•
4.
17 million records
17 million records
Ratio between categories present (male vs. female)
Plot figure
•
Colours used are important
of select number of variables
up to 12
12
19.
20. October 1st 2013, Statistics Netherlands tableplot of the census test file
21. Tableplot: Monitor data quality
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)
21
24. – Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the Dutch Tax office
‐ A total of 20 million records each month
‐ How to obtain insight into so much data?
• With a visualisation method: a heat map
24
27. Visualization helps with volume of data
–
–
–
–
–
–
Summarize by “binning”
Tableplot
Histogram
Heatmap (2D histogram)
Smoothing?
Detect unexpected patterns
We use it as a tool to check, explore and communicate
data
27
29. Big Data: issues & challenges
During our exploratory studies we identified
a number of issues & challenges.
Focussing on the methodological and legal ones,
we found that there is a need to:
1) deal with noisy and dirty data
2) deal with selectivity
3) go beyond correlation
4) cope with privacy and security issues
We have only solved some of them (partially)
29
30. 1) Deal with noisy and dirty data
– Big Data is often
‐ noisy
‐ dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
30
31. Noisy and dirty data
Social media sentiment
Traffic loop data
Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models
(capture structure), ‘Google approach’ (80/20 rule)
Preferably do NOT use samples !
31
33. Noise reduction
Social media, daily sentiment in Dutch messages
Social media: daily & weekly sentiment in Dutch messages
33
34. Noise reduction
Social media, daily sentiment in Dutch messages
Social media: daily, weekly & monthly sentiment in Dutch messages
34
35. Noise reduction
Social media, daily sentiment in Dutch messages
Social media: monthly sentiment in Dutch messages
35
36. Social media sentiment & Consumer confidence
Social media: monthly sentiment in Dutch messages &
Social media, daily sentiment in Dutch messages
Consumer confidence
Corr: 0.88
36
39. Correct for dirty data
Use data from same location from previous/next minute (5 min. window)
Before
Total = ~ 295 million vehicles
39
After
Total = ~ 330 million vehicles (+ 12%)
40. 2) Deal with selectivity
–
Big data sources are selective (they do NOT cover
the entire population considered)
‐
–
AND: all Big Data sources studied so far contain events!
‐
‐
–
Some probably more then others
E.g. social media messages created, calls made and vehicles detected
Events are probably the reason why these sources are so Big
When there is a need to correct for selectivity
1)
Convert events to units
How to identify units?
2) Correct for selectivity of units included
How to cope with units that are truly absent and part of the
population under study?
40
41. Units / events
– Big Data contains events
‐ Social media messages are generated by usernames
‐ Traffic loops count vehicles (Dutch roads are units)
‐ Call detail records of mobile phone ID’s
‐ Convert events to units
• By profiling
41
43. Travel behaviour of active mobile phones
Mobility of very active mobile
phone users
- during a 14-day period
Based on:
- Call- and text-activity
multiples times a day
- Location based on phone masts
Clearly selective:
- North and South-west
of the country hardly included
43
__
44. 3) Go beyond correlation
–
You will very likely use correlation to check Big Data
findings with those in other (survey) data
–
When correlation is high:
1) try falsifying it first (is it coincidental/spurious?)
correlation ≠ causation
2) If this fails, you may have found something
interesting!
3) Perform additional analysis (look for causality)
cointegration, structural time-series approach
44
Use common sense!
45. An illustrative example
Official unemployment percentage
Number of social media messages
including the word “unemployment”
X
Corr: 0.90 ?
45
46. 4) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Still appropriate measures need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use a secure environment
– Legal issues that enable the use of Big Data for official statistics
production is currently being looked at
‐ No problems for Big Data that can be considered ‘Administrative data’: i.e.
Big Data that is managed by a (semi-)governmentally funded organisation
46
47. Conclusions
– Big data is a very interesting data source
‐ Also for official statistics
– Visualisation is a great way of getting/creating insight
‐ Not only for data exploration
– A number of fundamental issues need to be resolved
‐ Methodological
‐ Legal
‐ Technical (not discussed here)
– We expect great things in the near future!
47