This document summarizes research conducted by the ESRC on public views regarding the use of alternative data sources.
The ESRC conducted public dialogues in 2013 and 2015 to understand concerns about linking administrative data and using private sector data for research. The dialogues found support for these practices if the data is used for socially beneficial purposes, is fully anonymized and kept secure, and no commercial gain is involved. However, participants needed extensive information to feel these conditions would be met.
The dialogues uncovered several public concerns including lack of transparency about data collection and use, data being passed to third parties, ensuring data security, and intentions for long-term data storage. Key factors affecting public acceptance included trust in governance processes,
Plans for the online 2021 Census with increased use of administrative and sur...UKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘the census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- 1. An online census in 2021; 2. Integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; 3. A recommendation for the future provision of population statistics beyond 2021. This presentation will outline ONS plans for Strands 1 and 2: to deliver a predominantly digital census while making the most effective use of administrative and survey data in its design, operation and outputs. It will cover the challenges of providing a census in 2021 that is 'digital by default', while building on the successes and lessons from the 2011 Census. Main areas that will be outlined include plans to address the challenge of digital exclusion while maximising the benefits of electronic data collection such as data quality, real-time response information and reducing processing time. Strand 2 is new for 2021, and looks at enhancing the traditional census building on the understanding of the opportunities and limitations of administrative data gained in Strand 3. Challenges include considering the most effective use of administrative and survey data in: optimising census data collection operations, estimating missing data, quality assuring results, reducing respondent burden or expanding topics covered.
Delivering early benefits and trial outputs using administrative dataUKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘The census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- • an online census in 2021; • integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; • a recommendation for the future provision of population statistics beyond 2021. Strand 3 continues with research carried out in the Beyond 2011 Programme exploring the potential of administrative data and surveys as a future alternative to traditional Census taking beyond 2021. Building upon the concept of ‘Statistical Population Datasets’ derived through anonymous linkage of multiple administrative sources, the ONS plans to release a series of annual ‘trial output’ statistics to deliver early benefits and engage users with the development and evaluation of methods. ‘Trial outputs’ are intended to illustrate what might be realised from administrative data, in particular the range and frequency of outputs, and the potential for small area statistics. The first release will focus on local authority population counts at age/sex level. Subsequent annual releases will aspire to produce smaller area population counts and additional outputs on households, income and ethnicity, subject to data access and quality. This presentation will outline ONS plans to deliver trial outputs in the run up to the 2021 Census.
Evaluating the feasibility of using administrative data in the context of cen...UKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘The census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- 1. an online census in 2021; 2. integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; 3. a recommendation for the future provision of population statistics beyond 2021. Strand 3 is continuing with research carried out in the Beyond 2011 Programme to develop an evaluation framework for assessing the suitability of using administrative data in the context of population statistics. By linking individual records between administrative sources and to Census data, a more informative view of data quality can be formed with particular focus on the statistical outputs being targeted. This presentation will highlight with examples the strengths and weaknesses of using administrative data to produce statistics about the population and its characteristics. Our results focus on the interpretation of cross-source and longitudinal linkage to demonstrate the extent to which the locational accuracy of administrative data can be relied upon to record individuals at their current place of residence. In addition, we present some of the challenges of producing statistics from differing statistical definitions, for example households and ethnicity, as well as variability in operational processes underpinning the collection and maintenance of administrative data.
ONS presentation at RSS South Wales poverty & inequality stats eventRichard Tonkin
Update on ONS data for poverty statistics & research. Presentation given at RSS South Wales event: Poverty & Inequality in Wales - Statistics for Action (28th Sept 2016)
Big data and macroeconomic nowcasting from data access to modellingDario Buono
Parallel advances in IT and in the social use of Internet-related applications, provide the general public with access to a vast amount of information. The associated Big Data are potentially very useful for a variety of applications, ranging from marketing to tapering fiscal evasion.
From the point of view of official statistics, the main question is whether and to what extent Big Data are a field worth investing to expand, check and improve the data production process and which types of partnerships will have to be formed for this purpose. Nowcasting of macroeconomic indicators represents a well-identified field where Big Data has the potential to play a decisive role in the future.
In this paper we present the results and main recommendations from the Eurostat-funded project “Big Data and macroeconomic nowcasting”, implemented by GOPA Consultants, which benefits from the cooperation and work of the Eurostat task force on Big Data and a few external academic experts.
Going Digital: Using Mobile Data Collection to Monitor ECD in South Africa presented by Khulisa Management Services Associate Leticia Taimo at the 6th SAMEA Conference 2017 held in Johannesburg, South Africa
Ce projet, réalisé pour le compte de la Société d’habitation du Québec (SHQ), a permis d’établir les liens entre les diverses sources de données afin de produire un profil statistique des résidents des logements sociaux de la Ville de Québec.
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
Some initial considerations and discussion points around geospatial big data. Location adds context and relevance. Need to consider a number of V factors including Value.
Plans for the online 2021 Census with increased use of administrative and sur...UKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘the census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- 1. An online census in 2021; 2. Integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; 3. A recommendation for the future provision of population statistics beyond 2021. This presentation will outline ONS plans for Strands 1 and 2: to deliver a predominantly digital census while making the most effective use of administrative and survey data in its design, operation and outputs. It will cover the challenges of providing a census in 2021 that is 'digital by default', while building on the successes and lessons from the 2011 Census. Main areas that will be outlined include plans to address the challenge of digital exclusion while maximising the benefits of electronic data collection such as data quality, real-time response information and reducing processing time. Strand 2 is new for 2021, and looks at enhancing the traditional census building on the understanding of the opportunities and limitations of administrative data gained in Strand 3. Challenges include considering the most effective use of administrative and survey data in: optimising census data collection operations, estimating missing data, quality assuring results, reducing respondent burden or expanding topics covered.
Delivering early benefits and trial outputs using administrative dataUKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘The census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- • an online census in 2021; • integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; • a recommendation for the future provision of population statistics beyond 2021. Strand 3 continues with research carried out in the Beyond 2011 Programme exploring the potential of administrative data and surveys as a future alternative to traditional Census taking beyond 2021. Building upon the concept of ‘Statistical Population Datasets’ derived through anonymous linkage of multiple administrative sources, the ONS plans to release a series of annual ‘trial output’ statistics to deliver early benefits and engage users with the development and evaluation of methods. ‘Trial outputs’ are intended to illustrate what might be realised from administrative data, in particular the range and frequency of outputs, and the potential for small area statistics. The first release will focus on local authority population counts at age/sex level. Subsequent annual releases will aspire to produce smaller area population counts and additional outputs on households, income and ethnicity, subject to data access and quality. This presentation will outline ONS plans to deliver trial outputs in the run up to the 2021 Census.
Evaluating the feasibility of using administrative data in the context of cen...UKDSCensus
Following the Government’s endorsement of the National Statistician’s recommendation on ‘The census and future provision of population statistics in England and Wales’, the ONS Beyond 2011 Programme has been closed and replaced by the new Census Transformation Programme. The new programme is focusing on developing the strategies and plans needed for delivery of the following major strands of work:- 1. an online census in 2021; 2. integrated statistical outputs that make use of administrative data and surveys in conjunction with the census; 3. a recommendation for the future provision of population statistics beyond 2021. Strand 3 is continuing with research carried out in the Beyond 2011 Programme to develop an evaluation framework for assessing the suitability of using administrative data in the context of population statistics. By linking individual records between administrative sources and to Census data, a more informative view of data quality can be formed with particular focus on the statistical outputs being targeted. This presentation will highlight with examples the strengths and weaknesses of using administrative data to produce statistics about the population and its characteristics. Our results focus on the interpretation of cross-source and longitudinal linkage to demonstrate the extent to which the locational accuracy of administrative data can be relied upon to record individuals at their current place of residence. In addition, we present some of the challenges of producing statistics from differing statistical definitions, for example households and ethnicity, as well as variability in operational processes underpinning the collection and maintenance of administrative data.
ONS presentation at RSS South Wales poverty & inequality stats eventRichard Tonkin
Update on ONS data for poverty statistics & research. Presentation given at RSS South Wales event: Poverty & Inequality in Wales - Statistics for Action (28th Sept 2016)
Big data and macroeconomic nowcasting from data access to modellingDario Buono
Parallel advances in IT and in the social use of Internet-related applications, provide the general public with access to a vast amount of information. The associated Big Data are potentially very useful for a variety of applications, ranging from marketing to tapering fiscal evasion.
From the point of view of official statistics, the main question is whether and to what extent Big Data are a field worth investing to expand, check and improve the data production process and which types of partnerships will have to be formed for this purpose. Nowcasting of macroeconomic indicators represents a well-identified field where Big Data has the potential to play a decisive role in the future.
In this paper we present the results and main recommendations from the Eurostat-funded project “Big Data and macroeconomic nowcasting”, implemented by GOPA Consultants, which benefits from the cooperation and work of the Eurostat task force on Big Data and a few external academic experts.
Going Digital: Using Mobile Data Collection to Monitor ECD in South Africa presented by Khulisa Management Services Associate Leticia Taimo at the 6th SAMEA Conference 2017 held in Johannesburg, South Africa
Ce projet, réalisé pour le compte de la Société d’habitation du Québec (SHQ), a permis d’établir les liens entre les diverses sources de données afin de produire un profil statistique des résidents des logements sociaux de la Ville de Québec.
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
Some initial considerations and discussion points around geospatial big data. Location adds context and relevance. Need to consider a number of V factors including Value.
Big Data with IOT approach and trends with case studySharjeel Imtiaz
The Big Data with IOT approach and trends. It will give you complete exposure of data science process and also will give insight how the step by step data science process explore the big data of TripAdvisor case study.
Power Decision-making at Scale with Address-based Spatial Data SciencePrecisely
Location intelligence can add critical context to your data science projects and lead to better business decisions. Organizations in numerous industries including retail, real estate, insurance, construction, telecommunications, and government can reduce risks and speed responses to situations with superior location intelligence. Unfortunately, many organizations are unable to leverage precise, location-aware information because data is too hard to access, process, and interpret. As organizations expand globally and confront complex addresses and related data, they struggle even more.
Join this TDWI webinar to learn how you can simplify and accelerate location-aware data science processes. Speakers will discuss how trends such as cloud-native technologies and use of prebuilt location data sets can power data science and give decision makers important perspectives on risk, property decisions, situation response, and more.
Topics to be covered include:
TDWI perspectives on location intelligence and address-based spatial data science
The value of big data and cloud-native technologies for adding valuable context to business addresses
How to use high-precision location data to estimate risk
Guidance for leveraging location to extract actionable insights in data science projects and sharing results visually throughout your organization
MUNICIPAL3 Mobile Recycling Inventory, Jerrard WhittenMassRecycleR32014
Jerrard Whitten, Merrimack Valley Planning Commission, discusses how his regional commission used ArcGIS and related apps to keep track of curbside recycling, enforcement, etc. MA Planning Commissions should offer similar offerings.
Interesting ways Big Data is used todayDaniel Sârbe
An overview on the Big Data field, interesting patterns on how data is used to make data mining, predictive analytics, machine learning and an overview on the jobs generated by the Big Data demand.
We provide real time big data training in Chennai by industrial experts with real time scenarios.
Our Advanced topics will enhance the students expectations into high level knowledge in Big Data Technology.
For More Info.Reach our Big Data Technical Team@ +91 96677211551/56
The Experience of Big data Training Experts Team.
www.thecreatingexperts.com
SAP BEST INSTITUTES IN CHENNAI
http://www.youtube.com/watch?v=UpWthI0P-7g
Data science and visualization lab presentationiHub Research
The Data Science and Visualization Lab! This product is based on a component of research that delves into and innovates on the processes of data science – collection, storage/management, analysis and visualization. You have probably come across one of our amazing info-graphics. What else can you do with data?
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...IT Network marcus evans
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong Value-Adding Proposition
by Patrick Hadley, Australian Bureau of Statistics at the Australian CIO Summit 2014
Presentation by Derek Silva from the National Geospatial-Intelligence Agency and Greg Brunner from ESRI for the ESRI Federal GIS Conference, February 8, 2015
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Similar to Opportunities for alternative data sources (20)
This presentation covers the key question: Why dashboards? Local authorities and other public bodies have largely ended publishing reports and now produce dashboards. What are the factors that have contributed to this change?
This is the first presentation from our Workshop on 21 September 2023 on Dashboards, APIs and PowerBI.
ONS Local has been established by the Office for National Statistics (ONS) to support evidence-based decision-making at the local level. We aim to host insightful events that connect our users with exciting developments happening in subnational statistics and analysis at the ONS and across other organisations.
In April 2022, as the impact of increases in the Cost of Living really came to the forefront, Public Health & Communities, Suffolk County Council published a Cost of Living profile as part of the Joint Strategic Needs Assessment.
Alongside a written Cost of Living report ‘Making ends meet: The cost of living in Suffolk’, an interactive dashboard was also created using Power BI. In addition to internal data flows, publicly available data from sources such as the ONS have been used to provide a rich picture of the current situation for the local community.
The dashboard was developed in order to:
• Provide up to date data and information on the Cost of Living for Suffolk County Council, partner organisations, and members of the public.
• Deliver an interactive tool to allow users to focus on areas most relevant to them.
• Demonstrate that, while increases in the cost of living affect everyone, impact will be greatest for those who are already under financial pressure, exacerbating inequalities.
• Provide a source of actionable insight to support the system with the evidence base needed to support project development, drive change and really make a difference in the community.
Features of the dashboard:
• Place-focused - published at smaller geographies where possible
• Collaborative - Includes local data from across the system such as data shared by Citizens Advice and other system partners.
• Automated - Most data sources have automated connections, meaning there is little manual intervention required.
• Self-Service - Making the report publicly available puts data at the fingertips of colleagues, system partners and members of the public.
• Live - The dashboard is a living report which is frequently updated.
This session will:
• Provide a demonstration of Suffolk County Council’s Cost of Living dashboard
• Give an overview of data sources
• Explore opportunities for automation using Power BI
• Discuss how the data dashboard is used locally
This event is open to all; however, we anticipate it will be of most interest to anyone working on cost of living dashboards at the local level.
If you have any questions, please contact ons.local@ons.gov.uk.
ONS Local has been established by the Office for National Statistics (ONS) to promote evidence-based decision-making at the local level. We aim to host insightful workshops which will provide practical, technical support to help users make the most of ONS data. The Cross-Government Data Science Community brings together data scientists and analysts to build data science capability across the UK governments and public sector.
We are delighted to welcome you to our inaugural Workshop in our new series, entitled: 'How to use APIs'. The session will cover what Application Programming Interfaces (APIs) are, the advantages in using them and a practical demonstration of how they can be used. The journey of two Local Authority analysts as they begin using APIs in place of manual processes will be showcased to the audience. The session will conclude by explaining the plan for the forthcoming series of Workshops that will begin in September and introducing the Slack channel that ONS Local and Cross-Government DS community will be using to support users' technical questions going forward.
This event is open to all; however, we anticipate it will be of most interest to anyone working at a local level on creating data dashboards for internal or external use.
If you have any questions, please contact ons.local@ons.gov.uk.
ONS Local has been established by the Office for National Statistics (ONS) to promote evidence-based decision-making at the local level. We aim to host insightful workshops which will provide practical, technical support to help users make the most of ONS data. The Cross-Government Data Science Community brings together data scientists and analysts to build data science capability across the UK governments and public sector.
We are delighted to welcome you to our inaugural Workshop in our new series, entitled: 'How to use APIs'. The session will cover what Application Programming Interfaces (APIs) are, the advantages in using them and a practical demonstration of how they can be used. The journey of two Local Authority analysts as they begin using APIs in place of manual processes will be showcased to the audience. The session will conclude by explaining the plan for the forthcoming series of Workshops that will begin in September and introducing the Slack channel that ONS Local and Cross-Government DS community will be using to support users' technical questions going forward.
This event is open to all; however, we anticipate it will be of most interest to anyone working at a local level on creating data dashboards for internal or external use.
If you have any questions, please contact ons.local@ons.gov.uk.
ONS Local has been established by the Office for National Statistics (ONS) to promote evidence-based decision-making at the local level. We aim to host insightful workshops which will provide practical, technical support to help users make the most of ONS data. The Cross-Government Data Science Community brings together data scientists and analysts to build data science capability across the UK governments and public sector.
We are delighted to welcome you to our inaugural Workshop in our new series, entitled: 'How to use APIs'. The session will cover what Application Programming Interfaces (APIs) are, the advantages in using them and a practical demonstration of how they can be used. The journey of two Local Authority analysts as they begin using APIs in place of manual processes will be showcased to the audience. The session will conclude by explaining the plan for the forthcoming series of Workshops that will begin in September and introducing the Slack channel that ONS Local and Cross-Government DS community will be using to support users' technical questions going forward.
This event is open to all; however, we anticipate it will be of most interest to anyone working at a local level on creating data dashboards for internal or external use.
If you have any questions, please contact ons.local@ons.gov.uk.
ONS Local has been established by the Office for National Statistics (ONS) to support evidence-based decision-making at the local level. We aim to host insightful events that connect our users with exciting developments happening in subnational statistics and analysis at the ONS and across other organisations.
From 1 August 2019, the Secretary of State for Education delegated responsibility for the commissioning, delivery and management of London’s Adult Education Budget (AEB) to the Mayor of London. The AEB helps Londoners to get the skills they need to progress both in life and work. The overarching aim of London’s AEB is to make adult education in London even more accessible, impactful and locally relevant.
In this presentation, the Greater London Authority will be going through the results of the pioneering 2021/22 London Learner Survey (LLS). The survey’s objective is to gain insight into the outcomes of learners to inform and improve policy. The LLS consists of two linked surveys of learners who participated in GLA-funded Adult Education Budget (AEB) learning in the academic year 2021/22.
In the LLS, Learners are surveyed prior to and 5-7 months after completing their course to estimate the economic and social changes that learners experience following an AEB course.
In particular, the presentation will show the economic impact broken down by:
. Progression into employment
. Progression within work
. Progression into further learning.
The social impact will be explored by looking at changes in:
. Health and wellbeing
. Improved self-efficacy
. Improved social integration
. Participation in volunteering
The presentation will also cover how outcomes vary by funding type, breaking down the results by Community Learning and Adult Skills.
This event is open to all; however, we anticipate it will be of most interest to anyone working at a local level on skills, education and employment.
If you have any questions, please contact ons.local@ons.gov.uk.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
3. Data sources for official statistics
• Surveys – eg of businesses and households
• Census – every 10 years
• Administrative data – by-product of
Government process
• Big Data?
‘Data that is difficult to collect, store or process within
the conventional systems of statistical organizations.
Either, their volume, velocity, structure or variety
requires the adoption of new statistical software
processing techniques and/or IT infrastructure to
enable cost-effective insights to be made.’
(UNECE, 2013)
5. Twitter
Rationale: Using geo-located
Twitter to gain new insights
mobility and migration
• 7 months of geo-located tweets
within Great Britain (about 100
million data points)
• Methodology to infer place of usual
residence:
- Identify user ‘anchor points’ by
clustering tweets using a DBSCAN
algorithm
- Identify residential anchor points using
AddressBase and nearest neighbour
analysis
Geolocated
penetration rates
by local
authority
13. Using data from property websites and
aerial imagery to support the Census
Karen Gask
ONS Big Data team
14. Plan for today
• How property website data can be used to
improve statistics in areas with caravan parks
• What analysis of aerial imagery can tell us
about caravan parks
• Your feedback
15. Challenges for Census enumeration -
where Big Data can help
• Address intelligence is required to effectively
plan enumeration resources
• Understanding where knowledge gaps exist
• Help identify where there are access issues
or new builds
18. Potential benefits of property data for
Census
Improve understanding of small areas by identifying:
• High proportions of rental properties
• Unusual properties which may not be captured well in the
Address Register (house boats, caravan homes, beach
huts)
• Areas where there may be access issues or new builds
Provide some limited information on tenure of private
sector housing for Administrative Data Census
19. Work undertaken
Aim: investigate methods of machine learning
which could accurately identify, or distinguish
between, traits of interest within property data
in an automated way
Rationale: automated classification could allow
targeted field work and inform enumeration
resource allocation
21. Collecting data from these websites
• Could have ‘web scraped’ html code behind
websites to capture this data
− But many websites prohibit this
• Zoopla provides (limited) data for free via an
API (Application Programming Interface)
− Successfully collected data about 60,000
properties for sale or for rent
22. Early results
• Identified caravan homes with good accuracy
using price, property description, number of
bedrooms and property type
• Distinguished between holiday and residential
caravan homes with reasonable accuracy
(although sample size is small)
• Currently working on analysing property
description to identify gated communities
23. Identifying caravan homes (1)
• Developed machine learning methods such
as logistic regression, decision trees and
support vector machines
Is property type
“mobile/park home”?
yesno
Does property
description contain
“holiday park”?
no yes
Caravan home
Caravan homeNot a caravan home
n=180
n=55n=17,501
24. Identifying caravan homes (2)
• Support vector
machines the
most accurate
method
y
x
Linear hyperplane
and support
vectors
Non-linear
hyperplane
y
x
Predicted
Actual Not a caravan
home
Caravan
home
Not a caravan
home
7,113 18
Caravan home 0 99
Unseen testing set of 7,230:
25. Distinguishing between holiday and
residential caravan homes
• Classified 500 caravan
home descriptions
• Split descriptions into
words then correlated
each word with holiday
/ residential
classification
• Small sample size so
there is some
overfitting
26. Gated communities
• Currently exploring
use of Natural
Language Processing
on property
description
• Want “set in a private
gated development”
but not “…gated side
access to the garden”
27. Limitations
• Only provides data
about properties for
sale or rent
• Zoopla does not cover
all properties for sale or
rent
• Some properties have
no description or a very
small description
• Sample of data we have
collected is small for
unusual properties
28. Next steps
• Data shows promise but we have collected all
the free data we can (nearly 60,000 records)
• Soon to issue a tender to purchase data for
Census Test areas to test methods in 2017
• Understand how this data could improve
statistics in areas with caravan parks
30. Potential benefits of image data for
Census
• Similar to property data – image data could
help fill knowledge gaps by identifying:
• the number of properties in a given area
• properties which are similar / different
• properties with particular features
• Images can be more timely than field
intelligence
• Images can provide more cost effective
insight than field intelligence
31. Work undertaken
Aim: To explore the utility of aerial and satellite
imagery for official statistics through a pilot
study of caravan site images
Rationale: This could improve statistics in areas
with caravan parks, which are historically
considered 'hard to count' within the Census
Address Register
32. Collecting data
• Data are obtained from Google's API
(Application Programming Interface) for free
• There is a limit on image dimension and data
amount one can get for free (e.g. download of
images of New Forest took 2 days)
• All downloaded images have the same
dimensions and the same level of magnitude
• Google takes care of some pre-processing:
blending images together, adjusting colours
33. Pre-processing
Machine learning requires
‘training data’ where the objects
of interest are correctly labelled
- Circa 60 images were
manually labelled before
analysis
To artificially increase the size
of the dataset images were
augmented by
• rotation,
• flipping and
• translation.
34. Object recognition
Used the following machine learning techniques
(plus others):
• Logistic regression
• Random Forests
• Support Vector Machines
• But artificial neural networks worked best
35. Output
• Heat map of
probabilities
that there is a
caravan at a
given
spot/patch of
the image
• Accuracy (for
single patches)
97%
36. Limitations
Limitations of the free data:
• Quality of the images, consistency of colours (white
balance, season)
• Timeliness of the data (e.g. Google satellite imagery
is up to 3 years old)
Algorithm limitations:
• Humans can't get it 100% right, 97% seems good
• But even small error rates lead to large amount of
false positives when the classification is deployed to
large area
37. So what can we do?
• Identify deficiencies in the Address Register
used for Census
• Maybe the accuracy is not good enough for
individual caravans, but it can still help with
caravan parks
• Focus on large clusters and compare them
with Address Register
38. For example these sites
Address Register: 21 caravans
Algorithm: 188 caravans
Address Register: 3 caravans
Algorithm: 121 caravans
39. Or these sites
Address Register: 0 caravans
Algorithm: 61 caravans
Address Register: 0 caravans
Algorithm: 21 caravans?
40. Similar housing
• Find similar
buildings (e.g.
terraced
houses)
• Heat map of
similarity of
each spot/patch
to the central
one
41. Next steps
Short term:
• Use discrepancies between algorithm and
Address Register in Census Tests
• Include more data sources, e.g. LIDAR which
captures the height of mapped objects
Long term / other applications:
• Land use classification (sustainable
development, crop types)
• Population density estimation
43. Your feedback please
• Questions or comments on this work
• Can you think of other applications for this
type of data or methods?
• Is there similar work happening elsewhere?
• Can you think of other ‘big data’ sources we
haven’t considered yet for Census?
Comments / questions are welcome now or
Jane and I will be around for lunch and the
rest of the day (or
ons.big.data.project@ons.gov.uk)
44. What do we understand about public acceptability
of using administrative and other data for
research?
Vanessa Cuthill, Deputy Director, ESRC
47. Dialogues – what have we done?
• ESRC
2013 Public dialogues on using admin data (ESRC,
ONS, Ipsos Mori)
2015 Big Data: Public views on the use of private sector
data for social research (ESRC, Hopkins Van Mil)
• Others:
2014 Powers and perils of data (Ipsos Mori)
2014 Public attitudes to the use and sharing of their
data (RSS and Ipsos Mori)
2015 Private Lives? (MRS Report)
49. Why did ESRC embark on the dialogues on
data in 2013?
• Rapidly evolving data landscape - Administrative
Data Taskforce
• We wanted to:
Better understand people’s views on the linking of admin
data
Begin the process of creating a terminology describing the
re-use of administrative data and data linking that is
understandable to the general public
Help inform the development of the governance and
operational procedures of the ADRCs and provide data
on public attitudes to inform their future strategies and
priorities
50. Background
• Throughout October - November 2013 public dialogues
held in 7 locations across UK (Ipsos MORl led)
• The aims were to:
To better understand the cultural barriers around linking
administrative data
To begin the process of creating a language that is
meaningful and accessible to the public.
To test the public perceptions of the rules that ESRC
ADRCs will be subject to and to provide the ADRCs
with data on public attitudes and appetite for
engagement
(To provide ONS with more detailed evidence on public
views of their current front-running option for Beyond
2011)
51. Support for initiative IF:
1. The data is linked for socially beneficial purposes
“As long as it’s used for good, like to develop things,
improve services, improve knowledge.” Belfast
2. It is fully de-identified – partial vs full postcodes
3. It is kept secure at all times – concerns around remote access
4. No commercial gain for business including commercial access
BUT participants needed extensive information and discussions with
experts and researchers in order to be satisfied that these conditions
would be met under the ADRN plans.
So - simply publicising these three conditions may not be enough to
ensure that the general public are reassured about or support the work
of the ADRN.
53. Impact of this dialogue
• Report shared with ADRN
• Informed decisions of the ADRN Management Committee for the
ADRN policies and procedures
▶ Lay membership in ADRN Board and Approvals Panel
• Short animated videos on ADRN website to help explain:
Data linkage https://www.youtube.com/watch?v=E3e4D2bHxa8&feature=youtu.be
Protecting Privacy https://www.youtube.com/watch?v=nnxz3_XGMAE&feature=youtu.be
• Clear policies and 5 safe’s Safe People, Safe Projects, Safe
Settings, Safe Outputs, Safe Data protocols
54. Big Data: Public views on the use of
private sector data for social research
A Findings Report for the Economic and Social
Research Council
55. Aim and objectives of the dialogue
To explore public views on access to and
the use of data from private sector
organisations for research purposes in the
context of three Data Research Centres
funded by the ESRC.
o To identify areas of public concern about
confidentiality and privacy impact
o To start creating a language around private
sector data and the use for research purposes
o To test public understanding of: data ownership,
data acquisition, data access, using/ re-using
private sector data, data storage and preservation
57. Public views on data collection by
the private sector
“It’s just the way we live”
Internet
GPS
trackin
g
devices
Cards
Public
Places
No way you
can opt out
of giving
data unless
you live like
a hermit in
the middle
of an island.
58. Particular Concerns
Lack of
Transparency, &
Information
Passing/Selling data
to others
Keeping data safe Linking data
Intentions for data
collection and use
DPA: Principles and
Sanctions
59. Examples of public concerns
“What people are worried about is that it’s not going to be kept just
within. It might get sold to insurance companies, employers and this
is where people want to know that it’s going to be safe.”
“The more data you add the more it’s creating this sense of identity
of each person, so it’s almost like everyone’s got this data avatar
that’s building up as we get older.”
“I try to avoid using the internet for purchases specifically because I
don’t want my data collected and then used for sales purposes
afterwards.”
(on DPA) “Things like ‘used in a way that is adequate, relevant and
not excessive’ what does that mean? Who decides what that is? Do
you get to decide that yourself when you’re doing research?
61. Data acquisition
Only acquire
data from
trustworthy
companies
Ensure
accuracy and
relevance of
data
Work as much
as possible with
anonymised
data
…. Test the
provenance of
data sets
Little support for
payment for
data sets
Improved
information
….
Accurate
and up to
date data is
vital
It is acceptable
to share data for
the good of
society – but we
need to know
how it will be
used
I don’t think it’s
right that they
[the Data
Centres]
should buy
data, it’s public
money so they
should be
spending it on
the public
62. Data access
Trust in
processes
Approval for
access
procedures
researchers
Avoid clash of
interests
.... Secure setting
favoured over
virtual
environment
Consent for
use of personal
data
Clear
communication
about purpose
of research
....
Big data is a
broad term,
at what
point is it
not personal
anymore,
when does
it lose your
name?
The process
to secure it
and who’s
actually
getting to
use it and
stuff like that
[…] seems to
be set up
that it’s pretty
secure and
that not just
anybody can
walk in and
get this
information.
63. Data storage
More
information:
what/ / for how
long/ security
Physical storage
favoured over
virtual
environment
Ensure systems
are accessible
in future
Storage
We don’t
know
where the
information’
s stored
and who’s
in control of
it.
64. Data ownership
Low
familiarity
More
information:
type of data
owned
More
information:
data you
don’t use
Ownership
They are going to
have data coming
in, they’re going to
be processing it,
so although it
originally came
from a source, the
new information
that’s been
collected, is that
owned by them or
does [ownership]
still go back to the
original?
If it’s
about
me,
surely I
own the
data?
65. Transparency – why it matters?
“Just educate the public about the Data
Centres. If the public are aware of what’s
happening then they may not mind so much”
“A lot of the stigma that comes with data
sharing comes with people not knowing and not
being educated about the facts of how the
data’s being used.”
66. Value of private sector data for knowledge
and society
“Show the greater good of using my data – the
benefits of my data for the greater good.”
“I’d be really willing to sign up for this only if I
saw the benefits that my data provided. […] If
you see the impact that it has not only on your
life but on the life of the NHS as well and then
they are going to change their services, that’s
the greater good of it”.
68. Conclusions from the dialogue
o Sharing of information about how the Data Centres operate
significantly increased levels of understanding of the benefits of
social research using private sector data.
o Improved communication and education about the processes by
which the Data Centres acquire, store, own and access private sector
data is vitally important to establish greater credibility.
o Data Centres should use case studies to demonstrate how the use of
private sector data in social research can lead to policy or service
improvements.
o The more reassurances are given about the data Centre processes
and the benefits of using private sector data the more buy-in can be
expected from members of the public.
73. Giving the last word to the participants…
Primary contact at ESRC:
Maria.Sigala@esrc.ac.uk
Thank you!
Vanessa Cuthill ESRC Deputy Director
(Evidence, Impact and Strategic Partnerships)