The document discusses the progress made on a project to quality assure and create historic snapshots of UK postcode directories from 1980 to the present. It outlines the following:
1) Four phases of work completed so far including loading the raw data, auditing for errors, and developing a methodology to verify instances and reconcile inconsistencies based on temporal and spatial thresholds.
2) Common error types identified include instances with the same introduction and termination dates (Type I) and instances lacking termination dates or with inconsistent timelines (Type II).
3) Plans to finalize the quality assurance rules, update the instance database, and then derive the historic snapshots from the quality controlled data.
4) Outstanding issues include reconciling remaining
The document discusses preparations for the 2020 Census and Geographic Partnerships. It provides updates on the 2020 Census program, the Geographic Support System Initiative (GSS-I), the Local Update of Census Addresses Program (LUCA), and the Redistricting Data Program (RDP). It summarizes various census tests conducted in 2015, 2016, and future tests planned for 2017 and 2018. The goals are to test new address canvassing methods, self-response optimization, use of administrative records, and field operations to improve the 2020 Census.
This film follows the rivalry between two postcode gangs, Hackers and Clapz, in London. When strange creatures begin attacking and killing people, including police officers and gang members, the two rival gangs are forced to work together to survive. They take refuge in Clapz's territory as Hackers' block is destroyed. The film includes comedic elements alongside the violent encounters with the creatures. Major expenses for the film include hiring the director for £8,000, camera equipment, a van for filming locations, and sound recording equipment and personnel. Popular artists like Kano, Scorcher, MzBratt, and Jessie J are considered for the soundtrack. T-shirts are produced to promote
UK Addresses is a simple but powerful address look-up tool which draws information from Royal Mail's address databases, including the Postcode Address File (PAF). View the presentation now to see how you could benefit from Royal Mail UK Addresses.
Bonnie Smith
Owner of Café Au Lait
- Provide overall budget
- Approve project plan
- Attend key meetings
- Sign-off on deliverables
- Promote event internally
High
High
Planning &
Execution
Sponsor/
Stakeholder O
Are you looking to write down a Project Management Plan and don't how to start.
Here is a free Project Management Plan template with embedded instructions from Simplilearn.
Completely free!! Go ahead and use it!
The document outlines the plan for the Odessa Mobile Technology Project, which aims to implement new mobile devices, software, and systems to improve officer safety, information sharing, and job effectiveness. It describes the project goals, scope, budget, risks, timeline, roles and responsibilities, and communication approach to guide the implementation of the new technology. The plan is intended to define expectations and provide oversight for the successful rollout of the project.
Tyler Ross Lambert presented on his co-op experience at Rheem. He conducted several projects including:
1) Developing programs to streamline reporting and generate warranty analysis reports.
2) Assisting with failure analysis by performing tank teardowns to identify common failure modes.
3) Designing an expansion to increase testing capacity and updating troubleshooting manuals.
4) Developing an early warning system using sales and call data to flag potential product issues.
5) Improving Reliance by creating consolidated escalation notices and adding cost tracking features.
Using FME to Automate Lidar QA\QC ProcessesSafe Software
Manitoba Hydro has developed a number of FME workbenches that allow it to automate a variety of QA\QC checks when receiving lidar data from external sources. These checks are necessary to ensure that lidar data meets specifications before it is accepted for use wihtin Manitoba Hydro. These workbenches allow for a standardized, efficient, and methodical approach to reviewing large amounts of data, and aid in quickly highlighting potential issues with the data.
The document discusses preparations for the 2020 Census and Geographic Partnerships. It provides updates on the 2020 Census program, the Geographic Support System Initiative (GSS-I), the Local Update of Census Addresses Program (LUCA), and the Redistricting Data Program (RDP). It summarizes various census tests conducted in 2015, 2016, and future tests planned for 2017 and 2018. The goals are to test new address canvassing methods, self-response optimization, use of administrative records, and field operations to improve the 2020 Census.
This film follows the rivalry between two postcode gangs, Hackers and Clapz, in London. When strange creatures begin attacking and killing people, including police officers and gang members, the two rival gangs are forced to work together to survive. They take refuge in Clapz's territory as Hackers' block is destroyed. The film includes comedic elements alongside the violent encounters with the creatures. Major expenses for the film include hiring the director for £8,000, camera equipment, a van for filming locations, and sound recording equipment and personnel. Popular artists like Kano, Scorcher, MzBratt, and Jessie J are considered for the soundtrack. T-shirts are produced to promote
UK Addresses is a simple but powerful address look-up tool which draws information from Royal Mail's address databases, including the Postcode Address File (PAF). View the presentation now to see how you could benefit from Royal Mail UK Addresses.
Bonnie Smith
Owner of Café Au Lait
- Provide overall budget
- Approve project plan
- Attend key meetings
- Sign-off on deliverables
- Promote event internally
High
High
Planning &
Execution
Sponsor/
Stakeholder O
Are you looking to write down a Project Management Plan and don't how to start.
Here is a free Project Management Plan template with embedded instructions from Simplilearn.
Completely free!! Go ahead and use it!
The document outlines the plan for the Odessa Mobile Technology Project, which aims to implement new mobile devices, software, and systems to improve officer safety, information sharing, and job effectiveness. It describes the project goals, scope, budget, risks, timeline, roles and responsibilities, and communication approach to guide the implementation of the new technology. The plan is intended to define expectations and provide oversight for the successful rollout of the project.
Tyler Ross Lambert presented on his co-op experience at Rheem. He conducted several projects including:
1) Developing programs to streamline reporting and generate warranty analysis reports.
2) Assisting with failure analysis by performing tank teardowns to identify common failure modes.
3) Designing an expansion to increase testing capacity and updating troubleshooting manuals.
4) Developing an early warning system using sales and call data to flag potential product issues.
5) Improving Reliance by creating consolidated escalation notices and adding cost tracking features.
Using FME to Automate Lidar QA\QC ProcessesSafe Software
Manitoba Hydro has developed a number of FME workbenches that allow it to automate a variety of QA\QC checks when receiving lidar data from external sources. These checks are necessary to ensure that lidar data meets specifications before it is accepted for use wihtin Manitoba Hydro. These workbenches allow for a standardized, efficient, and methodical approach to reviewing large amounts of data, and aid in quickly highlighting potential issues with the data.
The document provides an overview of the operations at Anoplate, a metal finishing company. It describes the key processes from receiving customer orders through planning, scheduling, racking, and plating. Issues identified include long wait times between receiving and racking, missed on-time delivery commitments, inaccurate projected start dates, and preferential treatment of premium customers. Recommendations are made to streamline processes, improve planning and scheduling software, track rack locations, and set more accurate projected dates.
The document discusses using FME to automate the cleansing and structuring of land ownership data for linear asset management projects. Previously, processing Land Registry data into the required format took over a day per project due to manual effort. Using FME, the process can now be completed in seconds by splitting names and addresses, standardizing formats, and validating fields. This saves significant time and money compared to manual work, and increases the return on investment from their existing FME software.
Internet of things - 3/4. Solving the problemsSumanth Bhat
This document discusses design challenges and solutions for energy efficiency in cyber-physical systems (CPS). It outlines key CPS challenges such as timing issues, miniaturization, and energy efficiency. It then describes approaches to optimize low power design at various layers, including low-energy VLSI techniques and low power communication methods. Two edge mining techniques called the Spanish Inquisition Protocol and Bare Necessities are introduced to reduce the number of sensing messages and achieve lower energy consumption. These techniques introduce state estimation and event detection at edge nodes to send aggregated data instead of raw sensor readings. This helps improve privacy by making it impossible to use the data for unintended purposes. Both techniques significantly reduce the number of data transmissions needed. The document
The document proposes updating time-invalid information in knowledge bases using mobile agents. It represents information validity time using semantic web technologies like RDF and uses SPARQL to assess validity. A planner guides an agent to recollect expired data on demand in response to queries, ensuring time-valid results. Experiments with a simulated robot and RDF knowledge base show feasibility of the approach. Future work includes optimization, multiple robots, and more complex validity rules.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
This document discusses the need for standardized work breakdown structures (WBS) across project management tools and teams. It presents that without a common classification system, projects lack monitoring, scope management, prioritization, and the ability for stakeholders to exchange information. The document then examines the evolution of using multiple WBS, provides a case study analyzing the need for at least 3 WBS for effective project information management, and outlines considerations for defining rules for WBS use and linking deliverables to WBS for project handover. In conclusion, it emphasizes that complexity in projects requires project managers to have a strong professional background to effectively classify data and integrate information using WBS across systems.
This document discusses the need for standardized work breakdown structures (WBS) across project management tools and teams. It argues that without a common classification system, projects lack the ability to effectively monitor scope, prioritize tasks, and make decisions. The document then presents a case study demonstrating that effective project management requires at least 3 WBS: scope of work (SOW), operational, and nature code. Rules for using the WBS in management are also defined. Finally, it is concluded that properly defining tagging rules and WBS from the start and implementing them in tools is essential for integrating information and successfully delivering projects.
The document discusses quality of service (QoS) techniques in computer networks. It describes four characteristics of data flows: reliability, delay, jitter, and bandwidth. It then discusses several QoS mechanisms including flow classes, scheduling, traffic shaping using leaky bucket and token bucket algorithms, resource reservation, admission control, Integrated Services (IntServ) model, and Differentiated Services (DiffServ) model. The IntServ model provides per-flow reservations using RSVP, while the DiffServ model provides class-based service using traffic conditioners and per-hop behaviors.
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Presented by Todd Lewis, Sr. Applications Specialist with Consortech
Abstract: Open Data Initiates allow application developers to build customized mobile applications based upon data provided by local municipalities. Since 2005 Open Data Initiatives in Canada and the US have led to unprecedented sharing of data, and innovation in the mobile space, providing access to or information regarding city services. This topic will introduce users the enabling technologies behind Open Data services, and demonstrate how to: save time and resources in preparation and publishing of disparate data (tabular, GIS, and Real-Time); enable innovation within their local development community; and deliver better services to residents, businesses, and tourists.
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
Presentation for project on Social Media World News Impact on Stock Index Values (DJIA) for Investment Fund Analytics. Group project done in course DS8004 - Data Mining at Ryerson University for Masters in Data Science and Analytics.
This document summarizes the second training session for ASUFE Juniors. It discusses revising functions, analyzing time complexity using Big O notation, different problem types like brute force and divide-and-conquer, techniques for reading problem statements, and differences between stack and heap memory. It also provides examples of time complexity calculations and overviews competition rules and problem difficulties on Codeforces.
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
The document discusses Conviva's Unified Framework (CUF) for analyzing video streaming data in real-time, near real-time, and offline using Spark and Databricks. It summarizes Conviva's platform for measuring video quality of experience across devices and networks. The framework unifies the three analysis stacks onto Spark to share code and insights. Using Databricks improves the offline analysis speed and enables data scientists to independently explore large datasets and build machine learning models.
Kafka is a distributed publish-subscribe system that is well-suited for building real-time data pipelines and streaming applications. It addresses issues that arise from scaling these applications, such as decoupling data producers and consumers and supporting parallel data processing. Kafka uses topics to organize streams of records called messages, which are partitioned and can be replicated across multiple servers. Producers write data to topics and consumers read from topics in a pull-based fashion coordinated by Zookeeper.
Chicago AWS user group - Raja Dheekonda: Replatforming MLAWS Chicago
Big Data and Analytics on AWS
Chicago AWS user group event Nov 12, 2019
“A Review of the Most Technical Challenges in the Operationalization of ML Models” - Raja Sekhar Rao Dheekonda, Senior Data Scientist at Morningstar
Data Science methods and case studies in anomaly detection - from SPC to Deep Learning Autoencoders and signature analysis. Includes application of models on event streams. Case study - IoT sensor data from energy production facilities.
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Gloria Re Calegari
Prediction of expensive datasets starting from a set of cheap heterogeneous information sources in smart city scenarios.
Prediction of the population and land use of Milano starting from data about Points Of Interest and phone activity.
A presentation for the Network Detroit digital humanities conference on September 25, 2015, on changes required to the digital collections data of The Henry Ford in order to support a new collections website.
- Time series data consists of data points measured at successive time intervals and is commonly found in domains like finance, science, and increasingly across other industries as sensors become more prevalent.
- While traditional RDBMS approaches have limitations for analyzing high-resolution time series data due to scaling and performance issues, MapReduce provides an alternative approach for distributed processing and analysis of large time series datasets.
- To calculate a simple moving average on time series data in MapReduce, records can be sorted during the shuffle phase using a composite key of the stock symbol and timestamp, allowing data to arrive at reducers already sorted and avoiding expensive sorting operations.
The document provides an overview of the operations at Anoplate, a metal finishing company. It describes the key processes from receiving customer orders through planning, scheduling, racking, and plating. Issues identified include long wait times between receiving and racking, missed on-time delivery commitments, inaccurate projected start dates, and preferential treatment of premium customers. Recommendations are made to streamline processes, improve planning and scheduling software, track rack locations, and set more accurate projected dates.
The document discusses using FME to automate the cleansing and structuring of land ownership data for linear asset management projects. Previously, processing Land Registry data into the required format took over a day per project due to manual effort. Using FME, the process can now be completed in seconds by splitting names and addresses, standardizing formats, and validating fields. This saves significant time and money compared to manual work, and increases the return on investment from their existing FME software.
Internet of things - 3/4. Solving the problemsSumanth Bhat
This document discusses design challenges and solutions for energy efficiency in cyber-physical systems (CPS). It outlines key CPS challenges such as timing issues, miniaturization, and energy efficiency. It then describes approaches to optimize low power design at various layers, including low-energy VLSI techniques and low power communication methods. Two edge mining techniques called the Spanish Inquisition Protocol and Bare Necessities are introduced to reduce the number of sensing messages and achieve lower energy consumption. These techniques introduce state estimation and event detection at edge nodes to send aggregated data instead of raw sensor readings. This helps improve privacy by making it impossible to use the data for unintended purposes. Both techniques significantly reduce the number of data transmissions needed. The document
The document proposes updating time-invalid information in knowledge bases using mobile agents. It represents information validity time using semantic web technologies like RDF and uses SPARQL to assess validity. A planner guides an agent to recollect expired data on demand in response to queries, ensuring time-valid results. Experiments with a simulated robot and RDF knowledge base show feasibility of the approach. Future work includes optimization, multiple robots, and more complex validity rules.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
This document discusses the need for standardized work breakdown structures (WBS) across project management tools and teams. It presents that without a common classification system, projects lack monitoring, scope management, prioritization, and the ability for stakeholders to exchange information. The document then examines the evolution of using multiple WBS, provides a case study analyzing the need for at least 3 WBS for effective project information management, and outlines considerations for defining rules for WBS use and linking deliverables to WBS for project handover. In conclusion, it emphasizes that complexity in projects requires project managers to have a strong professional background to effectively classify data and integrate information using WBS across systems.
This document discusses the need for standardized work breakdown structures (WBS) across project management tools and teams. It argues that without a common classification system, projects lack the ability to effectively monitor scope, prioritize tasks, and make decisions. The document then presents a case study demonstrating that effective project management requires at least 3 WBS: scope of work (SOW), operational, and nature code. Rules for using the WBS in management are also defined. Finally, it is concluded that properly defining tagging rules and WBS from the start and implementing them in tools is essential for integrating information and successfully delivering projects.
The document discusses quality of service (QoS) techniques in computer networks. It describes four characteristics of data flows: reliability, delay, jitter, and bandwidth. It then discusses several QoS mechanisms including flow classes, scheduling, traffic shaping using leaky bucket and token bucket algorithms, resource reservation, admission control, Integrated Services (IntServ) model, and Differentiated Services (DiffServ) model. The IntServ model provides per-flow reservations using RSVP, while the DiffServ model provides class-based service using traffic conditioners and per-hop behaviors.
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Presented by Todd Lewis, Sr. Applications Specialist with Consortech
Abstract: Open Data Initiates allow application developers to build customized mobile applications based upon data provided by local municipalities. Since 2005 Open Data Initiatives in Canada and the US have led to unprecedented sharing of data, and innovation in the mobile space, providing access to or information regarding city services. This topic will introduce users the enabling technologies behind Open Data services, and demonstrate how to: save time and resources in preparation and publishing of disparate data (tabular, GIS, and Real-Time); enable innovation within their local development community; and deliver better services to residents, businesses, and tourists.
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
Presentation for project on Social Media World News Impact on Stock Index Values (DJIA) for Investment Fund Analytics. Group project done in course DS8004 - Data Mining at Ryerson University for Masters in Data Science and Analytics.
This document summarizes the second training session for ASUFE Juniors. It discusses revising functions, analyzing time complexity using Big O notation, different problem types like brute force and divide-and-conquer, techniques for reading problem statements, and differences between stack and heap memory. It also provides examples of time complexity calculations and overviews competition rules and problem difficulties on Codeforces.
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
The document discusses Conviva's Unified Framework (CUF) for analyzing video streaming data in real-time, near real-time, and offline using Spark and Databricks. It summarizes Conviva's platform for measuring video quality of experience across devices and networks. The framework unifies the three analysis stacks onto Spark to share code and insights. Using Databricks improves the offline analysis speed and enables data scientists to independently explore large datasets and build machine learning models.
Kafka is a distributed publish-subscribe system that is well-suited for building real-time data pipelines and streaming applications. It addresses issues that arise from scaling these applications, such as decoupling data producers and consumers and supporting parallel data processing. Kafka uses topics to organize streams of records called messages, which are partitioned and can be replicated across multiple servers. Producers write data to topics and consumers read from topics in a pull-based fashion coordinated by Zookeeper.
Chicago AWS user group - Raja Dheekonda: Replatforming MLAWS Chicago
Big Data and Analytics on AWS
Chicago AWS user group event Nov 12, 2019
“A Review of the Most Technical Challenges in the Operationalization of ML Models” - Raja Sekhar Rao Dheekonda, Senior Data Scientist at Morningstar
Data Science methods and case studies in anomaly detection - from SPC to Deep Learning Autoencoders and signature analysis. Includes application of models on event streams. Case study - IoT sensor data from energy production facilities.
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Gloria Re Calegari
Prediction of expensive datasets starting from a set of cheap heterogeneous information sources in smart city scenarios.
Prediction of the population and land use of Milano starting from data about Points Of Interest and phone activity.
A presentation for the Network Detroit digital humanities conference on September 25, 2015, on changes required to the digital collections data of The Henry Ford in order to support a new collections website.
- Time series data consists of data points measured at successive time intervals and is commonly found in domains like finance, science, and increasingly across other industries as sensors become more prevalent.
- While traditional RDBMS approaches have limitations for analyzing high-resolution time series data due to scaling and performance issues, MapReduce provides an alternative approach for distributed processing and analysis of large time series datasets.
- To calculate a simple moving average on time series data in MapReduce, records can be sorted during the shuffle phase using a composite key of the stock symbol and timestamp, allowing data to arrive at reducers already sorted and avoiding expensive sorting operations.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
2. Overview
• About EDINA
• Project Background and Context
• Progress To Date
• Plans for coming months
• Outstanding Issues
3. EDINA
• A JISC funded national data centre based at Edinburgh University Data
Library.
• Provides the UK tertiary education and research community online access to a
library of data, information and research resources.
• The largest section of which (Geo Data Services), comprised of GIS
Specialists and Software Engineers provides access to 2 key online services -
Digimap & UKBORDERS.
• We and our user community have an interest in both contemporary and
historical postcode products.
4. Background & Context
• What are the historical postcode directories? - datasets which list all unit
postcodes within the UK and assigns to them a national grid reference,
geographic lookups and counts of assigned addresses.
• ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the
academic community. This community also has an interest in historic versions
of the AFPD and thus ONS supplied to ESRC historic postcode directories
(1980-2000) for free on the basis that ESRC would QA the historic versions.
• At this point all versions of postcode directories received by ESRC have been
available to users through the EDINA UKBORDERS service since October
2004.
• Steady stream of user downloads. Data for census years most popular but
interestingly significant interest in non-census years.
5. Deliverables
• Objectives/Deliverables of the QA set out formally in August 2004 MOU
between ESRC & ONS:
• Key Deliverable is a Quality Controlled postcode instance database spanning
1980 to present day. From this ESRC will derive snapshot historical versions of
the postcode directories replacing the versions of unknown quality that are
currently in existence.
• Postcode Instance - defined as the existence of a postcode for a certain period
of time which is unique on both postcode label and date of introduction.
• Postcode Instance = Postcode Label + Date of Introduction
• Instance db will have number of fields – DOI, DOT, most recent easting &
northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA).
• The ONS Ward History Database will be used to check the veracity of ward
codes within the historic versions of the postcode directories.
6. Progress to Date
• 4 sequential work phases to complete these objectives:
• I. Data Loading (complete)
• II. Quality Assurance I - Audit (complete)
• III. Quality Assurance II - Verification (in progress)
• IV. Production of Historic Snapshots
• At this point first 2 of these are complete and we are currently engaged in
the verification phase.
• ... Taking each phase in turn
7. Phase I – Data Loading
• Postcode directories were supplied by ONS from 1980 to present day.
• Origin of data varies:
• Central Postcode Directories: 1980 - 1990 (except 1989)
• AFPDs: 1991 - 1998 (except 1996 & 1997)
• NHSPD: 1996 & 1997
• AFPD (NHS Variant): 1999
• AFPD (Gridlink version): 2000
• + Gridlink versions of AFPD from 2001 to current release.
• With the exception of 1989, a complete set, quite remarkable given that
digital curation & preservation a fairly recent concern.
8. Phase I – Data Loading
• We took each historic version, loaded it into it`s own
database table (database used is PostgreSQL) &
then merged each years table into a super table
giving all postcodes from all versions of the AFPD.
• Given the differing origins of the year tables and the
tendency for number of attributes to increase over
time, the harmonisation of these snapshots itself
was an "interesting" data management challenge.
For practical purposes fields were distilled down to a
core set.
• The super table was reduced to a table with distinct
postcodes labels (giving the labels of all postcodes
since 1980) and then to the more valuable postcode
instance table.
• Composite merged table - 50,986,078 rows
• Distinct postcode unit table - 2,330,886 rows
• Postcode Instance table - 2,763,839 rows
9. Phase I – Data Loading
• By itself Date of Introduction only tells us when a postcode was instantised.
In order to be able to examine the lifecycle of each instance we also need to
know if this instance has been terminated or is still live.
• To each instance we attempted to add a Date Of Termination (DOT) by
searching through each of the historic AFPD version tables and determining if
the instance was terminated. Not a trivial task given volumes of data and
number of searches required.
• At the same time each instance also had associated with it latest grid
reference.
• Instance database is therefore quite rich as it holds both the temporal and
spatial history for the instances associated with a postcode.
10. Phase II – Quality Assurance
(Audit)
• Rationale for Quality Assurance – The quality of the instance database will be
propagated to derived products therefore essential that we have an understanding of
which instances are genuine and which can be regarded as spurious and which may
need to be fixed or weeded out.
• First Step – Analysis of the frequency of instances associated with distinct postcodes.
• Frequency of instances associated with distinct postcodes:
Num of postcode instances : Frequency
1 : 2,379,140
2 : 343,995
3 : 34,986
4 : 4,839
5 : 571
6 : 85
7 : 27
8 : 26
9 : 138
10 : 18
11 : 8
12 : 2
13 : 4
• Straightaway can see that in some cases distinct postcodes have multiple instances
associated with them.
11. Phase II – Quality Assurance
(Audit)
• Majority of postcodes represented by only a single instance. But significant
number of postcodes have multiple instances associated with them – why?
• Genuine Postcode Recycling
• Spurious Instances due to imputation problems or systematic tablewide
update procedures in past versions (i.e. update for all Scottish 1973
instances in 1980 table).
• Expected vs. Divergent Cases.
14. Phase II – Quality Assurance
(Audit)
• Programmatic tests were designed to flag cases in the Instance database
which diverged from what we expected.
• Do this by taking each postcode in turn and examining the timelines
associated with its instances. Errors grouped into 3 types:
• Type I - in which the DOI = DOT (the instance is instantised & terminated at
the same point in time)
• Type II – (A) in which all instances of the postcode are live or (B) there are
other inconsistencies within the timeline such as blank dates of termination
within a sequence of instances.
• Type III - multiple dates of termination - postcode instantised once but has
multiple dates of termination
Name of these errors is a convenience – not to be confused with Type I/II errors
in Statistics!
16. Phase II – Quality Assurance
(Audit)
• As we can see the Type II error cases represent the bulk of the errors so
effort has been directed at identifying different varieties of this type of error.
We will spend a few minutes examining two such examples now.
17. Phase II – Quality Assurance
(Audit)
• Case A
• 6 instances never with a date of termination - conflict immediately after the
first case.
• Is it valid for there to be so many postcodes which have multiple live
instances?
• Are all of these cases a result of postcode recycling or are they in fact due to
inconsistencies within the dataset itself?
18. Phase II – Quality Assurance
(Audit)
• Case B
• Again we have 6 instances - this time there is a blank date of termination
within the timeline (which conflicts with the latter 2 instances)
19. Phase II – Quality Assurance
(Audit)
• Why are these a problem? - when we create the historic cuts we don`t want
any ambiguity.
• need to be sure that all live postcodes are truly live (and should not have
been terminated).
• that where a postcode has multiple instances associated with it, these are
genuine and not a result of problems with how the data was created or
updated.
• that all data is consistent as possible.
• How to reconcile these Spurious cases?
20. Phase III – QA - Verification
• Type I errors - unclear - we can`t see any logic behind this - to which we ask
is it valid for an instance to introduced and terminated in the same month?
• Type II errors - problem less clear cut as we have already seen - different
species of the same problem causing instances to diverge from the expected
norm.
• Type III errors - multiple dates of termination - As a rule, pick either the
earliest OR latest and apply to all cases
• Mainly Concerned in rest of presentation with dealing with the Type II errors.
• Key Assumption – Instance database holds information about the location of
each instance in space and time. Instances which are similar in both these
respects can be merged.
22. Phase III – QA - Verification
• Time - According to Royal Mail:
• A postcode is only supposed to be reused after a minimum period of 3 years
has elapsed & residential postcodes are never reused.
• On this basis where we have 2 instances which are instantised within less
than 3 years of one another we can assume that they are referring to the
same thing.
23. Phase III – QA - Verification
Space (Geography)
• Nearby things tend to be more similar than things that are more distant
apart.
• Instances located close to one another likely reference the same set of
addresses. Instances located more distant apart may represent recycling
events.
• For a postcode instance can see how its instances change in position over
time - are they spatially stationary or more dynamic?
• How quantify this within the instance table? - for each set of instances
associated with a postcode unit compute change in easting & northing
between instances.
24. Phase III – QA - Verification
• BUT we need to be aware of the spatial accuracy issue. Accuracy with which
grid references have been assigned to postcodes has increased over time as
methodologies have changed with technology advances.
• An overall increase in accuracy of georeferencing over time.
• Instance location change may therefore operate at multiple scales – a local
change due to changes in georeferencing plus a larger change brought about
by recycling.
25. Phase III – QA - Verification
• Summary statistics for all instances:
• 75% of postcodes with multiple instances record no change in location
whatsoever.
• Of those that do exhibit location change, in 90% of cases this was between
1m and 3km with the remaining cases exhibiting a change of up to 500km.
• Clearly it would be useful if we had a spatial threshold (like the 3 year
temporal threshold) that we could use to decide whether 2 instances should
be merged or kept separate as genuine reuses.
• We argue that using a combination of temporal & spatial measures of
similarity it is possible to discriminate between genuine and spurious
instances.
26. Phase III – QA - Verification
• Research has only recently began to engage with this problem, progress has
been hindered by the size of the datasets involved and the pain involved in
isolating indicative cases.
• Significant time has been invested in exploring the problem but we are by no
means experts - we need feedback - does this methodology seem
appropriate - are our core assumptions logical?
• Plans are to explore the effects of applying different threshold values - using
known cases of reuse to inform selection of threshold value.
• Pick a threshold value - determine the effects of applying this to the dataset
as a whole in terms of i.e. number of merges that this yields taking samples
to determine the validity of results - are instances inappropriately merged.
28. Phase III – QA - Verification
• Demonstrate application of these rules by going back to the Spurious cases
we looked at earlier.
•Case A - using our temporal rule of 3 years - these 6 could be compressed to
3 instances. Using our spatial rule (assuming that our upper spatial threshold
exceeds 100m) these could be compressed to a single instance.
29. Phase III – QA - Verification
•Case B - the inconsistent instance must either be terminated or merged with
another instance. Applying the temporal rule it could be merged with the
following instance. However its location is quite different and so we might decide
that this falls outside our threshold and so instead we might terminate it with
the start date of the following instance.
30. Phase IV – Create QA Instance DB
At some point in order to move forward we are going to have to proceed,
implement the rules from phase 3 and carry out the updates to the instance
database.
• In doing this we run the risk of going in one of two directions - we can be
either be too inclusive leading to too many instances being merged together
or we cannot be inclusive enough with not enough instances merged
together.
• We intend to be pragmatic about this - we simply cannot have so many
possibly false instances associated with each postcode. Unlikely that we are
going to be able to resolve all cases.
• Once the rules are in place, implementation of them should be fairly straight
forward.
31. Creation of Historic Snapshots
• With Quality Controlled Instance database in place, yearly historic version of
the postcode directories can then be derived by pulling out all instances that
exist within a particular time slice.
32. Outstanding Issues
• Reconciling the spurious instances still an ongoing task.
• We would welcome comments/feedback about the
assumptions/methodologies we have chosen to adapt both from ONS and
from other expert users of the AFPD.
• Is there any documentation which might shed light on procedures used to
update the datasets in the past & might explain some of the systematic
inconsistencies we have discovered?
33. Conclusions
• 1. Historical & Contemporary postcode directory datasets are being accessed
by academic users through UKBORDERS.
• 2. QA process data has been received and loaded - raw instance database
has been created.
• 3. Quality Assurance Audit has been carried out - quality of dataset has been
assessed.
• 4. Significant Progress has been made in reconciling inconsistencies, but work
remains before derived data can be created and exposed to user community.
• 5. Feedback on work to date and input from others users is requested in
order to bring work to a close.