Meta Data and Quality of Data for OGD Platform India


Presentation on Meta Data and Quality of Data to be contributed on Open Government Data Platform India

Published in: Government & Nonprofit
  1. 1. Open Government Data Platform India ( Meta Data and Quality of Data By: Sunil Babbar, Scientist-C, NIC
  2. 2. Data Contributors and Their Role • Nominated by Chief Data Officer • Coordinate and Identify datasets which can be contributed • Preparing the datasets – Getting them cleaned – Metadata preparation for datasets in the predefined format – Ensuring quality and correctness datasets of his/her unit/division. • Contributing Catalogs/Resources(Datasets) through pre- defined workflow (Data Contributor  Chief Data Officer(CDO) for review and publish  PMU to publish on OGD Platform)
  3. 3. Resources (Datasets / Apps)  A data set (or dataset) is a collection of data  A data set corresponds to the contents of a single table or statistical data matrix, where  every column represents a particular variable, and  each row corresponds to a given member of the data set in question  OpenDataFormats: CSV XLS ODF XML/RDF JSON RSS/Atom KML/GML
  4. 4. Catalog  Catalogisgroupingof thesimilarresources(Datasets/Apps)  A catalog represents a collection of resources that you group together  Acts like directory of information about resources  BenefitofCatalog  To facilitate data access by users who are first interested in a particular kind of data  Cataloghelpsingroupingtheresourceswithsametheme/subjectand thusfacilitatetheuserinsearching aspecificdataset/resourceeasily  Ministry/Departmentshavelessefforttouploadsamesetofresources orupdatingthedatasetfornewperiodwithoutwritingthemetadata againandagain  Tofacilitatetheusersforeasiernavigationandsearchingforrelevant data.
  5. 5. Catalog Formation  Catalogwithsameresourcewithdifferenttimeperiod (Annual,Quarterly,Monthly,WeeklyandDaily)  Eg.AnnualRainfallData  Catalogwithsameresourcebutwithdifferentjurisdiction (India,States,Districts,Block,Village)  States/UTs-wise Forest and Tree Cover  Catalogwithsameresourcebutdifferentcategory (ScheduleCaste,ScheduleTribe,General,Religionetc.)  District-wise crimes committed against Schedule Caste  CatalogwithSimilartypeofresourceundersamereport (Resourcesofsimilarnature)fromthesamereport/survey canbegroupedunderthesamecatalog  Primary Census Abstract 2011 - India and States
  6. 6. MetaData • Is the information that describes the data – What is that data (About Data) – Data source – Who Created – When created – Etc. • Metadata allows the data to be traced to a know its origin and quality
  7. 7. Metadata Elements for Catalogs  Title(Required):Auniquenameforthecatalog(groupofresources)  Shouldcontainthegeneraltermswhichdescribestheessential properties/characteristicsofthedatasets/resources  Should be in plain English and include sufficient detail to facilitate search and discovery  Time-periodshouldnotbementionedinthecatalogtitlenormallysothatforthesimilar resources,containingsametypeofdataforthenexttime-period/periodicupdating,can beaccommodatedinsamecatalog  Howeverinexceptionalcases,itcancontaintimeperiodparticularlyforperiodic surveys/censuswhichcontainsahugenumberofdatasets/resourcesbelongingtothe sameperiod/year  Eg.CurrentPopulationSurvey,ConsumerPriceIndex,VarietywiseDailyMarketPrices Data,StatewiseConstructionofDeepTubewellsovertheyears,etc.  Description(Required):Provideadetaileddescriptionofthecatalog  Anabstractdeterminingthenatureandpurposeofthecatalog  Containsthenameofvariableswhichareavailableinthedatasets  Canalsocontainsthedefinitionofsomevariable
  8. 8. Metadata Elements for Catalogs  Keywords(Required):Itisalistof terms,separatedbycommas, describingandindicatingatthecontentofthecatalog.Example: rainfall,weather,monthlystatistics.  Help users discover your dataset; please include terms that would be used by technical and non-technical users.  GroupName:ThisisanoptionalfieldtoprovideaGroupNameto multiplecatalogsinordertoshowthattheymaybepresentedas agroupora set.  Sector& SubSector(Required):Choosethe sectors(s)/subsector(s)thosemostcloselyapply(ies)toyour catalog.  AssetJurisdiction(Required):Thisisa requiredfieldtoidentify theexactlocationorareatowhichthecatalogand resources(dataset/apps)caterstoviz.entirecountry, state/province,district,city,etc.
  9. 9. Example - Creation of catalog  Catalog Title:  CompanyMasterData2015  (Incorrect-Contains time frame, so in future if we want to add data under this catalog e.g Company master data for 2016, it would be not be possible to upload data under this catalog)  CompanyMasterData (Correct)  CatalogDescription:  GetdataofCompanymasterdata..??  (Incorrect-Does not contain detail information. Description should contain the name of variables which are available in the datasets)  Get data on master details of any company registered with Registrar of Companies (RoC). Data contains various information like Corporate Identification Number(CIN), Company Name, Company Status, Company Class, Company Category, Authorized Capital in INR, Paid-up Capital in INR, Date of Registration, Registered State, Registrar of Companies, Principal Business Activity, Registered Office Address and Sub Category. (Correct)  Keywords:  CompanyMasterData,….??  (Incorrect-listoftermsdescribingandindicatingthecontentofthecatalog,allthe possiblesearchkeywordsshouldbeincluded  RegisteredCompanies,CompanymasterData,CompanyData,IndianCompanies, Company,CompanyDetails,CorporateIdentificationNumber,CIN,CompanyAddress (Correct)
  10. 10. Metadata Elements for Resources  Title(Required):Auniquenameoftheresource  Shouldbeselfexplanatoryviz.ConsumerPriceIndexfor<Month/Year>etc.  Resourcetitleshouldcontainthetimeframe,sonoduplicationwilloccurinfutureeg. ConsumerPriceIndexfromApril-2000toApril-2015,Rainfalloftheyear2012  AccessMethod(Required):Howuserisgoingtogetthatdata  UploadaDatasetor  SingleClickLinktoDataset  Category(Required):IsitaDatasetoranApplication  ReferenceURLs:Thismayincludedescriptiontothestudydesign,instrumentation, implementation,limitations,andappropriateuseofthedatasetortool.Inthecaseof multipledocumentsorURLs,pleasedelimitwithcommasorenterinseparatelines.
  11. 11. Metadata Elements for Resources  IfResource Categoryis Dataset  Granularity of Data:It mentions the time interval over which the data inside thedatasetiscollected/updatedonaregularbasis(one- time,annual,hourly,etc.)  Frequency (Required):It mentions the time interval over which the dataset ispublishedontheOGDPlatformonaregularinterval(one- time,annual,hourly,etc.).  Access Type:Itmentionsthetypeofaccessviz.Open,Priced,Registered AccessorRestrictedAccess(G2G).  IfResource Categoryis App  App Type(Required):ItmentionsthetypeofAppbeingcontributedviz. WebApp,WebService,MobileApp,WebMapService,RSS,APIsetc.
  12. 12. Metadata Elements for Resources  DateReleased:ItmentionsthereleasedateoftheDataset/App.  Note:It mentions the anymore information the contributor/ChiefData Officer wishes to providetothedataconsumerorabouttheresource  Resourcenoteshouldcontainproperexplanationsofanyspecial characters/notationslike*,#,NAetcwhichwasusedinthedatasets  Otherrelevantinformationregardingthisdatasetshouldalsobeprovidedinthenote section.  Informationregardingfiguresinthedatashouldalsobeprovided,i.eFiguresarein numbers,Unit:(Rs./qtl.)  FootnoteavailableunderareportshouldbepartofResourceNote  NDSAPPolicy Compliance: Thisfieldistoindicateifthisdatasetisin conformitywiththeNationalDataSharingandAccessPolicyoftheGovt.of India.
  13. 13. Example - Creation of Resource  Resource Title:  NumberofRegisteredMotorVehicles (Transport&Non-Transport)inDelhi  (Incorrect - Resource title should contain the time frame, so no duplication will occur in future  NumberofRegisteredMotorVehicles(Transport&Non-Transport)inDelhiduring2009-2010 (correct) • ResourceNote:  NIL  (Incorrect - No note but dataset contains some special notations like *, # etc, There are some cells contain NA, some other relevant information are also present for this particular dataset)  Figuresareinnumbers;NA:Notavailable;$:Category-wisedatanotreceived;*:Includedincars; Totalsareprovisionalrepresentingsummationofavailabledata (Correct)  ResourceCategory:  Application  (Incorrect–Asit is dataset not application)  Datasets
  14. 14. Quality of Datasets • Data Compositeness/Completeness/Consistency – Check for the constituent elements (variables) within the dataset – The dataset should be well explained in terms of the variable present therein the dataset through a descriptive metadata – The metadata should well describe the time-period, units, definitions, frequency, data source, jurisdiction and notes to special mention in the dataset – The time series data should be continuous in nature • Data Coverage – Dataset should be made available at the lowest possible levels to allow users correctly describe the phenomena being measured
  15. 15. Quality of Datasets • Standard process of “data cleansing” : – Assigning string, date, character and numbers to the required fields – Abbreviations and acronyms to be replaced by full forms. – No special characters and blank spaces (replaced with NA) in the matrix. – Column header should be self-explanatory – Similar font size with no formulas and merged columns. – Dataset should be de-normalized without any merged column – No formula of calculated column should appear in dataset like Total or Average of available column or rows – Above all it must be in machine readable format viz. CSV, XML, JSON, ODS, XLS etc. – File name should not contain special character except _ and -; no blank space should not be present in file name.
