IDIC:C1001 A student data file comprises student records. Each record may contain several fields. Each field or data item contains one piece of information, for example, Student Name or Student Number. All the data items about that student would form his record.
IDIC:C1001 Alternatively, in a database environment, a file is called an entity set, record an entity and fields attributes.
IDIC:C1001 The physical file refers to the physical storage media, for example, where on a hard disk’s tracks and cylinders, are a file’s contents stored. A logical file are the actual contents a viewer sees, and the processing method, for example, processing data serially or using an index.
IDIC:C1001 A key field is a unique data item which is used to find and read or process that particular record, for example, a student number is used to access that student’s record and update it.
IDIC:C1001 The length of a record and number of records in a file determine the file size in terms of bytes or characters. Fixed length records can be designed for stock items, books, pharmaceuticals and other inanimate objects. However, files that contain information about human beings are usually of variable-length records, e.g. patients’ medical records, employee records and student records. One reason is human ‘data’ is of variable length. Names and addresses, and other personal information are seldom of fixed length.
IDIC:C1001 Data can be written to a hard disk in order to save it for future reference. Insertion allows records to be written to any location in an existing file. Deleting a record from a file erases its presence.
IDIC:C1001 Updating is to alter the contents of a record to reflect its latest status. Sorting is to re-organise records in a file according to a certain order, e.g. first to last, or last to first. Merging combines data from two or more files to output one data file.
IDIC:C1001 Matching reads records from two files and checks whether they are referring to the same entity, e.g. records from student file are compared with student fee payment file to highlight who has paid (or failed to pay). Searching needs a search expression where e.g. a student’s (actual) number (key value) is keyed in and the search program uses this number to find a match. Appending is contrasted with Insertion. Appending a record adds it to the end of the data file.
IDIC:C1001 The hit rate for a monthly payroll file will be high as all employees get their pay. The hit rate for a supermarket stock items would be much lower as it tends to stock up a large number of items. The number of items actually sold daily, weekly or monthly may be far less, so its hit rate would be low.
IDIC:C1001 Data files are categorised according to their nature of contents and processing purpose. They include master file, transaction file, work file, transition file, security file and audit file.
IDIC:C1001 Master file may be updated daily (sales data), weekly (stock items movement) or monthly (payroll). Master file works hand with transaction file to identify which records need to be amended and updated. Work files can be created during processing so that records can be set aside or marked for reading and processing. Sorting is a common process where work files are created for a temporary use.
IDIC:C1001 Transition files may be created in some applications where data records from two or more files are read and perhaps merged. The new transition file is created to reflect the merged data. E.g. Electricity bills may include customer records from a Master file, and records showing that customer’s consumption in a particular month. The output is shown as that month’s customer bill. A backup copy of important data is an additional copy that is an ‘insurance’ against any unforeseen corruption, damage or even loss of data. Auditors need to verify if transactions were correctly processed when computer software is used for transaction processing.
IDIC:C1001 Data is stored in storage media under file names. The method of storing or saving data is called data or file organisation. Serial file and sequential file organisation can be used both in magnetic tapes and disks. However, Indexed-sequential and random organisation can be used only on Direct Access Storage Devices like hard discs, floppy discs, and compact discs.
IDIC:C1001 In serial method data is saved according to the time the transaction takes place. No thought is given to sorting the data. This method is used when customer orders are accepted in an organisation. Each order is a transaction and transaction number may be created to refer to each customer order. Transaction files created in this manner are unsorted. Records saved serially are also accessed serially, i.e. one after another.
IDIC:C1001 Serial files are not suitable for fast search of individual records. Since records are not sorted, searching for a record means beginning the search from the first record and reading one after another till the desired record is found.
IDIC:C1001 Sequential files are sorted (ordered) according to a record key e.g. student number for Student file. An unsorted serial Transaction file of Customer Orders can be sorted to produce or output a sorted Sequential file called Customer Order file. The Customer Order file can be sorted based on the Customer Order Number as the primary record key, and Customer Number as the secondary record key.
IDIC:C1001 Since Sequential file is similar to serial the advantages look the same. However, searching for specific records based on the record key is possible.
IDIC:C1001 Data records are kept in a Sequential file sorted according to the primary record key (Records can have several keys e.g. Employee No., Identity Card No., Income Tax No., Social Security No., but one key has to chosen to act as the primary key). An Index file is generated by the File Management System based on the Record Key. This Index file stores the locations of all the data records. Indexed Sequential method thus uses 2 separate files, one to store data records, and the second to store record addresses. The three parts include the data storage area or the prime area where data is stored when they are first created. The second area called the Overflow area is used to store records added later if and when the Prime area is full. The Index area stores the Index file containing all the addresses of the records.
IDIC:C1001 Overflow areas are reserved on each track and cylinder in the hard disk to accept records that may be added later to the same filename. Pointers are indexes that link the System to the actual location of data i.e. in terms of disk surface number, track number and cylinder number. File reorganisation is the adjustment made to the file when records are deleted or new records are added. At the same time the Index file has to be re-adjusted so that new addresses can be kept up-to-date. Records stored in overflow areas could be ‘recovered’ and stored together where possible (similar to the principle of defragmenting disks).
IDIC:C1001 Sequential processing is where all records one after another are processed. Selective sequential is where a group of records within a large file is chosen, and this group is processed one by one. E.g. a specific class of Economics major students in a university is selected for sequential processing. Direct access could include separating a large number of records into groups, and then sub-dividing this group into smaller groups. The sub-division continues till the record is found.
IDIC:C1001 Random organisation does not save data in any particular order. A calculation is performed on the Record Key to generate a location or address. The record is then directly addressable.
IDIC:C1001 Indexed sequential is a very popular method of file organisation and access as it offers three different processing methods. File re-organisation techniques help to save valuable disk space. Access speed is acceptable, and data records need not be sorted. Storage space today is so much cheaper, thus the cost of storage media is not an important factor, especially considering the numerous advantages.
IDIC:C1001 Records are accessed directly in random organisation, however, accessed speed may be slower if the file has many records.
IDIC:C1001 Record keys are used to calculate the addresses (key transformation techniques). Division remainder method could generate the same address for two record keys (creating synonyms).
IDIC:C1001 Synonyms can be generated. Also some locations may not be assigned addresses thereby creating gaps.
IDIC:C1001 There is no facility to re-organise files so records are always scattered - this is unsuitable for large files. Access time is slow if the number of data records in the file is large. Synonyms can cause data loss as two records cannot occupy the same space. Gaps or empty spaces that are not used to store data show that this organisation method is not quite efficient.
IDIC:C1001 To verify data in the ‘olden’ days, double punching method required two different data entry clerks to enter data at different times using the same source documents. The two file copies were compared for errors. If errors existed, they were manually corrected. Sight verification is where data entry clerks visually check the data they enter against source documents for possible errors.
IDIC:C1001 Data validation is done by programs to test and verify if data contains errors. Presence check is to identify if a particular record exists in a file, e.g. before the record could be deleted, or even before the same record is newly added. Size is the length of the record or number of characters. The test is to limit, in some cases, how long a data item should be e.g. the number of characters in the Identity Card number. Range check is to limit the value of a data item e.g. exam marks should be within 1 to 100. Character check is to ensure numbers are not accepted in place of alphabets. Format of data can include how date format is presented e.g. DD/MM/YY, or MM/DD/YY, and also the accepted number of decimal places when displaying real numbers. Quantities should not be abnormally high or low e.g. 0.0 gm of rice is not reasonable. A check digit is a single digit that is attached to a numeric data item in order to make it self-checking. It is constructed in such a way that it has a unique relationship with the rest of the numbers.
IDIC:C1001 It is most important to recover correct data in case of system failure or other unforeseen disasters. For programs that take a long time to run, adequate program checkpoint and re-start facilities must be provided. At specific points in a program run, a dump is made of memory contents to disc or tape to identify the current status of the run. File dumps are used for online systems. File dumps mean copying the entire contents onto a storage medium like magnetic tape and housing it separately from the working file. File dumps c an also be taken at regular intervals during a program run. Generations of backup files : Three generations of backup concept (Grandfather-Father-Son) can also be used to enable recovery from data loss or corruption. This is done by re-constructing the file from previous generations of data files.
CCT101: Chapter 9 Files
OBJECTIVES• Describe the types of data processing files• Describe the types of file organization• Data validation
FILE, RECORD & FIELD- Field • Data item • e.g. student name- Record • A group of related data items or fields • e.g. student record- File • A collection of related records • e.g. Student file
ENTITY SET, ENTITY & ATTRIBUTES - Attributes • Describe the properties of the entity (I.e. field) - Entity • Which or when we store facts (i.e. records) - Entity set • A collection of logically related entities (i.e. file)
Logical File & Physical Files1. Physical file : – Refers to how the data is stored i.e. the actual arrangement of data in storage device2. Logical file : – What a file contains & how the data should be processed
Key Field• It is a field within the record which is used for locating & processing the record e.g. student number
FILE LENGTH• Fixed-length records – Each record has the same length – Advantage: Easier to design – Disadvantage: Wasted storage space• Variable-length records – Each record does not have the same length – Advantage: Saves storage space – Disadvantage: More difficult to design
INFORMATION RETRIEVAL1. Writing : – The act of transferring a record from main memory to secondary storage.2. Insertion : – Adding a new record to an existing file.3. Deleting : – Removing a record from a file.
INFORMATION RETRIEVAL4. Updating : – Making changes to the contents of a record to show the new status of information.5. Sorting : – Rearranging the records in a file for the purpose of producing ordered reports.6. Merging : – Combination of 2 or more files to produce a single output file.
INFORMATION RETRIEVAL7. Matching : – Where 2 or more output files are compared record against record to ensure there is a complete set of records for each key. Mismatched records are highlighted for action.8. Searching : – Involves looking for a record with a certain key value9. Appending : - Adding a record at the last available space of an existing file
ACTIVITY RATIO (HIT RATE)• The number of records that are changed as a result of updating when compared to the total number of records in the file. – HIT RATE = number or records affected total records on file• Volatility : – Measuring the number of additions and deletions in a file.• File growth – No of records additions – number of records deletions
TYPES OF DP FILES1. Master file – Permanent or semi-permanent data – Used for reference and updating – Shows the current status of data – Never empty except at its time of creation – E.g. stock master file
TYPES OF DP FILES2. Transaction file – Contains source or transaction data – Used for updating master file – E.g. sales transaction file3. Work file – Temporary file – Used for storing intermediate data for further processing – E.g. file used by sort utility
TYPES OF DP FILES4. Transition file – Temporary file for specific use – E.g. meter readings, customer’s detail for printout5. Security & backup file – Extra copy of file against damage/loss6. Audit file – Enables auditor to check correct functioning of computer based procedures – Keeps a copy of all transactions
FILE ORGANISATIONS• 4 Types 1. Serial 2. Sequential 3. Indexed-sequential 4. Random
SERIAL ORGANISATION• Simplest, not in any order• Placed record in next available space• Suitable for – Unsorted transaction files – Print files – Dump files – Temporary data files• Access in order of records placed
SERIAL ORGANISATION• Advantages : – File design is simple – Efficient for high activity file – Effective use of low cost file media suitable for batch processing• Disadvantage : – File are to be processed from beginning to the end
SEQUENTIAL ORGANISATION• Predefined order• A designated field within the record is selected as basis in ordering records• This key is also known as Record key or Simply key• Suitable for master file• Not for fast response on line enquiring systems• E.g. Payroll transaction file
SEQUENTIAL ORGANISATION• Advantages : – File design is simple – Efficient for high activity file – Effective use of low cost file media suitable for batched transactions• Disadvantage : – Entire file must be processed even if activity is low – Transactions required sorting
INDEXED SEQUENTIAL ORGANISATION• Physical sequence to primary key• Builds an index separate from the data or records• Accessed randomly and sequentially• 3 main parts – Prime (Home) area – Overflow area – Index area
INDEXED SEQUENTIAL ORGANISATION• When insufficient space in home area (prime area), overflow area will be used• Overflow areas created at cylinder & track level• Access controlled by means of pointers• File reorganization to be done• Overflow records recovered & indexes rebuilt
INDEXED-SEQUENTIAL FILES- Support three types of processing : 1. Sequential processing 2. Selective sequential processing/ Random access 3. Block is searched record by record until record is found/ Direct access/ Dynamic access
RANDOM ORGANISATION• Predictable relationship between record key & record’s location on disc• Not in sequence physically, scattered in random• Direct addressing• Key as physical address of record• Device dependent
INDEXED-SEQUENTIAL ORGANISATION• Advantages : – Transactions may be sorted or unsorted – Only the affected master records are processed during updating – Response time is reasonably fast – Facilities file enquiry – Be processed sequentially and randomly• Disadvantage : – Each master file access requires index file access – Requires direct access storage devices (still costly) – Storage space required for indexes
RANDOM ORGANIZATION• Predictable relationship between record key and record location on disc• Records may be scattered in random• Direct addressing
RANDOM ORGANIZATION• Key transformation techniques used 1. Division remainder method − Divide key value by an appropriate number − Remainder of division as address of record − Number used to divide is prime number 1. Mid Square Hashing − The key is squared, specified digits extracted from middle of the result to yield address of the results
RANDOM ORGANIZATION3. Hashing By Folding – Key is divided into 2 or more parts which are then added together – Truncation to bring result into required range of numbers
RANDOM ORGANISATION• Advantages : – As index are not required, space and searching time are saved – Insertion and deletion or records can take place• Disadvantage : – Variable-length records are difficult to handle – Gaps in keys can caused wasted space – Synonym can occur – Allocation of efficient overflow areas is difficult
DATA VERIFICATION• Double punching method• Sight verification
DATA VALIDATION• Presence• Size• Range• Character check• Format• Reasonableness• Check digits
ERROR RECOVERY• Adequate program checkpoint/ restart facilities• File dumps• Generations of backup files