This document discusses file structures and record access methods. It introduces the central problem of locating stored data within a file. Record keys like primary and secondary keys are used to search for records. There are two major search methods - sequential search which starts at the beginning and looks sequentially, and binary search which starts in the middle and removes half the list each time. Direct access allows going straight to the desired record. File headers contain metadata about the file structure and organization to allow different software to access the contents. The file organization and access method must work together - fixed-length records are needed for direct access while variable-length records require indexing. Files can contain different types of data objects by treating each as a variable-length record.
2. 2
What’s Up for This Chapter?
This Chapter’s Material
– Accessing records in files
– Record structures for access
– File access methods vs. file organizations
– Some real-world examples of file structures
– File portability issues
3. 3
The Central Problem
Locating Stored Data
– Once the data has been stored into a file,
how do you find it to retrieve it?
– What does “find the data” even mean?
How do you decide what you want to find?
How do you look for it?
What if it’s not there?
What if something very much like it is there?
What if there are lots of “it” there?
– And, of course, there are efficiency considerations
How fast is your search algorithm?
What would you have to do to the file to use a faster one?
Which will you do more often, add records or find them?
– Bringing you back to the design of the file itself
4. 4
Record Keys
What Is a Key?
– Data stored in a record by which you look for the
record
– Can be one field or a set of fields
Examples – { name } or {last name + first_name }
Two Types of Keys
– Primary key
Key value, unique in entire file, by which an individual
record
can be located or determined to be absent
– Secondary key
Key value by which one or more records can be located
5. 5
Primary Keys
Required Characteristics
– Unique across the entire file
Can never have 2 records with same primary key
Error to try to add record with duplicate primary key
– In “canonical” form
Format precisely known, so search candidates can be brought
into that same format before the search
Example – words (names, etc.) in all upper-case
– Not often used any more: rather, program the system to do the
search independently of case
– Unchanging
Value for given record should never change
– Given primary key value should always identify same record
– Example – Texas Driver’s License number stays with you, even
if you move away from Texas, then come back
6. 6
Primary Keys, cont’d.
Implication on File Design
– Don’t use possibly non-unique field(s) as primary
key
Bad – name, birth date, etc.
– Don’t use anything that can possibly change
Bad – name, address, etc.
– What can we use?
Best – artificial identifier
– Student number
– Driver’s license number
– Other artificially created unique value
7. 7
Secondary Keys
Not Such Stringent Rules
– Duplicates allowed
Still have to define what “find” means if duplicates allowed
– Usually real data, as opposed to primary keys
The kinds of thing you’d want to search for in real life
– Not used to impose any order on the file
Can return results based on secondary key(s)
– Selected by secondary key value(s)
– Sorted on secondary key value(s)
8. 8
Searching
From 2325 – Two Major Methods
– Sequential
Start at beginning, look until you find what you’re after
Choices:
– Non-unique keys allowed?
– Return first match or all of them?
– Binary
Start in middle, remove half the list each time through
Requires:
– Primary key values unique across file
– File sorted on primary
– Records directly accessible
There are others, but …
9. 9
Sequential Searching
Performance
– It might take 1 try; it might take N tries
Average number of tries = N / 2 if:
– Searching on a unique key
– Returning first match
Average number of tries = N if:
– Returning all matches
10. 10
Sequential Searching
Performance
– Big factor in disk access
Worst case:
– File fragmented around the disk
– Each program read takes one physical read
Best case:
– File fairly contiguous on disk
– I/O System buffers things so very few (1?) actual reads are done
– In multi-user OSs, this seldom happens
However:
– If read/write head didn’t move between accesses
• Rotational latency & transfer times small compared to seek time
• Multiple physical reads wouldn’t have as much of an impact
– However, most OSs are multi-tasking now
• Can’t rely on read/write head’s being where you left it
• Must assume N physical reads take N full disk accesses
11. 11
Improving Sequential Searches
Reduce Number of Physical Reads
– We can’t do anything about:
File fragmentation
– If file’s clusters scattered around disk, multiple seeks are necessary
Multi-tasking environment
– Have to assume each program read causes a physical read
– (May not be true, if I/O System has good internal caching)
– So what do we do?
Increase the number of records pulled in by each physical read
– Saw this with magnetic tape – group the records into blocks
– Similar to way we collected fields into records, but …
• Grouping fields into records is dependent on data characteristics
• Grouping records into blocks is dependent on I/O system & disk
– Block size should be:
• Multiple of disk sector size
• Compatible with I/O System’s ability to read
12. 12
When to Use Sequential Searching
Sequential Searching is Good for:
– Text files where you’re looking for a pattern
Unix ‘grep’ (general regular expression processor) command
– Small files
Like you use in labs here
– Files that are searched very infrequently
Not worth the effort to sort to make binary search work
– When you expect a large number of matches
Example – searching on a secondary key
It’s Not so Good for:
– Binary files
– Sorted files
– Big files
13. 13
Unix Tools for Sequential
Access
cat
– Seen this one – concatenate files
– cat F1 F2 >F3
wc
– Word count (also character & line count)
– wc article.txt
grep
– Search file for occurrences of regular expression pattern
– grep “Ames" personlist.txt
od
– Octal dump – or hex, or …
– od -ch list.dat
14. 14
Direct Access
What is it?
– Go straight to the record you want in the file
No searching
No unnecessary disk accesses
– What’s its “order”?
Time to find a record is independent of number of records
However, it can be harder to do
15. 15
Direct Access
How to Do It?
– At I/O System level, seek to record
C++ seek operations go to relative byte address (RBA) in file
Variants:
– Seek with “get” pointer vs. seek with “put” pointer
– Relative to start or end of file (default: start)
– But that still doesn’t answer the question
How do we know what RBA a particular record starts at?
We’ve talked about index files – but that’s for later
We could move the problem up one level
– Use relative record number (RRN)
But that’s no real help
– Still need some kind of index – way to find record’s RRN
– Also requires use of fixed-length records:
RBA = RRN * Record_Size
(assuming, of course, that the first RRA is 0)
16. 16
Building a File of Records
Like Building a Record of Fields
– Same problem, up one level
Fixed-length or specified-length records?
How to directly access records?
– But wait – there’s more:
Want to require software to know as few details about file
as possible
To do that, those details need to be stored with (in) the file
– File header records
Store file-specific information at start of file
Header record format
– Constant across all file types within one system
– Why?
17. 17
File Header Records
Things a Header Record Might Contain
– File structure
Type of record structure
Number of data records
Length of records (if fixed-length)
Record delimiter (if delimited)
– Record structure (if records have consistent structure)
Number of fields
Length of each field or delimiter between each field
Format of each field
Key information – if needed
– Primary key field
– Secondary key field(s), if any
– Date/time of most recent access
– Date/time of most recent update
18. 18
File Header Records,
continued
Header Record Format
– Binary or character?
Depends – is it important for people to read it?
– Here’s a place where HTML-style format might
work
Lets files of different formats have different headers
(in some ways)
Only invokes that parse overhead once per file
19. 19
What’s the Difference?
File Organization
– Format of the file itself
Fixed-length, specified-length, or delimited records
ASCII or binary character encoding
File Access Method
– Way(s) software can get at contents of file
Sequential vs. direct
Indexed sequential
20. 20
Designing a File
Access Affects Organization
– If sequential access is all we need
Pretty much any organization is OK
Subject, of course, to application needs
– If we need direct access
Need fixed-length records
Can also use indexed files, but that’s for later on
But Organization Also Affects Access
– What if data to be stored in a record is wildly variable?
Fixed-length records would be extremely wasteful
But if we use specified-length records, how to do direct access?
– Just about have to use indexing then
21. 21
Metadata
Data About Data
– Usually in the form of a file header
– Example in text
Astronomy image storage format
HTML format (name = value)
But look on page 177: coding style makes a BIG difference
– Parsing this kind of data
Read field name; read field value
Convert ASCII value to type required for storage & use
Store converted value into right variable
– Why use this type of header?
22. 22
More Metadata
PC Graphics Storage Formats
– Data
Color values for each pixel in image
Data compression often used (GIF, JPG)
Different color “depth” possibilities
– Metadata
Height, width
Number of bits per pixel (color depth)
If not true color (24 bits / pixel)
– Color look-up table
• Normally 256 entries
• Indexed by values stored for each pixel (normally 1 byte)
• Contains R/G/B values for color combination
– Formatted to be loaded directly into PC graphics RAM
23. 23
Mixing Data Objects in a File
Objective
– Store different types of data in the same file
– Textbook example – mix of astronomy data
“File” header (HTML-style)
“File” of notes – lines of ASCII text
“File” of image data – in whatever format
– So our data file becomes a file of files
Each individual “file” (header, notes, or image) looks like
a record in this new “mega-file”
These “mega-records” are of varying length
How do we store the “records” in the “mega-records”?
– Could use another level of specified-length record software
– Or, …
25. 25
More on Our Mega-File
Access
– Can we just read it sequentially?
Why or why not?
What if we wanted to skip a notes sub-file?
What if some image didn’t even have a notes sub-file?
– Can we access it directly?
What would the header have to include to allow that?
– An index of the “records” in the file
– We call the entries in that index “tags”
Each tag in the tag list has:
– Type of sub-file referred to
• Special-case type: end of file
– RBA of sub-file in mega-file
– Length of sub-file (not necessary, but helpful)
– Key information, if any, for sub-file
26. 26
More on Our Mega-File
Access, continued
– So how do we access the mega-file now?
Read and process the header
– Get whole-file information
– Build in-memory tag table for sub-files
Sequential access
– Same as before
– May be able to program in some speed-ups from tag table
Direct access
– Locate sub-file in tag table
– Go right to it
27. 27
Extensibility
Look at Our “Mega-File” Format Again
– Header tells us things about the sub-files:
What kinds of files they are
Where to find them
– Files themselves
To the mega-file processor, just random bytes
To the sub-file processor, meaningful information
What if we need a new type of sub-file?
– Define a new type of header entry
– Extend header processor to understand that entry
– Write (or borrow or buy) code to handle new sub-file
Cardinal Rule:
– Everything changes –file types, data types, ...
28. 28
Factors Affecting Portability - 1
Operating System Differences
– Example – text lines
End with line-feed character
End with carriage-return and line-feed
Prefixed by a count of characters in the line
Natural Language Differences
– Example – character coding
Single-byte coding – ASCII, EBCDIC
Double-byte coding – Unicode
Programming Language Differences
– Pascal can’t directly process varying-length records
– Different C++ compilers use different byte lengths
for the standard data types
29. 29
Factors Affecting Portability - 2
Computer Architecture Differences
– Byte order in 16-bit and 32-bit integer values
Big-endian – leftmost byte is most significant
Little-endian – rightmost byte is most significant
– Storage of data in memory
Some architectures require values that are N bytes long
to start at a byte whose address is divisible by N
0x15 0x32
Big-endian Little-endian
interpretation: interpretation:
0x1532 0x3215
30. 30
How to Port Files
Define Your Format C*A*R*E*F*U*L*L*Y
– Once a format is defined, never change it
If you need a new format, add it so as not to invalidate
the existing formats
If you need to change a format, add a new one instead,
and let programs that need the new version use it
– Decide on a standard format for data elements
Text lines
– ASCII , EBCDIC, or Unicode?
– Which character(s) to end lines?
Binary
– Tightly packed or multiple-of-N addressing?
– Which “endian”?
– You can always write code to convert to & from the
standard format on a new language, computer, etc.
31. 31
The Conversion Problem
Few Environments – can do directly
Many Env’ts. – need intermediate form
IBM VAX
VAX IBM
IBM IBM
VAX VAX
IA-32 IA-32
IA-64 IA-64
.
.
.
XML
(or some other
standard format)