Unit 3 chapter-1managing-files-of-records

1
Chapter 5 –
Managing Files of Records

2
What’s Up for This Chapter?
 This Chapter’s Material
– Accessing records in files
– Record structures for access
– File access methods vs. file organizations
– Some real-world examples of file structures
– File portability issues

3
The Central Problem
 Locating Stored Data
– Once the data has been stored into a file,
how do you find it to retrieve it?
– What does “find the data” even mean?
 How do you decide what you want to find?
 How do you look for it?
 What if it’s not there?
 What if something very much like it is there?
 What if there are lots of “it” there?
– And, of course, there are efficiency considerations
 How fast is your search algorithm?
 What would you have to do to the file to use a faster one?
 Which will you do more often, add records or find them?
– Bringing you back to the design of the file itself

4
Record Keys
 What Is a Key?
– Data stored in a record by which you look for the
record
– Can be one field or a set of fields
 Examples – { name } or {last name + first_name }
 Two Types of Keys
– Primary key
 Key value, unique in entire file, by which an individual
record
can be located or determined to be absent
– Secondary key
 Key value by which one or more records can be located

5
Primary Keys
Required Characteristics
– Unique across the entire file
 Can never have 2 records with same primary key
 Error to try to add record with duplicate primary key
– In “canonical” form
 Format precisely known, so search candidates can be brought
into that same format before the search
 Example – words (names, etc.) in all upper-case
– Not often used any more: rather, program the system to do the
search independently of case
– Unchanging
 Value for given record should never change
– Given primary key value should always identify same record
– Example – Texas Driver’s License number stays with you, even
if you move away from Texas, then come back

6
Primary Keys, cont’d.
Implication on File Design
– Don’t use possibly non-unique field(s) as primary
key
 Bad – name, birth date, etc.
– Don’t use anything that can possibly change
 Bad – name, address, etc.
– What can we use?
 Best – artificial identifier
– Student number
– Driver’s license number
– Other artificially created unique value

7
Secondary Keys
 Not Such Stringent Rules
– Duplicates allowed
 Still have to define what “find” means if duplicates allowed
– Usually real data, as opposed to primary keys
 The kinds of thing you’d want to search for in real life
– Not used to impose any order on the file
 Can return results based on secondary key(s)
– Selected by secondary key value(s)
– Sorted on secondary key value(s)

8
Searching
 From 2325 – Two Major Methods
– Sequential
 Start at beginning, look until you find what you’re after
 Choices:
– Non-unique keys allowed?
– Return first match or all of them?
– Binary
 Start in middle, remove half the list each time through
 Requires:
– Primary key values unique across file
– File sorted on primary
– Records directly accessible
 There are others, but …

9
Sequential Searching
Performance
– It might take 1 try; it might take N tries
 Average number of tries = N / 2 if:
– Searching on a unique key
– Returning first match
 Average number of tries = N if:
– Returning all matches

10
Sequential Searching
 Performance
– Big factor in disk access
 Worst case:
– File fragmented around the disk
– Each program read takes one physical read
 Best case:
– File fairly contiguous on disk
– I/O System buffers things so very few (1?) actual reads are done
– In multi-user OSs, this seldom happens
 However:
– If read/write head didn’t move between accesses
• Rotational latency & transfer times small compared to seek time
• Multiple physical reads wouldn’t have as much of an impact
– However, most OSs are multi-tasking now
• Can’t rely on read/write head’s being where you left it
• Must assume N physical reads take N full disk accesses

11
Improving Sequential Searches
Reduce Number of Physical Reads
– We can’t do anything about:
 File fragmentation
– If file’s clusters scattered around disk, multiple seeks are necessary
 Multi-tasking environment
– Have to assume each program read causes a physical read
– (May not be true, if I/O System has good internal caching)
– So what do we do?
 Increase the number of records pulled in by each physical read
– Saw this with magnetic tape – group the records into blocks
– Similar to way we collected fields into records, but …
• Grouping fields into records is dependent on data characteristics
• Grouping records into blocks is dependent on I/O system & disk
– Block size should be:
• Multiple of disk sector size
• Compatible with I/O System’s ability to read

12
When to Use Sequential Searching
 Sequential Searching is Good for:
– Text files where you’re looking for a pattern
 Unix ‘grep’ (general regular expression processor) command
– Small files
 Like you use in labs here
– Files that are searched very infrequently
 Not worth the effort to sort to make binary search work
– When you expect a large number of matches
 Example – searching on a secondary key
 It’s Not so Good for:
– Binary files
– Sorted files
– Big files

13
Unix Tools for Sequential
Access
 cat
– Seen this one – concatenate files
– cat F1 F2 >F3
 wc
– Word count (also character & line count)
– wc article.txt
 grep
– Search file for occurrences of regular expression pattern
– grep “Ames" personlist.txt
 od
– Octal dump – or hex, or …
– od -ch list.dat

14
Direct Access
 What is it?
– Go straight to the record you want in the file
 No searching
 No unnecessary disk accesses
– What’s its “order”?
 Time to find a record is independent of number of records
 However, it can be harder to do

15
Direct Access
 How to Do It?
– At I/O System level, seek to record
 C++ seek operations go to relative byte address (RBA) in file
 Variants:
– Seek with “get” pointer vs. seek with “put” pointer
– Relative to start or end of file (default: start)
– But that still doesn’t answer the question
 How do we know what RBA a particular record starts at?
 We’ve talked about index files – but that’s for later
 We could move the problem up one level
– Use relative record number (RRN)
 But that’s no real help
– Still need some kind of index – way to find record’s RRN
– Also requires use of fixed-length records:
RBA = RRN * Record_Size
(assuming, of course, that the first RRA is 0)

16
Building a File of Records
Like Building a Record of Fields
– Same problem, up one level
 Fixed-length or specified-length records?
 How to directly access records?
– But wait – there’s more:
 Want to require software to know as few details about file
as possible
 To do that, those details need to be stored with (in) the file
– File header records
 Store file-specific information at start of file
 Header record format
– Constant across all file types within one system
– Why?

17
File Header Records
Things a Header Record Might Contain
– File structure
 Type of record structure
 Number of data records
 Length of records (if fixed-length)
 Record delimiter (if delimited)
– Record structure (if records have consistent structure)
 Number of fields
 Length of each field or delimiter between each field
 Format of each field
 Key information – if needed
– Primary key field
– Secondary key field(s), if any
– Date/time of most recent access
– Date/time of most recent update

18
File Header Records,
continued
Header Record Format
– Binary or character?
 Depends – is it important for people to read it?
– Here’s a place where HTML-style format might
work
 Lets files of different formats have different headers
(in some ways)
 Only invokes that parse overhead once per file

19
What’s the Difference?
File Organization
– Format of the file itself
 Fixed-length, specified-length, or delimited records
 ASCII or binary character encoding
File Access Method
– Way(s) software can get at contents of file
 Sequential vs. direct
 Indexed sequential

20
Designing a File
Access Affects Organization
– If sequential access is all we need
 Pretty much any organization is OK
 Subject, of course, to application needs
– If we need direct access
 Need fixed-length records
 Can also use indexed files, but that’s for later on
But Organization Also Affects Access
– What if data to be stored in a record is wildly variable?
 Fixed-length records would be extremely wasteful
 But if we use specified-length records, how to do direct access?
– Just about have to use indexing then

21
Metadata
Data About Data
– Usually in the form of a file header
– Example in text
 Astronomy image storage format
 HTML format (name = value)
 But look on page 177: coding style makes a BIG difference
– Parsing this kind of data
 Read field name; read field value
 Convert ASCII value to type required for storage & use
 Store converted value into right variable
– Why use this type of header?

22
More Metadata
PC Graphics Storage Formats
– Data
 Color values for each pixel in image
 Data compression often used (GIF, JPG)
 Different color “depth” possibilities
– Metadata
 Height, width
 Number of bits per pixel (color depth)
 If not true color (24 bits / pixel)
– Color look-up table
• Normally 256 entries
• Indexed by values stored for each pixel (normally 1 byte)
• Contains R/G/B values for color combination
– Formatted to be loaded directly into PC graphics RAM

23
Mixing Data Objects in a File
Objective
– Store different types of data in the same file
– Textbook example – mix of astronomy data
 “File” header (HTML-style)
 “File” of notes – lines of ASCII text
 “File” of image data – in whatever format
– So our data file becomes a file of files
 Each individual “file” (header, notes, or image) looks like
a record in this new “mega-file”
 These “mega-records” are of varying length
 How do we store the “records” in the “mega-records”?
– Could use another level of specified-length record software
– Or, …

24
Our “Mega-File”
Notes
Sub-file
Image
Sub-file
Mega-file
Header
Notes
Sub-file
Image
Sub-file
…
Image
Header
Image
Data
Text line
Text line
Text line
Text line
Text line
…
Text line
Terminator line
Organization
Notes Header

25
More on Our Mega-File
Access
– Can we just read it sequentially?
 Why or why not?
 What if we wanted to skip a notes sub-file?
 What if some image didn’t even have a notes sub-file?
– Can we access it directly?
 What would the header have to include to allow that?
– An index of the “records” in the file
– We call the entries in that index “tags”
 Each tag in the tag list has:
– Type of sub-file referred to
• Special-case type: end of file
– RBA of sub-file in mega-file
– Length of sub-file (not necessary, but helpful)
– Key information, if any, for sub-file

26
More on Our Mega-File
 Access, continued
– So how do we access the mega-file now?
 Read and process the header
– Get whole-file information
– Build in-memory tag table for sub-files
 Sequential access
– Same as before
– May be able to program in some speed-ups from tag table
 Direct access
– Locate sub-file in tag table
– Go right to it

27
Extensibility
Look at Our “Mega-File” Format Again
– Header tells us things about the sub-files:
 What kinds of files they are
 Where to find them
– Files themselves
 To the mega-file processor, just random bytes
 To the sub-file processor, meaningful information
What if we need a new type of sub-file?
– Define a new type of header entry
– Extend header processor to understand that entry
– Write (or borrow or buy) code to handle new sub-file
Cardinal Rule:
– Everything changes –file types, data types, ...

28
Factors Affecting Portability - 1
Operating System Differences
– Example – text lines
 End with line-feed character
 End with carriage-return and line-feed
 Prefixed by a count of characters in the line
Natural Language Differences
– Example – character coding
 Single-byte coding – ASCII, EBCDIC
 Double-byte coding – Unicode
Programming Language Differences
– Pascal can’t directly process varying-length records
– Different C++ compilers use different byte lengths
for the standard data types

29
Factors Affecting Portability - 2
Computer Architecture Differences
– Byte order in 16-bit and 32-bit integer values
 Big-endian – leftmost byte is most significant
 Little-endian – rightmost byte is most significant
– Storage of data in memory
 Some architectures require values that are N bytes long
to start at a byte whose address is divisible by N
0x15 0x32
Big-endian Little-endian
interpretation: interpretation:
0x1532 0x3215

30
How to Port Files
Define Your Format C*A*R*E*F*U*L*L*Y
– Once a format is defined, never change it
 If you need a new format, add it so as not to invalidate
the existing formats
 If you need to change a format, add a new one instead,
and let programs that need the new version use it
– Decide on a standard format for data elements
 Text lines
– ASCII , EBCDIC, or Unicode?
– Which character(s) to end lines?
 Binary
– Tightly packed or multiple-of-N addressing?
– Which “endian”?
– You can always write code to convert to & from the
standard format on a new language, computer, etc.

31
The Conversion Problem
Few Environments – can do directly
Many Env’ts. – need intermediate form
IBM VAX
VAX IBM
IBM IBM
VAX VAX
IA-32 IA-32
IA-64 IA-64
.
.
.
XML
(or some other
standard format)

Unit 3 chapter-1managing-files-of-records

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unit 3 chapter-1managing-files-of-records

Similar to Unit 3 chapter-1managing-files-of-records (20)

Recently uploaded

Recently uploaded (20)

Unit 3 chapter-1managing-files-of-records