SlideShare a Scribd company logo
1
Chapter 5 –
Managing Files of Records
2
What’s Up for This Chapter?
 This Chapter’s Material
– Accessing records in files
– Record structures for access
– File access methods vs. file organizations
– Some real-world examples of file structures
– File portability issues
3
The Central Problem
 Locating Stored Data
– Once the data has been stored into a file,
how do you find it to retrieve it?
– What does “find the data” even mean?
 How do you decide what you want to find?
 How do you look for it?
 What if it’s not there?
 What if something very much like it is there?
 What if there are lots of “it” there?
– And, of course, there are efficiency considerations
 How fast is your search algorithm?
 What would you have to do to the file to use a faster one?
 Which will you do more often, add records or find them?
– Bringing you back to the design of the file itself
4
Record Keys
 What Is a Key?
– Data stored in a record by which you look for the
record
– Can be one field or a set of fields
 Examples – { name } or {last name + first_name }
 Two Types of Keys
– Primary key
 Key value, unique in entire file, by which an individual
record
can be located or determined to be absent
– Secondary key
 Key value by which one or more records can be located
5
Primary Keys
Required Characteristics
– Unique across the entire file
 Can never have 2 records with same primary key
 Error to try to add record with duplicate primary key
– In “canonical” form
 Format precisely known, so search candidates can be brought
into that same format before the search
 Example – words (names, etc.) in all upper-case
– Not often used any more: rather, program the system to do the
search independently of case
– Unchanging
 Value for given record should never change
– Given primary key value should always identify same record
– Example – Texas Driver’s License number stays with you, even
if you move away from Texas, then come back
6
Primary Keys, cont’d.
Implication on File Design
– Don’t use possibly non-unique field(s) as primary
key
 Bad – name, birth date, etc.
– Don’t use anything that can possibly change
 Bad – name, address, etc.
– What can we use?
 Best – artificial identifier
– Student number
– Driver’s license number
– Other artificially created unique value
7
Secondary Keys
 Not Such Stringent Rules
– Duplicates allowed
 Still have to define what “find” means if duplicates allowed
– Usually real data, as opposed to primary keys
 The kinds of thing you’d want to search for in real life
– Not used to impose any order on the file
 Can return results based on secondary key(s)
– Selected by secondary key value(s)
– Sorted on secondary key value(s)
8
Searching
 From 2325 – Two Major Methods
– Sequential
 Start at beginning, look until you find what you’re after
 Choices:
– Non-unique keys allowed?
– Return first match or all of them?
– Binary
 Start in middle, remove half the list each time through
 Requires:
– Primary key values unique across file
– File sorted on primary
– Records directly accessible
 There are others, but …
9
Sequential Searching
Performance
– It might take 1 try; it might take N tries
 Average number of tries = N / 2 if:
– Searching on a unique key
– Returning first match
 Average number of tries = N if:
– Returning all matches
10
Sequential Searching
 Performance
– Big factor in disk access
 Worst case:
– File fragmented around the disk
– Each program read takes one physical read
 Best case:
– File fairly contiguous on disk
– I/O System buffers things so very few (1?) actual reads are done
– In multi-user OSs, this seldom happens
 However:
– If read/write head didn’t move between accesses
• Rotational latency & transfer times small compared to seek time
• Multiple physical reads wouldn’t have as much of an impact
– However, most OSs are multi-tasking now
• Can’t rely on read/write head’s being where you left it
• Must assume N physical reads take N full disk accesses
11
Improving Sequential Searches
Reduce Number of Physical Reads
– We can’t do anything about:
 File fragmentation
– If file’s clusters scattered around disk, multiple seeks are necessary
 Multi-tasking environment
– Have to assume each program read causes a physical read
– (May not be true, if I/O System has good internal caching)
– So what do we do?
 Increase the number of records pulled in by each physical read
– Saw this with magnetic tape – group the records into blocks
– Similar to way we collected fields into records, but …
• Grouping fields into records is dependent on data characteristics
• Grouping records into blocks is dependent on I/O system & disk
– Block size should be:
• Multiple of disk sector size
• Compatible with I/O System’s ability to read
12
When to Use Sequential Searching
 Sequential Searching is Good for:
– Text files where you’re looking for a pattern
 Unix ‘grep’ (general regular expression processor) command
– Small files
 Like you use in labs here
– Files that are searched very infrequently
 Not worth the effort to sort to make binary search work
– When you expect a large number of matches
 Example – searching on a secondary key
 It’s Not so Good for:
– Binary files
– Sorted files
– Big files
13
Unix Tools for Sequential
Access
 cat
– Seen this one – concatenate files
– cat F1 F2 >F3
 wc
– Word count (also character & line count)
– wc article.txt
 grep
– Search file for occurrences of regular expression pattern
– grep “Ames" personlist.txt
 od
– Octal dump – or hex, or …
– od -ch list.dat
14
Direct Access
 What is it?
– Go straight to the record you want in the file
 No searching
 No unnecessary disk accesses
– What’s its “order”?
 Time to find a record is independent of number of records
 However, it can be harder to do
15
Direct Access
 How to Do It?
– At I/O System level, seek to record
 C++ seek operations go to relative byte address (RBA) in file
 Variants:
– Seek with “get” pointer vs. seek with “put” pointer
– Relative to start or end of file (default: start)
– But that still doesn’t answer the question
 How do we know what RBA a particular record starts at?
 We’ve talked about index files – but that’s for later
 We could move the problem up one level
– Use relative record number (RRN)
 But that’s no real help
– Still need some kind of index – way to find record’s RRN
– Also requires use of fixed-length records:
RBA = RRN * Record_Size
(assuming, of course, that the first RRA is 0)
16
Building a File of Records
Like Building a Record of Fields
– Same problem, up one level
 Fixed-length or specified-length records?
 How to directly access records?
– But wait – there’s more:
 Want to require software to know as few details about file
as possible
 To do that, those details need to be stored with (in) the file
– File header records
 Store file-specific information at start of file
 Header record format
– Constant across all file types within one system
– Why?
17
File Header Records
Things a Header Record Might Contain
– File structure
 Type of record structure
 Number of data records
 Length of records (if fixed-length)
 Record delimiter (if delimited)
– Record structure (if records have consistent structure)
 Number of fields
 Length of each field or delimiter between each field
 Format of each field
 Key information – if needed
– Primary key field
– Secondary key field(s), if any
– Date/time of most recent access
– Date/time of most recent update
18
File Header Records,
continued
Header Record Format
– Binary or character?
 Depends – is it important for people to read it?
– Here’s a place where HTML-style format might
work
 Lets files of different formats have different headers
(in some ways)
 Only invokes that parse overhead once per file
19
What’s the Difference?
File Organization
– Format of the file itself
 Fixed-length, specified-length, or delimited records
 ASCII or binary character encoding
File Access Method
– Way(s) software can get at contents of file
 Sequential vs. direct
 Indexed sequential
20
Designing a File
Access Affects Organization
– If sequential access is all we need
 Pretty much any organization is OK
 Subject, of course, to application needs
– If we need direct access
 Need fixed-length records
 Can also use indexed files, but that’s for later on
But Organization Also Affects Access
– What if data to be stored in a record is wildly variable?
 Fixed-length records would be extremely wasteful
 But if we use specified-length records, how to do direct access?
– Just about have to use indexing then
21
Metadata
Data About Data
– Usually in the form of a file header
– Example in text
 Astronomy image storage format
 HTML format (name = value)
 But look on page 177: coding style makes a BIG difference
– Parsing this kind of data
 Read field name; read field value
 Convert ASCII value to type required for storage & use
 Store converted value into right variable
– Why use this type of header?
22
More Metadata
PC Graphics Storage Formats
– Data
 Color values for each pixel in image
 Data compression often used (GIF, JPG)
 Different color “depth” possibilities
– Metadata
 Height, width
 Number of bits per pixel (color depth)
 If not true color (24 bits / pixel)
– Color look-up table
• Normally 256 entries
• Indexed by values stored for each pixel (normally 1 byte)
• Contains R/G/B values for color combination
– Formatted to be loaded directly into PC graphics RAM
23
Mixing Data Objects in a File
Objective
– Store different types of data in the same file
– Textbook example – mix of astronomy data
 “File” header (HTML-style)
 “File” of notes – lines of ASCII text
 “File” of image data – in whatever format
– So our data file becomes a file of files
 Each individual “file” (header, notes, or image) looks like
a record in this new “mega-file”
 These “mega-records” are of varying length
 How do we store the “records” in the “mega-records”?
– Could use another level of specified-length record software
– Or, …
24
Our “Mega-File”
Notes
Sub-file
Image
Sub-file
Mega-file
Header
Notes
Sub-file
Image
Sub-file
…
Image
Header
Image
Data
Text line
Text line
Text line
Text line
Text line
…
Text line
Terminator line
Organization
Notes Header
25
More on Our Mega-File
Access
– Can we just read it sequentially?
 Why or why not?
 What if we wanted to skip a notes sub-file?
 What if some image didn’t even have a notes sub-file?
– Can we access it directly?
 What would the header have to include to allow that?
– An index of the “records” in the file
– We call the entries in that index “tags”
 Each tag in the tag list has:
– Type of sub-file referred to
• Special-case type: end of file
– RBA of sub-file in mega-file
– Length of sub-file (not necessary, but helpful)
– Key information, if any, for sub-file
26
More on Our Mega-File
 Access, continued
– So how do we access the mega-file now?
 Read and process the header
– Get whole-file information
– Build in-memory tag table for sub-files
 Sequential access
– Same as before
– May be able to program in some speed-ups from tag table
 Direct access
– Locate sub-file in tag table
– Go right to it
27
Extensibility
Look at Our “Mega-File” Format Again
– Header tells us things about the sub-files:
 What kinds of files they are
 Where to find them
– Files themselves
 To the mega-file processor, just random bytes
 To the sub-file processor, meaningful information
What if we need a new type of sub-file?
– Define a new type of header entry
– Extend header processor to understand that entry
– Write (or borrow or buy) code to handle new sub-file
Cardinal Rule:
– Everything changes –file types, data types, ...
28
Factors Affecting Portability - 1
Operating System Differences
– Example – text lines
 End with line-feed character
 End with carriage-return and line-feed
 Prefixed by a count of characters in the line
Natural Language Differences
– Example – character coding
 Single-byte coding – ASCII, EBCDIC
 Double-byte coding – Unicode
Programming Language Differences
– Pascal can’t directly process varying-length records
– Different C++ compilers use different byte lengths
for the standard data types
29
Factors Affecting Portability - 2
Computer Architecture Differences
– Byte order in 16-bit and 32-bit integer values
 Big-endian – leftmost byte is most significant
 Little-endian – rightmost byte is most significant
– Storage of data in memory
 Some architectures require values that are N bytes long
to start at a byte whose address is divisible by N
0x15 0x32
Big-endian Little-endian
interpretation: interpretation:
0x1532 0x3215
30
How to Port Files
Define Your Format C*A*R*E*F*U*L*L*Y
– Once a format is defined, never change it
 If you need a new format, add it so as not to invalidate
the existing formats
 If you need to change a format, add a new one instead,
and let programs that need the new version use it
– Decide on a standard format for data elements
 Text lines
– ASCII , EBCDIC, or Unicode?
– Which character(s) to end lines?
 Binary
– Tightly packed or multiple-of-N addressing?
– Which “endian”?
– You can always write code to convert to & from the
standard format on a new language, computer, etc.
31
The Conversion Problem
Few Environments – can do directly
Many Env’ts. – need intermediate form
IBM VAX
VAX IBM
IBM IBM
VAX VAX
IA-32 IA-32
IA-64 IA-64
.
.
.
XML
(or some other
standard format)

More Related Content

What's hot

Fundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsFundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsDevyani Vaidya
 
File Organization
File OrganizationFile Organization
File OrganizationManyi Man
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing fileskeeeerty
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileShivaniJayaprakash1
 
File Handling Python
File Handling PythonFile Handling Python
File Handling PythonAkhil Kaushik
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data StructureProf Ansari
 
Data file handling in python reading & writing methods
Data file handling in python reading & writing methodsData file handling in python reading & writing methods
Data file handling in python reading & writing methodskeeeerty
 
Chapter 11 - File System Implementation
Chapter 11 - File System ImplementationChapter 11 - File System Implementation
Chapter 11 - File System ImplementationWayne Jones Jnr
 
Free Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFSFree Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFSUnited International University
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for filesZainab Almugbel
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
File handling and Dictionaries in python
File handling and Dictionaries in pythonFile handling and Dictionaries in python
File handling and Dictionaries in pythonnitamhaske
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structuresDevyani Vaidya
 

What's hot (20)

Fundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of recordsFundamental file structure concepts & managing files of records
Fundamental file structure concepts & managing files of records
 
File Organization
File OrganizationFile Organization
File Organization
 
FILES IN C
FILES IN CFILES IN C
FILES IN C
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
C files
C filesC files
C files
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing files
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
 
File Handling Python
File Handling PythonFile Handling Python
File Handling Python
 
File structures
File structuresFile structures
File structures
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Data file handling in python reading & writing methods
Data file handling in python reading & writing methodsData file handling in python reading & writing methods
Data file handling in python reading & writing methods
 
Chapter 11 - File System Implementation
Chapter 11 - File System ImplementationChapter 11 - File System Implementation
Chapter 11 - File System Implementation
 
Free Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFSFree Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFS
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for files
 
Files c3
Files c3Files c3
Files c3
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
File handling and Dictionaries in python
File handling and Dictionaries in pythonFile handling and Dictionaries in python
File handling and Dictionaries in python
 
Contigious
ContigiousContigious
Contigious
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structures
 

Similar to Unit 3 chapter-1managing-files-of-records

File Structure.pptx
File Structure.pptxFile Structure.pptx
File Structure.pptxzedd15
 
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year? BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year? panagenda
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in indiaEdhole.com
 
File organisation in system analysis and design
File organisation in system analysis and designFile organisation in system analysis and design
File organisation in system analysis and designMohitgauri
 
File organization and introduction of DBMS
File organization and introduction of DBMSFile organization and introduction of DBMS
File organization and introduction of DBMSVrushaliSolanke
 
fileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdffileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdfFraolUmeta
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxpeter1097
 
Transaction Processing Systems
Transaction Processing SystemsTransaction Processing Systems
Transaction Processing SystemsMR Z
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaIgor Sfiligoi
 
Lecture 8 comp forensics 03 10-18 file system
Lecture 8 comp forensics 03 10-18 file systemLecture 8 comp forensics 03 10-18 file system
Lecture 8 comp forensics 03 10-18 file systemAlchemist095
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
File organization
File organizationFile organization
File organizationGokul017
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collectionSherry Lake
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Christopher Curtin
 

Similar to Unit 3 chapter-1managing-files-of-records (20)

File organisation
File organisationFile organisation
File organisation
 
File Structure.pptx
File Structure.pptxFile Structure.pptx
File Structure.pptx
 
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year? BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
File organisation in system analysis and design
File organisation in system analysis and designFile organisation in system analysis and design
File organisation in system analysis and design
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
File organization and introduction of DBMS
File organization and introduction of DBMSFile organization and introduction of DBMS
File organization and introduction of DBMS
 
fileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdffileorganizationandintroductionofdbms-210313163900.pdf
fileorganizationandintroductionofdbms-210313163900.pdf
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
Transaction Processing Systems
Transaction Processing SystemsTransaction Processing Systems
Transaction Processing Systems
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
 
Lecture 8 comp forensics 03 10-18 file system
Lecture 8 comp forensics 03 10-18 file systemLecture 8 comp forensics 03 10-18 file system
Lecture 8 comp forensics 03 10-18 file system
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
File organization
File organizationFile organization
File organization
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
 
Files
FilesFiles
Files
 

Recently uploaded

Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdfKamal Acharya
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdfKamal Acharya
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientistgettygaming1
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfKamal Acharya
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwoodseandesed
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturingssuser0811ec
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...Amil baba
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfKamal Acharya
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology
 

Recently uploaded (20)

Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 

Unit 3 chapter-1managing-files-of-records

  • 1. 1 Chapter 5 – Managing Files of Records
  • 2. 2 What’s Up for This Chapter?  This Chapter’s Material – Accessing records in files – Record structures for access – File access methods vs. file organizations – Some real-world examples of file structures – File portability issues
  • 3. 3 The Central Problem  Locating Stored Data – Once the data has been stored into a file, how do you find it to retrieve it? – What does “find the data” even mean?  How do you decide what you want to find?  How do you look for it?  What if it’s not there?  What if something very much like it is there?  What if there are lots of “it” there? – And, of course, there are efficiency considerations  How fast is your search algorithm?  What would you have to do to the file to use a faster one?  Which will you do more often, add records or find them? – Bringing you back to the design of the file itself
  • 4. 4 Record Keys  What Is a Key? – Data stored in a record by which you look for the record – Can be one field or a set of fields  Examples – { name } or {last name + first_name }  Two Types of Keys – Primary key  Key value, unique in entire file, by which an individual record can be located or determined to be absent – Secondary key  Key value by which one or more records can be located
  • 5. 5 Primary Keys Required Characteristics – Unique across the entire file  Can never have 2 records with same primary key  Error to try to add record with duplicate primary key – In “canonical” form  Format precisely known, so search candidates can be brought into that same format before the search  Example – words (names, etc.) in all upper-case – Not often used any more: rather, program the system to do the search independently of case – Unchanging  Value for given record should never change – Given primary key value should always identify same record – Example – Texas Driver’s License number stays with you, even if you move away from Texas, then come back
  • 6. 6 Primary Keys, cont’d. Implication on File Design – Don’t use possibly non-unique field(s) as primary key  Bad – name, birth date, etc. – Don’t use anything that can possibly change  Bad – name, address, etc. – What can we use?  Best – artificial identifier – Student number – Driver’s license number – Other artificially created unique value
  • 7. 7 Secondary Keys  Not Such Stringent Rules – Duplicates allowed  Still have to define what “find” means if duplicates allowed – Usually real data, as opposed to primary keys  The kinds of thing you’d want to search for in real life – Not used to impose any order on the file  Can return results based on secondary key(s) – Selected by secondary key value(s) – Sorted on secondary key value(s)
  • 8. 8 Searching  From 2325 – Two Major Methods – Sequential  Start at beginning, look until you find what you’re after  Choices: – Non-unique keys allowed? – Return first match or all of them? – Binary  Start in middle, remove half the list each time through  Requires: – Primary key values unique across file – File sorted on primary – Records directly accessible  There are others, but …
  • 9. 9 Sequential Searching Performance – It might take 1 try; it might take N tries  Average number of tries = N / 2 if: – Searching on a unique key – Returning first match  Average number of tries = N if: – Returning all matches
  • 10. 10 Sequential Searching  Performance – Big factor in disk access  Worst case: – File fragmented around the disk – Each program read takes one physical read  Best case: – File fairly contiguous on disk – I/O System buffers things so very few (1?) actual reads are done – In multi-user OSs, this seldom happens  However: – If read/write head didn’t move between accesses • Rotational latency & transfer times small compared to seek time • Multiple physical reads wouldn’t have as much of an impact – However, most OSs are multi-tasking now • Can’t rely on read/write head’s being where you left it • Must assume N physical reads take N full disk accesses
  • 11. 11 Improving Sequential Searches Reduce Number of Physical Reads – We can’t do anything about:  File fragmentation – If file’s clusters scattered around disk, multiple seeks are necessary  Multi-tasking environment – Have to assume each program read causes a physical read – (May not be true, if I/O System has good internal caching) – So what do we do?  Increase the number of records pulled in by each physical read – Saw this with magnetic tape – group the records into blocks – Similar to way we collected fields into records, but … • Grouping fields into records is dependent on data characteristics • Grouping records into blocks is dependent on I/O system & disk – Block size should be: • Multiple of disk sector size • Compatible with I/O System’s ability to read
  • 12. 12 When to Use Sequential Searching  Sequential Searching is Good for: – Text files where you’re looking for a pattern  Unix ‘grep’ (general regular expression processor) command – Small files  Like you use in labs here – Files that are searched very infrequently  Not worth the effort to sort to make binary search work – When you expect a large number of matches  Example – searching on a secondary key  It’s Not so Good for: – Binary files – Sorted files – Big files
  • 13. 13 Unix Tools for Sequential Access  cat – Seen this one – concatenate files – cat F1 F2 >F3  wc – Word count (also character & line count) – wc article.txt  grep – Search file for occurrences of regular expression pattern – grep “Ames" personlist.txt  od – Octal dump – or hex, or … – od -ch list.dat
  • 14. 14 Direct Access  What is it? – Go straight to the record you want in the file  No searching  No unnecessary disk accesses – What’s its “order”?  Time to find a record is independent of number of records  However, it can be harder to do
  • 15. 15 Direct Access  How to Do It? – At I/O System level, seek to record  C++ seek operations go to relative byte address (RBA) in file  Variants: – Seek with “get” pointer vs. seek with “put” pointer – Relative to start or end of file (default: start) – But that still doesn’t answer the question  How do we know what RBA a particular record starts at?  We’ve talked about index files – but that’s for later  We could move the problem up one level – Use relative record number (RRN)  But that’s no real help – Still need some kind of index – way to find record’s RRN – Also requires use of fixed-length records: RBA = RRN * Record_Size (assuming, of course, that the first RRA is 0)
  • 16. 16 Building a File of Records Like Building a Record of Fields – Same problem, up one level  Fixed-length or specified-length records?  How to directly access records? – But wait – there’s more:  Want to require software to know as few details about file as possible  To do that, those details need to be stored with (in) the file – File header records  Store file-specific information at start of file  Header record format – Constant across all file types within one system – Why?
  • 17. 17 File Header Records Things a Header Record Might Contain – File structure  Type of record structure  Number of data records  Length of records (if fixed-length)  Record delimiter (if delimited) – Record structure (if records have consistent structure)  Number of fields  Length of each field or delimiter between each field  Format of each field  Key information – if needed – Primary key field – Secondary key field(s), if any – Date/time of most recent access – Date/time of most recent update
  • 18. 18 File Header Records, continued Header Record Format – Binary or character?  Depends – is it important for people to read it? – Here’s a place where HTML-style format might work  Lets files of different formats have different headers (in some ways)  Only invokes that parse overhead once per file
  • 19. 19 What’s the Difference? File Organization – Format of the file itself  Fixed-length, specified-length, or delimited records  ASCII or binary character encoding File Access Method – Way(s) software can get at contents of file  Sequential vs. direct  Indexed sequential
  • 20. 20 Designing a File Access Affects Organization – If sequential access is all we need  Pretty much any organization is OK  Subject, of course, to application needs – If we need direct access  Need fixed-length records  Can also use indexed files, but that’s for later on But Organization Also Affects Access – What if data to be stored in a record is wildly variable?  Fixed-length records would be extremely wasteful  But if we use specified-length records, how to do direct access? – Just about have to use indexing then
  • 21. 21 Metadata Data About Data – Usually in the form of a file header – Example in text  Astronomy image storage format  HTML format (name = value)  But look on page 177: coding style makes a BIG difference – Parsing this kind of data  Read field name; read field value  Convert ASCII value to type required for storage & use  Store converted value into right variable – Why use this type of header?
  • 22. 22 More Metadata PC Graphics Storage Formats – Data  Color values for each pixel in image  Data compression often used (GIF, JPG)  Different color “depth” possibilities – Metadata  Height, width  Number of bits per pixel (color depth)  If not true color (24 bits / pixel) – Color look-up table • Normally 256 entries • Indexed by values stored for each pixel (normally 1 byte) • Contains R/G/B values for color combination – Formatted to be loaded directly into PC graphics RAM
  • 23. 23 Mixing Data Objects in a File Objective – Store different types of data in the same file – Textbook example – mix of astronomy data  “File” header (HTML-style)  “File” of notes – lines of ASCII text  “File” of image data – in whatever format – So our data file becomes a file of files  Each individual “file” (header, notes, or image) looks like a record in this new “mega-file”  These “mega-records” are of varying length  How do we store the “records” in the “mega-records”? – Could use another level of specified-length record software – Or, …
  • 24. 24 Our “Mega-File” Notes Sub-file Image Sub-file Mega-file Header Notes Sub-file Image Sub-file … Image Header Image Data Text line Text line Text line Text line Text line … Text line Terminator line Organization Notes Header
  • 25. 25 More on Our Mega-File Access – Can we just read it sequentially?  Why or why not?  What if we wanted to skip a notes sub-file?  What if some image didn’t even have a notes sub-file? – Can we access it directly?  What would the header have to include to allow that? – An index of the “records” in the file – We call the entries in that index “tags”  Each tag in the tag list has: – Type of sub-file referred to • Special-case type: end of file – RBA of sub-file in mega-file – Length of sub-file (not necessary, but helpful) – Key information, if any, for sub-file
  • 26. 26 More on Our Mega-File  Access, continued – So how do we access the mega-file now?  Read and process the header – Get whole-file information – Build in-memory tag table for sub-files  Sequential access – Same as before – May be able to program in some speed-ups from tag table  Direct access – Locate sub-file in tag table – Go right to it
  • 27. 27 Extensibility Look at Our “Mega-File” Format Again – Header tells us things about the sub-files:  What kinds of files they are  Where to find them – Files themselves  To the mega-file processor, just random bytes  To the sub-file processor, meaningful information What if we need a new type of sub-file? – Define a new type of header entry – Extend header processor to understand that entry – Write (or borrow or buy) code to handle new sub-file Cardinal Rule: – Everything changes –file types, data types, ...
  • 28. 28 Factors Affecting Portability - 1 Operating System Differences – Example – text lines  End with line-feed character  End with carriage-return and line-feed  Prefixed by a count of characters in the line Natural Language Differences – Example – character coding  Single-byte coding – ASCII, EBCDIC  Double-byte coding – Unicode Programming Language Differences – Pascal can’t directly process varying-length records – Different C++ compilers use different byte lengths for the standard data types
  • 29. 29 Factors Affecting Portability - 2 Computer Architecture Differences – Byte order in 16-bit and 32-bit integer values  Big-endian – leftmost byte is most significant  Little-endian – rightmost byte is most significant – Storage of data in memory  Some architectures require values that are N bytes long to start at a byte whose address is divisible by N 0x15 0x32 Big-endian Little-endian interpretation: interpretation: 0x1532 0x3215
  • 30. 30 How to Port Files Define Your Format C*A*R*E*F*U*L*L*Y – Once a format is defined, never change it  If you need a new format, add it so as not to invalidate the existing formats  If you need to change a format, add a new one instead, and let programs that need the new version use it – Decide on a standard format for data elements  Text lines – ASCII , EBCDIC, or Unicode? – Which character(s) to end lines?  Binary – Tightly packed or multiple-of-N addressing? – Which “endian”? – You can always write code to convert to & from the standard format on a new language, computer, etc.
  • 31. 31 The Conversion Problem Few Environments – can do directly Many Env’ts. – need intermediate form IBM VAX VAX IBM IBM IBM VAX VAX IA-32 IA-32 IA-64 IA-64 . . . XML (or some other standard format)