File Organization Techniques

The basic technology of the data organization is based on a hierarchy. Data must be approached on an organized basis, if it is to be useful, in data processing, the hierarchy of data is described below:-

  1. A character is any simple number, alphabet or special symbol.
  2. A data record consists of a group of related data fields(e. g. Employee’s sequential record, customer record, etc.)
  3. A data field is an area that can hold one organization more characters that, together represents a specific data element(e. g. The name field, the quantity filed)
  4. A data file is a compilation of related data records maintained in some prearranged order.
  5. A database usually consists of several related organization integrated data files.

A file consists of a number of records. Each record is made up of a number of fields and each field consists of a number of characters. In order to produce useful information by means of computerized data processing, it is very necessary to organize data in systematic way. Methods of organizing data are referred to as data structures. The most important structure is a vertical hierarchy of data consisting of files, records, data items(or fields) and characters with characters encoded in terms of bits. It will be seen that at the top of the hierarchy is a file. When all records of the same record content are gathered as a single collection of information, the collection is referred to as file(or data set). The records in a file, in turn, are groupings of related data items or fields and then specific data values.

For ease of access or reference, each record is allocated a key field, that is a field by which it is identified. The files are, normally, organized in key field sequences. A record may contain two or more keys. For example, an invoice may also be required by customer number. Here, the customer number will be referred to as the primary key and the invoice number, the secondary number.

Data organization may culminate in a data bank. A data bank is a collection of librates of files. To clarify further: one line of an invoice may form a data item, a complete invoice may form a record, a complete set of such records may form a file, the collection of files(say, for inventory control) may form a library. In turn, librates are referred to as data bank.

Fixed Length Record:-

A file is said to consist of fixed-length records when each record is the same length. This can be achieved in two ways, either all the records are identical in layout or where the layout is different each record is padded so that the total length is equal to to that of the longest. The first is common on most master files while the second approach is often found in transaction files where the data required for different transactions varies. Fixed length records are usually easier to design and write programs for but can be more wasteful in backing storage than variable-length records.

Variable-Length Records:-

There are several types of variable-length record situations for example:

  • a group of different types of records, each of fixed length but different lengths.
  • records with a fixed minimum length and a variable number of of fixed-items following fixed portion.
  • records with a fixed minimum length and a fixed number of variable-length items following the fixed portion.
  • records with fixed minimum length and variable number of variable-length items following the fixed portion.
  • complete and random variability of length.

Often, it is possible top break down a variable-length record into a group of fixed-length records, so that with careful design, the advantage of both fixed and variable working can be achieved.

Given that a file consists, generally speaking, of a collection of records, a key element in file management is the way in which the records themselves are organized inside the file, since this heavily affects the system performances and as far as record finding and access. Note carefully that by“organization” we refer here to the logical arrangement of the records in the file (their ordering or more generally, the presence of“closeness” relations between them based on their content), and not instead to the physical layout of the file stored on a storage media. Choosing a file organization is a design decision, hence it must be done having in mind the achievement of good performance with respect to the most likely usage of the file. The criteria usually considered important are:

  • Fast access to single record or collection of related records.
  • Easy record adding\update\removal, without disrupting.
  • Storage efficiency.
  • Redundancy as well as a warranty against data corruption.

Needless to say, these requirements are in contrast with each other for all but the most trivial situations, and its the designers job to find a good compromise among them, yielding and adequate solution to the problem at hand. For example, easiness of adding etc. is not an issue when defining the data organization of a CD-ROM product, whereas fast access is, given the huge amount of data that this media can store. However, as it will become apparent shortly, fast access techniques are based on the use of additional information about the records, which in turn competes with the high volumes of data to be stored.

Logical data organization is indeed the subject of whole shelves of books; in the “Database” section of your library . Here we’ll briefly address some of the simpler used techniques, mainly because of their relevance to data management from the lower level(with respect to a database’s) point of view of an OS. Six organization models will be considered:-

  1. Pile
  2. Sequential
  3. Indexed-sequential
  4. Indexed
  5. Hashed
  6. Direct

The term file organization refers to a relationship of the key of the record to the physical location of that record in the computer file. Distinction may be made at the outset between physical and logical files and records. A physical file is a physical unit, such as a magnetic tape or disc. A logical file, on the other hand, is a complete set of records for a specific application or purpose. IT may occupy only part of a physical file or may extend over more than one physical file, e. g, an involuntary master file, which is a logical file, containing a record for each item in inventory, may require one or more reels of magnetic tape.

There are two objectives of a computer based file organization :

  1. Ease of file creation and maintenance.
  2. Providing an efficient means of storing and retrieving information.

The four file organization methods that are commonly used in business data processing application are:

  1. Sequential
  2. Indexed
  3. Indexed-sequential
  4. Direct or random access

The selection of a particular file organization depends upon:

  • The type of application.
  • The method of processing for updating files.
  • Size of the file.
  • File inquiry capabilities.
  • File volatility.
  • The response time.

Sequential File Organization:

This is the most common structure for large files that are typically processed in their entirety, and its at the heart of the more complex schemes. In this scheme , all the records have the same size and same field format, with the fields having fixed size as well. The records are sorted in the file according to the content of a field of a scalar type, called “key”.The key must identify uniquely a record hence different record have different keys. This organization is well suited for batch processing of the entire file, without adding or deleting items, this kind of operation can take application can take advantage of records and file; moreover, this organization is easily stored both on disk and tape. The key ordering, along with the fixed record size, makes this organization amenable to dichotomy search.

However, adding and deleting records to this kind of is a tricky process, the logical sequence of record typically matches their physical layout o the media storage, so to ease file navigation, hence adding a record and maintaining the key order requires a reorganization if the whole file. The usual solution to make use of a “log file”(also called “transaction file”), structured as a pile, to perform this kind of modification, and periodically perform a batch update on the master file.

In a sequential file, records are arranged in the ascending or descending order of chronological order of a key field which may be numeric(such as customer’s name), or both (a file of vehicle license including both letters of alphabet and numerals). Since the records are ordered by the key field, there is no storage location identification. Sequential file organization is particularly suited to such applications are payroll in which the file is to be processed entirely, i. e, each and every record is processed in the same setup.

For locating a record in the file, it is necessary to start at a given reference point (which in after the beginning of the file)and example each record in sequence until the desired record is located. There are often gaps in the numbering of the records, for example, record number 23 may be followed by record number 31. Transactions affecting the sequential files are accumulating in batches, which are then used to update the sequential file at periodic intervals.

Sequential files are normally created and maintained in magnetic tape(an exception is a card file). The vast majority are updated via batch processing, which is the most efficient method to use with a sequential file. All updated data is sorted prior to use. (During batch processing, the updated data and the sequential file data are alternately read and processed). After scanning each incoming record, the computer determines which record requires updating. After all the changes are made, the update is complete.

Sequential files can also be constructed on magnetic disk. However, when this is done, the direct access capabilities of disk are not taken advantages of.With most tape units, it is impractical to write a new record back to the same position as the old one occupied and as a result the master file is updated in the CPU and then written on to a new tape.

Advantages include ( Characteristics of Sequential file organization):

1. Sample to understand approach.

2. Easy to organize, maintain and understand.

3. Loading a record requires only the record key.

4. Efficient and economical if the activity rate, i. e, the proportion of file records to be processed is high.

5. Relatively inexpensive I\O media and devices may be used.

6. Errors in files remain localized.

7. Files may be relatively easy to reconstruct since a good measure of built-in backup is usually available.

Drawbacks include:

1. Entire file must be processed even when the activity rate is very low.

2. Transactions must be sorted and placed to sequence prior to processing.

3. Timeliness of data in file deteriorates while the batches are being accumulated.

4. Data redundancy is typically high since the same data may be stored in several files sequenced in different keys.

5. Random inquiries are virtually impossible to handle.

6. Applications: payroll A\C, financial K\C etc.

Indexed file organization:

Why using a single index for a certain key field of a data record? Indexes can be obviously built for each field that uniquely identifies a record (or set of records within a file), and whose type is amenable to ordering. Multiple indexes hence provide high degree of flexibility for accessing the data via search on various attributes; this organization also allows the use of variable length records (containing different fields). It should be noted that when multiple indexes are used the concept of sequential records within the file is useless; each attribute(field) used to construct an index typically imposes an ordering of its own. For this very reason is typically not possible to use the “sparse”(or”spaced”) type of indexing previously described. Two types of indexes are usually found in the applications, the exhaustive type, which contains and entry for each record in the main file, in the order given by the indexed key, and the partial type, which contain an entry for all those records that contain the chosen key field (for variable records only).

Indexed sequential access Organization:

An index file can be used to effectively overcome the above mentioned problem, and to speed up the key search as well. The simplest indexing structure is the single-level one, a file whose records are pairs key-pointer, where the pointer is the position in the data file of the record with the given key. Only a subset of data records, evenly spaced along the data file, are indexed, so to mark intervals of data records. A key search then proceeds as follows:

The search key is compared with the index ones to find the highest index key preceding the search one, and a linear search is performed from that record the index key points onwards, until the search key is matched or until the record pointed by the next index entry is reached. In spite of the double file access(index+data) needed by this kind of search, the decrease in access time with respect to a sequential file is significant.

Consider, for example, the case of simple linear search on a file with 1,000 records. With the sequential organization, an average of 500 key comparisons are necessary (assuming uniformly distributed search key among the data ones). However, using an evenly spaced index with 100 entries, the number of comparisons is reduced to 50 in the index file plus 50in the data file; a 5:1 reduction in the number of operations. This scheme can obviously be hierarchically extended. An index is a sequential file in itself, amenable to be indexed in turn by a second level index, and so on, thus exploiting more and more the hierarchical decomposition of the searches to decrease the access time. Obviously, if he layering of indexes is pushed too far, a point is reached when the advantages of indexing are hampered by the increased storage costs, and by the index access times as well.

The sequential and the direct access files are considered the opposites of each other. The indexed sequential file is a synthesis of these file types. The organization of an indexed sequential file combines the positives aspects of both the sequential and the direct access files. In an indexed sequential file, records are stored sequentially on a direct access device ( i.e. magnetic disk and data is accessible either randomly or sequentially. The sequential access of data occurs as one record at a time until the desired item of data is found. This type of file organization is best suited for situations where both batch and on line processing are to be supported. The records in these files are organized in sequence for the efficient processing of large batch jobs but an index is also used to speed up access to the records. This type of file organization combines advantage of both the sequential and direct approaches. Records are stored sequentially by a key record in a Direct access storage device (DASD). At the time of periodic updating during a batch run, the direct access capability is not in use, only the first record may be directly accessed, while all other records are read in sequence in the same manner as records stored on a magnetic tape. However, indexes permit access to selected records without searching the entire file. This technique is known as indexed sequential access method (ISAM).

Indexed sequential file organization is best suited for situations where it is not known in advance whether a particular record exists. The index is scanned for the requisite data item by employing a technique known as binary search. This is a way narrowing the scope of the search by looking at the middle of the index and half of the index. Each subsequent search is begun at the remaining records until the desired record is found. If the desired entry is not found in the index, no further search need to be made and time to be wasted.

Direct File Organization (also referred to as Random or Relative organization)

Files in this type of organization are stored in a direct access storage device(DASD) like, magnetic disk, using an identifying key. This identifying key relates a record to its actual storage position in the file. The computer can directly locate the key to find the desired record without having to search through any other records first. For example, employee records can be accessed by using the employee number assigned to them. Here, processing is known as random processing. It is used in on line systems where rapid response and fast updating are important. Compared to sequential access file organization, direct-access file is more useful in situations where the majority of accesses to the records in the file to individual records at unpredictable times. Sequential files are more important when most or all of the records need to be processed as a group. As already mentioned, records are stored in direct-access file storage location numbers in a DASD as the key for records stored in those locations, this is usually not done. Instead, an arithmetic procedure called transform is adopted to convert the record key number into a DASD storage location number. The transactions can be processed in any order and written at many locations through the stored file. To access a record, prior records need not be examined first. The CPU can go directly to the desired record without searching all the others in the file. In other words, this scheme permits immediate access to records for inquiring and updating purposes. Direct access is used where file activity is low.

Advantages include

1. Immediate access to records for updating purposes is possible.

2. Immediate updating of several files as a result of a single transaction is possible.

3. Transactions need not be sorted.

4. Different discs or disc units are not required for updating records as existing records may be amended by overwriting.

5. Random inquiries which are too frequent in business situations can be easily handled.

6. It is also possible to process direct file records sequentially in a record key sequence.

7. A direct file organization is most suitable for interactive on line applications such as airline or railway reservation systems, teller facility in banking applications, etc.

Disadvantages include

1. Data may be accidentally erased or over-written unless special precautions are taken.

2. Records in the on line file may be exposed to the risks of loss of accuracy and a breach of

security, therefore, special backup and reconstruction procedures must be established.

3. May be less efficient in the use of storage space than sequentially organized files.

4. Expensive hardware and software are required.

5. System design is complex and costly.

6. File updation is more difficult as compared to sequential files.

7. Special security measures are necessary for on line files that are accessible from several stations.

Leave a Reply

Your email address will not be published. Required fields are marked *