Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.
Patent law, rather than copyright, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms. For example, the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-2004. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.
One artifact of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly.
This led more recent operating system shells, such as Windows 95 and Mac OS X, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.
A downside of hiding the extension is that it then becomes possible to have what appears to be two or more identical filenames in the same folder. This is especially true when image files are needed in more than one format for different applications. For example, a company logo may be needed both in .tif format (for publishing) and .gif format (for web sites). With the extensions visible, these would appear as the unique filenames "CompanyLogo.tif" and "CompanyLogo.gif". With the extensions hidden, these would both appear to have the identical filename "CompanyLogo", making it more difficult to determine which to select for a particular application.
A further downside is that hiding such information can become a security risk. This is because on a system reliant on filename extensions all usable files will have such an extension (for example all JPEG images will have ".jpg" or ".jpeg" at the end of their name), so seeing file extensions would be a common occurrence and users may depend on them when looking for a file's format. By having file extensions hidden a malicious user can create an executable program with an innocent name such as "Holiday photo.jpg.exe". In this case the ".exe" will be hidden and a user will see this file as "Holiday photo.jpg", which appears to be a JPEG image, unable to harm the machine save for bugs in the application used to view it. However, the operating system will still see the ".exe" extension and thus will run the program, which is then able to cause harm and presents a security issue. To further trick users it may be possible to store an icon inside the program, like it is on Microsoft Windows, in which case the operating system's icon assignment can be overridden with an icon commonly used to represent JPEG images, making such a program look like and appear to be called an image, until it is opened that is. This issue requires users with extensions hidden to be vigilant, and never open files which seem to have a known extension displayed despite the hidden option being enabled (since it must therefore have 2 extensions, the real one being unknown until hiding is disabled). In reality this presents a problem for Windows systems where extension hiding is turned on by default.
The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.
So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.
This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.
The UTI is a Core Foundation string, which uses a reverse-DNS string. Common or standard types use the public domain (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in the OS X file system, including:
The NTFS filesystem also allows to store OS/2 extended attributes, as one of file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.
This concept has been taken again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (Distinguished Encoding Rules) encoded streams and files, and Structured Data Exchange Format (SDXF). Even XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers.