Text file
Wikipedia, the free encyclopedia - Cite This SourceA text file (sometimes spelled "textfile") is a generic description of a kind of computer file in a computer file system. At this generic level of description, there are two kinds of computer files: 1) text files; and 2) binary files. This broad two-level distinction is widely recognized and applied in computing, even though it can be misleading, and subject to differing interpretation.
Composition
Text files are sequences of readable characters such as letters, digits, punctuation, or whitespace; and control characters such as section boundaries, rendering instructions for different languages, line feeds and carriage returns. Embedded information such as font information, hyperlinks, or inline images do not appear in text files, though references to them can be included within (such as HTML elements or metadata).This simplicity allows a wide variety of programs to display their contents.
Encoding
A text file contains members from a character encoding set, or code page. Early sets include ASCII, developed by the American National Standards Institute (ANSI), and EBCDIC, developed by IBM. Though still used today, other sets have been derived from or created in competition with these, including ISO 8859, EUC, various code pages for Microsoft Windows, a special Mac-Roman encoding for Mac OS, and Unicode encoding schemes (common on many platforms) such as UTF-8 or UTF-16."Plain text" and "plaintext"
The terms "text file" and "plain text" are often confused with each other. "Text file" refers to a type of container, while "plain text" refers to a type of content. Text files can contain plain text, but they are not limited to such.Further confusing the matter, is the term "plaintext", which refers to unencoded content. Text files can contain plaintext and/or code. Text files themselves have plaintext (character) and encoded (binary) renderings. Furthermore, plaintext may or may not contain plain text.
Rendering
Text editors
When opened using the correct code page, a text editor presents human-readable content to the user. This often consists of the file's plain text visible to the user. Depending on the application, control codes may be rendered either as literal instructions acted upon by the editor, or as visible escape characters that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.Hexadecimal editors
Hexadecimal editors allow the individual bytes of a text file to be directly manipulated. Typically, a hex editor presents both the hexadecimal and character representation for every byte in the file. Control characters are represented rather than effected.Parsers
Text files of certain variety can be given as input to a parser, which translates specific sequences of characters as commands or values. The output is either an application, intelligible content usable by a human or another application, or some mixture in between. For instance, an XML document will produce content with contextual information that can be further interpreted by a program; a Rich Text format document will present formatted text; a file containing source code can be compiled and executed by its corresponding runtime environment.Data storage
Although text files are often meant for humans to read, they are also commonly used for data storage by computer programs. Text files have some advantages even for data storage because they avoid certain problems with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a file used for data storage, it is far easier for a human to fix if it is a text file. As a bonus, it may be easier for the program to recover from the error, because text files are pretty verbose, while binary files are usually compact. Text files have a low entropy rate—damaging an amount of a text file destroys little information; damaging the same amount of a binary file destroys more information.A large drawback of plain text files is that there is no way for a program to reliably determine what encoding is used. A text editor may save its text file in UTF-8, but a compiler might expect its input in ISO 8859. Trying to compile the UTF-8 text file would cause confusion and errors. Some text formats (such as XML) have an in-band mechanism for specifying the encoding of the document, but most text files have no such mechanism. Some programs go to great lengths to "guess" the encoding by looking for patterns in the text file, but this guessing procedure is very difficult to specify correctly for all cases (see AI-complete).
Formats
MIME
Text files usually have the MIME type "text/plain", usually with additional information indicating an encoding. Prior to the advent of Mac OS X, the Mac OS system regarded the content of a file (the data fork) to be a text file when its resource fork indicated that the type of the file was "TEXT". Under the Windows operating system, a file is regarded as a text file if the suffix of the name of the file (the "extension") is "txt". However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the programming language in which the source is written.ASCII
The ASCII standard allows ASCII-only text files (unlike most other file types) to be freely interchanged and readable on Unix, Macintosh, Microsoft Windows, DOS, and other systems. These differ in their preferred line ending convention and their interpretation of values outside the ASCII range (their character encoding)..txt
.txt is a filename extension for files consisting of text with very little formatting (ex: no bolding or italics). This kind of text format is also called a plain text file to differentiate them from other kinds of binary files, which, at the time the distinction was made, were not supposed to have human readable text. The precise definition of the .txt format is not specified, but typically matches the format accepted by the system terminal or simple text editor. Files with the .txt extension can easily be read or opened by any program that reads text and, for that reason, are considered universal (or platform independent).Plain text versus .txt
It should be noted that not all systems use the .txt extension when creating plain text files. In particular, on Unix systems, where extensions are entirely optional, it's common to see text files with no extension at all, the most prominent example being theREADME file, present in many software packages. However, there's no difference between a plain text file with no extension and a .txt file. The term "plain text" is attributed to the contents of the file, while the term ".txt" is attributed to the file metadata (i.e. the extension).
Plain text variations
Since plain text is not a formally defined standard, the definition of the format of a plain text file is rather loose. The principle differences are in character sets and character encodings, and conventions about formatting characters semantics.The ASCII character set is the most common format for English-language text files, and is generally assumed to be the default file format in many situations. For accented and other non-ASCII characters, it is necessary to choose a character encoding. In many systems, this is chosen on the basis of the default locale setting on the computer it is read on. Common character encodings include ISO 8859-1 for many European languages.
Because many encodings have only a limited repertoire of characters, they are often only usable to represent text in a limited subset of human languages. Unicode is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8, which has the advantage of being backwards-compatible with ASCII: that is, every ASCII text file is also a UTF-8 text file with identical meaning.
Formatting characters
If one is using an old Macintosh, then the newline command is associated to the ASCII character number 13. If one is using Unix, then the ASCII character is number 10. If, instead, the person is using an IBM Mainframe, then he or she would be using EBCDIC format and next line would be number 15.
Standard Windows .txt files
Microsoft MS-DOS and Windows use a common text file format, with each line of text separated by a two character combination: CR and LF, which have ASCII codes 13 and 10. It is common for the last line of text not to be terminated with a CR-LF marker, and many text editors (including Notepad) do not automatically insert one on the last line.
Most Windows text files use a form of ANSI, OEM or Unicode encoding. What Windows terminology calls "ANSI encodings" are usually single-byte ISO-8859 encodings, except for in locales such as Chinese, Japanese and Korean that require double-byte character sets. ANSI encodings were traditionally used as default system locales within Windows, before the transition to Unicode. By contrast, OEM encodings, also known as MS-DOS code pages, were defined by IBM for use in the original IBM PC text mode display system. They typically include graphical and line-drawing characters common in full-screen MS-DOS applications. Newer Windows text files may use a Unicode encoding such as UTF-16LE or UTF-8.
Notes and references
See also
- List of file formats
- File extensions
- ASCII
- EBCDIC
- Newline
- Text editor
- Unicode
- Plain text
- Binary file
- textfiles.com
External links
Wikipedia, the free encyclopedia © 2001-2006 Wikipedia contributors (Disclaimer)
This article is licensed under the GNU Free Documentation License.
Last updated on Wednesday March 05, 2008 at 12:10:08 PST (GMT -0800)
View this article at Wikipedia.org - Edit this article at Wikipedia.org - Donate to the Wikimedia Foundation