SMPTE defines content as metadata plus essence. Essence, in the digital cinema application, is the term applied to a single form of expression, such as picture, or sound, or subtitles. Essence types are singular in nature, i.e., only a 24 fps picture file, only a 48 fps picture file, only a 3D picture file, only a 5.1 sound track, only a 7.1 sound track, and so on. Using these definitions, a Track File carries a single essence type plus the necessary metadata to facilitate its use.
The independence of essence types in the Composition provides a high degree of extensibility, allowing new types of essence to be introduced in the future without breaking the structure of the Composition. For example, when the concepts of the Composition and the Digital Cinema Package were first introduced in digital cinema, stereoscopic 3D was not on the roadmap. But the extensibility of the Composition allowed the Stereoscopic Picture Track File to be quickly incorporated as digital 3D emerged.
MXF Picture & Sound
Track Files are wrapped per a constrained version of the Material Exchange Format, or MXF, specification. MXF provides a structured method for carrying a variety of essence types with metadata. While MXF is capable of carrying more than one essence type in a single file, it must be emphasized that the digital cinema application requires only one essence type per file. More can be learned about MXF on Wikipedia. The constraints applied to MXF for wrapping digital cinema Picture and Sound are defined in SMPTE ST429-3 Sound and Picture Track File.
MXF track files consist of a header, an essence container, and a footer. The header carries metadata that describes the track file. The essence container carries, of course, the essence. The footer carries an index table of the essence.
Picture and Sound essence is frame wrapped using KLV (Key-Length-Value). The KLV Key identifies the nature of the essence present. Length refers to the length of the Value field. The Value field itself contains a frame of essence. More can be learned about KLV on Wikipedia.
XML & Timed-Text
Timed Text Track Files, such as open or closed Subtitles and Captions, are defined in XML and then wrapped in MXF. The wrapping of Timed Text XML is similar to that for Picture and Sound, in a reel-by-reel manner, with the exception that font resources may also be included in the wrap. The constraints applied to MXF for wrapping digital cinema Timed Text are defined in SMPTE ST429-5 Timed Text Track File.