What Is MP4? A Practical MP4 Box Deep Dive

A practical walkthrough of MP4 box structure with concrete byte-level examples, from ftyp/moov/mdat to sample tables.

When people say “MP4 file,” they often mix up three different layers:

  1. Container (MP4)
  2. Codec (H.264, H.265, AAC, etc.)
  3. Bitstream payload (actual encoded samples)

MP4 is a container format based on ISO Base Media File Format (ISO/IEC 14496-12). Internally, it is organized as nested boxes (also called atoms).

The shortest useful mental model

An MP4 file is a tree of boxes:

  • Each box has a size and type.
  • Some boxes only store metadata.
  • Some boxes store media payload.
  • Player startup speed depends heavily on where metadata boxes are placed.

Common top-level layout:

[ftyp][moov][mdat]
  • ftyp: file type and compatibility brands
  • moov: movie metadata (tracks, timing, sample tables)
  • mdat: raw media data (encoded samples)

Box header format

A standard box starts with:

  • 4 bytes: size (big-endian)
  • 4 bytes: type (ASCII)

If size == 1, a 64-bit extended size follows.

Example 1: reading ftyp from bytes

Hex bytes:

00 00 00 20 66 74 79 70 69 73 6F 6D 00 00 02 00
69 73 6F 6D 69 73 6F 32 61 76 63 31 6D 70 34 31

Parse:

  • 00 00 00 20 => size = 0x20 = 32 bytes
  • 66 74 79 70 => type = ftyp
  • 69 73 6F 6D => major brand = isom
  • 00 00 02 00 => minor version = 512
  • remaining brands: isom, iso2, avc1, mp41

This tells us the file claims compatibility with ISO BMFF profiles including AVC-oriented usage.

Why moov matters for startup latency

A player usually needs moov before it can decode timeline/sample mapping.

If moov is at file tail ([ftyp][mdat][moov]), startup over HTTP range requests is slower. If moov is at file head ([ftyp][moov][mdat]), startup is typically faster.

This is why pipelines often run “faststart” post-processing.

Inside moov: where real indexing happens

High-value path:

moov
└── trak (per track)
    └── mdia
        └── minf
            └── stbl
                ├── stsd (sample description)
                ├── stts (decoding time-to-sample)
                ├── ctts (composition offset, optional)
                ├── stsc (sample-to-chunk)
                ├── stsz (sample sizes)
                └── stco/co64 (chunk offsets)

For debugging playback issues, stbl is usually the first place to inspect.

Example 2: parsing one stts entry

Suppose stts payload contains one entry:

entry_count = 1
sample_count = 300
sample_delta = 1000

Interpretation:

  • 300 consecutive samples
  • each sample advances decode timeline by 1000 timescale units

If track timescale is 90000:

  • per sample duration = 1000 / 90000 = 11.11ms
  • frame rate ≈ 90 fps (example only)

Real files often contain multiple entries when frame durations vary.

Example 3: linking sample tables to payload

Imagine simplified tables:

  • stsz: sample sizes = [1200, 980, 1105]
  • stco: first chunk offset = 4096
  • stsc: says these samples are in one chunk

Then payload mapping in mdat is:

  • sample #1 bytes: offset [4096, 4096+1200)
  • sample #2 bytes: offset [5296, 5296+980)
  • sample #3 bytes: offset [6276, 6276+1105)

This mapping logic is fundamental for:

  • random seek
  • segment packaging
  • repair/re-mux tooling

MP4 troubleshooting checklist (practical)

When a file “looks valid” but playback is bad, check in order:

  1. Is moov present and readable early enough?
  2. Do stsz/stco/stsc mappings point to valid mdat ranges?
  3. Are timescales and stts/ctts coherent?
  4. Is codec config (stsd, avcC/hvcC) consistent with samples?
  5. For streaming, are keyframe boundaries aligned with segment strategy?

Closing note

MP4 is simple at surface and deep in practice.

Understanding box structure is one of the highest-leverage skills in media platform work, because many “player bugs” are actually container-indexing issues.

If useful, I can write a follow-up focused on moof/mdat in fragmented MP4 (fMP4) for HLS/DASH workflows.