What Is MP4? A Practical MP4 Box Deep Dive

When people say “MP4 file,” they often mix up three different layers:

Container (MP4)
Codec (H.264, H.265, AAC, etc.)
Bitstream payload (actual encoded samples)

MP4 is a container format based on ISO Base Media File Format (ISO/IEC 14496-12). Internally, it is organized as nested boxes (also called atoms).

The shortest useful mental model

An MP4 file is a tree of boxes:

Each box has a size and type.
Some boxes only store metadata.
Some boxes store media payload.
Player startup speed depends heavily on where metadata boxes are placed.

Common top-level layout:

[ftyp][moov][mdat]

ftyp: file type and compatibility brands
moov: movie metadata (tracks, timing, sample tables)
mdat: raw media data (encoded samples)

Box header format

A standard box starts with:

4 bytes: size (big-endian)
4 bytes: type (ASCII)

If size == 1, a 64-bit extended size follows.

Example 1: reading `ftyp` from bytes

Hex bytes:

00 00 00 20 66 74 79 70 69 73 6F 6D 00 00 02 00
69 73 6F 6D 69 73 6F 32 61 76 63 31 6D 70 34 31

Parse:

00 00 00 20 => size = 0x20 = 32 bytes
66 74 79 70 => type = ftyp
69 73 6F 6D => major brand = isom
00 00 02 00 => minor version = 512
remaining brands: isom, iso2, avc1, mp41

This tells us the file claims compatibility with ISO BMFF profiles including AVC-oriented usage.

Why `moov` matters for startup latency

A player usually needs moov before it can decode timeline/sample mapping.

If moov is at file tail ([ftyp][mdat][moov]), startup over HTTP range requests is slower. If moov is at file head ([ftyp][moov][mdat]), startup is typically faster.

This is why pipelines often run “faststart” post-processing.

Inside `moov`: where real indexing happens

High-value path:

moov
└── trak (per track)
    └── mdia
        └── minf
            └── stbl
                ├── stsd (sample description)
                ├── stts (decoding time-to-sample)
                ├── ctts (composition offset, optional)
                ├── stsc (sample-to-chunk)
                ├── stsz (sample sizes)
                └── stco/co64 (chunk offsets)

For debugging playback issues, stbl is usually the first place to inspect.

Example 2: parsing one `stts` entry

Suppose stts payload contains one entry:

entry_count = 1
sample_count = 300
sample_delta = 1000

Interpretation:

300 consecutive samples
each sample advances decode timeline by 1000 timescale units

If track timescale is 90000:

per sample duration = 1000 / 90000 = 11.11ms
frame rate ≈ 90 fps (example only)

Real files often contain multiple entries when frame durations vary.

Example 3: linking sample tables to payload

Imagine simplified tables:

stsz: sample sizes = [1200, 980, 1105]
stco: first chunk offset = 4096
stsc: says these samples are in one chunk

Then payload mapping in mdat is:

sample #1 bytes: offset [4096, 4096+1200)
sample #2 bytes: offset [5296, 5296+980)
sample #3 bytes: offset [6276, 6276+1105)

This mapping logic is fundamental for:

random seek
segment packaging
repair/re-mux tooling

MP4 troubleshooting checklist (practical)

When a file “looks valid” but playback is bad, check in order:

Is moov present and readable early enough?
Do stsz/stco/stsc mappings point to valid mdat ranges?
Are timescales and stts/ctts coherent?
Is codec config (stsd, avcC/hvcC) consistent with samples?
For streaming, are keyframe boundaries aligned with segment strategy?

Closing note

MP4 is simple at surface and deep in practice.

Understanding box structure is one of the highest-leverage skills in media platform work, because many “player bugs” are actually container-indexing issues.

If useful, I can write a follow-up focused on moof/mdat in fragmented MP4 (fMP4) for HLS/DASH workflows.

The shortest useful mental model

Box header format

Example 1: reading ftyp from bytes

Why moov matters for startup latency

Inside moov: where real indexing happens

Example 2: parsing one stts entry