What Is MP4? A Practical MP4 Box Deep Dive
A practical walkthrough of MP4 box structure with concrete byte-level examples, from ftyp/moov/mdat to sample tables.
When people say “MP4 file,” they often mix up three different layers:
- Container (MP4)
- Codec (H.264, H.265, AAC, etc.)
- Bitstream payload (actual encoded samples)
MP4 is a container format based on ISO Base Media File Format (ISO/IEC 14496-12). Internally, it is organized as nested boxes (also called atoms).
The shortest useful mental model
An MP4 file is a tree of boxes:
- Each box has a size and type.
- Some boxes only store metadata.
- Some boxes store media payload.
- Player startup speed depends heavily on where metadata boxes are placed.
Common top-level layout:
[ftyp][moov][mdat]
ftyp: file type and compatibility brandsmoov: movie metadata (tracks, timing, sample tables)mdat: raw media data (encoded samples)
Box header format
A standard box starts with:
- 4 bytes:
size(big-endian) - 4 bytes:
type(ASCII)
If size == 1, a 64-bit extended size follows.
Example 1: reading ftyp from bytes
Hex bytes:
00 00 00 20 66 74 79 70 69 73 6F 6D 00 00 02 00
69 73 6F 6D 69 73 6F 32 61 76 63 31 6D 70 34 31
Parse:
00 00 00 20=> size = 0x20 = 32 bytes66 74 79 70=> type =ftyp69 73 6F 6D=> major brand =isom00 00 02 00=> minor version = 512- remaining brands:
isom,iso2,avc1,mp41
This tells us the file claims compatibility with ISO BMFF profiles including AVC-oriented usage.
Why moov matters for startup latency
A player usually needs moov before it can decode timeline/sample mapping.
If moov is at file tail ([ftyp][mdat][moov]), startup over HTTP range requests is slower.
If moov is at file head ([ftyp][moov][mdat]), startup is typically faster.
This is why pipelines often run “faststart” post-processing.
Inside moov: where real indexing happens
High-value path:
moov
└── trak (per track)
└── mdia
└── minf
└── stbl
├── stsd (sample description)
├── stts (decoding time-to-sample)
├── ctts (composition offset, optional)
├── stsc (sample-to-chunk)
├── stsz (sample sizes)
└── stco/co64 (chunk offsets)
For debugging playback issues, stbl is usually the first place to inspect.
Example 2: parsing one stts entry
Suppose stts payload contains one entry:
entry_count = 1
sample_count = 300
sample_delta = 1000
Interpretation:
- 300 consecutive samples
- each sample advances decode timeline by 1000 timescale units
If track timescale is 90000:
- per sample duration = 1000 / 90000 = 11.11ms
- frame rate ≈ 90 fps (example only)
Real files often contain multiple entries when frame durations vary.
Example 3: linking sample tables to payload
Imagine simplified tables:
stsz: sample sizes =[1200, 980, 1105]stco: first chunk offset =4096stsc: says these samples are in one chunk
Then payload mapping in mdat is:
- sample #1 bytes: offset
[4096, 4096+1200) - sample #2 bytes: offset
[5296, 5296+980) - sample #3 bytes: offset
[6276, 6276+1105)
This mapping logic is fundamental for:
- random seek
- segment packaging
- repair/re-mux tooling
MP4 troubleshooting checklist (practical)
When a file “looks valid” but playback is bad, check in order:
- Is
moovpresent and readable early enough? - Do
stsz/stco/stscmappings point to validmdatranges? - Are timescales and
stts/cttscoherent? - Is codec config (
stsd,avcC/hvcC) consistent with samples? - For streaming, are keyframe boundaries aligned with segment strategy?
Closing note
MP4 is simple at surface and deep in practice.
Understanding box structure is one of the highest-leverage skills in media platform work, because many “player bugs” are actually container-indexing issues.
If useful, I can write a follow-up focused on moof/mdat in fragmented MP4 (fMP4) for HLS/DASH workflows.