I don’t use the template myself, but having read the manual, this is my best guess of how it’s supposed to work.
First things first: it’s designed to be used when Scrivener is in Split Editor mode, with the transcript document in one editor and a media file (video or sound clip) in the other. The transcript is set up to automatically add the time stamp from the media at the relevant point.
Each of the things you’re likely to want to do in the transcript have an ‘element’ attached — they’re actually just special formats applied (e.g… bold, centred, hanging indents). To stop you having to explicitly choose an element each time, you can use Enter or Tab to automatically apply a specific format (the most common one) to the next element. In normal Scriptwriting, this could take the following form: you start off in Scene Heading. The most common next format is Action, so Scrivener allows you choose Action by pressing Enter. After the Action you either want to add another Action paragraph, or Character, and Scrivener allows you to choose Character with a simple Enter. After the character name, Enter will move to Dialogue. Pressing Tab will give you the second most common option – eg Parenthetical instead of Dialogue. In other words, Tab and Enter help you to move quickly through a script choosing the next element you want. But they can also add information automatically — such as Scene Numbers—when you press Tab.
In the Transcript format, it looks like the Speaker, Time and Text format works this way, once you have selected that element for a line:
You type in the Speaker name and press tab. It will automatically add the current time stamp from the media clip in the other editor, and adds a space. You then add the text by simply typing it—it will be formatted with a large left indent and hanging tab.
When you’ve finished with that entry, you press Enter. The element will still be Speaker / Time / Text and you can just add another Speaker Time Stamp, Text as before.
Or you can press Tab, which changes the current element to Time / Text (no Speaker), presumably to record the speech of the same speaker. Press Tab a second time and the current media timestamp will be added. If you don’t want either of those, Press enter to select a new element.
I think that’s what’s going on, anyway. Hope this helps you figure it out…