The "zero-byte delimited text"-recommendation is based on some work I did on serial protocols. Now serial ports are a bit more tricky than tcp connections since the sender of a data stream in certain situations doesn't know if there is anybody listening. The listener then has to "sync" with the data stream in order to properly interpret the data.
For TCP streams the situation is a little bit better, since the sender always knows when a client startet listening. But the current design demands a lot of care from the client implementation: It has to properly count bytes to know when the JSON is finished. And from experience I can tell that this byte counting in certain programming languages can be a headache...
The alternate approach I suggested basically allows to collect incoming data (which might come in weird chunks since TCP might split up packets), check if there is a zero byte in it, then try to parse from the start to the zero-byte as JSON. If it parses: great, if not discard the package up to (and including) the zero byte. Then check/wait for the next zero byte...
The "verbose enums" suggestion is based on some JSON-API design I did. Instead of passing around numbers that have to be kept in sync between server and all different clients use verbose strings. Server and every client is then free to map these strings in any way they like. This has kept me sane in quite some situations, where I had to add a new enum value and it would make the most sense to put it bang in the middle of the enums. Instead of having a very weird order of enum values, that need to keep that way for backward compatibility with all clients.
As for my use case: I am experimenting with external haptic devices controlled by a raspberry pi and some python scripting. I am trying to use the remote control API as a means to sync up haptic events with the stuff shown in the (Oculus Quest) Headset. So far it seems to work reasonably well, but one thing that is annoying me is, that identifying the currently playing movie from the DLNA server is hard, since the URL is obfuscated by the dlna server. I can't easily identify the filename (and I'd prefer not to query the dlna server myself, since dlna seems to be quite a mess...). Having the proper name availabe would make it easy to figure out the matching haptics description.
Maybe this doesn't have to be sent in each and every package, but having a command to get extended information about the currently playing movie would be helpful.
FYI: I right now have a latency of about 100 to 200ms between what is shown in the oculus quest and the timestamps I get via TCP, with quite some jitter. There probably is not that much you can do about it, since there is some wifi involved inbetween. But it works reasonably well with some guesswork on the client side.