#spatial
Explore tagged Tumblr posts
Photo
11K notes
·
View notes
Text
116 notes
·
View notes
Text
Shanghai Shuiguo · Tangquan Life (Wujiaochang Store)
Courtesy: DJX Design Studio,
Photography: Tan Xiaozhong, Wu Hehongquan
#art#design#stairwell#stairway#architecture#staircase#stairs#interiors#staircases#shop#flagship#shanghai#china#shanghai shuiguo#tangquan life#wujiaochang#DJX Studio#spatial
30 notes
·
View notes
Text
The view at the edge. Slight motion due to temporal-spatial shift and atmospheric distortion as the photo was taken from within the Phoenix AZ USA urban light dome.
#photography#space#time#crescent moon#davideragland#moon#photographers on tumblr#nature#temporal#spatial#stars#night sky#restaurant at the end of the universe#wonder of nature#when I think of heaven#i think of you
13 notes
·
View notes
Photo
Echoes of Form ended: The Symphony of Spatial Structures
9 notes
·
View notes
Text
Apple’s Mysterious Fisheye Projection
If you’ve read my first post about Spatial Video, the second about Encoding Spatial Video, or if you’ve used my command-line tool, you may recall a mention of Apple’s mysterious “fisheye” projection format. Mysterious because they’ve documented a CMProjectionType.fisheye enumeration with no elaboration, they stream their immersive Apple TV+ videos in this format, yet they’ve provided no method to produce or playback third-party content using this projection type.
Additionally, the format is undocumented, they haven’t responded to an open question on the Apple Discussion Forums asking for more detail, and they didn’t cover it in their WWDC23 sessions. As someone who has experience in this area – and a relentless curiosity – I’ve spent time digging-in to Apple’s fisheye projection format, and this post shares what I’ve learned.
As stated in my prior post, I am not an Apple employee, and everything I’ve written here is based on my own history, experience (specifically my time at immersive video startup, Pixvana, from 2016-2020), research, and experimentation. I’m sure that some of this is incorrect, and I hope we’ll all learn more at WWDC24.
Spherical Content
Imagine sitting in a swivel chair and looking straight ahead. If you tilt your head to look straight up (at the zenith), that’s 90 degrees. Likewise, if you were looking straight ahead and tilted your head all the way down (at the nadir), that’s also 90 degrees. So, your reality has a total vertical field-of-view of 90 + 90 = 180 degrees.
Sitting in that same chair, if you swivel 90 degrees to the left or 90 degrees to the right, you’re able to view a full 90 + 90 = 180 degrees of horizontal content (your horizontal field-of-view). If you spun your chair all the way around to look at the “back half” of your environment, you would spin past a full 360 degrees of content.
When we talk about immersive video, it’s common to only refer to the horizontal field-of-view (like 180 or 360) with the assumption that the vertical field-of-view is always 180. Of course, this doesn’t have to be true, because we can capture whatever we’d like, edit whatever we’d like, and playback whatever we’d like.
But when someone says something like VR180, they really mean immersive video that has a 180-degree horizontal field-of-view and a 180-degree vertical field-of-view. Similarly, 360 video is 360-degrees horizontally by 180-degrees vertically.
Projections
When immersive video is played back in a device like the Apple Vision Pro, the Meta Quest, or others, the content is displayed as if a viewer’s eyes are at the center of a sphere watching video that is displayed on its inner surface. For 180-degree content, this is a hemisphere. For 360-degree content, this is a full sphere. But it can really be anything in between; at Pixvana, we sometimes referred to this as any-degree video.
It's here where we run into a small problem. How do we encode this immersive, spherical content? All the common video codecs (H.264, VP9, HEVC, MV-HEVC, AVC1, etc.) are designed to encode and decode data to and from a rectangular frame. So how do you take something like a spherical image of the Earth (i.e. a globe) and store it in a rectangular shape? That sounds like a map to me. And indeed, that transformation is referred to as a map projection.
Equirectangular
While there are many different projection types that each have useful properties in specific situations, spherical video and images most commonly use an equirectangular projection. This is a very simple transformation to perform (it looks more complicated than it is). Each x location on a rectangular image represents a longitude value on a sphere, and each y location represents a latitude. That’s it. Because of these relationships, this kind of projection can also be called a lat/long.
Imagine “peeling” thin one-degree-tall strips from a globe, starting at the equator. We start there because it’s the longest strip. To transform it to a rectangular shape, start by pasting that strip horizontally across the middle of a sheet of paper (in landscape orientation). Then, continue peeling and pasting up or down in one-degree increments. Be sure to stretch each strip to be as long as the first, meaning that the very short strips at the north and south poles are stretched a lot. Don’t break them! When you’re done, you’ll have a 360-degree equirectangular projection that looks like this.
If you did this exact same thing with half of the globe, you’d end up with a 180-degree equirectangular projection, sometimes called a half-equirect. Performed digitally, it’s common to allocate the same number of pixels to each degree of image data. So, for a full 360-degree by 180-degree equirect, the rectangular video frame would have an aspect ratio of 2:1 (the horizontal dimension is twice the vertical dimension). For 180-degree by 180-degree video, it’d be 1:1 (a square). Like many things, these aren’t hard and fast rules, and for technical reasons, sometimes frames are stretched horizontally or vertically to fit within the capabilities of an encoder or playback device.
This is a 180-degree half equirectangular image overlaid with a grid to illustrate its distortions. It was created from the standard fisheye image further below. Watch an animated version of this transformation.
What we’ve described so far is equivalent to monoscopic (2D) video. For stereoscopic (3D) video, we need to pack two of these images into each frame…one for each eye. This is usually accomplished by arranging two images in a side-by-side or over/under layout. For full 360-degree stereoscopic video in an over/under layout, this makes the final video frame 1:1 (because we now have 360 degrees of image data in both dimensions). As described in my prior post on Encoding Spatial Video, though, Apple has chosen to encode stereo video using MV-HEVC, so each eye’s projection is stored in its own dedicated video layer, meaning that the reported video dimensions match that of a single eye.
Standard Fisheye
Most immersive video cameras feature one or more fisheye lenses. For 180-degree stereo (the short way of saying stereoscopic) video, this is almost always two lenses in a side-by-side configuration, separated by ~63-65mm, very much like human eyes (some 180 cameras).
The raw frames that are captured by these cameras are recorded as fisheye images where each circular image area represents ~180 degrees (or more) of visual content. In most workflows, these raw fisheye images are transformed into an equirectangular or half-equirectangular projection for final delivery and playback.
This is a 180 degree standard fisheye image overlaid with a grid. This image is the source of the other images in this post.
Apple’s Fisheye
This brings us to the topic of this post. As I stated in the introduction, Apple has encoded the raw frames of their immersive videos in a “fisheye” projection format. I know this, because I’ve monitored the network traffic to my Apple Vision Pro, and I’ve seen the HLS streaming manifests that describe each of the network streams. This is how I originally discovered and reported that these streams – in their highest quality representations – are ~50Mbps, HDR10, 4320x4320 per eye, at 90fps.
While I can see the streaming manifests, I am unable to view the raw video frames, because all the immersive videos are protected by DRM. This makes perfect sense, and while I’m a curious engineer who would love to see a raw fisheye frame, I am unwilling to go any further. So, in an earlier post, I asked anyone who knew more about the fisheye projection type to contact me directly. Otherwise, I figured I’d just have to wait for WWDC24.
Lo and behold, not a week or two after my post, an acquaintance introduced me to Andrew Chang who said that he had also monitored his network traffic and noticed that the Apple TV+ intro clip (an immersive version of this) is streamed in-the-clear. And indeed, it is encoded in the same fisheye projection. Bingo! Thank you, Andrew!
Now, I can finally see a raw fisheye video frame. Unfortunately, the frame is mostly black and featureless, including only an Apple TV+ logo and some God rays. Not a lot to go on. Still, having a lot of experience with both practical and experimental projection types, I figured I’d see what I could figure out. And before you ask, no, I’m not including the actual logo, raw frame, or video in this post, because it’s not mine to distribute.
Immediately, just based on logo distortions, it’s clear that Apple’s fisheye projection format isn’t the same as a standard fisheye recording. This isn’t too surprising, given that it makes little sense to encode only a circular region in the center of a square frame and leave the remainder black; you typically want to use all the pixels in the frame to send as much data as possible (like the equirectangular format described earlier).
Additionally, instead of seeing the logo horizontally aligned, it’s rotated 45 degrees clockwise, aligning it with the diagonal that runs from the upper-left to the lower-right of the frame. This makes sense, because the diagonal is the longest dimension of the frame, and as a result, it can store more horizontal (post-rotation) pixels than if the frame wasn’t rotated at all.
This is the same standard fisheye image from above transformed into a format that seems very similar to Apple’s fisheye format. Watch an animated version of this transformation.
Likewise, the diagonal from the lower-left to the upper-right represents the vertical dimension of playback (again, post-rotation) providing a similar increase in available pixels. This means that – during rotated playback – the now-diagonal directions should contain the least amount of image data. Correctly-tuned, this likely isn’t visible, but it’s interesting to note.
More Pixels
You might be asking, where do these “extra” pixels come from? I mean, if we start with a traditional raw circular fisheye image captured from a camera and just stretch it out to cover a square frame, what have we gained? Those are great questions that have many possible answers.
This is why I liken video processing to turning knobs in a 747 cockpit: if you turn one of those knobs, you more-than-likely need to change something else to balance it out. Which leads to turning more knobs, and so on. Video processing is frequently an optimization problem just like this. Some initial thoughts:
It could be that the source video is captured at a higher resolution, and when transforming the video to a lower resolution, the “extra” image data is preserved by taking advantage of the square frame.
Perhaps the camera optically transforms the circular fisheye image (using physical lenses) to fill more of the rectangular sensor during capture. This means that we have additional image data to start and storing it in this expanded fisheye format allows us to preserve more of it.
Similarly, if we record the image using more than two lenses, there may be more data to preserve during the transformation. For what it’s worth, it appears that Apple captures their immersive videos with a two-lens pair, and you can see them hiding in the speaker cabinets in the Alicia Keys video.
There are many other factors beyond the scope of this post that can influence the design of Apple’s fisheye format. Some of them include distortion handling, the size of the area that’s allocated to each pixel, where the “most important” pixels are located in the frame, how high-frequency details affect encoder performance, how the distorted motion in the transformed frame influences motion estimation efficiency, how the pixels are sampled and displayed during playback, and much more.
Blender
But let’s get back to that raw Apple fisheye frame. Knowing that the image represents ~180 degrees, I loaded up Blender and started to guess at a possible geometry for playback based on the visible distortions. At that point, I wasn’t sure if the frame encodes faces of the playback geometry or if the distortions are related to another kind of mathematical mapping. Some of the distortions are more severe than expected, though, and my mind couldn’t imagine what kind of mesh corrected for those distortions (so tempted to blame my aphantasia here, but my spatial senses are otherwise excellent).
One of the many meshes and UV maps that I’ve experimented with in Blender.
Radial Stretching
If you’ve ever worked with projection mappings, fisheye lenses, equirectangular images, camera calibration, cube mapping techniques, and so much more, Google has inevitably led you to one of Paul Bourke’s many fantastic articles. I’ve exchanged a few e-mails with Paul over the years, so I reached out to see if he had any insight.
After some back-and-forth discussion over a couple of weeks, we both agreed that Apple’s fisheye projection is most similar to a technique called radial stretching (with that 45-degree clockwise rotation thrown in). You can read more about this technique and others in Mappings between Sphere, Disc, and Square and Marc B. Reynolds’ interactive page on Square/Disc mappings.
Basically, though, imagine a traditional centered, circular fisheye image that touches each edge of a square frame. Now, similar to the equirectangular strip-peeling exercise I described earlier with the globe, imagine peeling one-degree wide strips radially from the center of the image and stretching those along the same angle until they touch the edge of the square frame. As the name implies, that’s radial stretching. It’s probably the technique you’d invent on your own if you had to come up with something.
By performing the reverse of this operation on a raw Apple fisheye frame, you end up with a pretty good looking version of the Apple TV+ logo. But, it’s not 100% correct. It appears that there is some additional logic being used along the diagonals to reduce the amount of radial stretching and distortion (and perhaps to keep image data away from the encoded corners). I’ve experimented with many approaches, but I still can’t achieve a 100% match. My best guess so far uses simple beveled corners, and this is the same transformation I used for the earlier image.
It's also possible that this last bit of distortion could be explained by a specific projection geometry, and I’ve iterated over many permutations that get close…but not all the way there. For what it’s worth, I would be slightly surprised if Apple was encoding to a specific geometry because it adds unnecessary complexity to the toolchain and reduces overall flexibility.
While I have been able to playback the Apple TV+ logo using the techniques I’ve described, the frame lacks any real detail beyond its center. So, it’s still possible that the mapping I’ve arrived at falls apart along the periphery. Guess I’ll continue to cross my fingers and hope that we learn more at WWDC24.
Conclusion
This post covered my experimentation with the technical aspects of Apple’s fisheye projection format. Along the way, it’s been fun to collaborate with Andrew, Paul, and others to work through the details. And while we were unable to arrive at a 100% solution, we’re most definitely within range.
The remaining questions I have relate to why someone would choose this projection format over half-equirectangular. Clearly Apple believes there are worthwhile benefits, or they wouldn’t have bothered to build a toolchain to capture, process, and stream video in this format. I can imagine many possible advantages, and I’ve enumerated some of them in this post. With time, I’m sure we’ll learn more from Apple themselves and from experiments that all of us can run when their fisheye format is supported by existing tools.
It's an exciting time to be revisiting immersive video, and we have Apple to thank for it.
As always, I love hearing from you. It keeps me motivated! Thank you for reading.
12 notes
·
View notes
Text
「室温」
ここちよい。
52 notes
·
View notes
Text
#metaverse advertising#spatial#art gallery#web3#augmented reality#alteredalley#over the reality#nft#altered alley#spatial gallery
2 notes
·
View notes
Text
SemiCreative
#SemiCreative#spatial#design#studio#New Zealand#one page#white#portfolio#type#typeface#font#ABC Camera Plain#2023#Week 37#website#web design#inspire#inspiration#happywebdesign
16 notes
·
View notes
Video
Veiled Melancholy/Book Narratives : Film Collages. #3 by Russell Moreton Via Flickr: pictify.com/user/russellmoreton The Politics of Architecture : Theorizing through speculative spatial practices. "He rubbed his eyes. The riddle of his surroundings was confusing but his mind was quite clear - evidently his sleep had benefited him. He was not in a bed at all as he understood the word, but lying naked on a very soft and yeilding mattress, in a trough of dark glass. The mattress was partly transparent, a fact he observed with a sense of insecurity, and below it was a mirror reflecting him greyly. Above his arm- and he saw with a shock that his skin was strangely dry and yellow - was bound a curious apparatus of rubber, bound so cunningly that it seemed to pass into his skin above and below. And this bed was placed in a case of greenish-coloured glass (as it seemed to him), a bar in the white framework of which had first arrested his attention. In the corner of the case was a stand of glittering and delicately made apparatus, for the most part quite strange appliances, though a maximum and minimum thermometer was recognizable." H. G. Wells : The Sleeper Awakes. 1899/1910 Spatiality : The Spatial Turn, Robert T. Tally Jr. 2013 Immediate Architectural Interventions, Durations and Effects : Apparatuses, Things and People in the Making of the City and the World. Alberto Altes Arlandis, Oren Lieberman. 2013 Preface (1921) ” The great city of this story is no more than a nightmare of Capitalism triumphant, a nightmare that was dreamt a quarter of a century ago. It is a fantastic possibility no longer possible. Much evil may be in store for mankind, but to this immense, grim organization of servitude, our race will never come” H.G. Wells. EastonGlebe, Dunmow,1921.
#collage#artist book#veiled melancholy#assemblage#materials#narrative#space#spatial#relationships#sepia#historical#image#page#spaces#rooms#interior#abstractions#film poetry#russell moreton#substances#place#fragments#black book#memory#consciousness#art#book works#research artist#visual cartography#evocative spaces
2 notes
·
View notes
Photo
166 notes
·
View notes
Text
#crobard#esquisse#sf#station spatiale#balancier#science fiction#vaisseau#spatial#caisson#mekton#cyberpunk#jdr
7 notes
·
View notes
Text
Painful Poetry Of The Imploded Kiss
2 notes
·
View notes
Text
This Date In Manka Bros. History - February 9, 1957
Manka Bros’ first attempt at 3D TV (MankaVision) ends badly after several Manka scientists are blinded by a combination of the radiation from the TV and the ‘Super Electrified Radia-ggles™.’
3 notes
·
View notes
Text
#art#spatial#art gallery#virtual gallery#art on tumblr#artfinityx#abstract expressionism#art shop#wall art#art prints#artists on tumblr#digital art#digital painting#digital download#art ai#ai art gallery
2 notes
·
View notes
Text
Encoding Spatial Video
As I mentioned in my prior post about Spatial Video, the launch of the Apple Vision Pro has reignited interest in spatial and immersive video formats, and it's exciting to hear from users who are experiencing this format for the first time. The release of my spatial video command-line tool and example spatial video player has inadvertently pulled me into a lot of fun discussions, and I've really enjoyed chatting with studios, content producers, camera manufacturers, streaming providers, enthusiasts, software developers, and even casual users. Many have shared test footage, and I've been impressed by a lot of what I've seen. In these interactions, I'm often asked about encoding options, playback, and streaming, and this post will focus on encoding.
To start, I'm not an Apple employee, and other than my time working at an immersive video startup (Pixvana, 2016-2020), I don't have any secret or behind-the-scenes knowledge. Everything I've written here is based on my own research and experimentation. That means that some of this will be incorrect, and it's likely that things will change, perhaps as early as WWDC24 in June (crossing my fingers). With that out of the way, let's get going.
Encoding
Apple's spatial and immersive videos are encoded using a multi-view extension of HEVC referred to as MV-HEVC (found in Annex G of the latest specification). While this format and extension were defined as a standard many years ago, as far as I can tell, MV-HEVC has not been used in practice. Because of this, there are very few encoders that support this format. As of this writing, these are the encoders that I'm aware of:
spatial - my own command-line tool for encoding MV-HEVC on an Apple silicon Mac
Spatialify - an iPhone/iPad app
SpatialGen - an online encoding solution
QooCam EGO spatial video and photo converter - for users of this Kandao camera
Dolby/Hybrik - professional online encoding
Ateme TITAN - professional encoding (note the upcoming April 16, 2024 panel discussion at NAB)
SpatialMediaKit - an open source GitHub project for Mac
MV-HEVC reference software - complex reference software mostly intended for conformance testing
Like my own spatial tool, many of these encoders rely on the MV-HEVC support that has been added to Apple's AVFoundation framework. As such, you can expect them to behave in similar ways. I'm not as familiar with the professional solutions that are provided by Dolby/Hyrbik and Ateme, so I can't say much about them. Finally, the MV-HEVC reference software was put together by the standards committee, and while it is an invaluable tool for testing conformance, it was never intended to be a commercial tool, and it is extremely slow. Also, the reference software was completed well before Apple defined its vexu metadata, so that would have to be added manually (my spatial tool can do this).
Layers
As I mentioned earlier, MV-HEVC is an extension to HEVC, and the multi-view nature of that extension is intended to encode multiple views of the same content all within a single bitstream. One use might be to enable multiple camera angles of an event – like a football game – to be carried in a single stream, perhaps allowing a user to switch between them. Another use might be to encode left- and right-eye views to be played back stereoscopically (in 3D).
To carry multiple views, MV-HEVC assigns each view a different layer ID. In a normal HEVC stream, there is only one so-called primary layer that is assigned an ID of 0. When you watch standard 2D HEVC-encoded media, you're watching the only/primary layer 0 content. With Apple's spatial and immersive MV-HEVC content, a second layer (typically ID 1) is also encoded, and it represents a second view of the same content. Note that while it's common for layer 0 to represent a left-eye view, this is not a requirement.
One benefit of this scheme is that you can playback MV-HEVC content on a standard 2D player, and it will only playback the primary layer 0 content, effectively ignoring anything else. But, when played back on a MV-HEVC-aware player, each layered view can be presented to the appropriate eye. This is why my spatial tool allows you to choose which eye's view is stored in the primary layer 0 for 2D-only players. Sometimes (like on iPhone 15 Pro), one camera's view looks better than the other.
All video encoders take advantage of the fact that the current video frame looks an awful lot like the prior video frame. Which looks a lot like the one before that. Most of the bandwidth savings depends on this fact. This is called temporal (changes over time) or inter-view (where a view in this sense is just another image frame) compression. As an aside, if you're more than casually interested in how this works, I highly recommend this excellent digital video introduction. But even if you don't read that article, a lot of the data in compressed video consists of one frame referencing part of another frame (or frames) along with motion vectors that describe which direction and distance an image chunk has moved.
Now, what happens when we introduce the second layer (the other eye's view) in MV-HEVC-encoded video? Well, in addition to a new set of frames that is tagged as layer 1, these layer 1 frames can also reference frames that are in layer 0. And because stereoscopic frames are remarkably similar – after all, the two captures are typically 65mm or less apart – there is a lot of efficiency when storing the layer 1 data: "looks almost exactly the same as layer 0, with these minor changes…" It isn't unreasonable to expect 50% or more savings in that second layer.
This diagram shows a set of frames encoded in MV-HEVC. Perhaps confusing at first glance, the arrows show the flow of referenced image data. Notice that layer 0 does not depend on anything in layer 1, making this primary layer playable on standard 2D HEVC video players. Layer 1, however, relies on data from layer 0 and from other adjacent layer 1 frames.
Thanks to Fraunhofer for the structure of this diagram.
Mystery
I am very familiar with MV-HEVC output that is recorded by Apple Vision Pro and iPhone 15 Pro, and it's safe to assume that these are being encoded with AVFoundation. I'm also familiar with the output of my own spatial tool and a few of the others that I mentioned above, and they too use AVFoundation. However, the streams that Apple is using for its immersive content appear to be encoded by something else. Or at least a very different (future?) version of AVFoundation. Perhaps another WWDC24 announcement?
By monitoring the network, I've already learned that Apple's immersive content is encoded in 10-bit HDR, 4320x4320 per-eye resolution, at 90 frames-per-second. Their best streaming version is around 50Mbps, and the format of the frame is (their version of) fisheye. While they've exposed a fisheye enumeration in Core Media and their files are tagged as such, they haven't shared the details of this projection type. Because they've chosen it as the projection type for their excellent Apple TV immersive content, though, it'll be interesting to hear more when they're ready to share.
So, why do I suspect that they're encoding their video with a different MV-HEVC tool? First, where I'd expect to see a FourCC codec type of hvc1 (as noted in the current Apple documentation), in some instances, I've also seen a qhvc codec type. I've never encountered that HEVC codec type, and as far as I know, AVFoundation currently tags all MV-HEVC content with hvc1. At least that's been my experience. If anyone has more information about qhvc, drop me a line.
Next, as I explained in the prior section, the second layer in MV-HEVC-encoded files is expected to achieve a bitrate savings of around 50% or more by referencing the nearly-identical frame data in layer 0. But, when I compare files that have been encoded by Apple Vision Pro, iPhone 15 Pro, and the current version of AVFoundation (including my spatial tool), both layers are nearly identical in size. On the other hand, Apple's immersive content is clearly using a more advanced encoder, and the second layer is only ~45% of the primary layer…just what you'd expect.
Here is a diagram that shows three subsections of three different MV-HEVC videos, each showing a layer 0 (blue), then layer 1 (green) cadence of frames. The height of each bar represents the size of that frame's payload. Because the content of each video is different, this chart is only useful to illustrate the payload difference between layers.
As we've learned, for a mature encoder, we'd expect the green bars to be noticeably smaller than the blue bars. For Apple Vision Pro and spatial tool encodings (both using the current version of AVFoundation), the bars are often similar, and in some cases, the green bars are even higher than their blue counterparts. In contrast, look closely at the Apple Immersive data; the green layer 1 frame payload is always smaller.
Immaturity
What does this mean? Well, it means that Apple's optimized 50Mbps stream might need closer to 70Mbps using the existing AVFoundation-based tools to achieve a similar quality. My guess is that the MV-HEVC encoder in AVFoundation is essentially encoding two separate HEVC streams, then "stitching" them together (by updating layer IDs and inter-frame references), almost as-if they're completely independent of each other. That would explain the remarkable size similarity between the two layers, and as an initial release, this seems like a reasonable engineering simplification. It also aligns with Apple's statement that one minute of spatial video from iPhone 15 Pro is approximately 130MB while one minute of regular video is 65MB…exactly half.
Another possibility is that it's too computationally expensive to encode inter-layer references while capturing two live camera feeds in Vision Pro or iPhone 15 Pro. This makes a lot of sense, but I'd then expect a non-real-time source to produce a more efficient bitstream, and that's not what I'm seeing.
For what it's worth, I spent a bit of time trying to validate a lack of inter-layer references, but as mentioned, there are no readily-available tools that process MV-HEVC at this deeper level (even the reference decoder was having its issues). I started to modify some existing tools (and even wrote my own), but after a bunch of work, I was still too far away from an answer. So, I'll leave it as a problem for another day.
To further improve compression efficiency, I tried to add AVFoundation's multi-pass encoding to my spatial tool. Sadly, after many attempts and an unanswered post to the Apple Developer Forums, I haven't had any luck. It appears that the current MV-HEVC encoder doesn't support multi-pass encoding. Or if it does, I haven't found the magical incantation to get it working properly.
Finally, I tried adding more data rate options to my spatial tool. The tool can currently target a quality level or an average bitrate, but it really needs finer control for better HLS streaming support. Unfortunately, I couldn't get the data rate limits feature to work either. Again, I'm either doing something wrong, or the current encoder doesn't yet support all of these features.
Closing Thoughts
I've been exploring MV-HEVC in depth since the beginning of the year. I continue to think that it's a great format for immersive media, but it's clear that the current state of encoders (at least those that I've encountered) are in their infancy. Because the multi-view extensions for HEVC have never really been used in the past, HEVC encoders have reached a mature state without multi-view support. It will now take some effort to revisit these codebases to add support for things like multiple input streams, the introduction of additional layers, and features like rate control.
While we wait for answers at WWDC24, we're in an awkward transition period where the tools we have to encode media will require higher bitrates and offer less control over bitstreams. We can encode rectilinear media for playback in the Files and Photos apps on Vision Pro, but Apple has provided no native player support for these more immersive formats (though you can use my example spatial player). Fortunately, Apple's HLS toolset has been updated to handle spatial and immersive media. I had intended to write about streaming MV-HEVC in this post, but it's already long enough, so I'll save that topic for another time.
As always, I hope this information is useful, and if you have any comments, feedback, suggestions, or if you just want to share some media, feel free to contact me.
3 notes
·
View notes