diff options
author | Wim Taymans <wim.taymans@collabora.co.uk> | 2011-11-07 12:23:15 +0100 |
---|---|---|
committer | Wim Taymans <wim.taymans@collabora.co.uk> | 2011-11-07 12:23:15 +0100 |
commit | 7ac25e9b26dcf61cf26bdcf83b0e361c75c3ef4d (patch) | |
tree | f1c258673c9db7ec6b7e8711c32fe11d21971a54 /docs/design | |
parent | 6cc887c53bb6d7ce8efbeea4e7d1a754a3622070 (diff) | |
parent | 3df415d4c7d142cf07f805464ab9f41d098b505f (diff) | |
download | gstreamer-plugins-base-7ac25e9b26dcf61cf26bdcf83b0e361c75c3ef4d.tar.gz |
Merge branch 'master' into 0.11
Conflicts:
common
configure.ac
gst-libs/gst/audio/gstbaseaudiosink.c
gst/playback/gstdecodebin2.c
gst/playback/gstplaysinkaudioconvert.c
gst/playback/gstplaysinkaudioconvert.h
gst/playback/gstplaysinkvideoconvert.c
gst/playback/gstplaysinkvideoconvert.h
Diffstat (limited to 'docs/design')
-rw-r--r-- | docs/design/draft-subtitle-overlays.txt | 548 |
1 files changed, 548 insertions, 0 deletions
diff --git a/docs/design/draft-subtitle-overlays.txt b/docs/design/draft-subtitle-overlays.txt new file mode 100644 index 000000000..ceff5e5e5 --- /dev/null +++ b/docs/design/draft-subtitle-overlays.txt @@ -0,0 +1,548 @@ +=============================================================== + Subtitle overlays, hardware-accelerated decoding and playbin2 +=============================================================== + +Status: EARLY DRAFT / BRAINSTORMING + +The following text will use "playbin" synonymous with "playbin2". + + === 1. Background === + +Subtitles can be muxed in containers or come from an external source. + +Subtitles come in many shapes and colours. Usually they are either +text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles +and the most common form of DVB subs). Bitmap based subtitles are +usually compressed in some way, like some form of run-length encoding. + +Subtitles are currently decoded and rendered in subtitle-format-specific +overlay elements. These elements have two sink pads (one for raw video +and one for the subtitle format in question) and one raw video source pad. + +They will take care of synchronising the two input streams, and of +decoding and rendering the subtitles on top of the raw video stream. + +Digression: one could theoretically have dedicated decoder/render elements +that output an AYUV or ARGB image, and then let a videomixer element do +the actual overlaying, but this is not very efficient, because it requires +us to allocate and blend whole pictures (1920x1080 AYUV = 8MB, +1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the overlay region +is only a small rectangle at the bottom. This wastes memory and CPU. +We could do something better by introducing a new format that only +encodes the region(s) of interest, but we don't have such a format yet, and +are not necessarily keen to rewrite this part of the logic in playbin2 +at this point - and we can't change existing elements' behaviour, so would +need to introduce new elements for this. + +Playbin2 supports outputting compressed formats, i.e. it does not +force decoding to a raw format, but is happy to output to a non-raw +format as long as the sink supports that as well. + +In case of certain hardware-accelerated decoding APIs, we will make use +of that functionality. However, the decoder will not output a raw video +format then, but some kind of hardware/API-specific format (in the caps) +and the buffers will reference hardware/API-specific objects that +the hardware/API-specific sink will know how to handle. + + + === 2. The Problem === + +In the case of such hardware-accelerated decoding, the decoder will not +output raw pixels that can easily be manipulated. Instead, it will +output hardware/API-specific objects that can later be used to render +a frame using the same API. + +Even if we could transform such a buffer into raw pixels, we most +likely would want to avoid that, in order to avoid the need to +map the data back into system memory (and then later back to the GPU). +It's much better to upload the much smaller encoded data to the GPU/DSP +and then leave it there until rendered. + +Currently playbin2 only supports subtitles on top of raw decoded video. +It will try to find a suitable overlay element from the plugin registry +based on the input subtitle caps and the rank. (It is assumed that we +will be able to convert any raw video format into any format required +by the overlay using a converter such as ffmpegcolorspace.) + +It will not render subtitles if the video sent to the sink is not +raw YUV or RGB or if conversions have been disabled by setting the +native-video flag on playbin2. + +Subtitle rendering is considered an important feature. Enabling +hardware-accelerated decoding by default should not lead to a major +feature regression in this area. + +This means that we need to support subtitle rendering on top of +non-raw video. + + + === 3. Possible Solutions === + +The goal is to keep knowledge of the subtitle format within the +format-specific GStreamer plugins, and knowledge of any specific +video acceleration API to the GStreamer plugins implementing +that API. We do not want to make the pango/dvbsuboverlay/dvdspu/kate +plugins link to libva/libvdpau/etc. and we do not want to make +the vaapi/vdpau plugins link to all of libpango/libkate/libass etc. + + +Multiple possible solutions come to mind: + + (a) backend-specific overlay elements + + e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu, + vaapidvbsuboverlay, vdpaudvbsuboverlay, etc. + + This assumes the overlay can be done directly on the backend-specific + object passed around. + + The main drawback with this solution is that it leads to a lot of + code duplication and may also lead to uncertainty about distributing + certain duplicated pieces of code. The code duplication is pretty + much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu, + kate, assrender, etc. available in form of base classes to derive + from is not really an option. Similarly, one would not really want + the vaapi/vdpau plugin to depend on a bunch of other libraries + such as libpango, libkate, libtiger, libass, etc. + + One could add some new kind of overlay plugin feature though in + combination with a generic base class of some sort, but in order + to accommodate all the different cases and formats one would end + up with quite convoluted/tricky API. + + (Of course there could also be a GstFancyVideoBuffer that provides + an abstraction for such video accelerated objects and that could + provide an API to add overlays to it in a generic way, but in the + end this is just a less generic variant of (c), and it is not clear + that there are real benefits to a specialised solution vs. a more + generic one). + + + (b) convert backend-specific object to raw pixels and then overlay + + Even where possible technically, this is most likely very + inefficient. + + + (c) attach the overlay data to the backend-specific video frame buffers + in a generic way and do the actual overlaying/blitting later in + backend-specific code such as the video sink (or an accelerated + encoder/transcoder) + + In this case, the actual overlay rendering (i.e. the actual text + rendering or decoding DVD/DVB data into pixels) is done in the + subtitle-format-specific GStreamer plugin. All knowledge about + the subtitle format is contained in the overlay plugin then, + and all knowledge about the video backend in the video backend + specific plugin. + + The main question then is how to get the overlay pixels (and + we will only deal with pixels here) from the overlay element + to the video sink. + + This could be done in multiple ways: One could send custom + events downstream with the overlay data, or one could attach + the overlay data directly to the video buffers in some way. + + Sending inline events has the advantage that is is fairly + transparent to any elements between the overlay element and + the video sink: if an effects plugin creates a new video + buffer for the output, nothing special needs to be done to + maintain the subtitle overlay information, since the overlay + data is not attached to the buffer. However, it slightly + complicates things at the sink, since it would also need to + look for the new event in question instead of just processing + everything in its buffer render function. + + If one attaches the overlay data to the buffer directly, any + element between overlay and video sink that creates a new + video buffer would need to be aware of the overlay data + attached to it and copy it over to the newly-created buffer. + + One would have to do implement a special kind of new query + (e.g. FEATURE query) that is not passed on automatically by + gst_pad_query_default() in order to make sure that all elements + downstream will handle the attached overlay data. (This is only + a problem if we want to also attach overlay data to raw video + pixel buffers; for new non-raw types we can just make it + mandatory and assume support and be done with it; for existing + non-raw types nothing changes anyway if subtitles don't work) + (we need to maintain backwards compatibility for existing raw + video pipelines like e.g.: ..decoder ! suboverlay ! encoder..) + + Even though slightly more work, attaching the overlay information + to buffers seems more intuitive than sending it interleaved as + events. And buffers stored or passed around (e.g. via the + "last-buffer" property in the sink when doing screenshots via + playbin2) always contain all the information needed. + + + (d) create a video/x-raw-*-delta format and use a backend-specific videomixer + + This possibility was hinted at already in the digression in + section 1. It would satisfy the goal of keeping subtitle format + knowledge in the subtitle plugins and video backend knowledge + in the video backend plugin. It would also add a concept that + might be generally useful (think ximagesrc capture with xdamage). + However, it would require adding foorender variants of all the + existing overlay elements, and changing playbin2 to that new + design, which is somewhat intrusive. And given the general + nature of such a new format/API, we would need to take a lot + of care to be able to accommodate all possible use cases when + designing the API, which makes it considerably more ambitious. + Lastly, we would need to write videomixer variants for the + various accelerated video backends as well. + + +Overall (c) appears to be the most promising solution. It is the least +intrusive and should be fairly straight-forward to implement with +reasonable effort, requiring only small changes to existing elements +and requiring no new elements. + +Doing the final overlaying in the sink as opposed to a videomixer +or overlay in the middle of the pipeline has other advantages: + + - if video frames need to be dropped, e.g. for QoS reasons, + we could also skip the actual subtitle overlaying and + possibly the decoding/rendering as well, if the + implementation and API allows for that to be delayed. + + - the sink often knows the actual size of the window/surface/screen + the output video is rendered to. This *may* make it possible to + render the overlay image in a higher resolution than the input + video, solving a long standing issue with pixelated subtitles on + top of low-resolution videos that are then scaled up in the sink. + This would require for the rendering to be delayed of course instead + of just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer + in the overlay, but that could all be supported. + + - if the video backend / sink has support for high-quality text + rendering (clutter?) we could just pass the text or pango markup + to the sink and let it do the rest (this is unlikely to be + supported in the general case - text and glyph rendering is + hard; also, we don't really want to make up our own text markup + system, and pango markup is probably too limited for complex + karaoke stuff). + + + === 4. API needed === + + (a) Representation of subtitle overlays to be rendered + + We need to pass the overlay pixels from the overlay element to the + sink somehow. Whatever the exact mechanism, let's assume we pass + a refcounted GstVideoOverlayComposition struct or object. + + A composition is made up of one or more overlays/rectangles. + + In the simplest case an overlay rectangle is just a blob of + RGBA/ABGR [FIXME?] or AYUV pixels with positioning info and other + metadata, and there is only one rectangle to render. + + We're keeping the naming generic ("OverlayFoo" rather than + "SubtitleFoo") here, since this might also be handy for + other use cases such as e.g. logo overlays or so. It is not + designed for full-fledged video stream mixing though. + + // Note: don't mind the exact implementation details, they'll be hidden + + // FIXME: might be confusing in 0.11 though since GstXOverlay was + // renamed to GstVideoOverlay in 0.11, but not much we can do, + // maybe we can rename GstVideoOverlay to something better + + struct GstVideoOverlayComposition + { + guint num_rectangles; + GstVideoOverlayRectangle ** rectangles; + + /* lowest rectangle sequence number still used by the upstream + * overlay element. This way a renderer maintaining some kind of + * rectangles <-> surface cache can know when to free cached + * surfaces/rectangles. */ + guint min_seq_num_used; + + /* sequence number for the composition (same series as rectangles) */ + guint seq_num; + } + + struct GstVideoOverlayRectangle + { + /* Position on video frame and dimension of output rectangle in + * output frame terms (already adjusted for the PAR of the output + * frame). x/y can be negative (overlay will be clipped then) */ + gint x, y; + guint render_width, render_height; + + /* Dimensions of overlay pixels */ + guint width, height, stride; + + /* This is the PAR of the overlay pixels */ + guint par_n, par_d; + + /* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems, + * and BGRA on little-endian systems (i.e. pixels are treated as + * 32-bit values and alpha is always in the most-significant byte, + * and blue is in the least-significant byte). + * + * FIXME: does anyone actually use AYUV in practice? (we do + * in our utility function to blend on top of raw video) + * What about AYUV and endianness? Do we always have [A][Y][U][V] + * in memory? */ + /* FIXME: maybe use our own enum? */ + GstVideoFormat format; + + /* Refcounted blob of memory, no caps or timestamps */ + GstBuffer *pixels; + + // FIXME: how to express source like text or pango markup? + // (just add source type enum + source buffer with data) + // + // FOR 0.10: always send pixel blobs, but attach source data in + // addition (reason: if downstream changes, we can't renegotiate + // that properly, if we just do a query of supported formats from + // the start). Sink will just ignore pixels and use pango markup + // from source data if it supports that. + // + // FOR 0.11: overlay should query formats (pango markup, pixels) + // supported by downstream and then only send that. We can + // renegotiate via the reconfigure event. + // + + /* sequence number: useful for backends/renderers/sinks that want + * to maintain a cache of rectangles <-> surfaces. The value of + * the min_seq_num_used in the composition tells the renderer which + * rectangles have expired. */ + guint seq_num; + + /* FIXME: we also need a (private) way to cache converted/scaled + * pixel blobs */ + } + + (a1) Overlay consumer API: + + How would this work in a video sink that supports scaling of textures: + + gst_foo_sink_render () { + /* assume only one for now */ + if video_buffer has composition: + composition = video_buffer.get_composition() + + for each rectangle in composition: + if rectangle.source_data_type == PANGO_MARKUP + actor = text_from_pango_markup (rectangle.get_source_data()) + else + pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...) + actor = texture_from_rgba (pixels, ...) + + .. position + scale on top of video surface ... + } + + (a2) Overlay producer API: + + e.g. logo or subpicture overlay: got pixels, stuff into rectangle: + + if (logoverlay->cached_composition == NULL) { + comp = composition_new (); + + rect = rectangle_new (format, pixels_buf, + width, height, stride, par_n, par_d, + x, y, render_width, render_height); + + /* composition adds its own ref for the rectangle */ + composition_add_rectangle (comp, rect); + rectangle_unref (rect); + + /* buffer adds its own ref for the composition */ + video_buffer_attach_composition (comp); + + /* we take ownership of the composition and save it for later */ + logoverlay->cached_composition = comp; + } else { + video_buffer_attach_composition (logoverlay->cached_composition); + } + + FIXME: also add some API to modify render position/dimensions of + a rectangle (probably requires creation of new rectangle, unless + we handle writability like with other mini objects). + + (b) Fallback overlay rendering/blitting on top of raw video + + Eventually we want to use this overlay mechanism not only for + hardware-accelerated video, but also for plain old raw video, + either at the sink or in the overlay element directly. + + Apart from the advantages listed earlier in section 3, this + allows us to consolidate a lot of overlaying/blitting code that + is currently repeated in every single overlay element in one + location. This makes it considerably easier to support a whole + range of raw video formats out of the box, add SIMD-optimised + rendering using ORC, or handle corner cases correctly. + + (Note: side-effect of overlaying raw video at the video sink is + that if e.g. a screnshotter gets the last buffer via the last-buffer + property of basesink, it would get an image without the subtitles + on top. This could probably be fixed by re-implementing the + property in GstVideoSink though. Playbin2 could handle this + internally as well). + + void + gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp + GstBuffer * video_buf) + { + guint n; + + g_return_if_fail (gst_buffer_is_writable (video_buf)); + g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL); + + ... parse video_buffer caps into BlendVideoFormatInfo ... + + for each rectangle in the composition: { + + if (gst_video_format_is_yuv (video_buf_format)) { + overlay_format = FORMAT_AYUV; + } else if (gst_video_format_is_rgb (video_buf_format)) { + overlay_format = FORMAT_ARGB; + } else { + /* FIXME: grayscale? */ + return; + } + + /* this will scale and convert AYUV<->ARGB if needed */ + pixels = rectangle_get_pixels_scaled (rectangle, overlay_format); + + ... clip output rectangle ... + + __do_blend (video_buf_format, video_buf->data, + overlay_format, pixels->data, + x, y, width, height, stride); + + gst_buffer_unref (pixels); + } + } + + + (c) Flatten all rectangles in a composition + + We cannot assume that the video backend API can handle any + number of rectangle overlays, it's possible that it only + supports one single overlay, in which case we need to squash + all rectangles into one. + + However, we'll just declare this a corner case for now, and + implement it only if someone actually needs it. It's easy + to add later API-wise. Might be a bit tricky if we have + rectangles with different PARs/formats (e.g. subs and a logo), + though we could probably always just use the code from (b) + with a fully transparent video buffer to create a flattened + overlay buffer. + + (d) core API: new FEATURE query + + For 0.10 we need to add a FEATURE query, so the overlay element + can query whether the sink downstream and all elements between + the overlay element and the sink support the new overlay API. + Elements in between need to support it because the render + positions and dimensions need to be updated if the video is + cropped or rescaled, for example. + + In order to ensure that all elements support the new API, + we need to drop the query in the pad default query handler + (so it only succeeds if all elements handle it explicitly). + + Might want two variants of the feature query - one where + all elements in the chain need to support it explicitly + and one where it's enough if some element downstream + supports it. + + In 0.11 this could probably be handled via GstMeta and + ALLOCATION queries (and/or we could simply require + elements to be aware of this API from the start). + + There appears to be no issue with downstream possibly + not being linked yet at the time when an overlay would + want to do such a query. + + +Other considerations: + + - renderers (overlays or sinks) may be able to handle only ARGB or only AYUV + (for most graphics/hw-API it's likely ARGB of some sort, while our + blending utility functions will likely want the same colour space as + the underlying raw video format, which is usually YUV of some sort). + We need to convert where required, and should cache the conversion. + + - renderers may or may not be able to scale the overlay. We need to + do the scaling internally if not (simple case: just horizontal scaling + to adjust for PAR differences; complex case: both horizontal and vertical + scaling, e.g. if subs come from a different source than the video or the + video has been rescaled or cropped between overlay element and sink). + + - renderers may be able to generate (possibly scaled) pixels on demand + from the original data (e.g. a string or RLE-encoded data). We will + ignore this for now, since this functionality can still be added later + via API additions. The most interesting case would be to pass a pango + markup string, since e.g. clutter can handle that natively. + + - renderers may be able to write data directly on top of the video pixels + (instead of creating an intermediary buffer with the overlay which is + then blended on top of the actual video frame), e.g. dvdspu, dvbsuboverlay + + However, in the interest of simplicity, we should probably ignore the + fact that some elements can blend their overlays directly on top of the + video (decoding/uncompressing them on the fly), even more so as it's + not obvious that it's actually faster to decode the same overlay + 70-90 times (say) (ie. ca. 3 seconds of video frames) and then blend + it 70-90 times instead of decoding it once into a temporary buffer + and then blending it directly from there, possibly SIMD-accelerated. + Also, this is only relevant if the video is raw video and not some + hardware-acceleration backend object. + + And ultimately it is the overlay element that decides whether to do + the overlay right there and then or have the sink do it (if supported). + It could decide to keep doing the overlay itself for raw video and + only use our new API for non-raw video. + + - renderers may want to make sure they only upload the overlay pixels once + per rectangle if that rectangle recurs in subsequent frames (as part of + the same composition or a different composition), as is likely. This caching + of e.g. surfaces needs to be done renderer-side and can be accomplished + based on the sequence numbers. The composition contains the lowest + sequence number still in use upstream (an overlay element may want to + cache created compositions+rectangles as well after all to re-use them + for multiple frames), based on that the renderer can expire cached + objects. The caching needs to be done renderer-side because attaching + renderer-specific objects to the rectangles won't work well given the + refcounted nature of rectangles and compositions, making it unpredictable + when a rectangle or composition will be freed or from which thread + context it will be freed. The renderer-specific objects are likely bound + to other types of renderer-specific contexts, and need to be managed + in connection with those. + + - composition/rectangles should internally provide a certain degree of + thread-safety. Multiple elements (sinks, overlay element) might access + or use the same objects from multiple threads at the same time, and it + is expected that elements will keep a ref to compositions and rectangles + they push downstream for a while, e.g. until the current subtitle + composition expires. + + === 5. Future considerations === + + - alternatives: there may be multiple versions/variants of the same subtitle + stream. On DVDs, there may be a 4:3 version and a 16:9 version of the same + subtitles. We could attach both variants and let the renderer pick the best + one for the situation (currently we just use the 16:9 version). With totem, + it's ultimately totem that adds the 'black bars' at the top/bottom, so totem + also knows if it's got a 4:3 display and can/wants to fit 4:3 subs (which + may render on top of the bars) or not, for example. + + === 6. Misc. FIXMEs === + +TEST: should these look (roughly) alike (note text distortion) - needs fixing in textoverlay + +gst-launch-0.10 \ + videotestsrc ! video/x-raw-yuv,width=640,height=480,pixel-aspect-ratio=1/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \ + videotestsrc ! video/x-raw-yuv,width=320,height=480,pixel-aspect-ratio=2/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \ + videotestsrc ! video/x-raw-yuv,width=640,height=240,pixel-aspect-ratio=1/2 ! textoverlay text=Hello font-desc=72 ! xvimagesink + + ~~~ THE END ~~~ + |