A gentle introduction to video technology, although it's aimed at software developers / engineers, we want to make it easy for anyone to learn. This idea was born during a mini workshop for newcomers to video technology.
The goal is to introduce some digital video concepts with a simple vocabulary, lots of visual elements and practical examples when possible, and make this knowledge available everywhere. Please, feel free to send corrections, suggestions and improve it.
There will be hands-on sections which require you to have docker installed and this repository cloned.
git clone https://github.com/leandromoreira/digital_video_introduction.git cd digital_video_introduction ./setup.sh
WARNING: when you see a
./s/ffmpeg
or./s/mediainfo
command, it means we're running a containerized version of that program, which already includes all the needed requirements.
All the hands-on should be performed from the folder you cloned this repository. For the jupyter examples you must start the server ./s/start_jupyter.sh
and copy the URL and use it in your browser.
An image can be thought of as a 2D matrix. If we think about colors, we can extrapolate this idea seeing this image as a 3D matrix where the additional dimensions are used to provide color data.
If we chose to represent these colors using the primary colors (red, green and blue), we define three planes: the first one for red, the second for green, and the last one for the blue color.
We'll call each point in this matrix a pixel (picture element). One pixel represents the intensity (usually a numeric value) of a given color. For example, a red pixel means 0 of green, 0 of blue and maximum of red. The pink color pixel can be formed with a combination of the three colors. Using a representative numeric range from 0 to 255, the pink pixel is defined by Red=255, Green=192 and Blue=203.
Other ways to encode a color image
Many other possible models may be used to represent the colors that make up an image. We could, for instance, use an indexed palette where we'd only need a single byte to represent each pixel instead of the 3 needed when using the RGB model. In such a model we could use a 2D matrix instead of a 3D matrix to represent our color, this would save on memory but yield fewer color options.
For instance, look at the picture down below. The first face is fully colored. The others are the red, green, and blue planes (shown as gray tones).
We can see that the red color will be the one that contributes more (the brightest parts in the second face) to the final color while the blue color contribution can be mostly only seen in Mario's eyes (last face) and part of his clothes, see how all planes contribute less (darkest parts) to the Mario's mustache.
And each color intensity requires a certain amount of bits, this quantity is known as bit depth. Let's say we spend 8 bits (accepting values from 0 to 255) per color (plane), therefore we have a color depth of 24 bits (8 bits * 3 planes R/G/B), and we can also infer that we could use 2 to the power of 24 different colors.
It's great to learn how an image is captured from the world to the bits.
Another property of an image is the resolution, which is the number of pixels in one dimension. It is often presented as width × height, for example, the 4×4 image below.
Hands-on: play around with image and color
You can play around with image and colors using jupyter (python, numpy, matplotlib and etc).
You can also learn how image filters (edge detection, sharpen, blur...) work.
Another property we can see while working with images or video is the aspect ratio which simply describes the proportional relationship between width and height of an image or pixel.
When people says this movie or picture is 16x9 they usually are referring to the Display Aspect Ratio (DAR), however we also can have different shapes of individual pixels, we call this Pixel Aspect Ratio (PAR).
DVD is DAR 4:3
Although the real resolution of a DVD is 704x480 it still keeps a 4:3 aspect ratio because it has a PAR of 10:11 (704x10/480x11)
Finally, we can define a video as a succession of n frames in time which can be seen as another dimension, n is the frame rate or frames per second (FPS).
The number of bits per second needed to show a video is its bit rate.
bit rate = width * height * bit depth * frames per second
For example, a video with 30 frames per second, 24 bits per pixel, resolution of 480x240 will need 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we don't employ any kind of compression.
When the bit rate is nearly constant it's called constant bit rate (CBR) but it also can vary then called variable bit rate (VBR).
This graph shows a constrained VBR which doesn't spend too many bits while the frame is black.
In the early days, engineers came up with a technique for doubling the perceived frame rate of a video display without consuming extra bandwidth. This technique is known as interlaced video; it basically sends half of the screen in 1 "frame" and the other half in the next "frame".
Today screens render mostly using progressive scan technique. Progressive is a way of displaying, storing, or transmitting moving images in which all the lines of each frame are drawn in sequence.
Now we have an idea about how an image is represented digitally, how its colors are arranged, how many bits per second do we spend to show a video, if it's constant (CBR) or variable (VBR), with a given resolution using a given frame rate and many other terms such as interlaced, PAR and others.
Hands-on: Check video properties
You can check most of the explained properties with ffmpeg or mediainfo.
We learned that it's not feasible to use video without any compression; a single one hour video at 720p resolution with 30fps would require 278GB<sup>*</sup>. Since using solely lossless data compression algorithms like DEFLATE (used in PKZIP, Gzip, and PNG), won't decrease the required bandwidth sufficiently we need to find other ways to compress the video.
<sup>*</sup> We found this number by multiplying 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps and time in seconds)
In order to do this, we can exploit how our vision works. We're better at distinguishing brightness than colors, the repetitions in time, a video contains a lot of images with few changes, and the repetitions within the image, each frame also contains many areas using the same or similar color.
Our eyes are more sensitive to brightness than colors, you can test it for yourself, look at this picture.
If you are unable to see that the colors of the squares A and B are identical on the left side, that's fine, it's our brain playing tricks on us to pay more attention to light and dark than color. There is a connector, with the same color, on the right side so we (our brain) can easily spot that in fact, they're the same color.
Simplistic explanation of how our eyes work The eye is a complex organ, it is composed of many parts but we are mostly interested in the cones and rods cells. The eye contains about 120 million rod cells and 6 million cone cells.
To oversimplify, let's try to put colors and brightness in the eye's parts function. The rod cells are mostly responsible for brightness while the cone cells are responsible for color, there are three types of cones, each with different pigment, namely: S-cones (Blue), M-cones (Green) and L-cones (Red).
Since we have many more rod cells (brightness) than cone cells (color), one can infer that we are more capable of distinguishing dark and light than colors.
Contrast sensitivity functions
Researchers of experimental psychology and many other fields have developed many theories on human vision. And one of them is called Contrast sensitivity functions. They are related to spatio and temporal of the light and their value presents at given init light, how much change is required before an observer reported there was a change. Notice the plural of the word "function", this is for the reason that we can measure Contrast sensitivity functions with not only black-white but also colors. The result of these experiments shows that in most cases our eyes are more sensitive to brightness than color.
Once we know that we're more sensitive to luma (the brightness in an image) we can try to exploit it.
We first learned how to color images work using the RGB model, but there are other models too. In fact, there is a model that separates luma (brightness) from chrominance (colors) and it is known as YCbCr<sup>*</sup>.
<sup>*</sup> there are more models which do the same separation.
This color model uses Y to represent the brightness and two color channels Cb (chroma blue) and Cr (chroma red). The YCbCr can be derived from RGB and it also can be converted back to RGB. Using this model we can create full colored images as we can see down below.
Some may argue, how can we produce all the colors without using the green?
To answer this question, we'll walk through a conversion from RGB to YCbCr. We'll use the coefficients from the standard BT.601 that was recommended by the group ITU-R<sup>*</sup> . The first step is to calculate the luma, we'll use the constants suggested by ITU and replace the RGB values.
Y = 0.299R + 0.587G + 0.114B
Once we had the luma, we can split the colors (chroma blue and red):
Cb = 0.564(B - Y)
Cr = 0.713(R - Y)
And we can also convert it back and even get the green by using YCbCr.
R = Y + 1.402Cr
B = Y + 1.772Cb
G = Y -
字节跳动发布的AI编程神器IDE
Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。
全能AI智能助手,随时解答生活与工作的多样问题
问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。
实时语音翻译/同声传译工具
Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。
一键生成PPT和Word,让学习生活更轻松
讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。
深度推理能力全新升级,全面对标OpenAI o1
科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。
一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型
Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。
AI助力,做PPT更简单!
咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。
选题、配图、成文,一站式创作,让内容运营更高效
讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。
专业的AI公文写作平台,公文写作神器
AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。
OpenAI Agents SDK,助力开发者便捷使用 OpenAI 相关功能。
openai-agents-python 是 OpenAI 推出的一款强大 Python SDK,它为开发者提供了与 OpenAI 模型交互的高效工具,支持工具调用、结果处理、追踪等功能,涵盖多种应用场景,如研究助手、财务研究等,能显著提升开发效率,让开发者更轻松地利用 OpenAI 的技术优势。
最新AI工具、AI资讯
独家AI资源、AI项目落地
微信扫一扫关注公众号