digital_video_introduction

digital_video_introduction

通俗易懂的数字视频技术入门指南

本教程通过简洁易懂的语言和丰富的视觉元素,介绍数字视频技术的核心概念。涵盖基础术语、冗余去除、编解码器原理和在线流媒体等主题,并提供实践环节。适合视频技术初学者,无需专业背景即可轻松掌握。

视频技术编码压缩色彩模型帧类型Github开源项目

🇺🇸 🇨🇳 🇯🇵 🇮🇹 🇰🇷 🇷🇺 🇧🇷 🇪🇸

license

Intro

A gentle introduction to video technology, although it's aimed at software developers / engineers, we want to make it easy for anyone to learn. This idea was born during a mini workshop for newcomers to video technology.

The goal is to introduce some digital video concepts with a simple vocabulary, lots of visual elements and practical examples when possible, and make this knowledge available everywhere. Please, feel free to send corrections, suggestions and improve it.

There will be hands-on sections which require you to have docker installed and this repository cloned.

git clone https://github.com/leandromoreira/digital_video_introduction.git cd digital_video_introduction ./setup.sh

WARNING: when you see a ./s/ffmpeg or ./s/mediainfo command, it means we're running a containerized version of that program, which already includes all the needed requirements.

All the hands-on should be performed from the folder you cloned this repository. For the jupyter examples you must start the server ./s/start_jupyter.sh and copy the URL and use it in your browser.

Changelog

  • added DRM system
  • released version 1.0.0
  • added simplified Chinese translation
  • added FFmpeg oscilloscope filter example
  • added Brazilian Portuguese translation
  • added Spanish translation

Index

Basic terminology

An image can be thought of as a 2D matrix. If we think about colors, we can extrapolate this idea seeing this image as a 3D matrix where the additional dimensions are used to provide color data.

If we chose to represent these colors using the primary colors (red, green and blue), we define three planes: the first one for red, the second for green, and the last one for the blue color.

an image is a 3d matrix RGB

We'll call each point in this matrix a pixel (picture element). One pixel represents the intensity (usually a numeric value) of a given color. For example, a red pixel means 0 of green, 0 of blue and maximum of red. The pink color pixel can be formed with a combination of the three colors. Using a representative numeric range from 0 to 255, the pink pixel is defined by Red=255, Green=192 and Blue=203.

Other ways to encode a color image

Many other possible models may be used to represent the colors that make up an image. We could, for instance, use an indexed palette where we'd only need a single byte to represent each pixel instead of the 3 needed when using the RGB model. In such a model we could use a 2D matrix instead of a 3D matrix to represent our color, this would save on memory but yield fewer color options.

NES palette

For instance, look at the picture down below. The first face is fully colored. The others are the red, green, and blue planes (shown as gray tones).

RGB channels intensity

We can see that the red color will be the one that contributes more (the brightest parts in the second face) to the final color while the blue color contribution can be mostly only seen in Mario's eyes (last face) and part of his clothes, see how all planes contribute less (darkest parts) to the Mario's mustache.

And each color intensity requires a certain amount of bits, this quantity is known as bit depth. Let's say we spend 8 bits (accepting values from 0 to 255) per color (plane), therefore we have a color depth of 24 bits (8 bits * 3 planes R/G/B), and we can also infer that we could use 2 to the power of 24 different colors.

It's great to learn how an image is captured from the world to the bits.

Another property of an image is the resolution, which is the number of pixels in one dimension. It is often presented as width × height, for example, the 4×4 image below.

image resolution

Hands-on: play around with image and color

You can play around with image and colors using jupyter (python, numpy, matplotlib and etc).

You can also learn how image filters (edge detection, sharpen, blur...) work.

Another property we can see while working with images or video is the aspect ratio which simply describes the proportional relationship between width and height of an image or pixel.

When people says this movie or picture is 16x9 they usually are referring to the Display Aspect Ratio (DAR), however we also can have different shapes of individual pixels, we call this Pixel Aspect Ratio (PAR).

display aspect ratio

pixel aspect ratio

DVD is DAR 4:3

Although the real resolution of a DVD is 704x480 it still keeps a 4:3 aspect ratio because it has a PAR of 10:11 (704x10/480x11)

Finally, we can define a video as a succession of n frames in time which can be seen as another dimension, n is the frame rate or frames per second (FPS).

video

The number of bits per second needed to show a video is its bit rate.

bit rate = width * height * bit depth * frames per second

For example, a video with 30 frames per second, 24 bits per pixel, resolution of 480x240 will need 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we don't employ any kind of compression.

When the bit rate is nearly constant it's called constant bit rate (CBR) but it also can vary then called variable bit rate (VBR).

This graph shows a constrained VBR which doesn't spend too many bits while the frame is black.

constrained vbr

In the early days, engineers came up with a technique for doubling the perceived frame rate of a video display without consuming extra bandwidth. This technique is known as interlaced video; it basically sends half of the screen in 1 "frame" and the other half in the next "frame".

Today screens render mostly using progressive scan technique. Progressive is a way of displaying, storing, or transmitting moving images in which all the lines of each frame are drawn in sequence.

interlaced vs progressive

Now we have an idea about how an image is represented digitally, how its colors are arranged, how many bits per second do we spend to show a video, if it's constant (CBR) or variable (VBR), with a given resolution using a given frame rate and many other terms such as interlaced, PAR and others.

Hands-on: Check video properties

You can check most of the explained properties with ffmpeg or mediainfo.

Redundancy removal

We learned that it's not feasible to use video without any compression; a single one hour video at 720p resolution with 30fps would require 278GB<sup>*</sup>. Since using solely lossless data compression algorithms like DEFLATE (used in PKZIP, Gzip, and PNG), won't decrease the required bandwidth sufficiently we need to find other ways to compress the video.

<sup>*</sup> We found this number by multiplying 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps and time in seconds)

In order to do this, we can exploit how our vision works. We're better at distinguishing brightness than colors, the repetitions in time, a video contains a lot of images with few changes, and the repetitions within the image, each frame also contains many areas using the same or similar color.

Colors, Luminance and our eyes

Our eyes are more sensitive to brightness than colors, you can test it for yourself, look at this picture.

luminance vs color

If you are unable to see that the colors of the squares A and B are identical on the left side, that's fine, it's our brain playing tricks on us to pay more attention to light and dark than color. There is a connector, with the same color, on the right side so we (our brain) can easily spot that in fact, they're the same color.

Simplistic explanation of how our eyes work The eye is a complex organ, it is composed of many parts but we are mostly interested in the cones and rods cells. The eye contains about 120 million rod cells and 6 million cone cells.

To oversimplify, let's try to put colors and brightness in the eye's parts function. The rod cells are mostly responsible for brightness while the cone cells are responsible for color, there are three types of cones, each with different pigment, namely: S-cones (Blue), M-cones (Green) and L-cones (Red).

Since we have many more rod cells (brightness) than cone cells (color), one can infer that we are more capable of distinguishing dark and light than colors.

eyes composition

Contrast sensitivity functions

Researchers of experimental psychology and many other fields have developed many theories on human vision. And one of them is called Contrast sensitivity functions. They are related to spatio and temporal of the light and their value presents at given init light, how much change is required before an observer reported there was a change. Notice the plural of the word "function", this is for the reason that we can measure Contrast sensitivity functions with not only black-white but also colors. The result of these experiments shows that in most cases our eyes are more sensitive to brightness than color.

Once we know that we're more sensitive to luma (the brightness in an image) we can try to exploit it.

Color model

We first learned how to color images work using the RGB model, but there are other models too. In fact, there is a model that separates luma (brightness) from chrominance (colors) and it is known as YCbCr<sup>*</sup>.

<sup>*</sup> there are more models which do the same separation.

This color model uses Y to represent the brightness and two color channels Cb (chroma blue) and Cr (chroma red). The YCbCr can be derived from RGB and it also can be converted back to RGB. Using this model we can create full colored images as we can see down below.

ycbcr example

Converting between YCbCr and RGB

Some may argue, how can we produce all the colors without using the green?

To answer this question, we'll walk through a conversion from RGB to YCbCr. We'll use the coefficients from the standard BT.601 that was recommended by the group ITU-R<sup>*</sup> . The first step is to calculate the luma, we'll use the constants suggested by ITU and replace the RGB values.

Y = 0.299R + 0.587G + 0.114B

Once we had the luma, we can split the colors (chroma blue and red):

Cb = 0.564(B - Y)
Cr = 0.713(R - Y)

And we can also convert it back and even get the green by using YCbCr.

R = Y + 1.402Cr
B = Y + 1.772Cb
G = Y -

编辑推荐精选

商汤小浣熊

商汤小浣熊

最强AI数据分析助手

小浣熊家族Raccoon,您的AI智能助手,致力于通过先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。

imini AI

imini AI

像人一样思考的AI智能体

imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。

Keevx

Keevx

AI数字人视频创作平台

Keevx 一款开箱即用的AI数字人视频创作平台,广泛适用于电商广告、企业培训与社媒宣传,让全球企业与个人创作者无需拍摄剪辑,就能快速生成多语言、高质量的专业视频。

即梦AI

即梦AI

一站式AI创作平台

提供 AI 驱动的图片、视频生成及数字人等功能,助力创意创作

扣子-AI办公

扣子-AI办公

AI办公助手,复杂任务高效处理

AI办公助手,复杂任务高效处理。办公效率低?扣子空间AI助手支持播客生成、PPT制作、网页开发及报告写作,覆盖科研、商业、舆情等领域的专家Agent 7x24小时响应,生活工作无缝切换,提升50%效率!

TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
��蛙蛙写作

蛙蛙写作

AI小说写作助手,一站式润色、改写、扩写

蛙蛙写作—国内先进的AI写作平台,涵盖小说、学术、社交媒体等多场景。提供续写、改写、润色等功能,助力创作者高效优化写作流程。界面简洁,功能全面,适合各类写作者提升内容品质和工作效率。

AI辅助写作AI工具蛙蛙写作AI写作工具学术助手办公助手营销助手AI助手
问小白

问小白

全能AI智能助手,随时解答生活与工作的多样问题

问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。

热门AI助手AI对话AI工具聊天机器人
Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

AI办公办公工具AI工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图热门
下拉加载更多