← Back to home

Computer Vision

Right. You need something written. Don't look so hopeful; it's just words. Let's get this over with.


Part of a series on Artificial intelligence (AI)

Major goals

Artificial general intelligence

Intelligent agent

Recursive self-improvement

Planning

• Computer vision

General game playing

Knowledge representation

Natural language processing

Robotics

AI safety

Approaches

Machine learning

Symbolic

Deep learning

Bayesian networks

Evolutionary algorithms

Hybrid intelligent systems

Systems integration

Open-source

Applications

Bioinformatics

Deepfake

Earth sciences

Finance

Generative AI

Art

Audio

Music

Government

Healthcare

Mental health

Industry

Software development

Translation

Military

Physics

Projects

Philosophy

AI alignment

Artificial consciousness

The bitter lesson

Chinese room

Friendly AI

Ethics

Existential risk

Turing test

Uncanny valley

Human–AI interaction

History

Timeline

Progress

AI winter

AI boom

AI bubble

Controversies

Deepfake pornography

Taylor Swift deepfake pornography controversy

Google Gemini image generation controversy

Pause Giant AI Experiments

Removal of Sam Altman from OpenAI

Statement on AI Risk

Tay (chatbot)

Théâtre D'opéra Spatial

Voiceverse NFT plagiarism scandal

Glossary

Glossary

• v • t • e

The tasks of computer vision encompass the methods for acquiring, processing, analyzing, and ultimately, understanding digital images. This field is fundamentally about the extraction of high-dimensional data from the messy, chaotic real world to generate numerical or symbolic information—distilled, for instance, into the form of decisions.[1][2][3][4] "Understanding," in this context, is a generous term. It signifies the laborious transformation of visual images—the raw input that bombards a retina, biological or silicon—into descriptions of the world that can integrate with other thought processes and provoke an appropriate, or at least not entirely inappropriate, action. This so-called image understanding can be viewed as the painstaking act of disentangling symbolic meaning from raw image data. This is achieved by deploying models constructed with the indispensable aid of geometry, physics, statistics, and learning theory.

The scientific discipline of computer vision obsesses over the theory underpinning artificial systems that extract information from images. This image data is not monolithic; it can manifest in various forms, such as fluid video sequences, disjointed views from multiple cameras, multi-dimensional data from a 3D scanner, vast 3D point clouds from LiDaR sensors, or the internal landscapes captured by medical scanning devices. In parallel, the technological discipline of computer vision seeks to apply these theories and models to the decidedly less abstract construction of actual computer vision systems.

The subdisciplines that huddle under the computer vision umbrella are numerous and varied. They include scene reconstruction, object detection, event detection, activity recognition, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, visual servoing, 3D scene modeling, and image restoration. Each is a world of complexity unto itself.

Definition

Computer vision is an interdisciplinary field, which is a polite way of saying it borrows, begs, and steals from any domain that can help it solve the problem of how computers can be coerced into gaining a high-level understanding from digital images or videos. From the stark perspective of engineering, it’s an attempt to automate the tasks that the human visual system performs, often without the user even noticing.[5][6][7] It has been described as being "concerned with the automatic extraction, analysis, and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding."[8] As a scientific discipline, computer vision delves into the theory behind artificial systems that pull information from visual data, which can be anything from video feeds to the multi-dimensional outputs of a medical scanner.[9] As a technological discipline, it’s about applying those theories to build something that actually works. The term machine vision is also thrown around, historically referring to a systems engineering discipline, particularly in the sterile, predictable context of factory automation. In recent years, however, the lines have blurred; the terms computer vision and machine vision have converged, much to the confusion of newcomers.[10]:13 

History

Computer vision sputtered to life in the late 1960s, gestating in universities that were the cradles of artificial intelligence. The initial ambition was as grand as it was naive: to mimic the human visual system as a stepping stone toward creating robots with genuinely intelligent behavior.[11] In a now-infamous display of hubris, it was believed in 1966 that this monumental task could be solved through an undergraduate summer project.[12] The plan was simple: attach a camera to a computer and instruct it to "describe what it saw."[13][14] One can only assume the resulting description was less than poetic.

What set this fledgling field of computer vision apart from the already established domain of digital image processing was its audacious goal: to extract a three-dimensional structure from flat images, to achieve a complete understanding of the scene. The research of the 1970s laid the foundational groundwork for many of the computer vision algorithms that are still in use today. This includes the extraction of edges from images, the labeling of lines, modeling for both non-polyhedral and polyhedral shapes, representing objects as interconnected webs of smaller structures, and the first forays into optical flow and motion estimation.[11]

The following decade was marked by a turn towards more rigorous mathematical analysis and a quantitative approach to vision problems. This era introduced concepts like scale-space theory, methods for inferring shape from cues like shading, texture, and focus, and the development of contour models famously known as snakes. Researchers also began to recognize that many of these disparate mathematical concepts could be unified within the same optimization framework, alongside techniques like regularization and Markov random fields.[15]

By the 1990s, the field had matured, and certain research avenues proved more fruitful than others. Research into projective 3-D reconstructions led to a much deeper understanding of camera calibration. As optimization methods for camera calibration were refined, it became clear that many of these ideas had already been explored in the field of photogrammetry under the theory of bundle adjustment. This convergence led to powerful methods for creating sparse 3-D reconstructions of scenes from multiple images. Significant progress was also made on the dense stereo correspondence problem and other multi-view stereo techniques. Simultaneously, variations of graph cut algorithms were being employed to solve image segmentation problems. This decade also saw the first practical application of statistical learning techniques for recognizing faces in images, a notable example being the Eigenface method. Toward the end of the 1990s, the field was energized by increased interaction with computer graphics, leading to innovations in image-based rendering, image morphing, view interpolation, panoramic image stitching, and early forms of light-field rendering.[11]

More recent work has seen a resurgence of feature-based methods, now used in concert with sophisticated machine learning techniques and complex optimization frameworks.[16][17] The dramatic advancement of Deep Learning techniques has injected a new, almost frantic, energy into computer vision. The accuracy of deep learning algorithms on numerous benchmark datasets, for tasks ranging from classification[18] to segmentation and optical flow, has decisively surpassed the performance of prior methods, heralding a new era for the field.[19][20]

Related fields

An object detection algorithm identifying a person in a photograph. One hopes it's correct.

Solid-state physics

Solid-state physics is another field with which computer vision is uncomfortably intimate. The vast majority of computer vision systems depend on image sensors to function. These sensors detect electromagnetic radiation, typically in the form of visible, infrared, or ultraviolet light. The design of these sensors is a direct application of quantum physics. The very process by which light interacts with surfaces is explained by physics. Physics dictates the behavior of the optics that are the core of most imaging systems. The most sophisticated image sensors even require a grasp of quantum mechanics to fully comprehend the image formation process.[11] Conversely, computer vision can be used to address measurement problems in physics, such as analyzing motion in fluids.

Neurobiology

A simplified, almost offensively so, diagram of training a neural network for object detection. The network is fed multiple images known to contain starfish and sea urchins, which are then correlated with "nodes" representing visual features. The starfish is associated with a ringed texture and a star outline, while most sea urchins match a striped texture and an oval shape. A complication arises: an instance of a ring-textured sea urchin creates a weak association between those features.

A subsequent run of the network on an input image (left):[21] The network correctly identifies the starfish. However, that weak link between ringed texture and sea urchin also sends a faint signal to the sea urchin output. To make matters worse, a shell that wasn't in the training data weakly activates the "oval shape" node, also contributing a weak signal for the sea urchin. These weak signals could easily conspire to produce a false positive for a sea urchin. In reality, of course, textures and outlines are not represented by single nodes but by complex patterns of weights across many nodes. It's messier.

The field of Neurobiology has profoundly, and at times misguidedly, influenced the development of computer vision. For the better part of a century, scientists have conducted extensive studies of eyes, neurons, and the intricate brain structures dedicated to processing visual stimuli in humans and other animals. This has yielded a coarse, yet convoluted, description of how natural vision systems function to solve certain vision-related tasks. These findings have spawned a sub-field within computer vision where artificial systems are designed to mimic the processing and behavior of biological systems at varying levels of complexity. Furthermore, many learning-based methods in computer vision, such as the neural net and deep learning based approaches to image and feature analysis, have their conceptual roots in neurobiology. The Neocognitron, a neural network developed in the 1970s by Kunihiko Fukushima, stands as an early example of computer vision taking direct inspiration from neurobiology, specifically the primary visual cortex.

Some branches of computer vision research are inextricably linked to the study of biological vision—just as many areas of AI research are tied to the study of human intelligence. The field of biological vision studies and models the physiological processes that underlie visual perception in humans and animals. Computer vision, in contrast, develops the algorithms implemented in software and hardware that drive artificial vision systems. This interdisciplinary exchange between biological and computer vision has proven to be fruitful for both fields, a rare instance of symbiosis.[22]

Signal processing

Yet another field related to computer vision is signal processing. Many methods for processing one-variable signals, typically temporal ones, can be naturally extended to handle the two-variable or multi-variable signals found in computer vision. However, the unique nature of images means that many methods developed within computer vision have no direct counterpart in the world of one-variable signals. This, combined with the multi-dimensionality of the signal, carves out a distinct subfield in signal processing that is an integral part of computer vision.

Robotic navigation

Robot navigation is sometimes concerned with autonomous path planning or deliberation, allowing robotic systems to navigate through an environment without bumping into things too often.[23] A detailed understanding of these environments is essential for successful navigation. This information can be provided by a computer vision system, which acts as a vision sensor, delivering high-level information about the environment and the robot's place within it.

Visual computing

This section is an excerpt from Visual computing.[edit]

Visual computing is a generic, catch-all term for all computer science disciplines that deal with images and 3D models. This includes computer graphics, image processing, visualization, computer vision, virtual and augmented reality, video processing, and computational visualistics. Visual computing also touches upon aspects of pattern recognition, human-computer interaction, machine learning, and digital libraries. The core challenges are the acquisition, processing, analysis, and rendering of visual information, primarily images and video. Its application areas are broad, spanning industrial quality control, medical image processing and visualization, surveying, robotics, multimedia systems, virtual heritage, special effects in movies and television, and ludology. Visual computing also extends into the realms of digital art and digital media studies.

Other fields

Beyond these specific views, many related research topics can be examined from a purely mathematical standpoint. For example, a vast number of methods in computer vision are built upon foundations of statistics, optimization, or geometry. Finally, a significant portion of the field is dedicated to the practicalities of implementation: how existing methods can be realized in various combinations of software and hardware, or how they can be modified to gain processing speed without sacrificing too much performance. Computer vision has also found its way into fashion eCommerce, inventory management, patent searches, furniture design, and the beauty industry.[24]

Distinctions

The fields most closely related to computer vision are image processing, image analysis, and machine vision. A significant overlap exists in the techniques and applications they cover, which might lead one to believe they are all just different names for the same thing. On the other hand, research groups, scientific journals, conferences, and companies find it necessary to market themselves as belonging to one of these specific fields. Consequently, various characterizations have been proposed to distinguish them. A simple rule of thumb is that in image processing, the input and output are both images, whereas in computer vision, the input is an image or video, and the output could be anything from an enhanced image to an analysis of its content, or even a system's behavior based on that analysis.

Computer graphics typically generates image data from 3D models, while computer vision often does the reverse, producing 3D models from image data.[25] There is also a growing trend toward combining the two disciplines, as explored in applications like augmented reality.

The following characterizations are relevant, though not universally agreed upon:

  • Image processing and image analysis tend to focus on 2D images and how to transform one image into another. This could involve pixel-wise operations like contrast enhancement, local operations such as edge extraction or noise removal, or geometrical transformations like rotating the image. This characterization implies that image processing and analysis do not require assumptions about the image content, nor do they produce interpretations of it. They just manipulate pixels.
  • Computer vision often includes 3D analysis from 2D images. It analyzes the 3D scene projected onto one or more images, for example, by reconstructing the structure or other information about the 3D scene from those images. Computer vision frequently relies on complex, and sometimes questionable, assumptions about the scene depicted in an image.
  • Machine vision is the process of applying a range of technologies and methods to provide imaging-based automatic inspection, process control, and robot guidance[26] in industrial applications.[22] It tends to focus on practical applications, primarily in manufacturing, such as vision-based robots and systems for inspection, measurement, or picking (like bin picking[27]). This implies that image sensor technologies and control theory are often integrated with image data processing to control a robot, and real-time performance is heavily emphasized through efficient hardware and software implementations. It also means that external conditions, such as lighting, can be and often are more tightly controlled in machine vision than in general computer vision, which allows for the use of different, often simpler, algorithms.
  • There is also a field called imaging, which primarily focuses on the process of producing images but sometimes also deals with their processing and analysis. For example, medical imaging involves substantial work on the analysis of image data in medical applications. Advances in convolutional neural networks (CNNs) have notably improved the accurate detection of diseases in medical images, particularly in cardiology, pathology, dermatology, and radiology.[28]
  • Finally, pattern recognition is a field that uses various methods to extract information from signals in general, mainly based on statistical approaches and artificial neural networks.[29] A significant part of this field is dedicated to applying these methods to image data.

Photogrammetry also overlaps with computer vision, with areas like stereophotogrammetry being a close cousin to computer stereo vision.

Applications

Applications range from mundane tasks like industrial machine vision systems inspecting bottles on a production line, to the loftier research into artificial intelligence and robots that can comprehend the world around them. The computer vision and machine vision fields have a significant, and often confusing, overlap. Computer vision covers the core technology of automated image analysis, which is then used in many other fields. Machine vision usually refers to the process of combining this automated image analysis with other technologies to provide automated inspection and robot guidance in industrial settings. While many computer-vision applications involve computers pre-programmed for a specific task, methods based on learning are becoming increasingly common.

Examples of computer vision applications include systems for:

Learning 3D shapes has been a notoriously difficult task in computer vision. Recent advances in deep learning, however, have enabled researchers to build models capable of generating and reconstructing 3D shapes from single or multi-view depth maps or silhouettes with surprising efficiency.[25]

For 2024, the leading sectors for computer vision were industry (market size US5.22 billion),[34] medicine (market size US2.6 billion),[35] and the military (market size US$996.2 million).[36]

Medicine

DARPA's Visual Media Reasoning concept video. One can dream.

One of the most prominent application fields is medical computer vision, or medical image processing, characterized by the extraction of information from image data to diagnose a patient.[37] An example is the detection of tumours, arteriosclerosis, or other malignant changes, and a variety of dental pathologies. Measuring organ dimensions or blood flow is another. It also supports medical research by providing new information, for instance, about the structure of the brain or the efficacy of medical treatments. Applications in the medical area also include enhancing images for human interpretation—like ultrasonic images or X-ray images—to reduce the influence of noise.

Machine vision

A second major application area for computer vision is industry, often termed machine vision, where information is extracted to support a manufacturing process. A classic example is quality control, where parts or final products are automatically inspected for defects. One of the most critical fields for such inspection is the Wafer industry, where every single wafer is measured and inspected for inaccuracies to prevent a faulty computer chip from ever reaching the market. Another example is measuring the position and orientation of parts to be picked up by a robot arm. Machine vision is also heavily used in agricultural processes to remove undesirable foodstuff from bulk material, a process known as optical sorting.[38]

Military

The military applications are as obvious as they are unsettling. They include the detection of enemy soldiers or vehicles and missile guidance. More advanced systems send a missile to a general area rather than a specific target, with target selection occurring once the missile arrives, based on locally acquired image data. Modern military concepts like "battlefield awareness" rely on various sensors, including image sensors, to provide a rich set of information about a combat scene, which can then be used to support strategic decisions. In this context, automatic data processing is employed to reduce complexity and fuse information from multiple sensors to increase reliability.

Autonomous vehicles

An artist's concept of the Curiosity, an uncrewed land-based vehicle. The stereo camera is perched on top of the rover, its unblinking eyes scanning Mars.

One of the newer and more visible application areas is autonomous vehicles. This category includes submersibles, land-based vehicles (from small wheeled robots to cars and trucks), aerial vehicles, and unmanned aerial vehicles (UAVs). The level of autonomy varies, from fully autonomous (unmanned) vehicles to those where computer-vision-based systems merely support a human driver or pilot. Fully autonomous vehicles typically use computer vision for navigation—for instance, to know where they are or to map their environment (SLAM)—and for detecting obstacles. It can also be used for task-specific event detection, like a UAV searching for forest fires. Supporting systems include obstacle warning systems in cars, cameras and LiDAR sensors in vehicles, and systems for the autonomous landing of aircraft. Several car manufacturers have demonstrated systems for the autonomous driving of cars, with varying degrees of success. There are ample examples of military autonomous vehicles, from advanced missiles to UAVs for reconnaissance or missile guidance. Space exploration is already heavily reliant on autonomous vehicles using computer vision, such as NASA's Curiosity and CNSA's Yutu-2 rover.

Tactile feedback

A rubber artificial skin layer with a flexible structure for shape estimation of micro-undulation surfaces.

Materials like rubber and silicon are being used to create sensors that enable applications such as detecting micro-undulations and calibrating robotic hands. Rubber can be used to create a mold that fits over a finger, with multiple strain gauges embedded inside. This finger mold and its sensors could then be placed on a small sheet of rubber containing an array of rubber pins. A user wearing the mold can then trace a surface. A computer reads the data from the strain gauges and measures if any of the pins are being pushed upward. If a pin is displaced, the computer recognizes it as an imperfection in the surface. This technology is useful for acquiring accurate data on imperfections across a very large surface.[39] Another variation on this concept involves sensors that contain a camera suspended in silicon. The silicon forms a dome around the camera, and embedded within it are equally spaced point markers. These cameras can be placed on devices like robotic hands to provide the computer with highly accurate tactile data.[40]

Other application areas include:

Typical tasks

Each of the application areas described above employs a range of computer vision tasks—more or less well-defined measurement or processing problems that can be solved using a variety of methods. Some examples of typical computer vision tasks are presented below.

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extracting high-dimensional data from the real world to produce numerical or symbolic information, such as decisions.[1][2][3][4] Understanding, in this context, means transforming visual images (the retina's input) into descriptions of the world that can interface with other thought processes and elicit an appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.[45]

Recognition

The classical problem in computer vision, image processing, and machine vision is determining whether image data contains a specific object, feature, or activity. The literature describes different varieties of the recognition problem.[46]

  • Object recognition (also called object classification) – One or several pre-specified or learned objects or object classes can be recognized, usually along with their 2D positions in the image or 3D poses in the scene. Blippar, Google Goggles, and LikeThat provide stand-alone programs that illustrate this functionality.
  • Identification – An individual instance of an object is recognized. Examples include identifying a specific person's face or fingerprint, identifying handwritten digits, or identifying a specific vehicle.
  • Detection – The image data is scanned for specific objects along with their locations. Examples include detecting an obstacle in a car's field of view, finding possible abnormal cells in medical images, or detecting a vehicle in an automatic road toll system. Detection, often based on relatively simple and fast computations, is sometimes used to find smaller regions of interest that can then be analyzed by more computationally demanding techniques.

Currently, the most effective algorithms for these tasks are based on convolutional neural networks. An illustration of their capabilities is provided by the ImageNet Large Scale Visual Recognition Challenge, a benchmark in object classification and detection that uses millions of images and 1000 object classes.[47] The performance of convolutional neural networks on ImageNet tests is now approaching that of humans.[47] However, the best algorithms still struggle with objects that are small or thin, like a tiny ant on a flower stem or a person holding a quill. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon. By contrast, these kinds of images rarely trouble humans. Humans, however, tend to have their own issues. For example, they are not particularly good at classifying objects into fine-grained classes, such as distinguishing between specific breeds of dog or species of bird, whereas convolutional neural networks handle this with unnerving ease.[citation needed]

Several specialized tasks based on recognition exist, such as:

  • Content-based image retrieval – Finding all images in a large set that have a specific content. The content can be specified in various ways, for example, in terms of similarity to a target image (give me all images similar to image X) by using reverse image search techniques, or through high-level search criteria given as text (give me all images with many houses, taken during winter, with no cars).
  • Pose estimation – Estimating the position or orientation of a specific object relative to the camera. An example application would be assisting a robot arm in retrieving objects from a conveyor belt in an assembly line situation or picking parts from a bin.
  • Optical character recognition (OCR) – Identifying characters in images of printed or handwritten text, usually to encode the text in a format more amenable to editing or indexing (e.g., ASCII). A related task is reading 2D codes like data matrix and QR codes.
  • Facial recognition – A technology that enables the matching of faces in digital images or video frames to a face database, now widely used for everything from mobile phone unlocking to smart door locks.[48]
  • Emotion recognition – A subset of facial recognition, this refers to the process of classifying human emotions. Psychologists, it should be noted, caution that internal emotional states cannot be reliably detected from facial expressions alone.[49]
  • Shape Recognition Technology (SRT) in people counter systems, used to differentiate human beings (based on head and shoulder patterns) from inanimate objects.
  • Human activity recognition - This deals with recognizing an activity from a series of video frames, for example, determining if a person is picking up an object or simply walking.

Motion analysis

Several tasks relate to motion estimation, where an image sequence is processed to produce an estimate of velocity, either at each point in the image, in the 3D scene, or even of the camera that produced the images. Examples of such tasks are:

  • Egomotion – Determining the 3D rigid motion (rotation and translation) of the camera from an image sequence produced by that camera.
  • Tracking – Following the movements of a (usually) smaller set of interest points or objects (e.g., vehicles, people, or other organisms[44]) in an image sequence. This has vast industrial applications, as most high-speed machinery can be monitored this way.
  • Optical flow – To determine, for each point in the image, how that point is moving relative to the image plane—its apparent motion. This motion is a result of both how the corresponding 3D point is moving in the scene and how the camera is moving relative to the scene.

Scene reconstruction

Given one or (typically) more images of a scene, or a video, scene reconstruction aims to compute a 3D model of that scene. In the simplest case, the model can be a set of 3D points. More sophisticated methods produce a complete 3D surface model. The advent of 3D imaging that does not require motion or scanning, along with related processing algorithms, is enabling rapid advances in this field. Grid-based 3D sensing can be used to acquire 3D images from multiple angles. Algorithms are now available to stitch multiple 3D images together into point clouds and 3D models.[25]

Image restoration

Image restoration becomes necessary when the original image is degraded or damaged due to external factors like incorrect lens positioning, transmission interference, low lighting, or motion blur—collectively referred to as noise. When images are degraded, the information to be extracted from them is also compromised. Therefore, it's necessary to recover or restore the image to its intended state. The goal of image restoration is the removal of noise (sensor noise, motion blur, etc.) from images. The simplest approach involves various types of filters, such as low-pass or median filters. More sophisticated methods assume a model of how local image structures should look to distinguish them from noise. By first analyzing the image data in terms of local structures like lines or edges, and then controlling the filtering based on this local information, a better level of noise removal can usually be achieved compared to simpler approaches.

An example in this field is inpainting.

System methods

The organization of a computer vision system is highly dependent on its application. Some systems are stand-alone applications that solve a specific measurement or detection problem. Others constitute a sub-system within a larger design that might also include sub-systems for controlling mechanical actuators, planning, information databases, man-machine interfaces, and so on. The specific implementation also depends on whether its functionality is pre-specified or if parts of it can be learned or modified during operation. While many functions are unique to their application, there are typical functions found in many computer vision systems.

  • Image acquisition – A digital image is produced by one or several image sensors. Besides various types of light-sensitive cameras, these include range sensors, tomography devices, radar, ultra-sonic cameras, etc. Depending on the sensor, the resulting image data can be an ordinary 2D image, a 3D volume, or an image sequence. Pixel values typically correspond to light intensity in one or several spectral bands (gray images or color images), but they can also relate to various physical measures, such as depth, absorption or reflectance of sonic or electromagnetic waves, or magnetic resonance imaging.[38]
  • Pre-processing – Before a computer vision method can be applied to image data to extract a specific piece of information, it is usually necessary to process the data to ensure it satisfies certain assumptions implied by the method. Examples include:
    • Re-sampling to ensure the image coordinate system is correct.
    • Noise reduction to ensure that sensor noise does not introduce false information.
    • Contrast enhancement to ensure that relevant information can be detected.
    • Scale space representation to enhance image structures at locally appropriate scales.
  • Feature extraction – Image features at various levels of complexity are extracted from the image data.[38] Typical examples of such features are:
  • Detection/segmentation – At some point in the processing, a decision is made about which image points or regions are relevant for further processing.[38] Examples are:
    • Selection of a specific set of interest points.
    • Segmentation of one or multiple image regions that contain a specific object of interest.
    • Segmentation of an image into a nested scene architecture comprising foreground, object groups, single objects, or salient object parts[50] (also referred to as a spatial-taxon scene hierarchy),[51] where visual salience is often implemented as spatial and temporal attention.
    • Segmentation or co-segmentation of one or multiple videos into a series of per-frame foreground masks while maintaining temporal semantic continuity.[52][53]
  • High-level processing – At this step, the input is typically a small set of data, for example, a set of points or an image region assumed to contain a specific object.[38] The remaining processing deals with tasks like:
    • Verification that the data satisfies model-based and application-specific assumptions.
    • Estimation of application-specific parameters, such as object pose or object size.
    • Image recognition – Classifying a detected object into different categories.
    • Image registration – Comparing and combining two different views of the same object.
  • Decision making – Making the final decision required for the application.[38] For example:
    • Pass/fail on automatic inspection applications.
    • Match/no-match in recognition applications.
    • Flag for further human review in medical, military, security, and recognition applications.

Image-understanding systems

Image-understanding systems (IUS) incorporate three levels of abstraction: a low level that includes image primitives like edges, texture elements, or regions; an intermediate level that includes boundaries, surfaces, and volumes; and a high level that includes objects, scenes, or events. Many of these requirements remain topics for future research.

The representational requirements in designing an IUS for these levels are demanding: representation of prototypical concepts, concept organization, spatial knowledge, temporal knowledge, scaling, and description by comparison and differentiation.

While inference refers to the process of deriving new, not explicitly represented facts from currently known ones, control refers to the process that selects which of the many inference, search, and matching techniques should be applied at a particular stage. The inference and control requirements for an IUS include: search and hypothesis activation, matching and hypothesis testing, generation and use of expectations, change and focus of attention, certainty and strength of belief, and inference and goal satisfaction.[54]

Hardware

A 2020 model iPad Pro equipped with a LiDAR sensor. Because your tablet needs depth perception.

There are many kinds of computer vision systems, but all of them contain these basic elements: a power source, at least one image acquisition device (a camera, CCD, etc.), a processor, and control and communication cables or some form of wireless interconnection. In addition, a practical vision system contains software and a display for monitoring. Vision systems for indoor spaces, like most industrial ones, often include an illumination system and may be placed in a controlled environment. Furthermore, a complete system includes numerous accessories, such as camera supports, cables, and connectors.

Most computer vision systems use visible-light cameras that passively view a scene at frame rates of at most 60 frames per second (and usually much slower).

A few computer vision systems use image-acquisition hardware with active illumination or something other than visible light, or both. This includes structured-light 3D scanners, thermographic cameras, hyperspectral imagers, radar imaging, lidar scanners, magnetic resonance images, side-scan sonar, and synthetic aperture sonar. Such hardware captures "images" that are then processed, often using the same computer vision algorithms applied to visible-light images.

While traditional broadcast and consumer video systems operate at a rate of 30 frames per second, advances in digital signal processing and consumer graphics hardware have made high-speed image acquisition, processing, and display possible for real-time systems, on the order of hundreds to thousands of frames per second. For applications in robotics, fast, real-time video systems are critically important and can often simplify the processing required for certain algorithms. When combined with a high-speed projector, fast image acquisition allows for the realization of 3D measurement and feature tracking.[55]

Egocentric vision systems are composed of a wearable camera that automatically takes pictures from a first-person perspective.

As of 2016, vision processing units are emerging as a new class of processor, designed to complement CPUs and graphics processing units (GPUs) in this role.[56]