Mova : Volumetric Cinematography: The World No Longer Flat


White Paper
		Volumetric Cinematography: The World No Longer Flat Steve Perlman President, Mova October 25, 2006 Today, most movies are shot with motion picture cameras that are strikingly similar to the first motion picture camera developed by Thomas Edison’s lab in the 1890s. In fact, the film size used by most motion picture cameras, 35mm, is exactly the same film size used by Edison in the first motion picture cameras. Of course, there have been enormous refinements to motion picture cameras and film over the last 100 years. And, recently, we’ve begun to see motion pictures, such as Superman Returns (2006) and director David Fincher’s upcoming Zodiac (2007) which eliminate film altogether and utilize high definition digital video cameras. But, despite all of these refinements to motion picture cameras, Edison would have been quite familiar with the way motion picture cameras today are used to shoot a scene: The camera looks upon the scene from a given point-of-view (POV), and then records a flat image of the scene as it appears from that POV. Of course, the camera may move or the lens may be adjusted during a shot, and there may be more that one camera recording a shot at once. But, ultimately, when the motion picture is edited, for each given frame in the final film, the director is limited to choosing the flat image recorded from the POV of one of the cameras shooting the scene. We call this “POV Cinematography”. Developed over one hundred years ago by Edison’s team, this paradigm for creating movies is still the dominant paradigm in use today. Although most live action motion pictures and video segments are shot this way today, increasingly other forms of moving image entertainment are “shot” in a very different way: they are synthesized by a computer. Video games and computer-generated animated movies present a scene to the viewer from the POV of a “virtual camera”, that is, the computer generates an image of the scene as it would have been viewed by camera from a particular position in space, through a particular lens. In the case of a video game, typically the player controls the position and lens of this virtual camera. In the case of a computer-generated animation, the director controls it. So, no longer is the director limited to the selection of POVs from the cameras placed in the scene. The virtual camera can be positioned anywhere. Essentially, the computer has given us cinematography in the round. But the flexibility of computer-generated motion imagery goes far beyond that. The director can choose the lighting of the scene, any creature or prop can be created and arbitrarily manipulated, scenes with thousands of characters can be shot cost-effectively, complex special effects can be achieved. We are no longer limited to working with the flat images captured by POV Cinematography, but rather have the ability to manipulate whatever we want within the entire 3D volume of the scene. Literally, the computer has opened up an entire dimension of creative opportunity that was unimaginable in Edison’s day. It is not simply a refinement of the paradigm Edison’s team established over a hundred years ago. It is an entirely new animal. We call this new paradigm “Volumetric Cinematography”. Just as POV Cinematography defined motion image production for most of the last century, we expect that Volumetric Cinematography will define motion image production for the next. Already in the last 10 years, we have seen both the motion picture industry and the video game industry rapidly embrace the new paradigm. 2006 is seeing the release of more than a dozen computer-generated animated features (up from one computer-generated movie in 1995), and three new video game platforms are just arriving on the scene, all touting their ability to render 3D scenes with increasing realism and interactivity. A tidal wave of creativity in both the motion picture (“linear”) and video game (“non-linear”) worlds has resulted, and audiences, both for linear and non-linear content, have responded voraciously to the new look, creating enormous demand for computer-generated content. And the technology has evolved to the point where we can synthesize computer-generated natural scenery and even animals such that they are virtually indistinguishable from the images shot by a real camera. But all of this stands in stark contrast to an assertion I made at the beginning: Most motion pictures today are shot almost the same way that Edison shot motion pictures over a hundred years ago. With all the tools for creating realistic computer-generated imagery, and with all the creative flexibility Volumetric Cinematography affords, why is almost every single motion picture shot using POV motion picture cameras? The answer is very simple: there is no Volumetric Camera that exists. We certainly have the ability for an animation team to synthesize 3D scenes within a computer. And, once those scenes are synthesized, we have the ability to manipulate them with all the power and creative freedom that Volumetric Cinematography affords. But, to date, we have not had a means to shoot a scene that exists in the real world volumetrically. The closest thing we have today to a Volumetric Camera that can capture an entire scene is a marker-based motion capture system. With a marker-based motion capture system, the markers (typically, retroreflective markers or painted dots) are placed on tightly-clad performers and props, and an array of cameras surrounding the scene tracks the dots, and by triangulating between camera views is able to determine where the dots are in “capture volume” (i.e. the scene). Since there are a very small number of markers that can be practically placed on a performer (e.g. typically 30-160 markers are placed on a human face), what results from this process is a very sparse representation of what motion occurred in the scene. While this sparse information works well for capturing the motion of rigid objects (like human skeletal motion, or the motion of a rigid prop), it’s not very effective at capturing the motion of deformable surfaces, like faces, skin, and cloth. And even when it is capturing the position of rigid objects as they move, it doesn’t provide any information about the objects themselves, such as their shape, color, surface characteristics, etc. Motion capture dots, literally and figuratively, are like 3D versions of the 2D constellations we see in the sky at night. And, it is up to the observers of the constellation (in the case of motion capture, a 3D animation production team) to draw lines between the dots, and then flesh out the characters represented by those lines. The limitations of marker-based motion capture systems (and the inconvenience and discomfort of markers on performers) have spurred the development of a variety of markerless motion capture techniques. Each technique is suited for certain specialized applications, but none of them provides a general Volumetric Camera solution. Here are some examples: Optical flow: Two or more cameras are used to triangulate between natural surface features (e.g. pores in the skin) and reconstruct surface geometry (see footnote 1). While the concept works well in theory, in practice it is difficult to consistently capture tiny surface features from multiple camera angles, even with high resolution cameras and flat lighting limited to a very small capture volume (e.g. just the face). This results in extensive manual data cleanup and high production cost. Structured light: Patterns of light are projected onto a surface and one or more cameras determine the surface shape either by measuring the deformation of the light pattern on surface (see footnote 2), or by triangulating between cameras (see footnote 3). Because the projected patterns are typically useful only within a small volume, such systems are usually limited to lock-down captures, where the subject’s motion is very restricted. Also, most computer animation applications require “vertex continuity”, such that a uniform polygonal mesh of the character is maintained from frame-to-frame. Since the projected light patterns are not “attached” to the surface (e.g. as a marker or a pore would be), it is not practical to track the motion of the surface at high resolution. Image analysis (also called “image understanding” or “machine vision”): One or more camera views of a scene are analyzed to identify character poses (typically limited to facial expressions) previously trained into the system, and then these derived poses are used to “puppeteer” a synthetic character model (see footnote 4). Since such systems do not capture the actual 3D geometry of the performance at all, the performance range is limited to the scope of the trained poses and the synthesized shapes of the character models. While each of these motion capture techniques is quite useful for certain applications, none of them offer the generality, flexibility and quality of Edison’s century-old invention: the motion picture camera. Still, the results that have been achieved by creative teams using these systems, even with their limitations, have given us a tantalizing glimpse into what might be possible if a flexible, general-purpose and high-quality Volumetric Camera system actually existed. Well, now such a system does exist. In July 2006 Mova introduced the first Volumetric Camera System capable of capturing an entire scene of vertex-continuous 3D surfaces in motion, the Contour™ Reality Capture System. Like state-of-the-art marker-based motion capture systems, Contour uses an array of cameras placed around the capture volume to look at the scene from many different angles. But unlike such systems, Contour is markerless, so rather than having upwards of 100 markers on the performer's face, there are none. And, rather than placing the performers in tightly-clad latex suits, the performers wear whatever costume is appropriate for their performance. In short, Contour tries to create a performance environment that is closer to what actors are accustomed to in a POV Cinematography shoot. Also, rather than capturing a sparse set of dots in a scene, Contour captures entire 3D surfaces. For example, with markers it might be feasible to capture a 3D “constellation” of up to 200 points on a human face. With Contour we capture over 100,000 3D points on a face with 0.1mm precision, and while we are doing it, we also capture the visual image of the face as it is lit in the scene. With so many points captured, the face no longer looks like a constellation of dots; it looks like a photoreal face, just like one you’d see captured by a POV motion picture camera. And there is good reason for this: We are capturing surfaces in 3D volumetrically (i.e. in the round), with similar resolution as a conventional motion picture camera records scenes in 2D from a single POV. Effectively, this makes it possible to shoot in 3D without compromising the realism that we expect when we shoot in 2D. The details of how Contour works can be found on the mova.com website. But what I’d like to cover here are the implications of having a camera system that is capable of Volumetric Cinematography. Let’s start with what is perhaps the most compelling image that a camera can record—and what has been the most elusive to photorealistically recreate in 3D—the human face. In 1970, a roboticist, Masahiro Mori, published a research paper about how humans respond to robot faces. He found that as robot faces became increasingly more like human faces, people responded to them with increasing empathy. Not surprisingly, people responded more empathetically to faces that had eye sockets and protrusions for noses than they did to very simple face with just circles for eyes and a triangle for a nose. Then he discovered an odd perceptual response. When the face became very close to a human face, but was imperfect, people responded quite negatively to it, and it wasn’t until a face became so real it was virtually indistinguishable from a healthy human’s face that people once again responded empathetically. He called this perceptual region where a face is almost perfect, but there is a sudden drop in viewer empathy, the “Uncanny Valley”. The computer graphics community has adopted this terminology to describe a perceptual response that is observed when viewers see computer-generated faces that are almost real, but not quite real. For example, there is higher viewer empathy for a 3D caricature like “Princess Fiona” from Shrek (2001) than for a 2D caricature, like “Princess Odette” from The Swan Princess (1994), since Princess Fiona looks more like a real woman. Viewer empathy for a live action actress’ face is even higher than that of a 3D caricature like Princess Fiona. But, if you create a 3D woman’s face that almost looks and moves like the face of a real woman, but not exactly, viewer empathy for that character plummets. Such near-photoreal faces are often described as lifeless, or zombie-like. So, typically in computer-generated movies, a deliberate effort is made to “back off” from photorealism to stay outside of the “Uncanny Valley”. Just as a 2D POV camera captures an actual scene in 2D at photographic resolution, Contour captures an actual scene in 3D at photographic resolution. It captures every wrinkle, depression, protrusion and curve. So, when it captures faces, they are virtually indistinguishable from faces captured with a 2D POV camera. One of the early challenges we had in developing Contour was that when we demonstrated it to people and asked them what they thought of the faces we had captured, they often would ask, “What am I supposed to look for? Isn’t this just a video of someone’s face?” It was at that point we knew: we had crossed the Uncanny Valley. The computer-generated faces captured by Contour no longer elicited comments about what was real or unreal about them; they simply looked like a live performance. Finally, we had camera system that captured a face in 3D with such a high degree of realism that it achieved the same high levels of empathy that a 2D POV camera could achieve. So, now that we are past the Uncanny Valley, the question is what lies beyond? Contour doesn’t just capture faces, it captures lips, skin, hands, cloth—essentially any surface that is dry and holds together as it moves. (For example, on the human body, Contour can’t capture eyeballs, loose hair or the inside of the mouth.) Contour can also capture the surface of props, including deformable ones, like bouncing balls. So, that means we can capture a pretty good percentage of anything a director might want to drop into a scene. But unlike a POV camera, Contour captures the scene volumetrically. So, the director is no longer limited to just manipulating the 2D pixels from the POV of the cameras shooting the scene, but instead is able to manipulate the real-world scene like a video game: Choosing any camera position and lens configuration; completely controlling the lighting; adding 3D environments, props and characters. And, most importantly, applying the subtleties of human performance to characters in the scene, a capability which is now being called “Digital Makeup”. Digital Makeup refers to capturing the human face in 3D, and then manipulating the face, while still retaining the performance of the actor. An actor can be aged, made to look younger, changed to look like a historical figure, or even transformed into a fantasy character. But, if the actor flares his nostrils or furrows his brow, this same subtle performance will be carried through to the nose or the forehead of his new character. So, just as actors today use their voices to add life to animated characters, with Contour, actors can use their bodies to bring life to computer-generated characters, while still retaining the subtleties of their performance. Once we bring live actors into the world of Volumetric Cinematography, their performances can be enhanced by all of the flexibility and creativity that has been developed for computer-generated motion imagery. We get the best of both worlds: the photorealism that previously was relegated to POV Cinematography, and the flexibility, power and interactivity of computer-generated imagery. Welcome to the world of Volumetric Cinematography. © 2006, Mova LLC ¹ Borshukov, et. al., “Universal Capture—Image-based Facial Animation for ‘The Matrix Reloaded’, http://www.virtualcinematography.org/publications/acrobat/UCap-s2003.pdf describes such a technique. ² Eyetronics (www.eyetronics.com) is an example of a system using such a technique. ³ 3DMD (www.3dmd.com) is an example of a system using such a technique. ⁴ Image Metrics (www.imagemetrics.net) is an example of a system using such a technique. .

White Paper