SYSTEM AND METHOD FOR FACE RECOGNITION USING THREE DIMENSIONS
A system for facial recognition comprising at least one processor; at least one input operatively connected to the at least one processor; a database configured to store threedimensional facial image data comprising facial feature coordinates in a predetermined common plane; the at least one processor configured to locate threedimensional facial features in the image of the subject, estimate threedimensional facial feature location coordinates in the image of the subject, obtain the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane; and compare the location of the facial feature coordinates of the subject to images of people in the database; whereby recognition, comparison and/or likeness of the facial images is determined by comparing the predetermined common plane facial feature coordinates of the subject to images in the database. A method is also disclosed.
Latest U.S. Army Research Laboratory ATTN: RDRLLOCI Patents:
The embodiments herein may be manufactured, used, and/or licensed by or for the United States Government without the payment of royalties thereon.
CROSSREFERENCE TO RELATED APPLICATIONSThis application claims priority to U.S. patent application Ser. No. 15/198,344 entitled System and Method for Face Recognition with TwoDimension Sensing Modality” by Dr. Shiqiong Susan Young, filed Jun. 30, 2016, herein incorporated by reference.
BACKGROUND OF THE INVENTIONIn automatic face recognition an unknown subject is identified by inputting a facial image into a gallery of previously identified persons. The database of stored images is referred to as a gallery or watch list and the inputted image or video is usually referred to as a probe.
Biometric identifiers are distinctive, measurable characteristics used to label and describe individuals. Facial images commonly use biometric characteristic, as are images of the iris, finger prints, gait, etc. Accurate and reliable face recognition can be utilized for surveillance and security tasks (i.e. entrance security, law enforcement, criminal record identification, etc.).
Twodimensional facial recognition may not offer effective facial recognition due to different illuminations, expressions and poses/viewpoints. Use of threedimensional (3D) face models preserves the geometric structure of a face despite illumination, expressions and pose variables. For example, U.S. Pat. No. 7,620,217, entitled “Threedimensional face recognition system and method”, by W.C. Chen, et al, herein incorporated by reference, discloses a generalized framework for the threedimensional face recognition. U.S. Patent Publication No. 2006/0078172 entitled “3D Face Authentication and Recognition Based on Bilateral Symmetry Analysis,” by L. Zhang, et al., herein incorporated by reference, discloses the use of curvatures for 3D face profile recognition and authentication.
Since depth information is lost in a twodimensional image, construction of 3D models from 2D images requires specific algorithms and sensors. Examples of attempts to obtain a 3D representation of a human face include U.S. Pat. No. 6,047,078, entitled “Method for Extracting a Threedimensional Model Using Appearancebased Constrained Structure from Motion,” by S. B. Kang, herein incorporated by reference, which discloses creation of a 3D face model from a sequence of temporally related 2D images by tracking the facial features. In the publication by Z. L. Sun, et al., entitled “Depth Estimation of Face Images Using the Nonlinear Leastsquares Model”, IEEE Transaction on Image Processing, 22(1): 1730, (January 2013), the threedimensional structure of a human face is reconstructed from its corresponding 2D images with different poses. The depth values of feature points are estimated by a nonlinear leastsquare method. The appearancebased approaches require two or more input 2D images of different pose views of the subject; therefore it is difficult to apply to moving subjects. Also, 2D images are sensitive to environment lighting.
Passive stereo sensors, as disclosed for example in U.S. Published Application No. 2005/0111705, entitled “Passive stereo sensing for 3D facial shape biometrics”, May 26, 2005, herein incorporated by reference, use two cameras to capture the object and determine the object's location in the threedimensional space by using a triangulation technique. Although the passive stereo works well on textured scenes and has a high resolution, issues develop with occluding boundaries and difficulty with smooth regions (no textures for matching correspondences). Passive stereo requires the illumination from the environment because it does not have an active light source. It is therefore only suitable for outdoor daylight or indoor strong light scenarios and cannot be applied in low light conditions.
Range sensors or TimeofFlight (ToF) sensors resolve depth information by measuring the time or phase changes of the emitted light from the camera to the scene point Similar to approaches based on the structured light technique, timeofflight sensors emit active lighting of a certain spectrum into the scene and do not require additional environment light; so that they can be used in low light conditions. This class of apparatus is a popular solution in existing threedimensional face recognition approaches due to high resolution and high speed. U.S. Pat. No. 6,947,579, entitled “Threedimensional Face Recognition,” issued September, 2005, by M. M. Bronstein, et al., discloses high precision threedimensional representations of human faces acquired using a range camera; and the 3D isometric face model is used to deal with expression changes. In the publication by Kakadiaris. I. A., et al., entitled “Threedimensional Face Recognition in the Presence of Facial Expressions: an Annotated Deformable Model Approach”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4): 640649, (2007), a laser scanner is used to acquire the high resolution depth information. In this method, the annotated face model (AFM) is used to fit a face geometry image from threedimensional 3D face data to eliminate the variations caused by expressions.
Due to the high cost of TimeofFlight or range sensors, it is not practical to deploy them in a large scale environment for surveillance purposes. It is desirable to design an affordable largescale face recognition system that can be used in the dark/low light environment. For this reason Kinect sensors, such as disclosed in U.S. Pat. No. 7,433,024, are desirable for acquiring 3D depth information. U.S. Pat. No. 7,433,024 entitled “Range Mapping Using Speckle Decorrelation,” issued Oct. 7, 2008, by J. Garcia, et al., (herein incorporated by reference) discloses a Kinect sensor comprising color camera (with three color channels termed with red, green, and blue, RGB) and an infrared (IR) projector and receiver Similar to structured light approaches, the IR projector emits a random dotted pattern to the scene. By correlating the received pattern with the projected pattern, the depth information is resolved by stereo triangulation. Since the depth resolution acquired by the Kinect sensor is very low and detecting facial features requires highfidelity imagery, the development of tailored algorithms optimizes usage of the Kinect sensor for face recognition.
A Kinect sensor provides a color image (called RGB image) and a depth image (called D image) of the scene. Together, they are called RGBD images. Using the depth information from a Kinect sensor for face recognition is an active area of research. Most existing methods use a single Kinect sensor for threedimensional face data acquisition. The publication by R. I. Hg, et al. entitled “An RGBD Database Using Microsoft's Kinect for Windows for Face Detection,” published by the Eighth International Conference on Signal Image Technology and Internet Based Systems, Naples, (November, 2012) discloses the building of a RGBD dataset using a Kinect sensor, comprising 1581 images of 31 targets. Using this method, faces are aligned using ae triangle formed by the two eyes and the nose. The publication by. B. Y. Li, et al., entitled “Using Kinect for Face Under Varying Poses, Expressions, Illumination and Disguise”, published by IEEE Workshop on Applications of Computer Vision, Tampa, Fla. (2013) discloses the use of “dictionary learning” to fit the noisy depth data acquired by a Kinect sensor onto a canonical model. The publication by G. Goswami, et al, entitled “On RGBD Face Recognition Using Kinect,” published by IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems, Arlington, Va. (2013), discloses the applcation of a histogram of gradients (HOG) feature on the entropy and salient map of both RGB and depth images for the face classification. The publication by T. Huynh, et al. entitled “An efficient LBPbased descriptor for facial depth images applied to gender recognition using RGBD face data,” published by ACCV Workshop on Computer Vision with Local Binary Pattern Variants, Daejeo, Korea, (Nov. 56, 2012) discloses the use of gradient local binary patterns (GLBP) to represent faces in depth images and the application of this descriptor to identify the gender of the subject.
However, a single Kinect sensor has the limited field of view. In order to increase the field of view, there are some approaches that use multiple Kinects for the data acquisition. The publication by M. Hossny, et al, entitled “Low cost multimodal facial recognition via Kinect sensors,” published by Proceedings of the 2012 Land Warfare Conference, Melbourne, Victoria (2012), discloses a system comprising three Kinect sensors on a triangular rig to capture the 3D information of a subject and the application of Haar features to detect human faces.
The above Kinect sensor based methods consider the depth image as a regular gray image and apply traditional statisical based classification methods for recogntion, See, e.g., M. A. Turk, et al, “Face Recognition Using Eigen Faces,” Proceedings of IEEE Computer Vision and Pattern Recogntion (CVPR), pp. 586591 (1991).
Pose variation is one of the major challenges in face recognition even where threedimensional data is available due to the reason that important facial features (i.e., eye corners and mouth corners) may be not complete in nonfrontal poses. Most existing methods exploit the facial symmetry to complete the missing data caused by the pose variation or occlusion (See, for example, the publication by G. Passalis, et al., entitled “Using Facial Symmetry to Handle Pose Variations in Realworld 3D Face Recognition,” published by IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10): 19381951 (2011); and the publication by B. Y. Li, et al., entitled “Using Kinect for Face Recognition Under Varying Poses, Expressions, Illumination and Disguise,” published by IEEE Workshop on Applications of Computer Vision, Tampa, Fla. (2013). However, the completed face information obtained by this method lacks accuracy because the estimation was based on a hypothetic model. Taking a different approach, U.S. Published Application No. 20120293635, entitled “Head Pose Estimation Using RGBD Camera,” issued Nov. 22, 2012, by P. Sharma, et al, (herein incorporated by reference) discloses multiple temporalrelated depth images and application of an extended Kalman filter to estimate the 3D head poses (translations and rotations) with respect to a reference pose. The accuracy of this approach was determined by the number of depth images.
It is also known in the art to connect a plurality of Kinects to obtain 3D images. For example, Chinese Patent No. CN103279987A, by Huarong, et al., herein incorporated by reference, discloses an object fast threedimensional modeling method based on Kinect comprising the steps of: (step 1) fixing the relative positions of each Kinect and a rotating platform, enabling all Kinects to directly face the rotating platform with different visual angles respectively to obtain a relatively integral object model; (step 2) placing an object to be reconstructed in the center of the rotating platform, starting a system to carry out reconstruction on the object, achieving scene modeling on the scene depth information output by the Kinect by using a threedimensional vision theory, unifying the scene depth information of Kinect locating in different coordinate systems on an identical coordinate system; (step 3) filtering wrong threedimensional point clouds by using a removing method base on normal correction, specifically, obtaining the dense threedimensional point clouds of scene depth information through the step 2, extracting the normal information of the threedimensional point clouds, constructing exterior point judgment functions based on local normal constraint, judging the data of the threedimensional point clouds which does not meet the local normal constraint as exterior points, and removing the exterior points; and (step 4) obtaining a threedimensional model of the object.
SUMMARYA preferred method for facial recognition comprises:
inputting image data representing a plurality of images from a database; the database comprising images of people wherein the location of the three dimensional facial features is defined relative to a predetermined common plane;
inputting an image of a subject to be identified;
locating predetermined threedimensional facial features in the image of the subject for comparison to the image data from the database;
estimating threedimensional facial feature location coordinates of the subject head in the image of the subject;
obtaining the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane;
comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database; and
determining the identity of the subject.
Optionally, the facial features comprise the eyes, nose and mouth of a subject and the coordinates of the location of the facial features are defined as the location of eye corners, nose tip and mouth corners; and the predetermined common plane is a vertical plane passing the midpoints of the eye corners and mouth corners; and wherein the orientation parameters correlate to the yaw pitch and roll of the subject's head. Optionally, the step of obtaining the threedimensional facial feature location coordinates and orientation parameters comprises estimating the threedimensional orientation of the subject's head. As a further option, the yaw, pitch and roll of each head in the database of images of people is approximately the same since all facial features are specified relative to a predetermined common plane comprising the centers of the eye corners and mouth corners. As an additional option, a plurality of geodesic distances between facial coordinates of the subject are determined and matched against the geodesic distances between facial coordinates of the image data in the database. The step of inputting an image of a subject to be identified may optionally comprise generating point clouds and/or surface meshes depicting the subject's head using at least one threedimensional image sensor. The step of obtaining threedimensional facial feature location coordinates and orientation parameters may optionally comprise estimating the three dimensional orientation of the subject's head and scale; and further comprises transforming the threedimensional facial feature location coordinates of the corners of the eyes, nose tip, and corners of the mouth of the subject into corresponding coordinates relative to a vertical plane containing the midpoints of the corners of the eyes and corners of the mouth using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure. For each image of a person in the database, the method may optionally comprise:
locating predetermined threedimensional facial features in the image of the person;
estimating threedimensional facial feature location coordinates;
obtaining the threedimensional facial feature location coordinates and orientation parameters relative to a coordinate system defined by the location of facial features relative to the predetermined common plane.
A preferred embodiment of the present invention comprises:
at least one processor;
at least one input operatively connected to at least one processor and configured to input at least one image of a subject to be identified;
a database configured to store threedimensional facial image data comprising facial feature coordinates in a predetermined common plane;
the at least one processor configured to locate predetermined threedimensional facial features in the image of the subject, estimate threedimensional facial feature location coordinates of the subject head in the image of the subject, and obtain the threedimensional facial feature location coordinates and orientation parameters relative to a coordinate system in which the facial features are located in the predetermined common plane; and compare the location of the facial feature coordinates of the subject to the locations of the facial feature coordinates of the images of people in the database;
whereby recognition comparison and/or likeness of the facial images is determined by comparing the predetermined common plane facial feature coordinates of the subject to the predetermined common plane facial feature coordinates of the images in the database.
As an option, a preferred embodiment may utilize facial features comprising eyes, nose and mouth of a subject and the facial features coordinates may be the location of eye corners, nose tip and mouth corners; and the predetermined common plane may be a vertical plane containing the centers of the eye corners and mouth corners; and orientation parameters may correlate to the yaw pitch and roll of the subject's head.
As a further, the database may be configured to store a plurality of geodesic distances between facial feature coordinates of the images of people in the database and the at least one processor determines geodesic distances between facial feature coordinates of the subject and matches the geodesic distances against the geodesic distances between facial coordinates of the images of people in the database. Another option is use at least one threedimensional image sensor configured to generate point clouds and/or surface meshes depicting the subject's head. The at least one processor may be configured to obtain the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
Another preferred method of facial recognition comprises:
inputting image data representing a plurality of images from a database; the database comprising images of people wherein the location of the facial features is defined relative to a predetermined common plane;
inputting a plurality of images of a subject to be identified from at least one image sensor;
locating predetermined threedimensional facial features in the image of the subject for comparison to the image data from the database;
estimating threedimensional facial feature location coordinates of the subject head in the image of the subject;
obtaining the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane;
comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database; and
determining the identity of the query subject.
Optionally, the method may be used for one of authentication and surveillance and may further comprise:
selectively activating a device based upon the likeness, resemblance and/or comparison of the subject to images of people in the database; and wherein the step of obtaining the threedimensional facial feature location coordinates and orientation parameters comprises using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
These and other embodiments will be described in further detail below with respect to the following figures.
A more complete appreciation of the invention will be readily obtained by reference to the following Description of the Preferred Embodiments and the accompanying drawings in which like numerals in different figures represent the same structures or elements. The representations in each of the figures are diagrammatic and no attempt is made to indicate actual scales or precise ratios. Proportional relationships are shown as approximates.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTSThe embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of wellknown components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, the dimensions of objects and regions may be exaggerated for clarity. Like numbers refer to like elements throughout. As used herein the term “and/or” includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the full scope of the invention. The singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Use of the terms “comprises” and/or “comprising” in this specification specifies the presence of stated features, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In this application the terms first, second, etc. may be used to describe various ranges, elements, components, regions, layers and/or sections. These elements, components, regions, layers and/or sections should not be limited by these terms. For example, when referring first and second areas, these terms are only used to distinguish one area from another area. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the present invention.
Furthermore, relative terms, such as “lower” or “bottom” and “upper” or “top,” may be used herein to describe one element's relationship to other elements as illustrated in the Figures. It will be understood that relative terms are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures. For example, if the device in the Figures is turned over, elements described as being on the “lower” side of other elements would then be oriented on “upper” sides of the other elements. The exemplary term “lower”, can therefore, encompass both an orientation of “lower” and “upper,” depending of the particular orientation of the figure. Similarly, if the device in one of the figures is turned over, elements described as “below” or “beneath” other elements would then be oriented “above” the other elements. The exemplary terms “below” or “beneath” can, therefore, encompass both an orientation of above and below. Furthermore, the term “outer” may be used to refer to a surface and/or layer that is farthest away from a substrate.
As may be used herein, the terms “substantially” and “approximately” provide an industryaccepted tolerance for its corresponding term and/or relativity between items. Such an industryaccepted tolerance ranges from less than one percent to ten percent and corresponds to, but is not limited to, component values, angles, et cetera. Such relativity between items ranges between less than one percent to ten percent. As may be used herein, the term “substantially negligible” means there is little relative difference, the little difference ranging between less than one percent to ten percent.
As used herein the terminology “substantially all” means for the most part; essentially all.
As may be used herein, the term “significantly” means of a size and/or effect that is large or important enough to be noticed or have an important effect.
This description and the accompanying drawings that illustrate inventive aspects and embodiments should not be taken as limiting the scope of the invention. Modifications may be made without departing from the scope of the invention defined in the claims. Details of known structures and techniques have not been shown or described in detail for clarity and so as to not to obscure the invention. The drawings are not to scale and relative sizes of components are for illustrative purposes only and do not reflect the actual sizes that may occur in any actual embodiment of the invention. In accordance with the MPEP like numbers in two or more figures represent the same or similar elements. Elements and their associated aspects that are described in detail with reference to one embodiment may, whenever practical, be included in other embodiments in which they are not specifically shown or described. If an element is described in detail with reference to one embodiment, it may not be described with in the description of the second embodiment, but the intention is to incorporate the element by reference into the second embodiment and the element may nevertheless be claimed as included in the second embodiment.
The preferred or alternate embodiments of the present invention described in this application and illustrated using crosssection illustrations are schematic in nature and represent idealized embodiments. Thus, variations from the shapes of the elements presented in this application are contemplated as being within the scope of the present invention and not a departure from the invention. The embodiments of the present invention are not limited to the particular shapes illustrated herein but are to include deviations in shape and sizing. The figures are schematic in nature and shapes are not precise and are not intended to limit the scope of the present invention.
Unless otherwise stated or defined, all terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art to which this invention belongs. Terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless it is expressly stated or defined otherwise.
A preferred embodiment of the present invention may be used to identify personnel targets at dark or low light conditions when the images can be obtained without external illuminations. A preferred embodiment of the present invention offers an improved system using, for example, the lowcost and high speed Kinect sensor(s) to capture the 3D face imagery. Due to the low depth resolution of the Kinect sensor, there is a need to utilize common structures (biometric landmarks) to characterize 3D faces from Kinect sensors. A preferred embodiment of the present invention develops 3D face model reconstruction from a single pose input imagery via a new stereotactic registration method. As an option, to increase the effective field of view and prevent the scenario that only partial face landmarks are available from one Kinect sensor, a preferred embodiment may use multiple Kinect sensors in the system. As an option, a preferred embodiment may enable 3D face model reconstruction from multiple pose input imagery obtained from a multiple Kinect sensor system via a registration method. Also, in order to operate in an uncontrolled environment, pose variations may be accommodated in face recognition by registering images that are acquired at different times and/or by different sensors into a common 3D coordinate system.
V_{P}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8},θ,φ,ψ)
where a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8 }are independent of yaw, pitch, and roll, and where θ, Φ, Ψ, and α are used in a Gaussian Least Squares Differential correction (GLSDC) minimization procedure to perform the conversion. Inputted facial features are transformed to a new unified coordinate system (UCS) and matching, recognition, and/or identification is performed (Box 24).
In the following (shown in
Referring now to
In order to acquire 3D information in lowlight/dark environment, a Kinect sensor may be used for depth sensing since the Kinect sensor projects an IR light into the scene without requiring external lighting. To increase the effective field of view and prevent the partial face landmarks from one Kinect sensor, a preferred embodiment system may comprise three Kinect sensors for capturing the RGBD images of the subject, with a processing unit for data storage and mathematical and logical computation.
As depicted in
To acquire a gallery of subjects of known identities, the subject is standing or sitting at a fixed distance away from the rig, facing the Kinect 72 in the middle. Consequently, RGBD images of three poses are available in the gallery including the front (obtained from Kinect 72), left (obtained from Kinect 71), and right profiles (obtained from Kinect 73). To acquire a probe, the multiple Kinect sensor system may be placed in a fixed location in the area that needs surveillance. When a subject approaches, the system is triggered by the control unit to capture the RGBD images of the subject. In an uncontrolled environment, a subject could approach in any direction and with various poses. Since the preferred embodiment system has three Kinect sensors (71, 72, 73) to acquire images of the subject simultaneously, the system can increase the field of view and can handle pose variations more effectively.
A Kinect sensor may be used to capture the RGBD images of a subject as illustrated schematically in
As shown in
{right arrow over (r)}=C+λ{right arrow over (d)} Equation (1)
where C(0,0,0) is the camera's center of the projection, which is also the origin of the coordinate system, {right arrow over (d)} is a unit vector indicating the ray's direction, and λ is the propagation factor.
Since P is on the trajectory of the ray {right arrow over (r)}, then:
P=C+λ_{P}{right arrow over (d)}. Equation (2)
In a camera coordinate system, the direction vector {right arrow over (d)} can be decomposed along the three major axes as {right arrow over (d)}=(d_{x}, d_{y}, d_{z}), where the x and y directions determine the image plane and the z direction is the optical axis. Another point lying on the ray {right arrow over (r)} is the pixel p(u, v), which is the intersection of the ray {right arrow over (r)} and the image plane. Since the image plane is at z=1 in this case, it follows that:
(u,v,1)=(0,0,0)+λ_{p}(d_{x},d_{y},d_{z}). Equation (3)
Therefore the ray's direction can be computed as
For each pixel p(u, v) on the input image, there is a depth measurement from the depth image along the optical axis. Given the depth measurement {tilde over (z)} of the 3D point P corresponding to the pixel p(u, v), its propagation factor can be computed by Equation (2) as
λ_{P}={tilde over (z)}√{square root over (u^{2}+v^{2}+1)}. Equation (5)
Therefore, by substituting Equation (5) into Equation (2), the 3D location of P can be computed as (u{tilde over (z)}, v{tilde over (z)}, {tilde over (z)}).
When this procedure is used for every input pixel in the depth image, a point cloud representation is obtained for the acquired face from a Kinect sensor. In order to have a better visualization of a 3D face, or make any measurements from the face, a 3D surface mesh can be reconstructed by applying the delaunay triangulation (see L. Guibas and J. Stolfi, “Primitives for the manipulation of general subdivisions and the computation of voronoi,” ACM Transaction on Graphics, 4(2): 74123, April 1985) (herein incorporated by reference) to the point cloud.
The point cloud or the 3D surface mesh is referred to as a 3D face premodel. From the resultant 3D face surface mesh (3D face premodel), the facial features are either manually annotated on the 3D model or can be located using existing automatic 3D face feature extraction methods.
In accordance with the preferred embodiment, the face plane is a vertical plane that passes through four points (the centers of corners of each eye and the corners of mouth).
For each subject in the gallery, if the single pose input imagery is utilized, one depth image from one pose is acquired from one Kinect sensor and one 3D face premodel is reconstructed. If multiple pose input imagery is utilized, multiple depth images from multiple poses are acquired from multiple Kinect sensors and multiple 3D face premodels are reconstructed. Each 3D face premodel corresponds to one pose input acquisition. Similarly, the probe could contain one 3D face premodel for one subject or multiple 3D face premodels for one subject.
3D Registered Face Model ReconstructionWhen a 3D face premodel is acquired under an uncooperative (uncontrolled) condition, the 3D face premodel can contain three types of rotations, i.e., inplane, pose, and tilt rotations as shown in
In order to match 3D face premodels that are taken by different sensors and at different times under uncooperative conditions, the face premodels from the gallery and probe need to be transformed into a common coordinate system via the preferred embodiment 3D registration method, into a coordinate system which may be referenced as the unified coordinate system. A preferred embodiment of the present invention utilizes a coordinate system that is invariant to the sensor position with respect to the human subject when the face image is taken; i.e., invariant relative to a pose of the human head.
The preferred embodiment also may utilize a novel method of a 3D registration, which may be referred to as 3D stereotactic registration. Although the word “stereotactic” may be used herein, the invention is not limited to stereotactic registration. The preferred embodiment 3D registration method transforms the 3D face premodel into a unified coordinate system (UCS) to generate the 3D registered face model. The transformation contains shifts, 3D rotation, and scale, which is a generalized yaw, pitch, roll, scale, and shift transformation to a unified coordinate system (UCS).
Described in the following are a) the unified coordinate system (UCS); b) estimation of preferred embodiment transformation parameters using a single 3D face premodel; c) estimation of preferred embodiment transformation parameters using a single 3D face premodel with the iterative refinement; d) estimation of preferred embodiment transformation parameters using multiple 3D face premodels; and e) the integrated face model from multiple registered face models.
Unified Coordinate System (UCS)When a human face in a three dimensional space is viewed from a side view, there is a difference in depth for the eyes and the mouth, which is unique for each person. Thus, for a human head in a “normal” position, which is normally termed as a frontal pose, the centers of corners of each eye and the corners of mouth are on two different vertical planes.
A preferred embodiment utilizes a unified coordinate system (UCS) defined by first labeling four points on a human face, two midpoints of the corners of each eye and two corners of mouth, as labeled in
This face plane is distinguishable from a face plane described in U.S. Pat. No. 7,221,809 ('809 Geng patent), entitled “Face recognition system and method,” May 22, 2007, by Z. J. Geng, herein incorporated by reference. The '809 Geng patent defined “face plane” differently in that Geng's face plane was not defined as being vertical, but simply as passing through the centers of two eyes and outer corners of mouth, with the location of the nose tip being on the central bisection plane of the face plan, and the projection of the nose tip towards the face plane forming a “Nose Vector.”
The four points defined using the preferred embodiment Unified Coordinate System (UCS) (for example, midpoints of the corners of the eyes and corners of the mouth) uniquely define each person according to the coordinates of the sensor with respect to the relative pose of the head. In accordance with a preferred embodiment of the present invention, a face premodel is transformed from some undefined position into the unified coordinate system (UCS) to generate a registered face model. Then the face recognition or face matching can then be performed under the unified coordinate system on the registered face models.
As stated above, a face plane may be defined as a plane passing through four points (the centers of corners of each eye and the corners of mouth) that is vertical. For a normal posed human head in a frontal pose, the centers of corners of each eye and the corners of mouth are on two different vertical planes. In order to have these four points passing a vertical plane, the human head is tilted from the normal frontal pose; referred to as the face plane as depicted in
Since face premodels at any other positions with possible three rotation angles are transformed into the unified coordinate system (UCS) using a preferred embodiment of the present invention, the face recognition problem with uncooperative or uncontrolled conditions where face premodels have 3D rotational angles may need additional treatment. An uncooperative or random subject for face recognition may be based on one single 3D face premodel that is obtained from one Kinect sensor or multiple 3D face premodels that are obtained from multiple Kinect sensors. Recognition and registration is achieved by performing the generalized yaw, pitch, roll, scale, and shift transformation to a unified coordinate system (UCS).
Shift is related to the relative position of the object to the center of the image plane. Scale is related to the object range and sensor focal plane properties (number of pixels, element spacing, etc.) Yaw, pitch, and roll are related to the pose rotation, head tilt rotation and inplane rotation.
Estimation of Transformation Parameters Using Single 3D Face PreModelA face plane under the unified coordinate system (UCS) is illustrated in
The origin point 0 (0, 0, 0) of the face plane is defined as the intersection of two lines as shown in
As depicted in
−x_{e1}=x_{e4}=a_{1},y_{e1}=y_{e4}=a_{2},−x_{e2}=x_{e3}=a_{3},y_{e2}=y_{e3}=a_{4},x_{n}=0,y_{n}=a_{5},z_{n}=a_{6},−x_{m1}=x_{m2}=a_{7},y_{m1}=y_{m2}=a_{8 }
The seven landmark coordinates with respect to the primary parameters may be defined as (Equation list (8):
left eye outer corner: p_{1}=−a_{1},a_{2},0), left eye inner corner: p_{2}=(−a_{3},a_{4},0), right eye inner corner: p_{3}=(a_{3},a_{4},0), right eye outer corner: p_{4}=(a_{1},a_{2},0) nose tip: p_{5}=(0,a_{5},a_{6}), mouth left corner: p_{6}=(−a_{7},a_{8},0), mouth right corner: p_{7}=(a_{7},a_{8},0).
In accordance with the principles of the present invention, each landmark point on a 3D face premodel can be transformed to the UCS domain by the following transformation:
P_{M}=q·R·P_{U} Equation (9)
where P_{M}=[u, v, r]^{T }represents the coordinates of a point in the 3D face premodel, and denoted a point in the measurement domain; P_{U}=[x,y,z]^{T }represents the coordinates of a point on the corresponding registered 3D face model in the UCS domain; q is the scaling factor to normalize faces of different sizes; and R is the rotation matrix. As illustrated in
R=Θ·Φ·Ψ Equation (10)
where Θ represents the inplane rotation transformation (roll) in the (x, y) plane, that is,
Φ represents the poserotation transformation (yaw) in the (x, z) plane, that is,
Ψ represents the tiltrotation transformation (pitch) in the (y, z) plane, that is,
where θ, Φ, and ψ are the angles of inplane, pose, and tilt rotations, respectively.
In order to estimate the transformation that transforms a 3D face premodel into the UCS, 12 parameters need to be estimated that is, (a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8},θ,φ,ψ,q). These parameters can be normalized by the scale q that is derived from the unit distance defined above. Therefore, 11 transformation parameters need to be estimated.
A parameter vector can be defined as:
V_{P}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8},θ,φ,ψ)^{T} Equation (14)
The coordinates of each landmark in a 3D face premodel can be represented by this parameter vector, V_{p}, as shown in Equation (9). The coordinates of each face landmark can also be obtained by measurements from the face premodel, which is denoted as
For a 3D face premodel, 7 landmarks are obtained in the measurement domain which results 7 sets of (u, v, r) values. Therefore, 21 equations are formed from Equation (9) to estimate 11 parameters in Equation (14). The goal is to make the difference between the measurement coordinates and the estimated coordinates of face landmarks minimal, that is,
min
The function F, which is the function of the parameter vector V_{p}, can be defined as
F=ƒ(V_{p}) Equation (16)
where
where (u, v, r) is calculated from Equation (9) using the parameter vector V_{p}, n=1, . . . , 7 represent 7 landmarks, and k=1, . . . , 21 represent 21 equations that are resulted from 7 landmark measurements. The measurements (ū,
A primary goal is to find an estimate of V_{p}, denoted as {tilde over (V)}_{p}. This estimate minimizes
F−ƒ({tilde over (V)}_{p}). Equation (17)
To estimate eleven parameters V_{p}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8},θ,φ,ψ)^{T}, a nonlinear least square algorithm, such as the Gaussian least square differential correction (GLSDC) can be applied. The function F can be approximated by
If ΔF is given, then one can find ΔV_{p }as follows:
Next, the estimation is updated using V_{p}=V_{p}+ΔV_{p}. The algorithm can be summarized as follows as shown in

 Step (1) Input measurements F (Box 102).
 Step (2) Input parameter vector starting estimates (Box 104)
V_{p}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8},θ,φ,ψ)^{T }

 Step (3) Calculate ΔF=F−ƒ(V_{p})(Box 106).
 Step (4) Calculate the estimate correction vector ΔV_{p }using Equation (19) (Box 108).
 Step (5) Update the parameter vector V_{p}=V_{p}+ΔV_{p }(Box 110).
 Step (6) Stop iteration if ΔF does not change significantly from one iteration to another (Box 112), otherwise go to step (3).
 Step (7) Output the estimated transformation parameters V_{p }(Box 114).
Estimation of Preferred Embodiment Transformation Parameters Using a Single 3D Face PreModel with the Iterative Refinement
To improve the accuracy of the estimation, the transformation parameters are separated into two groups: the face plane parameters P_{ƒ}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8})^{T }and the rotation parameters P_{r}=(θ,Φ,ψ)^{T}. In two steps they may be iteratively refined. In this case, the objective function of the optimization problem becomes:
where R(P_{r}) corresponds to the combined rotational transformation (inplane, pose and tilt). According to Equation (10), R(P_{r}) can be written as (Equation (21)):
M is the transformation matrix that maps the parameters P_{ƒ }(8 by 1 column vector) to P_{u }(1 by 3 row vector):
P_{u}=M·p_{ƒ}. Equation (22)
The transformation is performed according to the landmark coordinate as shown in Equation (8). Each landmark has a unique transformation matrix M.
A technique that iteratively refines face plane parameters P_{ƒ }and rotation parameters P_{r }is described in
The face plane parameters P_{ƒ }are initialized using the aligned mean face and applied as known parameters to solve P_{r}=(θ, φ, ψ)^{T }by the gradient descent optimization. Since the unknown parameters are reduced to P_{r }only, our objective function becomes:
Next, G(θ,φ,ψ) is defined as:
G(θ,φ,ψ)=F−R(P_{r})·M·P_{ƒ}=F−Θ(θ)Φ(φ)Ψ(ψ)·M·P_{ƒ},
A goal is to find optimal solutions for (θ,φ,ψ) that results in the minimum value of G.
One approach to solve this optimization is by using an exhaustive search, which computes G for all possible values of θ,φ,ψ and returns the combination that produces the smallest G as the solution. However, this approach has very low efficiency since the search space is of O(n^{3}).
Instead of conducting an exhaustive search, a gradient descent method is used to solve the optimization problem. Gradient descent is based on the observation that the multivariable function G is differentiable in a neighborhood of an arbitrary point (θ,φ,ψ). Therefore, G decreases fastest if one goes from (θ,φ,ψ) in the direction of the negative gradient of G at (θ,φ,ψ). It is assumed that (θ^{k},φ^{k}, ψ^{k}) is determined at the k^{th }iteration, then (θ^{k+1},φ^{k+1},ψ^{k+1}) is updated for the next iteration k+1 as:
(θ^{k+1},φ^{k+1},ψ^{k+1})=(θ^{k},φ^{k},ψ^{k})−γ·∇H(θ^{k},φ^{k},ψ^{k}) Equation (24)
where
and γ is a parameter that controls the step of updating. In experiments, γ=0.01 was chosen. The gradient of H(θ^{k},φ^{k}, ψ^{k}), ∇H(θ^{k},φ^{k}, ψ^{k}), can be computed as:
∇H(θ^{k},φ^{k},ψ^{k})=J_{G}(θ^{k},φ^{k},ψ^{k})G(θ^{k},φ^{k},ψ^{k})
where J_{G }is the Jacobian matrix of G:
G_{1, . . . 7 }are the objective functions for each landmark point.
If γ is small enough, than G(θ^{k+1},φ^{k+1},ψ^{k+1})<G(θ^{k},φ^{k},ψ^{k}). The iteration is stopped when the difference between the updated result and the previous result is below some predefined threshold ε(ε>0), i.e., stop at the k^{th }iteration if G(θ^{k+1},φ^{k+1},ψ^{k+1})−G(θ^{k},φ^{k},ψ^{k})≦ε.
2) P_{ƒ }SubProblem (Box 128)Once the rotation parameters P_{r}=(θ,φ,ψ)^{T }are obtained, they are used as known parameters to solve for the face plane parameters, P_{ƒ}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8})^{T}. In this subproblem, an objective function becomes:
Since the objective function has eight variables, solving the optimization problem using the gradient descent method would need to compute a Jacobian matrix with dimension 7 by 8, which is computationally expensive. To reduce computation time, the problem is formulated into a large linear system:
F=R·M·P_{ƒ}. Equation (27)
This linear system is solved using the least square optimization. Let A=R·M. Since A is not a square matrix (3 by 8), the pseudoinverse of A is computed using the singular value decomposition (SVD):
A^{−1}=V·S·U^{T} Equation (28)
where V and U are orthogonal, so that their inverses are equal to their transposes, S is a diagonal matrix composed of singular values of A (square roots of the eigenvalues of A^{H}A), i.e., S=[diag(σ_{1}^{−1}, σ_{2}^{−1}, . . . , σ_{n}^{−1})]. When σ_{i} is very close to 0, one can replace σ_{i}^{−1 }with 0 to reduce roundoff error. Therefore, the solution of Equation (26) can be computed as:
P_{ƒ}=V·[diag(σ_{1}^{−1},σ_{2}^{−1}, . . . ,σ_{n}^{−1})]·(U^{T}F). Equation (29)
In summary, the transformation that transforms a 3D face premodel into the UCS domain with the iterative refinement is summarized in the following as shown in

 Step (1) Input measurements F (Box 122).
 Step (2) Initialize the face plane parameters (Box 124):
P_{ƒ}=(a_{1},a_{2},a_{3},a_{4},a_{5},a_{6},a_{7},a_{8})^{T}.

 Step (3) Use P_{ƒ }as known parameters and solve the rotation parameters
 P_{r}=(θ,φ,ψ)^{T }by the gradient descent method (Box 126):
 Step (3) Use P_{ƒ }as known parameters and solve the rotation parameters
(θ^{k+1},φ^{k+1},ψ^{k+1})=(θ^{k},φ^{k},ψ^{k})−γ·J_{G}(θ^{k},φ^{k},ψ^{k})G(θ^{k},φ^{k},ψ^{k}).

 Step (4) Use P_{r }as known parameters and solve the face plane parameters P_{ƒ }by the SVD (Box 128):
P_{ƒ}=V·[diag(σ_{1}^{−1},σ_{2}^{−1}, . . . ,σ_{n}^{−1})]·(U^{T}F).

 Step (5) Stop the iteration until the difference between the updated result and the previous result is below some predefined threshold ε(ε>0) (Box 130), otherwise go to Step 3).
 Step (6) Output the parameters P_{r }and P_{ƒ }(Box 132).
Estimation of Preferred Embodiment Transformation Parameters Using Multiple 3D Face PreModels
Note that the above algorithms require the input to be a complete set of seven landmark points, otherwise the problem is underdetermined, i.e., the face parameter vector in the UCS domain cannot be solved. However, in real experiments, partial landmarks could occur due to occlusion caused by large pose changes or missing depth data. To solve the problem of having partial landmarks, multiple face premodels are utilized that are acquired from a multiple Kinect sensor system as input whose landmark points are complementary. Illustrated in
Due to the annotation error and incomplete depth data, the location or depth information of a landmark point can sometimes be inaccurate. To eliminate inaccurate landmark points, an interactive user interface can be implemented to allow the user to select desirable landmark points that are most suitable for the transformation parameter estimation to improve accuracy.
As shown in
Similar to the iterative refinement approach described in the previous section, an iterative refinement approach is used to estimate P_{ƒ }and {P_{r}^{i}}, where {P_{r}^{i}}={θ^{i},φ^{i},ψ^{i}}, i=1, 2, 3. The Jacobian matrix in the θ,φ,ψ subproblem becomes
Integrated Face Model from Multiple Registered Face Models
When there is multiple Kinect data available, multiple registered 3D face models may be obtained for each subject. In order to unify the multiple 3D models and reduce the noise, a face model fusion operation is performed to generate the integrated 3D face model. First, a grid with M×N uniformly sampled bins is established. Then, multiple registered 3D face models are placed into this grid. For each bin, outliers may first be removed using a median filter. If there are multiple points, {p_{k}}, k=1, . . . , n, fall into one bin, the final depth value for the bin may be obtained by averaging, that is,
After these steps, the integrated 3D face model is constructed.
ExampleWhen there is single pose input image one registered 3D face model is obtained that is constructed from one face premodel; referred to herein as a registered 3D face model from single pose input. When there are multiple pose input images, one integrated 3D face model may be constructed from multiple registered 3D face models; referred to herein as an integrated 3D face model from multiple pose input. Either a registered 3D face model or an integrated 3D face model is referred to as the final 3D face model. In either case, the final 3D face model is transformed into the unified coordinate system UCS domain using a preferred embodiment of the present invention.
In an alternate approach of preferred embodiment matching, the matching problem is formulated as a multiple hypotheses testing. In this matching approach, two methods are described: a) using facial landmark features and b) using geodesic distances between features. This alternate approach is depicted in Boxes 209212 of
In this multiple hypotheses testing formation, the gallery model, the probe model, and the decision rule are constructed.
Gallery Model:The gallery contains M final 3D face models that belong to M subjects. In the gallery model, the gallery is denoted by a set of M hypotheses,
H_{m}, m=1,2, . . . ,M Equation (33)
with a prior probabilities
P_{m}=Pr ob(Gallery generating H_{m}). Equation (34)
The probe contains the test final 3D face model that belongs to a test subject. The probe observation vector, R, contains N landmark measurement coordinates (r_{xi},r_{yi},r_{zi}), that is,
{right arrow over (R)}=[r_{x1}r_{y1}r_{z1}, . . . r_{xN}r_{yN}r_{zN}]. Equation (35)
The coordinates of landmarks of the gallery are denoted as (x_{mi},y_{mi},z_{mi}). The coordinates from the probe face model and the gallery face model are related by some noise, this noise is assumed under the added white Gaussian noise model, called an AWGN model. Therefore, N landmark measurement coordinates (r_{xi},r_{yi},r_{zi}) are presented by using
r_{xi}=x_{mi}+n_{xi},r_{yi}=y_{mi}+n_{yi},r_{zi}=z_{mi}+n_{zi},i=1,2, . . . N Equation (36)
where n_{xi}'s, n_{yi}'s, and n_{zi}'s are independent identically distributed (iid) normal or Gaussian random variables with zero mean and variance σ_{n}^{2}, respectively, that is,
E(n_{xi})=0,E(n_{xi}^{2})=σ_{n}^{2},E(n_{yi})=0,E(n_{yi}^{2})=σ_{n}^{2},E(n_{zi})=0,E(n_{zi}^{2})=σ_{n}^{2}. Equation (37)
The conditional probability in which the probability of the observation vector given the gallery generating H_{m }is represented as follows,
Under the above model, the multiple hypotheses testing theory allows us to develop the decision rule as shown in the following:
By manipulating the equation in (36):
In summary, our decision rule is to calculate the minimum value according to Equation (41) between the test subject in the probe and M subjects in the gallery. The subject in that results the minimum value in Equation (41) is claimed as the right match to the test subject.
The theory of multiple hypotheses testing shows that the cost function of this decision rule in Equation (41) is optimal to minimize probability of error under the added white Gaussian noise (AWGN) measurement model.
Multiple Hypotheses Testing Using Geodesic Distances Between the FeaturesThe above cost function in Equation (39) is formulated based on the facial landmark features. It can also be formulated using the geodesic distances between the features. This takes advantage of the 3D face data that is obtained from the Kinect sensor. This is a more accurate matching metric given the registered 3D face model.
In the following, the computation of the geodesic distance and the decision rule based on the geodesic distances between face landmarks are described.
Computation of the Geodesic Distance
Geometrically, for a given surface z=ƒ(x, y) and two points p_{1 }and p_{2 }on the surface, the geodesic distance between p_{1 }and p_{2 }is defined as
D_{g}(p_{1},p_{2})=∫_{p1}^{p2}ds=∫_{p1}^{p2}√{square root over (dX^{2}+dy^{2}+dz^{2})} (42)
where s is shortest path between p_{1 }and p_{2 }on the 3D surface, dx, dy and dz are the differential unit along the x, y and z axis, respectively, as shown in
However, computing the geometric shortest path on a polyhedral is a classical NPhard problem. Our approach is to approximate the solution by mapping the problem to a graph shortest path and solving it using the Dijkstra's algorithm (see E. W. Dijkstra, “A Note on two Problems in Connexion with Graphs”, Numerische Mathematik 1, 269271, 1959) (herein incorporated by reference).
In the discrete case, it can be assumed that the 3D facial surface is discretized to a triangular mesh with N vertices, {x_{i}}, i=1, . . . , N. The geodesic distance D_{g}(x_{1},x_{2}) between vertices x_{1 }and x_{2 }can be approximated as the length of the shortest piecewise linear path on the mesh surface, that is,
where P(x_{1},x_{2}) is the path from x_{1 }to x_{2 }defined as an ordered sequence of adjacent vertices and L(•) is the length of the path computed as the sum of edges.
In order to increase the accuracy of the graph approximation, Steiner points may be added on each edge as additional vertices. P(x_{1},x_{2}) is determined using Dijkstra's algorithm. An initial vertex sequence list {x_{1}} may be created that only contains the source vertex x_{1 }and set its distance to the source as zero. The distances from the source x_{1 }to all other vertices that are not in the list to infinity are set. Assume x_{i }is a vertex on the path whose shortest distance path P(x_{1},x_{i}) has already been found. Let x_{j }be a vertex in the neighborhood of x_{i}. The distance of x_{1 }and x_{j }are updated as
min{L(P(x_{1},x_{1}))+L(P(x_{i},x_{j})),L(P(x_{1},x_{j}))} (44)
where L(P(x_{i},x_{j})) is the Euclidean distance between x_{i }and x_{j}, x_{j }is in the neighborhood of x_{i}. The path sequence may be updated until the destination vertex x_{2 }is added. The length of the final path is returned as the geodesic distance between x_{1 }and x_{2}.
ExampleSince the geodesic distance is determined by the positions of feature points and the face shape from the 3D face data, using the geodesic distances in the decision rule is a more robust matching criterion than using the face landmark positions.
When the geodesic distances in the decision rule are used, the observation vectors are replaced with the geodesic distances in the gallery and the probe models that are described in the previous section. Now, the probe observation vector, {right arrow over (R)}, that was described in Equation (35), contains N geodesic distances:
{right arrow over (R)}=[{tilde over (D)}_{g1},{tilde over (D)}_{g2}, . . . {tilde over (D)}_{gN}]. Equation (45)
A geodesic distance D_{g }can be calculated between any pair of facial landmarks. Since there are seven landmarks, there can be at most 32 pairs of landmarks. Therefore the maximum value of N is 32. To reduce computational time and select reliable geodesic distances, certain pair combinations of geodesic distances are utilized. For example, geodesic distances between the following landmarks can be used: nose to left eye center (D_{g1}), nose to right eye center (D_{g2}), and nose to mouse center (D_{g3}).
The added white Gaussian noise (AWGN) model can be used to relate the geodesic distances from the probe model and the gallery model. By substituting D_{g }into Equations (36)(40), the following decision rule is created:
where {tilde over (D)}_{gi}, i=1, . . . N, is the geodesic distance between a pair of the facial landmarks in the probe, D_{gi}, i=1, . . . N, is the geodesic distance between a pair of the facial landmarks in the gallery, N is the number of the geodesic distances.
In summary, a decision rule is used to calculate the minimum value according to Equation (46) between the test subject in the probe and M subjects in the gallery. The subject m that results in the minimum value in Equation (46) is claimed as the right match to the test subject.
ExampleThe following test examples and results help to further explain the invention. However, the examples are illustrative of the invention and the invention is not limited only to these examples.
Experiments are performed on the dataset provided in the publication by B. Y. Li, A. S. Mian, W. Liu and A. Krishna, entitled “Using Kinect for face recognition under varying poses, expressions, illumination and disguise”, IEEE Workshop on Applications of Computer Vision, Tampa, Fla. (2013), herein incorporated by reference. The depth images of frontal, left (30 degrees) and right (−30 degrees) poses are used as the gallery images to emulate three depth images of three poses acquired by the multiple Kinect sensor system. For the probe image, another arbitrary pose is chosen to emulate the uncontrolled environment. The recognition results of matching scores of 10 crosscompared faces are shown in the table in
The matching scores are calculated using the Equation (46), where N is the number of the geodesic distances. In this example, N is 3, that is, three geodesic distances between the following landmarks are: between the corners of right eye (D_{g1}), between the corners of mouth (D_{g2}), and between the nose tip and the left eye inner corner (D_{g3}), as shown in the example in
The low matching score represents the correct matching. The diagonal elements represent the matching pairs. The nondiagonal elements represent the nonmatching pairs. The results show that the matching pairs have lower scores and nonmatching pairs have higher scores.
Although various preferred embodiments of the present invention have been described herein in detail to provide for complete and clear disclosure, it will be appreciated by those skilled in the art that variations may be made thereto without departing from the spirit of the invention. It should be emphasized that the abovedescribed embodiments are merely possible examples of implementations. Many variations and modifications may be made to the abovedescribed embodiments. All such modifications and variations are intended to be included herein within the scope of the disclosure and protected by the following claims. The invention has been described in detail with particular reference to certain preferred embodiments thereof, but it will be understood that variations and modifications can be effected within the spirit and scope of the invention.
As used herein, the term “optimal” means most desirable or satisfactory result for an application or applications.
As used herein the terminology “geodesic” means the shortest line between two points that lies in a given surface.
As used herein center or centers of the corners of eyes is determined by determining the midpoint of a line between the corners of the eyes.
As used herein, the terminology “transform,” “transforming,” or “transformation” is to be interpreted as defined in Merriam Webster Dictionary: “the operation of changing (as by rotation or mapping) of one configuration or expression into another in accordance with a mathematical rule; especially: a change of variables or coordinates in which a function of new variables or coordinates is substituted for each original variable or coordinate . . . the formula that effects a transformation.”
As used herein, the terminology “estimating” or “estimate” means to roughly calculate or judge the value, number, quantity, or extent of, calculate roughly; approximate.
As used herein, the terminology “alert” means a message or signal, such as a quick notice of any unusual and potentially dangerous or difficult circumstances; or possible danger; a warning of a danger, threat, or problem, typically with the intention of having it avoided or dealt with by those person to which the alert is sent.
The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the claims.
Claims
1. A method of facial recognition comprising:
 inputting image data representing a plurality of images from a database; the database comprising images of people wherein the location of the three dimensional facial features is defined relative to a predetermined common plane;
 inputting an image of a subject to be identified;
 locating predetermined threedimensional facial features in the image of the subject for comparison to the image data from the database;
 estimating threedimensional facial feature location coordinates of the subject head in the image of the subject;
 obtaining the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane;
 comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database relative to the predetermined common plane; and
 determining the identity of the subject.
2. The method of claim 1 wherein the facial features are eyes, nose and mouth of a subject and wherein the coordinates of the location of the facial features are defined as the location of eye corners, nose tip and mouth corners; and wherein the predetermined common plane is a vertical plane passing the midpoints of the eye corners and mouth corners; and wherein the orientation parameters correlate to the yaw pitch and roll of the subject's head.
3. The method of claim 1 wherein the step of obtaining the threedimensional facial feature location coordinates and orientation parameters comprises estimating the threedimensional orientation of the subject's head; and wherein the predetermined common plane comprises the midpoints of the corners of the eyes and corners of the mouth.
4. The method of claim 2 wherein the yaw, pitch and roll of each head in the database of images of people is approximately the same since all facial features are specified relative to a predetermined common plane comprising the centers of the eye corners and mouth corners; and wherein the centers of the eye corners and mouth corners are located in a vertical plane.
5. The method of claim 1 wherein a plurality of geodesic distances between facial coordinates of the subject are determined and matched against the geodesic distances between facial coordinates of the image data in the database.
6. The method of claim 1 wherein the step of inputting an image of a subject to be identified comprises generating point clouds and surface meshes depicting the subject's head using at least one threedimensional image sensor.
7. The method of claim 1 wherein the step of obtaining threedimensional facial feature location coordinates and orientation parameters comprises estimating the three dimensional orientation of the subject's head and scale; and further comprises transforming the threedimensional facial feature location coordinates of the corners of the eyes, nose tip, and corners of the mouth of the subject into corresponding coordinates relative to a vertical plane containing the midpoints of the corners of the eyes and corners of the mouth using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
8. The method of claim 1 wherein the predetermined common plane is a vertical plane comprising the midpoints of the corners of the eyes and corners of the mouth.
9. A method of facial recognition comprising:
 for each image of a person in a database: locating predetermined threedimensional facial features in the image of the person; estimating threedimensional facial feature location coordinates; obtaining the threedimensional facial feature location coordinates and orientation parameters relative to a coordinate system defined by the location of facial features relative to the predetermined common plane;
 inputting image data representing a plurality of images from the database;
 inputting an image of a subject to be identified;
 locating predetermined threedimensional facial features in the image of the subject for comparison to the image data from the database;
 estimating threedimensional facial feature location coordinates of the subject head in the image of the subject;
 obtaining the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane;
 comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database; and
 determining the identity of the subject.
10. The method of claim 9 wherein the step of obtaining the threedimensional facial feature location coordinates and orientation parameters comprises using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
11. The method of claim 9 further comprising determining the geodesic distances between at least two of the facial features; and wherein the step of comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database comprises comparing the geodesic distances between at least two of the facial features.
12. The method of claim 1 wherein the inputting of an image of a subject to be identified comprising combining a plurality of images to form an image of the subject.
13. A system for facial recognition comprising:
 at least one processor;
 at least one input operatively connected to at least one processor and configured to input at least one image of a subject to be identified;
 a database configured to store threedimensional facial image data comprising facial feature coordinates in a predetermined common plane;
 the at least one processor configured to locate predetermined threedimensional facial features in the image of the subject, estimate threedimensional facial feature location coordinates of the subject head in the image of the subject, and obtain the threedimensional facial feature location coordinates and orientation parameters relative to a predetermined common plane; and compare the locations of the facial feature coordinates of the subject to the threedimensional facial image data in the database relative to the predetermined common plane;
 whereby recognition likeness of the facial images is determined by comparing the predetermined common plane facial feature coordinates of the subject to the predetermined common plane facial feature coordinates of the images in the database.
14. The system of claim 13 wherein the facial features are eyes, nose and mouth of a subject and wherein the facial features coordinates are the location of eye corners, nose tip and mouth corners; and wherein the predetermined common plane is a vertical plane containing the centers of the eye corners and mouth corners; and wherein orientation parameters correlate to the yaw pitch and roll of the subject's head.
15. The system of claim 13 wherein the predetermined common plane is a vertical plane containing centers of the corners of eyes and mouth corners.
16. The system of claim 13 wherein the database is configured to store a plurality of geodesic distances between facial feature coordinates of the images of people in the database and the at least one processor determines geodesic distances between facial feature coordinates of the subject and matches the geodesic distances against the geodesic distances between facial coordinates of the images of people in the database.
17. The system of claim 13 further comprising at least one threedimensional image sensor configured to generate point clouds and surface meshes depicting the subject's head.
18. The system of claim 13 wherein the at least one processor is configured to obtain the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
19. A method of facial recognition comprising:
 inputting image data representing a plurality of images from a database; the database comprising images of people wherein the location of the facial features is defined relative to a predetermined common plane;
 inputting a plurality of images of a subject to be identified from at least one image sensor;
 locating predetermined threedimensional facial features in the image of the subject for comparison to the image data from the database;
 estimating threedimensional facial feature location coordinates of the subject head in the image of the subject;
 obtaining the threedimensional facial feature location coordinates and orientation parameters in a coordinate system in which the facial features are located in the predetermined common plane;
 comparing the location of the coordinates of the subject to the locations of the coordinates of the images of people in the database relative to the predetermined common plane; and
 determining the identity of the query subject.
20. The method of claim 19 wherein the method is used for one of authentication and surveillance and further comprising:
 selectively activating a device based upon the likeness, resemblance of the subject to images of people in the database; and wherein the step of obtaining the threedimensional facial feature location coordinates and orientation parameters comprises using a Gaussian Least Squares Differential Correction (GLSDC) minimization procedure.
Type: Application
Filed: Jul 11, 2017
Publication Date: Jan 4, 2018
Patent Grant number: 9959455
Applicant: U.S. Army Research Laboratory ATTN: RDRLLOCI (Adelphi, MD)
Inventors: Shiqiong Susan Young (Bethesda, MD), Jinwei Ye (Elkton, MD)
Application Number: 15/647,238