Development and evaluation of wireless 3D video conference system using decision tree and behavior network
© Sung and Cho; licensee Springer. 2012
Received: 30 August 2011
Accepted: 18 February 2012
Published: 18 February 2012
Video conferencing is a communication technology that allows multiple users to communicate with each other by both images and sound signals. As the performance of wireless network has improved, the data are transmitted in real time to mobile devices with the wireless network. However, there is the limit of the amount of the data to be transmitted. Therefore it is essential to devise a method to reduce data traffic. There are two general methods to reduce data rates: extraction of the user's image shape and the use of virtual humans in video conferencing. However, data rates in a wireless network remain high even if only the user's image shape is transferred. With the latter method, the virtual human may express a user's movement erroneously with insufficient information of body language or gestures. Hence, to conduct a video conference on a wireless network, a method to compensate for such erroneous actions is required. In this article, a virtual human-based video conference framework is proposed. To reduce data traffic, only the user's pose data are extracted from photographed images using an improved binary decision tree, after which they are transmitted to other users by using the markup language. Moreover, a virtual human executes behaviors to express a user's movement accurately by an improved behavior network according to the transmitted pose data. In an experiment, the proposed method is implemented in a mobile device. A 3-min video conference between two users was then analyzed, and the video conferencing process was described. Photographed images were converted into text-based markup language. Therefore, the transmitted amount of data could effectively be reduced. By using an improved decision tree, the user's pose can be estimated by an average of 5.1 comparisons among 63 photographed images carried out four times a second. An improved behavior network makes virtual human to execute diverse behaviors.
Video conferencing has widely been used in public organizations and private companies. However, communication problems due to increased data traffic may occur if many users are connected simultaneously . Hence, one strategy to ensure that many users are connected at the same time is to reduce the amount of data traffic.
There are at least two approaches to reduce data traffic in video conferencing. One is to extract the shape of the user when images are captured [2–4]. The shapes extracted from multiple users are then arranged in a three-dimensional (3D) virtual environment to reconstruct a virtual conference space. A user is readily identified because actual human images are shown, as in this study . However, data sent by one user are delivered to multiple other users at the same time. Hence, if more users are connected, the data traffic increases accordingly. Given that multiple images are transmitted in real time, there would be too much data to transmit on a wireless network.
Another approach to reduce the data traffic is to extract and send the physical location and features of a user and reconstruct it in the virtual environment [1, 5]. This approach is advantageous in that it expresses the gestures and body language of users by using their physical location and features . For example, there is a method that provides distance consulting services by calculating a depth map on an image to represent a speaker in three dimensions . In other studies, a speaker's body position has been used to control virtual humans by using a body-tracking system that recognizes skin color in 2D images . However, in these studies, it was difficult to express the movements of a virtual character with only partial data on the speaker's body. Hence, the positions of unavailable body parts were estimated with inverse kinematics. This makes it difficult to express a speaker's motions precisely.
In studies on data reduction in video conferencing, a common problem is that of low quality of service (QoS). In particular, when multiple users are connected at the same time, the amount of data that a user can transmit simultaneously on a wireless network becomes relatively small compared to that on a physically wired network. Hence, it is necessary to improve the QoS of video conferencing.
In this article, a framework that enables video conferencing by multiple users on a wireless network is proposed. To reduce data traffic, a user's pose is first recognized through a binary decision tree and transmitted by using the markup language. Next, a method based on a behavior network is introduced to express the movements of the virtual human precisely. Subsequently, the proposed method is implemented in an experiment and verified in a mobile device. The proposed method involves multiple users communicating with each other using a mobile device. Therefore, it is applicable to various forms of communication such as chatting, gaming, and video conferencing.
The rest of the article is organized as follows. In Section 2, we introduce a method to reduce data traffic in video conferencing. In Section 3, we propose a video conferencing framework. In Section 4, we describe a series of processes to implement the proposed framework in a mobile device and control a virtual human. In Section 5, we summarize the proposed method and discuss future directions for research.
2. Related study
In a video conference, the amount of photographed images increases in proportion to the number of connected users. Therefore, even if the images are compressed, all images cannot be transmitted on a wireless network. To make wireless video conferencing possible, it is necessary to solve data traffic in advance. In this section, we introduce studies on video conferencing, and examine research results that could be adopted to reduce the amount of data.
The following are studies in which the aim was to extract a user's shape from an image to reduce data traffic. The Virtual Team User Environment (VIRTUE) provides an environment in which multiple users can communicate with each other at the same time . A virtual conference space was constructed and integrated with a 3D environment after receiving all users' shape images. The position of the virtual human in the virtual environment was calculated through the photographed images. In other studies, methods to improve the speed of the VIRTUE have been proposed as well [3, 4]. Although not directly related to video conferencing, there have been some studies on the reconstruction of 3D shape [6, 7]. In these studies, reconstruction was performed as follows. First, the background was removed from images that were photographed with multiple cameras. The objects were then extracted from the images and the 3D shape created. Lastly, the 3D shape was colored. However, to apply the photographed images and virtual environment to a wireless network at the same time, further data reduction is necessary.
Given that images increase data traffic, there have been a number of studies on the reconstruction of video conferences by extracting and transmitting only a user's features from photographed images. For example, it has been shown that medical advice can be obtained through a telemedicine system to perform an operation . Here, a depth-map was extracted from the photographed images, and then, a distance user was represented. In other studies on virtual humans for video conferences, body features were extracted from photographed images . After locating the body positions by identifying the hands and face, the body positions were transmitted. Then, the virtual human gestures by referencing the transmitted body positions.
However, it is difficult to express exact gestures due to insufficient data on body features. Further data are required to make the virtual human act naturally. Facial images and body features can also be extracted from photographed images. The extracted facial images are mapped onto the face of the virtual human after analyzing facial features. By extracting and transmitting only faces, data traffic is reduced. Actual human faces are used in this method. This makes it easier to distinguish real users from virtual humans.
Lastly, there have been studies on the virtual meeting room [8, 9]. In these studies, the photographed images were converted into silhouettes for comparison with pre-defined models . The poses were estimated, and deictic motions were expressed using the silhouette. Data traffic can be reduced because the data are in xml format. However, the problem is to define a model in person using a tool. In addition, it takes time to estimate poses if there are many models. To reduce the amount of estimation, decision tree  can be applied. If silhouettes which should be compared are classified, the number of comparison could be reduced.
If only data traffic is considered, studies in which the actions of a user are expressed through the virtual human by transmitting the user's features are more appropriate in a wireless network environment than the method of extracting the user's image shape. In these studies, however, the data are insufficient to describe a user's body. Therefore, it is necessary to devise a method to express the body more precisely. Moreover, to make virtual human to act naturally, behavior network  could be applied. Behavior network selects the behavior for virtual human considering goals and pre-executed behaviors. Therefore, virtual human can execute behaviors naturally. By defining the behaviors of the virtual human in advance and using behavior network, this article proposes a method in which body motions can be freely expressed even with little data traffic.
3. 3D video conference framework
To express a user's movements by a virtual human, it is necessary to devise a method to extract and transmit a user's features and then reconstruct the virtual conference by virtual humans. This section describes a method to estimate user's pose and control a virtual human with the estimated pose.
Data are determined in definition stage as follows. First, to estimate a user's pose, the necessary images must be defined. This requires the generation of images for a user's expected pose photographed by camera. The generated images are then compared with real-time photographed images of the user to estimate the user's pose. However, pose estimation will be time-consuming if the number of expected poses is excessive. Thus, the pose-estimation time can be reduced by using only a subset of expected pose images by constructing a binary decision tree (referred to as the pose decision tree)
Next, a behavior network is defined to generate the behavior that is executed by a virtual human. At first, an action is defined by a virtual human's joint angles, and a behavior is expressed by its actions. Selected behaviors are executed by virtual human. The pose images, pose decision tree, motions, actions, and consecutive action network are shown in Figure 1. These are all defined in the definition stage. This behavior network is referred to as the consecutive action network.
In the recognition stage, images are created by photographing users at certain intervals. A user's poses are estimated by comparing the photographed images with the pose decision tree. The estimated poses are then transmitted to other users through the network. In the reconstruction stage, a user's presence is expressed in a virtual human by considering the estimated pose.
3.2. Framework structure
The recognition stage converts the estimated pose into markup language, which is transmitted to the network as follows. The photographed images are received by the image receiver and sent to the background learner and silhouette extractor. Background learner acquires backgrounds when the user is absent and then transfers the background image to the silhouette extractor. Subsequently, the silhouette extractor extracts the shapes of users from the received images by considering the background images and transmits them to the pose estimator. The pose estimator searches the pose decision tree and estimates the poses of the received images. The estimated poses are then transmitted to the network through the message generator and message sender (the former creates messages and the latter transmits them to other users). Each message contains a user's pose and speech.
The reconstruction stage then creates the image and voice from the received messages as follows. The message receiver transmits the pose and speech to the behavior planner and speech generator, respectively. The behavior planner plans the behaviors to be executed by the virtual human. The virtual human controller then executes the planned behaviors.
3.2.1. Image receiver and silhouette extractor
3.2.2. Pose decision tree and pose estimator
The pose estimator, which estimates poses with the extracted silhouette in the recognition stage, must recognize multiple poses in real time in a mobile environment. However, the time to estimate poses increases with the number of poses because the number of comparisons also increases. To solve this problem, we propose a pose decision tree.
where and are the left and right nodes of node n i , respectively; and m i and v i are the matching value and center value of node n i (range 0 to 1), respectively. The matching value indicates the similarity of two silhouettes. For example, if its value is 1, the silhouettes are considered identical only when they have exactly the same images. In contrast, if its value is 0, the silhouettes are considered identical regardless of their differences. The matching value is determined to estimate the pose by establishing various values. The center value, which expresses a standard based on a search of the left and right child nodes, is automatically established when the pose decision tree is constructed. As shown in Equation (6), there is a one-to-one relation between pose r i and node n i .
Fifth, the mean of the correlation coefficients between the last node on the left and the first node on the right is set to the center value of the root node. Lastly, both left and right nodes sort the child nodes through repetitive comparisons just as in the case of the root node.
In the recognition stage, the pose decision tree is used as follows. The user-silhouette is compared to the silhouette of the root node. If the correlation coefficient of two silhouettes is equal to or greater than εPoseMatching, the index of the root node is transmitted to the message generator. Otherwise, the user-silhouette is compared to left child node of the root node. If it is greater than the center value, the user-silhouette is compared to the right child node. Therefore, the comparison of nodes continues until the correlation coefficient of the two silhouettes is over εPoseMatching or the terminal node is reached. The index of the node that is ultimately reached is also transmitted to the message generator.
3.2.3. Action, consecutive action network and behavior
Next, all the actions of Set A are placed, and the sequences of consecutive actions are expressed as a tree by using directed acyclic graph (DAG). The action nodes which contain one action of Set A are primarily defined, and it is placed between start-poses and goal-poses, as shown in Figure 7. After action node is arranged, then DAG connects each action node. To prevent tree's containing any loop, an action node which has identical action, is repeatedly defined like action a1 of Figure 7.
Next, the transition probability of all DAG is defined. The sums of probabilities transit from each action to another action are normalized to be 100.
where m k is a action node that contains a j . s k and g k represent the index of start-pose and goal-pose connected with action node m k . o k x means the other action nodes to which action node m k is connected in x th. q k y is the probability of transition to o k x . The network composing of action nodes is defined as the consecutive action network.
The defined consecutive action network is used when an action is selected in the reconstruction stage. By using the pose index received in time t - 1 as start-pose index and the pose index received in time t as goal-pose index, behavior planner generates consecutive actions as follows. First, among the several numbers of start-poses, the pose in the pose index received in just time t - 1 is selected. Next, among the several numbers of goal-poses, the pose in pose index received in just time t is selected. Next, all action nodes and connections which is movable from the selected start-pose to the selected goal-pose, is selected. Next, the transition probability for the selected nodes besides each action node is normalized up to be 100. Next, among the selected connections, one connection is selected through the probability. If the only one action node is connected it can be directly selected. The selection of connections is repeatedly processed until the action node which is connected with the goal-pose is reached. Finally, the consecutive actions are constructed by connecting the actions in all the visiting action nodes, and are defined as a behavior. The defined behavior is then executed.
For example, behavior is generated as shown in Figure 8 when pose index 2 is received at time t - 1 and time t from Figure 7. Each m4 and m2 is activated by connecting to start-pose and to goal-pose. The other activated two action nodes, m7 and m, exist in the connection from m4 to m6. From the consecutive action network, action nodes according to the connections from m4 are visited as the connection of m4·m7·m6 or m4·m·m6. Therefore, the behaviors having the orders of a2a3a2 or a2a a2 are generated and executed.
To verify the proposed virtual human-based video conference framework, we carried out an experiment by establishing a conference system in iPad. We introduced a method to define the pose decision tree and behaviors for a video conference. We then verified this series of processes.
4.1. Verification of definition stage
The proposed framework requires the definition of a pose decision tree and a consecutive action network. The pose decision tree requires expected poses and silhouettes for the recognition stage, and the consecutive action network requires actions and behaviors for the reconstruction stage.
4.1.1. Expected poses, silhouettes, and pose decision tree
Each child node of the root node also constructs child nodes in the manner of the root node. The remaining nodes in each child node were also equally processed and constructed. Therefore, a pose decision tree constructed into a full binary tree is six levels high. In the pose decision tree in Figure 11, the correlation coefficient EMatchRate was set to 0.93. The numerical values on each image represent the matching and center values. For example, the root node has a matching value of 0.93 and a center value of 0.640616.
After defining the pose decision tree, in recognition stage photographed images are compared with the pose of the root node. If the match rate was over 0.93, the pose of the photographed image was estimated as the pose of the first node. If the correlation coefficient was 0.93 or below and the match rate was 0.640616 or below, the pose was estimated after searching the child nodes on the left side. In contrast, the right child nodes were searched when the match rate was greater than 0.640616.
4.1.2. Actions and consecutive actions
Virtual human's action
-2.57,5.25,-4.12,0.55,3.55,-1.79,-0.44,-3.57,1.76,1.94,-0.71,1.56,-0.00,0.00,0.00,0.00, 0.00,0.00,3.35,-10.17,8.05,-2.85,7.04,-5.65,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00, 0.00,6.29,22.05,-1.40,-6.77,-25.64,0.51,0.00,0.00,0.00,0.00,0.00,0.00,7.25,2.37,23.87,
4.2. Verification of recognition stage
In Figure 13, BMP and JPG represent data rates in BMP and JPG formats, respectively, without removing the background from the images. When transmitted in BMP format, about 83K occurred four times a second (252K/s on average). If the images were compressed in JPG format, data were transmitted at 122K/s on average. To extract the user's shape, the data were transmitted in BMP or JPG format. Given that BMP remains uncompressed even when the background is eliminated, the data rates were identical even though the background was not removed. In contrast, JPB decreased by approximately 16K (57%). However, there is too much data traffic when multiple users attend the conference. In the proposed method, 111 bytes were transmitted on average at a time (i.e., 444 bytes/s). Hence, data rates were significantly less than when images were transmitted or when user's shape was extracted.
4.3. Verification of reconstruction stage
In this article, a framework to display a user as a virtual human in a 3D virtual conference by using pose decision tree and consecutive action network is proposed. The framework consists of a recognition stage that generates messages based on the pose decision tree and a reconstruction stage that controls a virtual human with the consecutive action network.
The recognition stage includes an image receiver, background learner, silhouette extractor, pose estimator, message generator, and message sender. The image receiver transmits the user's images to the silhouette extractor, which then extracts the user's silhouettes from user's images. The pose estimator estimates the poses of user-silhouettes based on the pose decision tree and sends the pose's index to the message generator. After converting the index of the received pose into markup language, the message generator then delivers the messages to the message sender.
The reconstruction stage consists of a message receiver, behavior planner, and virtual human controller. The message receiver sends the index of the pose to the behavior planner, which in turn generates a behavior to be executed by the 3D virtual human by using a consecutive action network. The generated behavior is delivered to the virtual human controller. Finally, the virtual human controller controls the joint angle of the virtual human according to the generated behavior.
In an experiment, the proposed framework was verified for the series of processes designed to control the virtual human in a mobile device. In the definition stage, the pose decision tree was constructed by using 63 expected silhouettes. A total of 63 actions were defined. The consecutive action network was constructed on considering the 63 actions. In the recognition stage, images were photographed at 4 fps. The pose most similar to each photographed image was found by using the pose decision tree. In the recognition stage, a pose decision tree was used. As a result, 31 poses could be estimated by 5.1 expected silhouette comparisons on average. In the reconstruction stage, behaviors defined as consecutive actions are generated by consecutive action network. Therefore, even if the same pose index was received, the various behaviors could be generated and executed.
The user's images are converted to markup language by using the index of the pose. This process could allow multiple users to attend a video conference at the same time. Further studies are being planned on a method to generate the user's silhouettes for the pose decision tree as well as the actions and behaviors for consecutive action network automatically.
This article was supported by the research program of Dongguk University.
- Mortlock AN, Machin D, McConnell S, Sheppard P: Virtual conferencing. BT Technol J 1997, 15(4):120-129. 10.1023/A:1018687630541View ArticleGoogle Scholar
- Xu LQ, Lei B, Hendriks E: Computer vision for a 3-D visualization and telepresence collaborative working environment. BT Technol J 2002, 1(1):64-74.View ArticleGoogle Scholar
- Lei BJ, Hendriks EA: Real-time multi-step view reconstruction for a virtual teleconference system. EURASIP J Appl Signal Process 2002, 2002(1):1067-1087. 10.1155/S1110865702206071View ArticleGoogle Scholar
- Lei BJ, Chang C, Hendriks EA: An efficient image-based telepresence system for videoconferencing. IEEE Trans Circ Syst Video Technol 2004, 14(3):335-347. 10.1109/TCSVT.2004.823393View ArticleGoogle Scholar
- Fuchs H, Bishop G, Arthur K, McMillan L, Bajcsy R, Lee SW, Farid H, Kanade T: Virtual space teleconferencing using a sea of cameras. Proceeding of First International Conference on Medical Robotics and Computer Assisted Surgery, Pittsburgh 1994, 2: 161-167.Google Scholar
- Tai LC, Jain R: 3D video generation with multiple perspective camera views. Image Process 1997, 1: 9-12.Google Scholar
- Moezzi S, Tai LC, Gerard P: Virtual view generation for 3D digital video. Multimedia IEEE 1997, 4(1):18-26. 10.1109/93.580392View ArticleGoogle Scholar
- Nijholt A, Welbergen H, Zwiers J: Introducing an embodied virtual presenter agent in a virtual meeting room. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2005), Innsbruck 2005, 579-584.Google Scholar
- Welbergen H, Nijholt A, Reidsma D, Zwiers J: Presenting in virtual worlds: towards an architecture for a 3D presenter explaining 2D-presented information. In International Conference on Intelligent Technologies for Interactive Entertainment, (INTETAIN 2005). Madonna di Campiglio; 2005:203-212.View ArticleGoogle Scholar
- Poppe R, Heylen D, Nijholt A, Poel M: Toward real-time body pose estimation for presenters in meeting environments. In 13th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG 2005). University of West Bohemia, Plzen-Bory; 2005:41-44.Google Scholar
- Leo B, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA; 1984.Google Scholar
- Yoon J, Cho S: A mobile intelligent synthetic character with natural behavior generation. 2nd International Conference on Agents and Artificial Intelligence (ICAART 2010), Valencia 2010, 132-133.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.