TaintDroid on Android
Figure 1 shows an overview of TaintDroid architecture, so that we can better understand dynamic taint tracking technology. It is also based on Enck’s achievement [5, 12, 13].
Implicit flow and implicit taint
At the same time, we can look at the implicit flow working process to have a clear view of implicit taint on theory against dynamic taint tracking technology. Then the background of privacy information is very much obvious. The example code of implicit flow is as follows:
In the code shown in program 1, program reads and user inputs variables x, then it will generate msg submitted by post methods. Taint x is marked as a tainted attribute, in accordance with the taint analysis method of explicit flow, msg is assigned const {‘a’, ‘b’} and marked as an untainted attribute. But msg values depend on T and F values of the judgment of L3 and L7 conditions, that is having control and dependency relationship with x (also known as implicit stream pollution), here will produce a failure alarm, msg should be marked as tainted attribute; and the value of uri is not affected by L3 and L7, labeled as an untainted attribute. The code segment has a security risk because msg is submitted by the post; the attacker can infer the value of x input by the user after capturing msg. According to the explicit taint analysis method, in program 2, the code L4 and L5 msg are tainted by x directly, but the value of msg is a constant ‘a’ after tainted, then it will generate false alarm, msg should also be marked as an untainted attribute.
The judge sentence is abstract expressed as:
$$ \mathrm{S}0=\mathrm{e}:\mathrm{S}1,\mathrm{S}2,\dots \mathrm{S}\mathrm{n}, $$
Implicit taint will be existed in the following three conditions:
-
(1)
The conditional expression e of S0 is a tainted attribute.
-
(2)
There is control dependence between judge statement S0 and assignment statements S1, S2…Sn.
-
(3)
There is a difference in assignment value of the same variable in multiple branches of assignment statements S1, S2…Sn.
The three features above are described in two aspects including control flow (corresponding to condition (2)) and data flow (corresponding to conditions (1) and (3)). The implicit taint test, based on taint analysis of data flow, diagnoses implicit taint problems and amends relevant variables. There may have multiple nested statements in the program (as shown in Fig. 2a). This paper detects variable values in all code blocks of the program point between judging statement and all subsequent assignment statement. If one variable has more than one different value before confluent point (Fig. 2b), the variable should be marked as tainted attribute; if all the values of the variable are the same, it should be marked as untainted attribute.
Principle of dynamic taint tracking technology
The basic principle of dynamic taint tracking technology is as shown in Fig. 3. Among them, the six lines of programs A and B in the box below represent this program to run within the process. The curve representing the thread does not currently contain tainted data. The dotted line represents the thread exits tainted data and the different dotted line represents the tainted data of different types included.
When detecting that the program reads the privacy information of users, it will mark the privacy data read as tainted. But when the user-sensitive data with taints is operated by the program, we can carry on the corresponding processing of the operation to ensure the taint can follow the privacy data in communication. When communication is on between the two procedures, taints can also follow the data tracking normally. For example, a process of program A (dotted line) sends tainted data to a thread of program B and taints can continually track data transmitted in the thread. When a program sends tainted data to the outside through the privacy leak transmission, it will record the behavior in real-time. Finally, as per the results of the log analysis, we can determine whether there are privacy leakages in the program. If the log shows that tainted data were sent out in the process, we could determine that there is a privacy leakage in the procedure.
The realization of the taint tracking system
Compared with TaintDroid, the TaintChaser system proposed by this paper could monitor more user privacy information, not only IMEI, phone number, location information, photos, and audio, but also communication, SMS, email, and other important information, detected as pollution point sources. At the same time, the system achieves a more fine-grained taint tracking mode through the detection of each byte in memory content. The system detects more privacy leakage points (socket communication, HTTPS encryption communication, SMS, or Bluetooth communication), gives the path information of the detected program executed in detail, and has automatic tests of procedures [14].
To better understand how to track implicit taint, the following will explain the privacy leakage method. Based on the implicit control flow analysis method of SSA form: (1) in the program control flow graph (CFG), there contains control dependence of code in judging block and assignment statements; and (2) calculate the point of code block of assignment statement and convert the program number flow into the form of SSA, in point count value of multiple versions of code block variables, finally according to the value of each variable to determine taint attribute of variables.
The reason for implicit taint is that there is control dependency between assignment statements and judge statements. The control dependency of a program can be expressed as: if the execution of statement S2 determines whether statement S1 executes, statement S1 control is dependent on statement S2. In this paper, we use a dominate tree to find this kind of dependence. First, the CFG of the program is extracted, the dominate tree is calculated, and then the control dependency between judgment and assignment statements from the tree is obtained. Definitions are given as follows.
-
(1)
Control dependency
Immediate dominate: for x ≠ y, dominate x is an immediate dominate of y, if x is an immediate dominate of y, and there is no dominate z, such that x is the only immediate dominate of z and z is the immediate dominate of y, marked as idem (y).
Dominate tree: contains each dominate of CFG in the tree, and for each dominate x, there is a path from idom (x) to the x side, because each dominate has an exactly immediate dominate, the tree known as the dominate tree.
The paper uses the dominate tree of depth-first search tree calculation program to CFG [15]. The complexity of the method is similar to linear time. The dominate tree of CFG describes the control dependency of the program, as shown in Fig. 2b. Judgment dominate B2 is an immediate dominate of assignment dominates B3 and B4 and judgment dominate B4 is an immediate dominate of B5 at the same time. That is {B3, B4} control and rely on B2 and {B5} controls and relies on B4.
-
(2)
Path convergence criterion
Implicit taint is caused by control dependence and there is assignment difference in the branch of judgment statement; the assignment differences have an impact on the following variables in branches merging. As shown in Fig. 2a in the sample code, judgment dominate B4 is nested in judgment dominate B2, only consider B4 cannot correctly respond to the effects of variables assignment; in fact, the paths of {B2, B4} converge to B6. Therefore, in order to analyze the effects of the judgment statement of variables, we shall examine the path confluence of judgment statements. The following is a definition of convergence criterion of the program path: when the program path is to meet all of the following conditions, dominate n is the confluence of the variable a.
-
1)
There is a code block x that contains an assignment a.
-
2)
There is a code block y(y ≠ x) that contains an assignment a.
-
3)
There is a non-empty path Pxn from x to n.
-
4)
There is a non-empty path Pyn from y to n.
-
5)
There is no other path in common between path Pxn and Pyn other than dominate n.
-
6)
Dominate n does not appear in these two paths before the point of path Pxn and Pyn, but it can occur in each path.
By the above criteria, there are one-to-one mapping relations between the assignment and path of variable A. There are a number of precursors a of convergence n, then there are many kinds assignment of variable A.
-
(3)
Discovery of path convergence
It is not practical if we use the convergence criterion method directly to discover the convergence because of the need to traverse all the way from the dominate x and y to the confluence n. To use the dominate tree to decide dominate boundary of judgment statement code block, it can be found that convergence of judgment sentence efficiently.
-
(4)
Virtual value function
Variables value analyze and use static single assignment (SSA) of the program in program convergence. SSA forms an intermediate presence. In the process, each variable has only one assignment, the static assignment may be located in a cycle which can be dynamically executed many times, so it is called SSA. The SSA form is not associated with the application form for the same variable to change into different variables and two control flows in control flow chart merge into a virtual value function φ, e.g. dominate n is as the variable a inserted into function φ (a1, a2), used to distinguish multiple assignments of variable a. The characteristics of value function are as follows: the number of parameters of function are the same with the possible value number of the variable. Each parameter corresponds to the precursor with a particular control flow. According to the above characteristics, all values of variables for the convergence can be acquired through calculation of the parameter of virtual value function. The calculation process is as follows: it is from the CFG of the program and completes the import of function φ and parameter calculation of function φ [16]. The value of using virtual value function φ in SSA is so we can judge whether the variables are the same, so as to determine the taint attribute [17]. Based on the knowledge of the above method, let us review Taint tracking and TaintChaser.
Taint data
The TaintChaser system can detect most user privacy information, identification numbers of mobile phone equipment, phone numbers, location information, e-mail, contacts, SMS, schedule, and browser history. Because we obtained this privacy information in different ways, we need to treat them separately as taint marks. The related service process by the Android system supplies the identification number of mobile phone equipment, phone number and location information [5, 17], we need to process the service as process-related. The taint marked process is shown in Fig. 4. At the time of data reading, the program is as a client through binder sends out a data requested to the corresponding service process; server side acquire privacy data; TaintChaser system will mark these data as taint, so as to ensure the data of the process got taint labeled.
Email, contacts, SMS, schedule, and browser history have been stored in the database. For the taint marked data, the behavior of the reading database will be processed. Among which, the general process of data reading from the database is first we need to get the database cursor, then by using the cursor to call correlation function (get String) to read the contents of the database. The taint marked process of user-sensitive information stored in the database is shown in Fig. 5, when the program obtains the storage of the user privacy information database cursor, the cursor will be taint marked.
Later in the running program, if the program detects the specific content by the tainted cursor, all data read are marked as taint. In this way, we can realize the taint mark on email, short messages, and other privacy information stored in the database.
Taint storage
Different from TaintDroid taint storage, TaintChaser uses a fine-grained way to describe taint, each byte in memory corresponds to a taint. In this way, in the tracking of tainted data, we can track more accurately to reduce greatly the probability of excessive diffusion of taint. The storage structure of the taint as shown in Fig. 6; the Android system is 32 bit, taint storage table has two level address index organizations. This storage method can mostly save memory space and fast lookup taint of corresponding memory [18].
Taint tracking
The Android program is executed by parsing the Dalvik virtual machine. TaintChaser system achieves the tainted data tracking by processing all instructions in the Dalvik virtual machine. This is processed by the following three aspects:
-
(1)
General instructions mainly comprise a series of basic operations of variables (the assignments and add, subtract, multiply, and divide, etc.) and function call. For the basic operating of variables, if operands data involved contain taint, the final result will mark taint. In the function call, tainted data may be spread by input parameters and return values of function. The instruction of invoke and return need to be processed.
-
(2)
The file read and write refers to the detection of file read and write. When tainted data are written to a file it will mark taint to the file; and when detected the contents with taint of the file be read out. It also can be processed in real time.
-
(3)
Inter process communication: when detecting the inter process communication, if tainted data are included in the communication content, real-time tracking can disseminate tainted data to ensure taint can continue to follow the data communication in the new process. But because of the JNI mechanism in the Android applications process, the program will break away from the Dalvik virtual machine to enter the local C/C++ database, the system cannot carry on the normal taint tracking [19, 20]. In order to realize taint on normal communication, we have to deal with it by the related hook function.
Privacy leak point
Privacy leakage detected by the TaintChaser system has socket communication interface, Https encryption communication interface, Bluetooth, short message, etc. For each type of privacy leakage, because interface of data sent is not the only one we need to process all the function related, first, according to the length and its memory address of transmitted data, check in taint storage table whether the data is tainted data, if it is, it will be recorded including content and destination address of sent data, etc. other related information [21, 22].
The output of path information of testing program execution
The output of path information of testing program execution is achieved through the function call instruction (invoke) in the Dalvik virtual machine. With the execution of the invoke command, the class and method name will be printed out. It contains the procedure call and prints out the information.
But because testing the program will call a large number of system functions in the implementation process (such as interface rendering, inter process communication, event processing, and so on), the printed result not only includes the execution path of testing the program, but also a lot of information about system function call. Therefore, we need to add the filtering mechanism to filter the call information of system function and other useless information. Because the system function and software class name are different, it can be filtered based on the name of class of call function [23, 24].
First, before the program is tested, we need to extract the relevant class of tested program and import it to the specified configuration paper; then, in the test when the invoke command is executed, the content will be matched with configuration file name with calling function. If matched, the related information of the called function will output or be discarded directly; it can eventually get the path information of the tested program which has been executed.
Automated testing program
The automated testing program needs to have automatic setup to the system and can automatically start the process in the system and, during the execution of the program, automatically sends a series of events. To achieve this, the automatic installation and start-up can use the tools adb and am provided by the Android system. Before testing the program automation, we need to do some pre-processing. First, testing procedures have to be decompressed, by reading related program information from the file AndroidManifest.xml. Then, according to the file directory structure of the program, that program name is speculated, and the class name is organized into files in the specified directory, so as to output the path information of the testing program executed. After these preparations are done, we can use the adb tool to install the testing program on the system and begin the test. In the testing process, we use the am command to send a series of events to the program.