Yelu Yu is currently a master’s student at the University of Science and Technology of China. She received her B.E. degree from the North China Electric Power University in 2021. Her research interests include privacy and security in database
Han Fang is currently a research fellow at School of Computing, National University of Singapore. He received his B.S. degree in 2016 from Nanjing University of Aeronautics and Astronautics and the Ph.D. degree in 2021 from University of Science and Technology of China. His research interests include image watermarking, information hiding, and adversarial machine learning
Weiming Zhang is currently a Professor with the School of Cyber Science and Technology, University of Science and Technology of China. He received his M.S. degree and Ph.D. degree in 2002 and 2005, respectively, from the University of Information Engineering. His research interests include information hiding and multimedia security
Database watermarking is one of the most effective methods to protect the copyright of databases. However, traditional database watermarking has a potential drawback: watermark embedding will change the distribution of data, which may affect the use and analysis of databases. Considering that most analyses are based on the statistical characteristics of the target database, keeping the consistency of the statistical characteristics is the key to ensuring analyzability. Since statistical characteristics analysis is performed in groups, compared with traditional relational databases, time series databases (TSDBs) have obvious time-grouping characteristics and are more valuable for analysis. Therefore, this paper proposes a robust watermarking algorithm for time series databases, effectively ensuring the consistency of statistical characteristics. Based on the time-group characteristics of TSDBs, we propose a three-step watermarking method, which is based on linear regression, error compensation, and watermark verification, named RCV. According to the properties of the linear regression model and error compensation, the proposed watermark method generates a series of data that have the same statistical characteristics. Then, the verification mechanism is performed to validate the generated data until it conveys the target watermark message. Compared with the existing methods, our method achieves superior robustness and preserves constant statistical properties better.
Graphical Abstract
A robust time series database watermarking method that can keep statistical characteristics unchanged.
Abstract
Database watermarking is one of the most effective methods to protect the copyright of databases. However, traditional database watermarking has a potential drawback: watermark embedding will change the distribution of data, which may affect the use and analysis of databases. Considering that most analyses are based on the statistical characteristics of the target database, keeping the consistency of the statistical characteristics is the key to ensuring analyzability. Since statistical characteristics analysis is performed in groups, compared with traditional relational databases, time series databases (TSDBs) have obvious time-grouping characteristics and are more valuable for analysis. Therefore, this paper proposes a robust watermarking algorithm for time series databases, effectively ensuring the consistency of statistical characteristics. Based on the time-group characteristics of TSDBs, we propose a three-step watermarking method, which is based on linear regression, error compensation, and watermark verification, named RCV. According to the properties of the linear regression model and error compensation, the proposed watermark method generates a series of data that have the same statistical characteristics. Then, the verification mechanism is performed to validate the generated data until it conveys the target watermark message. Compared with the existing methods, our method achieves superior robustness and preserves constant statistical properties better.
Public Summary
This paper proposes a robust database watermarking scheme for time series databases, which can effectively ensure the consistency of statistical characteristics before and after watermark embedding.
Based on the time-group characteristics of TSDBs, we propose a three-step watermarking method, which is based on linear regression, error compensation, and watermark verification, named RCV.
The effectiveness of our scheme in keeping the statistical characteristics unchanged is verified both theoretically and practically. The experimental results show that our scheme has good robustness against database malicious attacks.
In the era of the digital economy, all walks of life are generating massive amounts of data every day. Databases, such as time-series databases which consist of thousands of pieces of data, have a high commercial value because the analysis of such data can effectively help the development of the industry.
However, precisely because of their high commercial value, databases are also exposed to the risk of security issues, such as data breaches, unauthorized copying, and copyright violations. Such risks also exist in the field of multimedia, in which one common solution is the digital watermarking technique. By inserting different watermark signals into different multimedia data, such as images[1,2], video[3,4], 3D mesh[5–7], the copyright of which can be effectively protected. Therefore, to remedy such risk in databases, there has been some research on digital watermarking techniques for time series data[8–10]. Duy et al.[8] proposed a watermarking scheme that embeds watermark information based on modifying the mean modulation relationship of approximation coefficients in the wavelet domain. This scheme treats time series data as one-dimensional signals and obtains good robustness to signal processing noise. However, it cannot guarantee robustness against common attacks on databases.
Therefore, database watermarking was introduced by Agrawal and Kiernan[11], who provided a new technical solution for database security technologies. Database watermarking technology by embedding unique watermark information in the database to prove the copyright of the database and prevent malicious piracy or unauthorized use. In addition, database watermarking technology can also be used to track and prevent tampering. When pirated databases are freely distributed, identity information can be extracted by specific means as reliable digital evidence. To deal with the increasingly severe database security problems, the development of database watermarking has become a concerning and vital research topic.
Since then, several robust database watermarking schemes[12–17] have been proposed for copyright protection and traceability. Guo et al.[12] proposed a robust watermarking algorithm for relational databases based on fingerprint recognition. The algorithm embedded fingerprints to identify legitimate recipients of relational data and provided a digital confidence level to identify owners and illegal distributors. Guo et al.[13] proposed an improved LSB algorithm for watermarking digital attributes in relational databases to protect copyright. Franco-Contreras et al.[14] proposed a robust database watermarking scheme that can achieve semantic control of data distortion and extend quantization index modulation (QIM) to circular histograms of numerical attributes. However, traditional robust database watermarking schemes have a potential drawback, i.e., the embedding of the watermark will change the statistical characteristics of databases, which may influence the analysis of the whole database.
One solution for maintaining the consistency of the database is reversible watermarking[18–29], by designing a reversible manner to embed the watermark. The watermark as well as the original database can be recovered from the watermarked database. The first reversible watermarking scheme for databases was proposed in 2006[18], in which histogram expansion was used for reversible database watermarking, but the anti-attack performance of this scheme was poor. In 2009, the technique called difference expansion-based watermarking (DEW)[19] was utilized to watermark a database in a reversible way, but since the watermark is embedded in the integer part, the data distortion caused by it is very large. In Ref. [20], Jawad and Khan combined the DEW scheme with the GA to enhance the robustness of DEW. Imamoglu et al.[22] improved DEW with the firefly algorithm to reduce data distortion. Hu et al.[21] designed a robust reversible database watermarking based on distortion control, which uses the genetic algorithm to optimize histograms for watermark embedding. This method ensures the data distortion of a single attribute value within a certain range. Refs. [25, 26] proposed a robust and reversible watermarking algorithm based on continuous columns in histograms. In 2020, Ge et al.[28] proposed a novel, robust, and reversible database watermarking technique, named histogram shifting watermarking based on random forest and genetic algorithm (RF-GAHCSW). However, the reversible process requires people with key permissions to restructure the non-destructive database for analysis, which is unsatisfactory in real life. First, the reverse operation is computationally complex. Second, the common case is that the watermarked databases are expected to be analyzed by people without such permissions.
The analysis of the database is often conducted on the statistical characteristics of the data, e.g., mean and variance. Therefore, the key point to ensure the analyzability of the database is keeping the statistical characteristics unchanged before and after watermark embedding. However, none of the existing methods can satisfy this goal. Therefore, designing a database watermarking method that maintains the statistical property invariance is currently an urgent demand.
To this end, this paper proposes a robust watermarking scheme for time series databases (TSDBs) that can effectively embed the watermark while maintaining the statistical characteristics at the same time. Statistical characteristics analysis often needs to be performed with groups of data. Compared with traditional relational databases which are clustered with single rows and single columns, TSDBs have obvious time-grouping characteristics where the information contained in the same group effectively reflects the characteristics of a certain time period. Based on the time-group characteristics of the time series database, we propose performing watermark embedding on a group basis rather than on a point basis. Specifically, we propose a three-step watermarking method, which is based on linear regression, error compensation, and watermark verification, named RCV. First, based on the linear regression model and error compensation, the proposed watermark method could generate a series of data that have the same statistical characteristics as the original database. Then, the verification mechanism is performed to validate the generated data until it conveys the target watermark message. The specific process of the RCV method is described in Section 2.2.3. In watermark extraction, we will determine the final extracted watermark bits by the majority voting principle.
The main contributions of this article are summarized as follows:
(Ⅰ) We propose a robust watermarking scheme that can effectively keep the statistical characteristics unchanged. Based on the time-group characteristics of time series databases, we propose a mechanism named RCV to embed the watermark into data groups. We verify the validity of the proposed RCV scheme in statistical characteristics preservation both theoretically and practically.
(Ⅱ) Extensive experimental results indicate that our method has strong robustness and can resist common database attacks. For alteration attacks, extracting watermarks maintains a correct rate of 0.84 even with altered groups up to 90%. Our method maintains a high watermark extraction accuracy for deletion and insertion attacks.
2.
The proposed method
2.1
Motivation
Since time series databases have obvious time-group characteristics, data editing operations on time series databases are often performed in groups of data (e.g., the data in a certain time period). Therefore, in this paper, we proposed embedding the watermark in groups rather than in individual data points. The group-based operation has two advantages: (ⅰ) Embedding the same watermark bit in the group of data is similar to spread spectrum watermarking, which can effectively improve the redundancy and enhance the robustness of watermarking; (ⅱ) The operation within the group is more conducive to the maintenance of statistical properties. Compared with the embedding in individual data points, the modifications between different data in the same group can compensate for each other and thus better satisfy the statistical characteristics preserving properties.
Based on this idea, we propose a three-step watermarking scheme named RCV which is realized by a “regression, compensation, and verification” operation. Based on the properties of the linear regression model, watermarking can be effectively achieved with statistical characteristics preserving.
2.2
Framework
In this section, we mainly introduce the proposed watermarking framework. The whole framework can be divided into three phases: preprocessing, watermark embedding, and watermark extraction. Before introducing each phase, we first describe the common components of the time series database.
2.2.1
Composition of time series database
For better illustration, we give an example of a time series database[30], as shown in Table 1. The time series database mainly contains four parts: point, timestamp, tag, and field. The definition of each of them can be expressed as:
• Point: the piece of data in the database, for example, “67.20” in the second row of the “Price” column.
• Timestamp: a column of points that must exist in a time series database, which indicates the time point when the data were collected.
• Tag: a column of points representing the attribute of the collected data, which generally does not change with time, such as the “Information” column in Table 1.
• Field: a column of points representing the measured value of the data, which fluctuates smoothly over time, such as “Price” and “Demand” in Table 1.
In addition, for a quick reference, we list the notations used in this paper, as shown in Table 2.
Table
2.
Notations used in the paper.
Symble
Description
W
Original watermark
We
Extracted watermark
Ks
The secret key
l
The length of the watermark
N
Number of points in the database
n
Number of points in the group
m
Total number of groups
T
The set of timestamp points
Ti
The ith group of timestamp points
tji
Timestamp point of the jth data in group i
G
Timestamp grouping function
Ft
Function to map the watermark index to timestamp groups
In the preprocessing phase, any form of watermark information (such as pictures, text, sounds.) will be converted into binary bit sequences W with length l. W∈{0,1}l will be the watermark to be embedded. Then we cluster the original database D into m groups according to its timestamp and embed a 1-bit watermark messages in each group. It should be noted that timestamp points are usually not allowed to be modified, so we group them according to timestamp points. The function used for grouping is denoted as G, which is used in both the embedding side and extraction side. After grouping, we could obtain m groups with the different timestamps, denoted as Ti, i∈[1,m]. The grouping function G is, specifically defined as Ti=G(D,m)={Dij|N×(i−1)/m<j≤N×i/m, i∈[1,m]}. For each group, there are n timestamp points, denoted as tij, j∈[1,n]. Then we have to determine which bit will be embedded in each group. This operation is realized by a mapping function Ft. For the ith group, Ft receives the timestamp points Ti, a secret key Ks1, watermark length l as inputs, and outputs the index of the watermark to be embedded in this group denoted as ki:
ki=Ft(Ti,Ks,l),
(1)
where ki∈[1,l]. In this paper, Ft is achieved by
Ft(Ti,Ks,l)=mod(H(Ks,n∑j=1tij),l),
(2)
where mod(⋅) indicates the modulo operation, and H indicates the Hashing operation. Based on Eq. (2), we could determine the index of the watermark bit be embedded in the group i. It should be noted that m should be larger than l for full watermark embedding. In addition, according to the definition of Ft, the watermark bit with the same index might be embedded more than once.
2.2.3
Watermark embedding
For a typical time series database, the points of the “timestamp” and “tag” columns often have less information than the “field” column. Therefore, it would be better to embed the watermark into “field” points. Assume that in “field” points, there are several column points with a confidential attribute such as salary information, and several column points with a nonconfidential attribute such as behavior information. The data we need to maintain statistical characteristics rely more on points with confidential attributes. Therefore, in this paper, we propose embedding the watermark into the confidential points while leaving the nonconfidential points unmodified. The nonconfidential points could effectively serve as a reference to maintain the statistical characteristics of the confidential points. In addition, each point in the “field” column corresponds to a “timestamp” feature, so the group information of “timestamp” can be directly applied to the “field” points.
Denote the column of points we want to embed as X, and the referenced column of points as S. Since we have clustered the database with m groups, the goal is to embed each watermark bit into each group of X to generate the watermarked points Y.
Besides, the statistical characteristics of each group of X (denoted as Xi∈X,i∈[1,m]), i.e., the mean of Xi and the variance of Xi should be the same as those of Yi∈Y,i∈[1,m], and the covariance between Xi and Si is equal to the covariance between Yi and Si.
To achieve this goal, we propose RCV, a regression-compensation-verification-based method for watermarking. Based on Xi and Si, and the standard normal distribution data points Ai, we first use linear regression models and conduct a error compensation mechanism to generate a set of data points Yi, guaranteeing that the statistical characteristics of Xi and Yi are consistent. Then we perform a watermark verification mechanism Fv to validate whether the watermarked Yi could convey the watermarked message. If Yi passes the verification, that is, Yi has embedded watermark information, then proceed to the next group of data. Otherwise, we re-execute the regression-compensation process to generate Yi and repeat until Yi can pass the verification mechanism. To more clearly describe the RCV scheme, our specific examples will be given in Section 3.1.1.
Specifically, for the ith group, we use a linear regression model with parameter ˉαi0 and ˉαi1 to predict the value of xij∈Xi,j∈[1,n] with sij∈Si,j∈[1,n],
^xij=ˉαi0+ˉαi1×sij,
(3)
where ^xij is the predicted point. Then we sample a set of noise Ai with the same size of Xi from the standard normal distribution. Ai is further represented by another linear regression model with parameter ˉβ0, ˉβ1, and ˉβ2, i.e.,
^aij=ˉβi0+ˉβi1×sij+ˉβi2×xij,
(4)
where ^aij indicates the predicted value of aij∈Ai. After obtaining ^Ai, we calculate the differences between ^Ai and Ai, denoted as Bi, which can be formulated as:
Bi=Ai−^Ai.
(5)
Then we calculated the compensation parameter Ci according to Bi with Eq. (6):
cij=bijσBiξ,
(6)
where cij∈Ci, bij∈Bi, and ξ can be calculated as
ξ2=σ2Xi−σ2XiSiσ2Si,
(7)
where σ2Xi, σ2Si, and σXiSi are the variance of Xi, Si, and the covariance between Xi and Si, respectively. After determining the compensation parameter Ci, the final watermarked data Yi of the group i can be calculated as:
yij=^xij+cij.
(8)
In this manner, the generated Yi can maintain the same mean, the same variance, and the same covariance (with Si) as that of Xi. The relevant proof can be found in Section 2.2.5.
Then, we will conduct a verification mechanism Fv to check whether the generated Yi can convey the watermark bit, i.e. whether Fv(Yi,Ks)=W(ki). The verification mechanism in this paper is shown as Eq. (9):
Fv(Yi,Ks)=mod(H(Ks,n∑j=1⌊yij⌋),2),
(9)
where ⌊⋅⌋ indicates the round down function. The whole embedding process will be conducted until the generated Yi has passed the verification. Then we replace all the X with Y to generate the final watermarked database Dw.
The embedding algorithm is illustrated as Algorithm 1.
Algorithm 1: Watermark embedding algorithm
Input: Secret key Ks, original database D, watermark W, group number m
Output: Watermarked database Dw
1 n,Ti=G(T∈D,m),i∈[1,m]
2 fori=1→mdo
3 ki=Ft(Ti,Ks,l)
4 do
5 Generate Ai;
6 forj=1→ndo
7 ^xij=ˉαi0+ˉαi1×sij;
8 ^aij=ˉβi0+ˉβi1×sij+ˉβi2×xij;
9 end
10 Bi=Ai−^Ai;
11 forj=1→ndo
12 cij=bijσBiξ;
13 yij=^xij+cij;
14 end
15 vi=Fv(Yi,Ks)
16 whilevi!=W(ki);
17 end
18 Dw=D(X⇒Y)
2.2.4
Watermark extraction
In this section, we will introduce the mechanism to extract the watermark from the watermarked database Dw. Specifically, we first use the grouping algorithm G, which is the same as the embedding phase, to cluster the database with the secret key Ks and the grouping numbers m. For each group i, we utilize the mapping function Ft to determine the index denoted as ki of the extracted watermark according to each Ti, as shown in Eq. (1) and Eq. (2). Then we take the embedded column Yi and perform the verification mechanism Fv on it to extract the embedded watermark bit We. In embedding, the watermark bit with the same index might be embedded into more than one group; in extraction, for different groups of watermarks corresponding to the same index, we will determine the final extracted watermark bits by majority voting principle. Specifically, for each index of watermark k∈[1,l], we record the number of bits “0” (denoted as Numk0) and the number of bits “1” (denoted as Numk1) that are extracted from all groups corresponding to k. If Numk0 is larger than Numk1, we regard the watermark bit of index k as “0”, otherwise, we regard the bit as “1”.
Algorithm 2: Watermark extraction algorithm
Input: Secret Key Ks, watermarked database Dw, group number m, watermark length l
Output: Watermark We
1 E∈Rl×2 = 0;
2 n,Ti=G(T∈Dw,m),i∈[1,m]
3 fori=1→mdo
4 ki=Ft(Ti,Ks,l);
5 vi=Fv(Yi,Ks);
6 E(ki,vi)=E(ki,vi)+1;
7 end
8 fork=1→ldo
9 ifE(k,0)≥E(k,1)then
10 We(k)=0;
11 else
12 We(k)=1;
13 end
14 end
For example, suppose there are 3 groups all corresponding to the 7th bit, and the extracted watermark of these 3 groups is {1,1,0}, so the 7th bit watermark will be determined as 1 according to the majority voting principle.
2.2.5
Statistical characteristics analysis
In this section, we will analyze the statistical characteristics of the database before and after embedding.
Since in this paper, we use the linear regression model to predict confidential points X from nonconfidential points S, two properties of the linear regression model are first introduced. Take the simple linear regression model with parameter p, q, and n points as an example, as shown in Eq. (10):
yi=pxi+q,i∈[1,n].
(10)
Denoting the predicted value as ^yi=pxi+q, and the predicted residual as ei=pxi+q−yi=^yi−yi. The two properties are as follows: (ⅰ) The mean value of ei is 0. (ⅱ) ei is orthogonal to xi.
Here we give the proof. The purpose of model training is to solve the optimal parameters p and q, which makes the regression result as close as possible to the true value. Without loss of generality, we use the least squares method to solve the optimization problem. Denoting the input data as xi, i=1,2,…,n, we can obtain a set of predicted values f(xi) according to Eq. (10) and calculate the squared loss between f(xi) and the existing real value yi based on the squared loss function as follows:
L(p,q)=1nn∑i=1(f(xi)−yi)2=1nn∑i=1(pxi+q−yi)2,
(11)
where L(p,q) is the squared loss with parameters p and q. For the ideal case, the partial derivative of L on p and q should be 0, i.e.:
where E(ei) indicates the mean value of ei. Based on Eq. (14), we could see property (ⅰ) is satisfied. For property (ⅱ), it could be proved with Eq. (12). According to Eq. (12), we can rewritten the equation as:
n∑i=1(pxi+q−yi)⋅xi=n∑i=1ei⋅xi=0.
(15)
Therefore, ei is orthogonal to xi.
Besides, according to Eq. (12) and Eq. (13), we could determine the value of p as:
where E(X) indicates the mean value of xi∈X, E(Y) indicates the mean value of yi∈Y, σXY is the covariance of xi∈X and yi∈Y, and σ2X is the variance of X. The value of q can be determined as:
q=E(Y)−pE(X).
(17)
Now we analyze of the statistical characteristics of Yi and Xi for specific group i. Based on the proposed embedding mechanism, the value yij∈Yi could be generated as:
yij=^xij+cij=ˉα0+ˉα1×sij+cij.
(18)
Mean: For the mean of Yi, we can get:
E(Yi)=E(^Xi)+E(Ci)=E(^Xi)+E(BiσBiξ),
(19)
where Bi is the predicted residual of Ai, as shown in Eq. (5). According to property (ⅰ), E(Bi) is 0, so Eq. (19) could be re-written as:
E(Yi)=E(^Xi)+E(Bi)=E(^Xi)=E(Xi).
(20)
Therefore, the mean value of Yi is equal to the mean value of Xi.
Variance: The variance of Yi, we can be calculated as:
Therefore, when conducting an ideal linear regression and satisfying Eq. (22), the mean value, variance value, and covariance value with reference column Si of the embedded column Yi is equal to those of the original column Xi.
3.
Experimental results and analysis
Experiments are implemented on a common PC with an Intel Core i5 CPU and RAM of 16 GB. It should be noted that the proposed RCV scheme could be applied to all time series databases rather than specific databases with special characteristics. Without loss of generality, we take one available public time series database, the Circuit Load Data of Singapore[30], to evaluate the performance of our scheme, in which we focus on three attributes: timestamp, electricity load, and real-time electricity price. The equipment collects power load and electricity price information every half an hour, i.e., 48 data points per day, and a total of 340080 data points are collected from January 1, 2003, to May 22, 2022. In our experiment, the length of the embedded watermark is set as 70 bits with a group size of 48 data points, the same size as one day. Note that the length and grouping size of the watermark information can be selected reasonably by the owner according to the specific situation of the database. For the time series database, the statistical analysis for a certain time period is more important, so we pay more attention to how the statistical characteristics of each group remain unchanged. To better analyze the information contained in the data, we recommend that the group size be a day, a week, or a meaningful period of data points.
The following experiments are illustrated in two aspects. The first part analyzes the statistical characteristics and compares the statistical preserving properties of the proposed RCV scheme with other watermarking schemes. In the second part, the robustness of our scheme is compared with some watermarking schemes.
3.1
Statistical characteristics analysis
This subsection is mainly divided into two parts. The first part proves the effectiveness of the proposed scheme in preserving statistical characteristics. Then, the statistical characteristics of RCV are compared and analyzed with the existing database watermarking schemes, including GAHSW[21], DEW[19], and time series data watermarking scheme Signal[8].
3.1.1
Local statistical characteristics preservation
The proposed RCV does not degrade data availability after embedding watermarks while keeping the local statistical properties of the data the same. The watermarked data Yi and Xi obtained by each group have exactly the same mean and variance as the original data. Additionally, the covariance between Yi and Si is exactly the same as that between Xi and Si. To better illustrate the statistical characteristics preservation of the proposed RCV scheme, we randomly select a timestamp points group Ti as an example for analysis. In this paper, we grouped the time series database by days, and each group has 48 data points, including the values of the confidential data X (electricity price) and the nonconfidential data S (electricity load). As shown in Table 3, the mean and the variance of Xi are 3304.7456 and 52423.3529, respectively. The correlation between Xi and Si is 3808.8394. Regressing X on S, we obtain ¯α1=9.1830 and ¯α0=2650.2113 based on Eq. (3) and calculate the predicted values for each observation. Then, the next column Ai represents the set of random variables generated from a univariate normal distribution with mean 0 and variance 1. Next, based on Eq. (4), Ai is regressed on both Xi and Si and we obtain ¯β0=4.5530, ¯β1=0.0120, and ¯β2=−0.0017. The prediction residuals calculated by Eq. (5) are shown in the next column as Bi. Based on σ2Xi, σ2Si, and σXiSi and using Eq. (7), we can calculate that ξ=132.0865. Finally, we can calculate Ci and Yi for each observation according to Eq. (6) and Eq. (8).
Table
3.
An illustration of local statistical characteristics preservation.
It is easily verified that Yi has the same mean and variance as that of Xi. Besides, the correlation between Yi and Si is 3808.8394, which is exactly the same as that between Xi and Si. Thus, the results of the analysis for which the mean and covariance are sufficient statistics, such as regression analysis, will be exactly the same when using Yi in place of Xi.
In the verification mechanism, we need to verify whether the watermark information in the obtained group Yi is the same as the watermark information to be embedded. Using Eq. (2), we calculate that the 7th-bit watermark “1” is to be embedded in the group. Then, Eq. (9) is used to calculate whether the watermark information carried in this group Yi is “1”. If right, then Yi passes the verification; Otherwise, regenerate Ai and repeat the above steps until Yi passes the verification.
3.1.2
Comparative experiment of statistical characteristics preservation
In this subsection, we compare the statistical characteristics preservation performance of the proposed watermarking scheme and the comparative ones, including Signal[8], GAHSW[21], and DEW[19]. As shown in Table 4, the mean, variance, and covariance between the attribute X and attribute S of the watermarked database with the proposed RCV are the same as the corresponding statistical characteristics of the original database. Specifically, our scheme can not only keep the local statistical characteristics unchanged but also keep the overall statistical characteristics unchanged. For the comparative schemes, the statistical characteristics of the database watermarked by GAHSW change slightly, yet the variance and covariance values of DEW and Signal have relatively large variations. The evaluation results verify the effectiveness of our proposed scheme in keeping the statistical characteristics unchanged.
Table
4.
Statistical characteristics of the database watermarked with different schemes.
Referring to the experimental setting of existing database watermark schemes, we consider three common attacks, i.e., insertion, deletion, and alteration, to evaluate the robustness of the proposed RCV scheme. The watermark extraction accuracy (Acc), i.e., the ratio of the correctly extracted bits in the extracted watermark bits, is used as the robustness metric, which can be calculated as follows:
Acc=l∑i=1wi⊕wdetil,
(25)
where wi is an embedded watermark bit and wdeti is the extracted watermark bit. It should be noted that a higher Acc means higher robustness of watermarking schemes.
Considering that the high-value data of time series databases are often concentrated, the attacker tends to modify important parts of the continuous data. Therefore, we evaluate the Acc of the watermarked database under insertion, deletion, or alteration with percentages from 10% to 90% in steps of 10% of the data groups. The experimental results are illustrated in the following charts, in which the vertical axis represents the rate of successful watermark detection, and the horizontal axis represents the change of groups in attack percentage according to the size of the database.
To simulate the alteration attack, we alter the watermarked database with different ratios of data groups, mainly altering the measured values of the database. Fig. 1 shows Acc of the extracted watermarks of RCV, Signal, DEW, and GAHSW under alteration attack. We found that, for all schemes, the Acc of the extracted watermark decreases as the number of altered groups increases. Nevertheless, RCV is superior among other methods under alteration attack. This result is because our scheme has a higher embedding rate of watermark information, which makes the same amount of modification can be embedded with more duplicate information and is more resistant to an alteration attack. The GAHSW scheme embeds watermark information into the database through HSW (histogram shifting of prediction error expansion watermarking), which is mainly embedded by moving the left and right sides of the histogram. The embedding formula proposed in this scheme can only be embedded when the absolute value between the prediction error and the peak bin is equal, so the embedding rate of the watermark is not very high. The DEW scheme embeds watermark information through differential technology but also needs to meet certain conditions before embedding the watermark bit into the tuple. Compared with these schemes, our scheme embeds the watermark in each group and has a higher embedding rate. Therefore, even after an alteration of up to 90% of the data groups, the Acc of RCV is still higher than 80%.
Figure
1.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after alteration attack.
In the deletion attack, we delete different ratios of the data groups from the watermarked database randomly. As shown in Fig. 2, for RCV, we can see that even only a small portion of the preserved database is dufficient for successful watermark extraction. Meanwhile, when the database suffers from a heavy deletion attack, e.g., 90% of the tuples of the database are deleted, Signal and DEW could only be extracted with Acc values of 49% and 10%, respectively, lacking the robustness to database deletion. For the Signal scheme, watermark information is embedded by modifying the average modulation relationship of the approximate coefficients in the wavelet domain. Although this way of treating time series as one-dimensional signals has good robustness against noise attacks, it will destroy the synchronization structure of the watermark signal and affect the robustness against database deletion attacks. Compared with GAHSW and DEW, RCV has higher data redundancy when embedding the watermark. Therefore, even though many groups are deleted, the watermark can be correctly extracted as long as the remaining groups contain the bits of the watermark.
Figure
2.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after deletion attack.
For the insertion attack, data groups are randomly created and inserted between groups, aiming to weaken the embedded watermark. As shown in Fig. 3, the proposed RCV and GAHSW have excellent resilience to insertion attacks, while DEW and Signal are not very robust to it. Similarly, the insertion attack also destroys the synchronization structure of the watermark signal, resulting in the failure of the signal to extract the correct watermark.
Figure
3.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after insertion attack.
Based on the inherent time-group characteristics of time series databases, this paper proposes a robust watermarking scheme for time series databases, which effectively ensures the consistency of statistical characteristics. Specifically, we devise a three-step scheme RCV, composed of regression, compensation, and verification operations. Based on the characteristics of the linear regression model, the statistical characteristics of the database can be kept constant. The experimental results show that the proposed method outperforms the existing methods in terms of effectiveness, robustness, and fidelity of statistical characteristics.
To maintain the availability of data, we design a database watermarking scheme with statistical feature preservation. However, in some scenarios where data accuracy is more needed, we have a higher goal for data availability. In future work, we will first further expand our RCV watermarking scheme. The current scheme keeps the single-row and single-column statistical characteristics unchanged and extends to the multirow and multicolumn statistical characteristics unchanged. We will further study the characteristics of the high-dimensional mean vector and covariance matrix in order to design a more versatile database watermarking scheme. Then, we consider the study of data lossless database watermarking technology, in order to meet the requirements of more stringent data accuracy scenarios.
Acknowledgements
This work was supported by the Natural Science Foundation of China (62072421, U2336206, 62102386, 62372423, and U20B2047) and Fundamental Research Funds for the Central Universities (WK2100000041).
Conflict of interest
The authors declare that they have no conflict of interest.
1Ks should be selected from a large key space so that it is computationally infeasible for an attacker to guess the key.
2 The “Circuit Load Dataset” records the electricity price and load data of a certain device over a period of time, and an observation value is recorded every half an hour.
This paper proposes a robust database watermarking scheme for time series databases, which can effectively ensure the consistency of statistical characteristics before and after watermark embedding.
Based on the time-group characteristics of TSDBs, we propose a three-step watermarking method, which is based on linear regression, error compensation, and watermark verification, named RCV.
The effectiveness of our scheme in keeping the statistical characteristics unchanged is verified both theoretically and practically. The experimental results show that our scheme has good robustness against database malicious attacks.
Xu J, Chen H, Yang X, et al. Verifiable image revision from chameleon hashes. Cybersecurity, 2021, 4: 34. DOI: 10.1186/s42400-021-00097-3
[2]
Yuan G, Hao Q. Digital watermarking secure scheme for remote sensing image protection. China Communications, 2020, 17: 88–98. DOI: 10.23919/JCC.2020.04.009
[3]
Sun J, Jiang X, Liu J, et al. An anti-recompression video watermarking algorithm in bitstream domain. Tsinghua Science and Technology, 2020, 26: 154–162. DOI: 10.26599/TST.2019.9010050
[4]
Munir R, Harlili. A secure fragile video watermarking algorithm for content authentication based on Arnold cat map. In: 2019 4th International Conference on Information Technology (InCIT). Bangkok, Thailand: IEEE, 2019 : 32–37.
[5]
Wang F, Zhou H, Fang H, et al. Deep 3D mesh watermarking with self-adaptive robustness. Cybersecurity, 2022, 5: 24. DOI: 10.1186/s42400-022-00125-w
[6]
Hamidi M, Haziti M E, Cherifi H, et al. A robust blind 3-D mesh watermarking based on wavelet transform for copyright protection. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). Fez, Morocco: IEEE, 2017 : 1–6.
[7]
Hou J U, Kim D G, Lee H K. Blind 3D mesh watermarking for 3D printed model by analyzing layering artifact. IEEE Transactions on Information Forensics and Security, 2017, 12: 2712–2725. DOI: 10.1109/TIFS.2017.2718482
[8]
Duy T P, Tran D, Ma W. An intelligent learning-based watermarking scheme for outsourced biomedical time series data. In: 2017 International Joint Conference on Neural Networks (IJCNN). Anchorage, AK, USA: IEEE, 2017 : 4408–4415.
[9]
Kaur S, Singhal R, Farooq O, et al. Digital watermarking of ECG data for secure wireless commuication. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. Kerala, India: IEEE, 2010 : 140–144.
[10]
Edward Jero S, Ramu P, Swaminathan R. Imperceptibility- Robustness tradeoff studies for ECG steganography using Continuous Ant Colony Optimization. Expert Systems With Applications, 2016, 49: 123–135. DOI: 10.1016/j.eswa.2015.12.010
[11]
Agrawal R, Kiernan J. Watermarking relational databases. In: Proceedings of the 28th international conference on Very Large Data Bases. New York: ACM, 2002 : 155–166.
[12]
Guo F, Wang J, Li D. Fingerprinting relational databases. In: Proceedings of the 2006 ACM symposium on Applied computing. New York: ACM, 2006 : 487–492.
[13]
Guo F, Wang J, Zhang Z, et al. An improved algorithm to watermark numeric relational data. In: Proceedings of the 6th international conference on Information Security Applications. New York: ACM, 2005 : 138–149.
[14]
Franco-Contreras J, Coatrieux G. Robust watermarking of relational databases with ontology-guided distortion control. IEEE Transactions on Information Forensics and Security, 2015, 10: 1939–1952. DOI: 10.1109/TIFS.2015.2439962
[15]
Sion R, Atallah M, Prabhakar S. Rights protection for relational data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York: ACM, 2003: 98–109.
[16]
Shehab M, Bertino E, Ghafoor A. Watermarking relational databases using optimization-based techniques. IEEE Transactions on Knowledge and Data Engineering, 2008, 20: 116–129. DOI: 10.1109/TKDE.2007.190668
[17]
Gross-Amblard D. Query-preserving watermarking of relational databases and XML documents. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York: ACM, 2003 : 191–201.
[18]
Zhang Y, Yang B, Niu X. Reversible watermarking for relational database authentication. Journal of Computer, 2006 , 17(2): 59–65.
[19]
Gupta G, Pieprzyk J. Reversible and blind database watermarking using difference expansion. International Journal of Digital Crime and Forensics, 2009, 1: 42–54. DOI: 10.4018/jdcf.2009040104
[20]
Jawad K, Khan A. Genetic algorithm and difference expansion based reversible watermarking for relational databases. Journal of Systems and Software, 2013, 86: 2742–2753. DOI: 10.1016/j.jss.2013.06.023
[21]
Hu D, Zhao D, Zheng S. A new robust approach for reversible database watermarking with distortion control. IEEE Transactions on Knowledge and Data Engineering, 2019, 31: 1024–1037. DOI: 10.1109/TKDE.2018.2851517
[22]
Imamoglu M B, Ulutas M, Ulutas G. A new reversible database watermarking approach with firefly optimization algorithm. Mathematical Problems in Engineering, 2017, 2017: 1387375. DOI: 10.1155/2017/1387375
[23]
Farfoura M E, Horng S J, Wang X. A novel blind reversible method for watermarking relational databases. Journal of the Chinese Institute of Engineers, 2013, 36: 87–97. DOI: 10.1080/02533839.2012.726041
[24]
Iftikhar S, Kamran M, Anwar Z. RRW—a robust and reversible watermarking technique for relational data. IEEE Transactions on Knowledge and Data Engineering, 2015, 27: 1132–1145. DOI: 10.1109/TKDE.2014.2349911
[25]
Li Y, Wang J, Jia H. A robust and reversible watermarking algorithm for a relational database based on continuous columns in histogram. Mathematics, 2020, 8: 1994. DOI: 10.3390/math8111994
[26]
Li Y, Wang J, Luo X. A reversible database watermarking method non-redundancy shifting-based histogram gaps. International Journal of Distributed Sensor Networks, 2020, 16: 1550147720921769. DOI: 10.1177/1550147720921769
[27]
Tang X, Cao Z, Dong X, et al. PKMark: A robust zero-distortion blind reversible scheme for watermarking relational databases. In: 2021 IEEE 15th International Conference on Big Data Science and Engineering (BigDataSE). Shenyang, China: IEEE, 2021 : 72–79.
[28]
Ge C, Sun J, Sun Y, et al. Reversible database watermarking based on random forest and genetic algorithm. In: 2020 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). Chongqing, China: IEEE, 2020 : 239–247.
[29]
Wang W, Liu C, Wang Z, et al. FBIPT: A new robust reversible database watermarking technique based on position tuples. In: 2022 4th International Conference on Data Intelligence and Security (ICDIS). Shenzhen, China: IEEE, 2022 : 67–74.
Figure
1.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after alteration attack.
Figure
2.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after deletion attack.
Figure
3.
A Comparison of watermark extraction Acc of RCV with Signal, DEW, and GAHSW after insertion attack.
References
[1]
Xu J, Chen H, Yang X, et al. Verifiable image revision from chameleon hashes. Cybersecurity, 2021, 4: 34. DOI: 10.1186/s42400-021-00097-3
[2]
Yuan G, Hao Q. Digital watermarking secure scheme for remote sensing image protection. China Communications, 2020, 17: 88–98. DOI: 10.23919/JCC.2020.04.009
[3]
Sun J, Jiang X, Liu J, et al. An anti-recompression video watermarking algorithm in bitstream domain. Tsinghua Science and Technology, 2020, 26: 154–162. DOI: 10.26599/TST.2019.9010050
[4]
Munir R, Harlili. A secure fragile video watermarking algorithm for content authentication based on Arnold cat map. In: 2019 4th International Conference on Information Technology (InCIT). Bangkok, Thailand: IEEE, 2019 : 32–37.
[5]
Wang F, Zhou H, Fang H, et al. Deep 3D mesh watermarking with self-adaptive robustness. Cybersecurity, 2022, 5: 24. DOI: 10.1186/s42400-022-00125-w
[6]
Hamidi M, Haziti M E, Cherifi H, et al. A robust blind 3-D mesh watermarking based on wavelet transform for copyright protection. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). Fez, Morocco: IEEE, 2017 : 1–6.
[7]
Hou J U, Kim D G, Lee H K. Blind 3D mesh watermarking for 3D printed model by analyzing layering artifact. IEEE Transactions on Information Forensics and Security, 2017, 12: 2712–2725. DOI: 10.1109/TIFS.2017.2718482
[8]
Duy T P, Tran D, Ma W. An intelligent learning-based watermarking scheme for outsourced biomedical time series data. In: 2017 International Joint Conference on Neural Networks (IJCNN). Anchorage, AK, USA: IEEE, 2017 : 4408–4415.
[9]
Kaur S, Singhal R, Farooq O, et al. Digital watermarking of ECG data for secure wireless commuication. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. Kerala, India: IEEE, 2010 : 140–144.
[10]
Edward Jero S, Ramu P, Swaminathan R. Imperceptibility- Robustness tradeoff studies for ECG steganography using Continuous Ant Colony Optimization. Expert Systems With Applications, 2016, 49: 123–135. DOI: 10.1016/j.eswa.2015.12.010
[11]
Agrawal R, Kiernan J. Watermarking relational databases. In: Proceedings of the 28th international conference on Very Large Data Bases. New York: ACM, 2002 : 155–166.
[12]
Guo F, Wang J, Li D. Fingerprinting relational databases. In: Proceedings of the 2006 ACM symposium on Applied computing. New York: ACM, 2006 : 487–492.
[13]
Guo F, Wang J, Zhang Z, et al. An improved algorithm to watermark numeric relational data. In: Proceedings of the 6th international conference on Information Security Applications. New York: ACM, 2005 : 138–149.
[14]
Franco-Contreras J, Coatrieux G. Robust watermarking of relational databases with ontology-guided distortion control. IEEE Transactions on Information Forensics and Security, 2015, 10: 1939–1952. DOI: 10.1109/TIFS.2015.2439962
[15]
Sion R, Atallah M, Prabhakar S. Rights protection for relational data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York: ACM, 2003: 98–109.
[16]
Shehab M, Bertino E, Ghafoor A. Watermarking relational databases using optimization-based techniques. IEEE Transactions on Knowledge and Data Engineering, 2008, 20: 116–129. DOI: 10.1109/TKDE.2007.190668
[17]
Gross-Amblard D. Query-preserving watermarking of relational databases and XML documents. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York: ACM, 2003 : 191–201.
[18]
Zhang Y, Yang B, Niu X. Reversible watermarking for relational database authentication. Journal of Computer, 2006 , 17(2): 59–65.
[19]
Gupta G, Pieprzyk J. Reversible and blind database watermarking using difference expansion. International Journal of Digital Crime and Forensics, 2009, 1: 42–54. DOI: 10.4018/jdcf.2009040104
[20]
Jawad K, Khan A. Genetic algorithm and difference expansion based reversible watermarking for relational databases. Journal of Systems and Software, 2013, 86: 2742–2753. DOI: 10.1016/j.jss.2013.06.023
[21]
Hu D, Zhao D, Zheng S. A new robust approach for reversible database watermarking with distortion control. IEEE Transactions on Knowledge and Data Engineering, 2019, 31: 1024–1037. DOI: 10.1109/TKDE.2018.2851517
[22]
Imamoglu M B, Ulutas M, Ulutas G. A new reversible database watermarking approach with firefly optimization algorithm. Mathematical Problems in Engineering, 2017, 2017: 1387375. DOI: 10.1155/2017/1387375
[23]
Farfoura M E, Horng S J, Wang X. A novel blind reversible method for watermarking relational databases. Journal of the Chinese Institute of Engineers, 2013, 36: 87–97. DOI: 10.1080/02533839.2012.726041
[24]
Iftikhar S, Kamran M, Anwar Z. RRW—a robust and reversible watermarking technique for relational data. IEEE Transactions on Knowledge and Data Engineering, 2015, 27: 1132–1145. DOI: 10.1109/TKDE.2014.2349911
[25]
Li Y, Wang J, Jia H. A robust and reversible watermarking algorithm for a relational database based on continuous columns in histogram. Mathematics, 2020, 8: 1994. DOI: 10.3390/math8111994
[26]
Li Y, Wang J, Luo X. A reversible database watermarking method non-redundancy shifting-based histogram gaps. International Journal of Distributed Sensor Networks, 2020, 16: 1550147720921769. DOI: 10.1177/1550147720921769
[27]
Tang X, Cao Z, Dong X, et al. PKMark: A robust zero-distortion blind reversible scheme for watermarking relational databases. In: 2021 IEEE 15th International Conference on Big Data Science and Engineering (BigDataSE). Shenyang, China: IEEE, 2021 : 72–79.
[28]
Ge C, Sun J, Sun Y, et al. Reversible database watermarking based on random forest and genetic algorithm. In: 2020 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). Chongqing, China: IEEE, 2020 : 239–247.
[29]
Wang W, Liu C, Wang Z, et al. FBIPT: A new robust reversible database watermarking technique based on position tuples. In: 2022 4th International Conference on Data Intelligence and Security (ICDIS). Shenzhen, China: IEEE, 2022 : 67–74.