ISSN 0253-2778

CN 34-1054/N

2017 Vol. 47, No. 4

Display Method:
Original Paper
Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage
ZHU Shaolin, YANG Yating, MI Chenggang, LI Xiao, WANG Lei
2017, 47(4): 283-289. doi: 10.3969/j.issn.0253-2778.2017.04.001
When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.
Research on web page automatic categorization based on structural and text information
GU Min, GUO Qing, CAO Ye, ZHU Feng, GU Yanhui, ZHOU Junsheng, QU Weiguang
2017, 47(4): 290-296. doi: 10.3969/j.issn.0253-2778.2017.04.002
Since web pages contain abundant information resources, a better extraction and management of the information can be achieved through web page categorization. Considering the complex structure and abundant text information, a method was proposed for web page categorization based on the structure and text. The method of combining joint features and atomic features was employed to classify the web pages. The experiment result shows that the proposed method is feasible to some extent and has a higher precision and recall rate than using text information only.
Data reliable assurance model of Internet of Things based on the trust evaluation of perceived source
CHEN Zhenguo, TIAN Liqin, LIN Chuang
2017, 47(4): 297-303. doi: 10.3969/j.issn.0253-2778.2017.04.003
In order to solve the source reliability problem of Big Data in the Internet of Things, a data reliable assurance model of Internet of Things based on the trust evaluation of perceived source was constructed with the evaluation unit, and each evaluation unit includes three categories, including the work node, the companion node and the decision node. The working nodes which are aware of the same kind of index can be used as the basis for calculating the value of each other’s trust. The companion nodes and decision nodes are used to monitor the status of the working nodes. The companion nodes are used to verify the data of the working nodes regularly, so as to determine the status of the working nodes. The decision node is used when the working node is suspected to be abnormal so as to give the final result. Through the data collection and validation of the above three kinds of nodes, a method for calculating and adjusting the reliability of nodes was presented, which was used to obtain the trust value of each working node. Then, according to the given threshold value, the trust list was constructed, and the non-trusted nodes removed, and only the data perceived by the trusted node gets transmited and processed. At the same time, in order to ensure the initial reliability of the sensor node, the access authentication mechanism is introduced. From the results of theoretical analysis and simulation, the model has the characteristics of reliable node sensing data and flexible expansion, and can effectively improve the reliability of the data source of the Internet of Things.
A cloud storage access scheme with security proxy based on attribute mapping node
LI Huakang, LIU Pan, YANG Yitao, SUN Guozi
2017, 47(4): 304-310. doi: 10.3969/j.issn.0253-2778.2017.04.004
With the development of mobile technology, more and more people use cloud storage to back up their local data. The cloud platforms provide cheap and convenient data storage services while there are serious data security problems, especially the ciphertext data access control being totally dependent on the cloud provider. An advanced CP-ABE scheme based on mapping nodes was presented to prevent illegal access from unauthorized users or partially trusted cloud storage providers. In order to guarantee the security of cloud data in open environment, Key Generation Center and Security Proxy are introduced to separate the data service and security service in the access scheme. Experimental results show that the proposed attribute management scheme is capable of separating the secret key from data service at a low computational cost, showing great potential for applications.
Application of sparse spectral clustering algorithm in high-dimensional data
XU Xueli, ZHAO Xuejing
2017, 47(4): 311-319. doi: 10.3969/j.issn.0253-2778.2017.04.005
A new sparse spectral clustering algorithm——high-dimensional sparse spectral clustering based on partitioning around medoids (HSSPAM) was proposed, which takes advantage of the sparse similarity matrix in computation as well as the superiority of the PAM algorithm over K-means. To reduce or even eliminate the impact of “dimensionality curse” on high dimensional data processing, the high correlation filter (HCF) and the principal component analysis (PCA) method are also investigated in the algorithm. The proposed method has higher precision and more stable clustering results than the algorithms introduced in this paper for comparison in the real high-dimensional gene data under different clustering evaluation criteria.
Detection and recognition of high-speed railway catenary locator based on Deep Learning
CHEN Dongjie, ZHANG Wensheng, YANG Yang
2017, 47(4): 320-327. doi: 10.3969/j.issn.0253-2778.2017.04.006
High-speed rail monitoring is conducted mainly by adopting image processing and computer vision technology to detect, identify and track catenary components in image sequences taken by the visible light high-definition camera. In the entire monitoring system, the detection and recognition of the locator constitutes the very basis. It is difficult to design the feature descriptor with the characteristics of versatility, robustness and high-accuracy by using traditional target detection algorithms. #br##br#The detection of the high-accuracy locators based on the Faster R-CNN framework has been realized. Meanwhile, the Hough transform is used to detect the skeleton outline of the locator, and the optimal fitting straight line of the locator is extracted by the filtering mechanism, which paves the way for the non-contact precision measurement of the slope of the locators.
Bursty topic detection method for microblog based on influence from user behaviors
WAN Yue, SUI Jie
2017, 47(4): 328-335. doi: 10.3969/j.issn.0253-2778.2017.04.007
Social networks are becoming more and more popular where people can post anything anytime. Due to the huge user community, social network data is increasing with each passing day. Therefore how to explore the knowledge in huge data seems to be hard work. As microblog has time-related characteristics and social network behavior attributes, momentum signal enhancement model is put forward to detect bursty microblog topics effectively. Influence factor and hot energy factor are put forward to improve the momentum model. The influence factor uses the data before the current point but within a given period to calculate the
Research on leakage location of secondary water distribution networks
ZHANG Zhenya, ZHANG Meng, XIE Chenlei, ZHANG Zhaoxiang, FANG Qiansheng
2017, 47(4): 336-341. doi: 10.3969/j.issn.0253-2778.2017.04.008
Solving the leakage problem of the secondary water distribution networks often requires the combination of detection instruments and worker experience. However, this kind of method has several disadvantages, such time consumption, low efficiency and strong subjectivity. A new leakage-location method based on data analysis was proposed. The method gathered data from networks’ pressure monitoring points at a high frequency and then built a data set under a no-leakage condition. K-means clustering algorithm was used to classify the data set, thus obtaining the pressure data features in different times. Comparing the new nodal pressure vector with the data set, one can find whether there is leakage and where it is. Experimental results show that the method can help locate leakage in secondary water distribution networks. Compared with the existing methods, the proposed approach is faster and more objective and of higher practical value.
A new method for variable message sign layout based on the actual guidance effect maximization
GUO Yirong, WANG Xiaoming, WANG Min, ZHANG Hong
2017, 47(4): 342-349. doi: 10.3969/j.issn.0253-2778.2017.04.009
In order to induce the urban traffic flow properly, a new optimization layout method for variable message signs based on maximization of the actual guidance effect was proposed. The repeated induction utility and the wasted induction utility were added in the calculation of the actual utility, which improved utility of repeated induction and calculation of the wasteful utility participation in actual utility. This method also redefined inductive coverage rate and inductive repetition rate. Finally, a method of solving the optimum layout function of message sign based on the greedy algorithm was designed. Evidence collected from 36 road’ segments verify that the method is simple and efficient, and capable of optimizing the allocation of the plate through the analysis of complex situations of traffic flow to increase the inductive efficiency of the whole system. Hence, it is more suitable for the needs of the actual traffic flow guidance.
Ensembles of classifier chains for multi-label classification based on Spark
WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin
2017, 47(4): 350-357. doi: 10.3969/j.issn.0253-2778.2017.04.010
With the wide application of data mining technology, multi-label learning has become a hot topic in the data mining domain. Although ensembles of classifier chains (ECC) algorithm is a multi-label learning method which is effective and accurate, its complexity of time and space is so high that it cannot adapt to the large-scale multi-label classification tasks. A new algorithm named Spark ensembles of classifier chains(S-ECC) was proposed based on Spark platform on which a parallel implementation was conducted of each step of the sequential ECC algorithm. The test results in stand-alone and cluster environments show that S-ECC has a good adaptability to large-scale data with a high speedup, and that it is no less capable than the traditional sequential program.
BDAP: A data mining platform based on Spark
BU Yao, WU Bin, CHEN Yufeng, BAI Demeng
2017, 47(4): 358-368. doi: 10.3969/j.issn.0253-2778.2017.04.011
Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.