ISSN 0253-2778

CN 34-1054/N

2016 Vol. 46, No. 3

Display Method:
Original Paper
A novel voting method and parallel implementation for soft clustering
ZHANG Jingjing, YANG Yan, WANG Hongjun, HAN Xiaotao, DENG Qiang
2016, 46(3): 173-179. doi: 10.3969/j.issn.0253-2778.2016.03.001
As an important tool of Data Mining, clustering ensemble has been widely recognized and studied. This paper proposes a novel voting method for Soft Clustering(VMSC). The ensemble process consists of two steps: calculating the average degree of membership matrix as the input of the second step, and iterative optimization. This method deals well with eliminating the influences of noise and has good stability. The cloud computing platform of Spark handles big data efficiently. The VMSC algorithm was parallelizod to make it suitable for big data on Spark Cloud Computing platform. In the VMSC experiments, 12 UCI datasets were used to test it, and its results were compared with 4 other soft clustering ensemble algorithms: sCSPA, sMCLA, sHGBF and SVCE. The experiments indicate that the VMSC algorithm has a better integration effect. And the parallel experiments show that its parallel implementation manages big data efficiently.
An efficient weighted graph aggregation algorithm
HU Baoli, YOU Jinguo, ZHOU Cuilian, WANG Yang, CUI Hongbo
2016, 46(3): 180-187. doi: 10.3969/j.issn.0253-2778.2016.03.002
Graph aggregation (graph summarization) technique is one of the effective ways to mine and analyze huge graphs. However, in reality, these graphs are not only huge but also carry weighted edges. The current algorithms do not or seldom take the weight into consideration, leading to a great difference between the aggregation graph and the original one. In order to solve this problem and improve the quality and efficiency of graph aggregation, The weighted graph aggregation algorithm was studied, the consistency of grouping area values of the adjacent matrix of the aggregation graphs was introduced to measure the consistency of weights of edges, compression ratio was defined to measure the spatial efficiency of the graph aggregation algorithm, and error rate was used to evaluate the difference between the aggregation graph and the original graph. The compression quality is ensured by controlling error rates and a comparison is made between the proposed algorithm and the existing graph aggregation algorithms. The experiment results show the effectiveness of the graph aggregation algorithm.
BDCode: An erasure code algorithm for big data storage systems
YIN Chao, WANG Jianzong, LV Haitao, CUI Zongmin, CHENG Lianglun, LI Tongfang, LIU Yan
2016, 46(3): 188-199. doi: 10.3969/j.issn.0253-2778.2016.03.003
An optimized algorithm, based on erasure coding technology towards the big data storage system that contains a lot of data, was proposed. By studying existing coding technologies and big data systems, this algorithm, named BDCode (big data code) can not only protect system reliability, but also improve the security and the utilization of storage space. Due to the high reliability and space saving rate of coding technology, coding mechanisms were introduced into big data systems. The storage nodes are divided into many virtual nodes to realize load balancing. By setting different virtual nodes’ storage groups for different codec servers , ensure system availability. And by using the parallel decoding computing of the nodes and the block of data, we can be ensured the recovery efficiency of the system can be proved when data is corrupted. Additionally, different users setting different coding parameters can improve the robustness of big data storage systems. We configured various data block m and calibration block k to improve the utilization rate in the quantitative experiments. The results show that parallel decoding speed can be nearly two times faster than the past serial decoding speed. The encoding efficiency with BDCode coding is, on average, 36.1% higher than using CRS and 58.2% higer than using RS coding. The decoding rate by using BDCode averages 19.3% higher than using CRS and 33.1% higher than using RS.
Research on stream program task scheduling and cache optimization for X86 multi-core processor
TANG Jiufei, LI He, YU Junqing
2016, 46(3): 200-207. doi: 10.3969/j.issn.0253-2778.2016.03.004
Stream programming as a kind of programming paradigm widely used in multi-core systems, in which the parallel scheduling and access latency of main memory has a great influence on the performance of the program. To solve this problem, a task scheduling and cache optimization method was proposed for the stream program by combining the characteristics of X86 multi-core processors. To improve the parallel scheduling granularity and locality of the target program, expanded scheduling in the pre-processing phase was used firstly. Then the compiler exploitsed the data parallelism and task parallelism to keep load balancing and construct the corresponding software pipeline scheduling. According to cache hierarchies of the target system, the interference were reduced among parallel scheduling cores were reduced by eliminating cache false sharing, and the optimized mapping between logic threads and physical cores in line with the inter-thread communication distribution were implemented. The compiler took a COStream dataflow program as input and outputs the optimized program. The common algorithms in media processing technology were chosen as the test programs for the experiment. The experimental result shows that the optimized test programs achieve linear speedup, which indicates the effectiveness of the compiling optimization system.
An outlier sample eliminating algorithm based on joint XY distances for near infrared spectroscopy analysis
YIN Baoquan, SHI Yinxue, SUN Ruizhi
2016, 46(3): 208-214. doi: 10.3969/j.issn.0253-2778.2016.03.005
Outlier samples in near infrared spectroscopy analysis can strongly influence on the performance of the prediction model. To detect and eliminate the outlier samples, a new outlier sample eliminating algorithm base on joint XY distances (ODXY) was presented, and the relation of XY distances of NIR is proposed and proved. In this research, 102 lamb samples were collected and the data of NIR spectroscopy and moisture content was measured and analyzed. Initially, Mahalanobis distances method, Monte-Carlo sampling method and ODXY method to were employed to eliminate the outlier samples and built the PLS prediction model based on the processed samples. Then, the predictive mean square error (RMSEP) and the coefficient of determination (R2) were used to test the performance of the prediction model. Finally, the generalization of the eliminating algorithm was tested by new calibration and validation sets. The experiments show that ODXY method has better performance and better generalization ability than the other methods tested in our experiments.
Real-time multitask load balance algorithm for heterogeneous cloud computing platforms
XU Aiping, WU Di, XU Wuping, CHEN Jun
2016, 46(3): 215-221. doi: 10.3969/j.issn.0253-2778.2016.03.006
A load-balancing algorithm applying in heterogeneous cloud Computing Platform handling real-time multitasks was proposed. Average hardware resource consumption of jobs running on nodes was measured. Balancing server receive load status of each node in the cluster periodically. A load status vector that reflects the quantity of resources required to finish allocated jobs of each node can be estimated according to the latest load status report and other parameters. As a request is submitted to the cluster, Balancing Server calculates the load status estimation vector of each node, and then dispatches it to the node that possesses the minimal load status estimation value. Experiment results proved that this dynamic load balancing algorithm is reasonable and effective.
Research on Boosting theory and its applications
ZHANG Wensheng, YU Tingzhao
2016, 46(3): 222-230. doi: 10.3969/j.issn.0253-2778.2016.03.007
Boosting is one of the most popular ensemble algorithms in machine learning, and it has been widely used in machine learning and pattern recognition. There are mainly two frameworks of Boosting, learnable theory and statistical theory. Boosting was first proposed from the theory of weak learnability which illustrates the theory of boosting a group of weak learners into a strong learner. After a finite number of iterations, the combination of weak learners could be boosted into any accuracy on the training set, and the only requirement for a weak learner is that the accuracy be slightly better than a random guess. From the statistical point of view, Boosting is an additive model, and the equivalence between these two models has already been proved. The theory of weak learnability is reviewed from the PAC perspective, and the challenges Boosting may face are presented, includeing effectiveness for high dimension data and the Margin theory. Then, various Boosting algorithms are discussed from the above two viewpoints and their new applications with Boosting framework. Finally, the future of Boosting is discussed.
An exam robot for sentence completion in high school English tests
CHEN Zhigang, LIU Qingwen, LIN Wei, WANG Yang, CHEN XiaoPing
2016, 46(3): 231-237. doi: 10.3969/j.issn.0253-2778.2016.03.008
Addressed in this paper is the problem of sentence completion in Chinese national college or high school entrance English examinations in which the most appropriate word or phrase from a given set shoud be chosen to complete a sentence. Although a variety of methods have been developed to solve this problem in the literature, these approaches mainly focused on language modeling (LM) and latent semantic analysis (LSA) to the best of our knowledge. An exam robot prototype was built by extending the language modeling and latent semantic analysis methods to verb tense analysis and long distance phrase extraction. Specifically speaking, the syntactic, lexical and semantic features are extracted separately using by means of LM and LSA as well as verb tense analysis and phrase extraction two methods developed by the authors. These features are then fed into a learning to rank model to build the exam robot. The proposed approach outperforms LM and LSA models by 4.0 percentage points, achieving 78% accuracy on the question sets for senior entrance exams and 76% accuracy on the question sets for college entrance exams.
Forecasting Shanghai stock index using FTS model based on SVM-modify
LI Xiaolin, SUN Yue, LIU Yang
2016, 46(3): 238-246. doi: 10.3969/j.issn.0253-2778.2016.03.009
Traditional methods for stock index research are still at the stage of judging by experience or relying on simple data analysis, among which fundamental analysis and trading indicator analysis frequently used. These methods have noticeable disadvantages: Inefficient utilization of existing information or requirement for highly experienced users. A modified fuzzy time series (FTS) model was proposed based on the following three aspects. Firstly, a new method of interval division was developed. Secondly, a new weight formula for fuzzy set was devised. Thirdly, a modified FTS model was built with the application of SVM classification model. Predictions for stock index were made by using the proposed model. Experiment results from Shanghai index data ranging from 1996 to 2003 indicate that compared with other important FTS models; the proposed model provides better performance.
An anomaly detection algorithm for taxis based on trajectory data mining and online real-time monitoring
HAN Boyang, WANG Zhaoyang, JIN Beihong
2016, 46(3): 247-252. doi: 10.3969/j.issn.0253-2778.2016.03.010
Taking the prevention of taxi frauds as a motivating example, an anomalous spatio-temporal trajectory detection method that combines offline mining and online detection was proposed. A city roadmap was partitioned into a grid based on the longitude and latitude, using Pathlet sequences to express taxi trajectories instead of the traditional GPS sequences. Then, K-racial classes’ normal sequences were clustered in the same origin-destination pair from history data sets. The incoming online GPS data was transformed into Pathlet sequences and matched with K-racial classes’ normal sequences. The distance was computed and scored. Distance along with spatial and temporal factors together forms the criterion for determing anomalous taxi trajectories. Finally, based on the real taxi GPS data sets in Beijing area during March, 2011 to May, 2011, experimental results indicate that the proposed method is able to detect online anomalous trajectories efficiently and quickly.
Research on an automatic retrieval method for special topic news based on semantic frame
XIE Xingsheng, ZHOU Bangding, XIONG Yan
2016, 46(3): 253-258. doi: 10.3969/j.issn.0253-2778.2016.03.011
A novel negative news retrieving semantic frame(NNFrame) and its identification were presented. Different from traditional semantic FrameNets, which were defined based on word-sense disambiguation, NNFrames were defined on each subcategory of negative news with a single semantic context. By constructing the NNFrame knowledge base, domain ontology repository, and a collection of annotated example sentences for each NNFrame, a method was described for identifying NNFrame by a task-specific extended conditional log-likelihood model, that takes dependency-syntax structure representations, and the part of speech tags as input. This approach is practical, efficient, and can achieve state-of-the-art results on precision/recall metrics for identification and classification of negative news whose subcategories are pre-defined in the NNFrame knowledge base.