ISSN 0253-2778

CN 34-1054/N

2016 Vol. 46, No. 9

Display Method:
Original Paper
Parallel ISOMAP algorithm based on Spark
SHI Lukui, YUAN Bin, LIU Wenhao
2016, 46(9): 711-718. doi: 10.3969/j.issn.0253-2778.2016.09.001
To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.
Semantic similarity measurement based on low-dimensional sense vector model
CAI Yuanyuan, LU Wei
2016, 46(9): 719-726. doi: 10.3969/j.issn.0253-2778.2016.09.002
Semantic similarity measurement enables the improvement of information retrieval in terms of accuracy and efficiency, so it has become one of the core components in text processing. To solve the problem of lexical ambiguity like polysemy, a sense vector model based on vector composition was proposed, which integrates knowledge base with corpus by fusing multiple semantic features derived from both of them. This model focuses on the continuous distributed word vectors and the inherent semantic properties in WordNet. Firstly, the continuous word vectors were trained from a textual corpus in advance by the neural network language model in deep learning. Then multiple semantic information and relationship information were extracted from WordNet to augment original vectors and generate sense vectors for words. Hence, the semantic similarity between concepts can be measured by the similarity of sense vectors. The experimental results on benchmark indicate that this measure outperforms state-of-the-art measures based on either WordNet or corpora. Compared with the measures based on original distributed word vectors, the proposed measure has an improvement of Pearson correlation coefficient (7.5%). The outstanding results also show the contribution of multiple feature fusion to measuring the conceptual semantic similarity.
An incremental cost-sensitive support vector machine
QUAN Xin, GU Yuanhua, ZHENG Guansheng, GU Bin
2016, 46(9): 727-735. doi: 10.3969/j.issn.0253-2778.2016.09.003
Cost-sensitive learning is an important field in machine learning, which widely exists in real-world applications, such as cancer diagnosis, credit application, etc. Cost-sensitive support vector machine proposed by Masnadi et al. handles cost-sensitive problems through making the hinge loss function cost-sensitive, which has better generalization accuracy than other traditional cost-sensitive algorithms. In practice data are obtained one batch after another. Conventional batch algorithms would waste a lot of time when appending samples, because they should re-train the model from scratch. To make the cost-sensitive support vector machine more practical in on-line learning problems, an incremental cost-sensitive support vector machine algorithm was proposed, which can directly update the trained model without re-training it from scratch when appending samples. Experiment study on several datasets show that our algorithm is significantly more efficient than batch algorithms of the cost-sensitive support vector machine.
Collaborative filtering recommendation algorithm based on nearest neighbor clustering
WEI Huijuan, DAI Muhong, NING Yongyu
2016, 46(9): 736-742. doi: 10.3969/j.issn.0253-2778.2016.09.004
With the increasing number of users and items in recommender systems, designing a scalable algorithm becomes a big challenge for recommendation systems. However, many recommendation algorithms and the improved algorithms proposed thus far have focused on improving recommendation quality, resulting in shortcomings such as lower recommendation efficiency and running time consumption as the system increases in scale. To address the problem of scalability, a collaborative filtering recommendation algorithm based on nearest neighbor clustering was proposed. Firstly, the k-means algorithm was utilized to place similar scores into the same cluster, which was used to build the user clustering model. Then, it picked out the active users’ nearest neighbor clusters from the clustering model and treats them as a retrieval space. Finally, the nearest neighbors of an active user are found according to the retrieval space, and the recommendation to the active user was given. Experimental results show that the algorithm proposed in this paper not only significantly improves the response speed of the recommendation system online but also maintains a high accuracy.
A fast algorithm of image moments in copy-move forgery detection
LAI Yuecong, HUANG Tianqiang, LIN Jing
2016, 46(9): 743-748. doi: 10.3969/j.issn.0253-2778.2016.09.005
In copy-move forgery detection, feature extraction is an important step. The image moments (e.g., Zernike moments, radial harmonic Fourier moments, Exponential Fourier moments) are common feature vectors. In view of the excessive length of running time of most existing feature extraction algorithms, a fast algorithm of image moments in copy-move forgery detection was proposed. Integral expression should be discretized and turn integration turned into sum. In the final discretization expression, we can divide it into two parts. One is the fixed section, the other is the grey level of the image. In the classical computing algorithm of image moments, the two parts should be calanlated Num times to get their product if Num image moments are to be obtoined. A fast computing method was put forward to calculate the two parts independently. The fixed parts are calculated just once and the grey level Num times before the product was obtained. This reduced the running time because of the decline of the computing times. If the resolution ratios of the images are invariant and the numbers of the image moments sufficiently large, the fast algorithm of image moments can greatly reduce the running time compared with the classical method.
Dimensionality reduction method of graph kernel based on KLDA
YU Yajun, PAN Zhisong, HU Guyu, MO Xiaoyong, XUE Jiao
2016, 46(9): 749-756. doi: 10.3969/j.issn.0253-2778.2016.09.006
Graph structure has strong expression ability and high flexibility. The identification and classification of graph structure data fall into the category of structural pattern recognition. The research idea of the graph structure data is to transform the graph structure data to the vector in the vector space, then the traditional machine learning algorithm is used to analyze the vector. Data representation and analysis based on graph structure has become a hot research topic in the field of machine learning. The classical graph kernel method was extended. The kernel linear discriminant analysis (KLDA) was employed to reduce the dimension of the high dimension feature space, and the low dimensional feature space corresponding to the original graph structure data was obtained. Then the traditional machine learning algorithm was used to analyze these new data. The effectiveness of the proposed method is verified by the experimental results on standard data sets.
Core-points based spectral clustering for big data analysis
YANG Yi, MA Runing
2016, 46(9): 757-763. doi: 10.3969/j.issn.0253-2778.2016.09.007
With regard to failures in applying spectral clustering to big data due to its computation complexity, a new spectral clustering algorithm for big data was proposed. Firstly, core-points based on random sampling and data similarity were selected, with which, the big data were grouped. Secondly, spectral clustering was applied to the core-points. Finally, the clustering of whole data was completed by combining the clustering result of the core-points and the grouped big data information. The algorithm both promotes the spectral clustering to big data and reduces the influence of noise or abnormal data by the core-points. A large number of experiments fully verify the effectiveness of the method proposed in this paper.
Belief rule based inference methodology for classification based on differential evolution algorithm
LIU Wanling, WANG Hanjie, FU Yanggeng, YANG Longhao, Wu Yingjie
2016, 46(9): 764-773. doi: 10.3969/j.issn.0253-2778.2016.09.008
A new method, based on belief rule base(BRB) was proposed for constructing a classification system with high performance for classification problems. The existing belief rule base classification system (BRBCS) is flawed because its classification accuracy is limited to the partition number, the parameter training method needs the number of rules given in advance, and the reasoning process does not reflect the correlation between characteristics and results. The belief rule base inference method was thus proposed for classification based on the differential evolutionary algorithm (DEBRM) to solve the classification problems. The proposed method consists of two procedures: belief rule base classification system (BRBCS) and parameter training method. The new method first introduced the construction strategy of BRBCS to determine the number of rules. Then, belief reasoning method was adopted as the inference engine. Finally, the training model for classification which is combined with differential evolutionary algorithm was built. In the experiment analysis, the effectiveness of the method was validated by comparing it with the existing parameter training method, and the rationality of parameters training in comparison of other belief rule base methods for different number of partitions. The classification results show that the proposed method is effective and reasonable.
Comparative study of data-driven intelligent flood forecasting methods for small- and medium-sized rivers
MA Kaikai, LI Shijin, WANG Jimin, YU Yufeng
2016, 46(9): 774-779. doi: 10.3969/j.issn.0253-2778.2016.09.009
In recent years, data driven flood forecasting methods have been widely used in flood forecasting, and good results have been achieved. But most data-driven models are applied to large basins, seldom in small basins. Flash floods in small- and medium-sized rivers, which are mostly located in data-poor mountainous areas in China, are featured by abruptness, rapid concentration and short forecasting time. The support-vector-machine (SVM) model, the BP neural network model, the RBF neural network model and extreme learning machine (ELM) model respectively are established and the used to forecast flash floods in Changhua basin. The results show that the SVM model and RBF network model have accurate prediction in the low flow section with simple parameters while BP network has better performance in the high flow section with less stable forecast results for the low flow section, and that the ELM model is not stable with large deviations. As a result, the SVM model or RBF model was adopted for the low flow section, and BP network for the high flow section. This final combination model shows better performance in experiments.
A TSP algorithm based on clustering ensemble ACO and restricted solution space
PANG Yongming, ZHONG Caiming, CHENG Kai
2016, 46(9): 780-787. doi: 10.3969/j.issn.0253-2778.2016.09.010
Ant colony algorithm (ACO) is a metaheuristic search method, which can solve efficiently NP-complete problems such as the famous traveling salesman problem (TSP). To alleviate, to some extent, ACO’s drawback of getting stuck in a local optimum, a TSP algorithm based on clustering ensemble ACO and restricted solution space is proposed in this paper. The main idea is as follows: Firstly, a triangle algorithm is presented to generate initial TSPs, which are used to construct the primitive transfer probability matrix so that the randomness of trip of ants is reduced; Secondly, to avoid premature convergence, the k-means clustering method is employed repeatedly to produce co-association matrix (CM), which can be viewed as a perturbation factor and improve the selection of the next city for each ant, namely, city pairs with high values in CM are connected closely, or vice versa; Finally, a restricted solution space, namely, the restricted connections, is reconsidered to improve the solutions produced by ACO. The experiments on 9 benchmarks from TSPLIB demonstrate that the proposed method outperforms some of the state-of-art TSP algorithms.
A bridge crack image detection and classification method based on a climbing robot
CHEN Yao, MEI Tao, WANG Xiaojie, LI Feng, LIU Yanwei
2016, 46(9): 788-796. doi: 10.3969/j.issn.0253-2778.2016.09.011
Traditional bridge crack detection methods are of high cost and high risk. A bridge crack detection and classification method was proposed based on a climbing robot using image analysis with a miniature camera mounted on the robot to collect images. First, the motion blur of acquired images was removed by Wiener filtering method. Second, wavelet transform was used to enhance the fractures of the crack in the image. Third, to complete crack image recognition, the surface morphology analysis is applied to extract crack fragments and then KD-tree was used to connect them. Finally, support vector machine method was used to classify crack images based on a series of basic visual characteristics and geometric features. Comparison of geometrical characteristic classification method and BP neural network classification method, results show that our method is better.