基于子序列全连接和最大团的时间序列模体发现算法
来源:用户上传
作者:
摘 要:针对时间序列模体发现算法计算复杂,并且无法发现多实例模体的问题,提出基于子序列全连接和最大团的时间序列模体发现(TSSJMC)算法。首先,使用快速时间序列子序列全连接算法求得所有子序列之间的距离,生成距离矩阵;然后,设置相似性阈值,将距离矩阵转化为邻接矩阵,构造子序列相似图;最后采用最大团搜索算法从相似图中搜索最大团,最大团的顶点对应的时间序列为包含最多实例的模体。在公开的时间序列数据集上进行实验,选用已有的能够发现多实例模体的Brute Force和Random Projection算法作为对比对象,分别从准确性、效率、可扩展性和鲁棒性对TSSJMC算法进行分析并获得了客观的评判结果。实验结果表明,与Random Projection算法相比,TSSJMC算法在效率、可扩展性和鲁棒性法方面均有明显优势;与Brute Force算法相比,TSSJMC算法发现的模体实例数量虽略低,但其效率和可扩展性都优于Brute Force算法。因此,TSSJMC是质量和效率相平衡的算法。
关键词:时间序列;时间序列子序列;子序列连接;最大团;模体发现
中图分类号: TP311.13
文献标志码:A
Abstract: Existing time series motif discovery algorithms have high computational complexity and cannot find multi-instance motifs. To overcome these defects, a Time Series motif discovery algorithm based on Subsequence full Joins and Maximum Clique (TSSJMC) was proposed. Firstly, the fast time series subsequence full join algorithm was used to obtain the distance between all subsequences and generate the distance matrix. Then, the similarity threshold was set, the distance matrix was transformed into the adjacency matrix, and the sub-sequence similarity graph was constructed. Finally, the maximum clique in the similarity graph was extracted by the maximum clique search algorithm, and the time series corresponding to the vertices of the maximum clique were the motifs containing most instances. In the experiments on public time series datasets, TSSJMC algorithm was compared with Brute Force algorithm and Random Projection algorithm which also could find multi-instance motifs in accuracy, efficiency, scalability and robustness. The experimental results demonstrate that compared with Random Projection algorithm, TSSJMC algorithm has obvious advantages in terms of efficiency, scalability and robustness; compared with Bruce Force algorithm, TSSJMC algorithm finds slightly less motif instances, but its efficiency and scalability are better. Therefore, TSSJMC is an algorithm that balances quality and efficiency.
Key words: time series; time series subsequence; subsequence join; maximum clique; motif discovery
0 引言
時间序列是按时间顺序排列的、具有相等时间间隔的一系列数据的集合[1]。时间序列无处不在,使其在各个行业获得普遍的应用。例如金融领域的证券交易数据[2]、气象领域的气温气压数据[3]、工业领域的用电数据[4]、医学领域的脑电波和心电图数据[5-6]等。在时间序列数据挖掘的诸多问题中,时间序列模式发现是一个基础性问题。时间序列模式发现包括查找事先指定模式和预先未知的模式。查找事先指定模式的问题(即按内容查询)已有诸多解决方法[7-10]。然而,查找预先未知、重复出现的模式即时间序列模体发现(也称为时间序列的序列主题发现)问题则面临更多挑战[11]。模体发现对于时间序列挖掘具有重要意义,可以用于解决时间序列聚类、分类、关联规则发现等问题[12]。
模体发现源于生物信息学,用于寻找脱氧核糖核酸(DeoxyriboNucleic Acid, DNA)序列中具有相似排列和功能的短核苷酸片段。2002年,Lin等[11]首次将“模体”一词引入时间序列,并提出时间序列模体发现的概念,此后出现了许多模体发现方法。第一类常见的时间序列模体发现算法采用近似离散化方法,该类算法先采用字符串聚集近似(Symbolic Aggregate Approximation, SAX)算法将时间序列进行离散并符号化,然后在压缩后的数据中寻找相似片段或者提取规则[13-14]。此类算法虽表现出良好的性能,但是SAX会对数据进行平均处理,可能丢失数据集中具有重要意义的峰、谷等信息。第二类常见的模体发现方法是采用聚类发现时间序列模体,先将时间序列进行分段,然后使用聚类算法进行模体发现[15-16]。Eamonn等[17]指出利用滑动窗口提取时间序列进行聚类挖掘是完全无意义的,因为通过滑动窗口平移提取的每个数据点对于整体贡献为一条直线。第三类时间序列模体发现算法基于概率方法,该类算法主要采用滑动窗口提取子序列,然后将投影算法作用于子序列得到候选模体,再将候选模体作为基准从原始序列中查找多条模体[18-19]。此类算法虽效率高,但涉及参数过多并且计算复杂。第四类算法是子序列连接的时间序列模体发现算法,该类算法通过子序列连接得到所有子序列的最近邻后,构建子序列相似图,再采用最大团或最近邻算法计算相似序列作为模体[20-21]。该类算法在子序列连接和最大团搜索中计算成本很高。 2017年Yeh等[1]使用快速傅里叶变换提高子序列全连接的速度,然后寻找每个子序列的1-最近邻(1-Nearest Neighbor, 1-NN),彼此间相似度最高的两子序列即为模体,该算法无法发现多实例模体。结合文献[1]中算法计算简单、速度快的优势,本文提出基于子序列全连接和最大团的时间序列模体发现(Time Series motif discovery based on Subsequence full Joins and Maximum Cliques, TSSJMC)算法以高效地发现等长的多实例模体。实验结果表明,本文算法能够快速、准确地发现多实例模体,同时对噪声具有较好的鲁棒性。
4 结语
针对时间序列模体发现算法计算复杂,并且无法发现多实例模体的问题,本文提出了一种计算简单并且能够发现多實例模体的基于子序列全连接和最大团的时间序列模体发现(TSSJMC)算法。该算法通过子序列全连接,构建子序列相似图,寻找图中的最大团三个步骤获得时间序列中的多实例模体。基于多个公开数据集的多组实验表明,本文提出的算法能够在较短时间内发现多条模体,并且具有高效性、准确性和更强的可扩展性和鲁棒性。文中算法发现的模体均为等长模体,未来我们将考虑发现不同长度的模体。
参考文献:
[1] YEH C-C M, ZHU Y, ULANOVA L, et al. Matrix Profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets [C]// Proceedings of the 2016 IEEE 16th International Conference on Data Mining. Piscataway, NJ: IEEE, 2016: 1317-1322.
[2] 周博,严洪森.基于小波和多维泰勒网动力学模型的金融时间序列预测[J].系统工程理论与实践,2013,33(10):2654-2662. (ZHOU B, YAN H S. Financial time series forecasting based on wavelet and multi-dimensional Taylor network dynamics model [J]. Systems Engineering ― Theory and Practice, 2013, 33(10): 2654-2662.)
[3] AILLIOT P, BESSAC J, MONBET V, et al. Non-homogeneous hidden Markov-switching models for wind time series [J]. Journal of Statistical Planning and Inference, 2015, 160: 75-88.
[4] 张淑清,师荣艳,李盼,等.基于混沌关联积分的暂态电能质量扰动分类[J].仪器仪表学报,2015,36(1):160-166. (ZHANG S C, SHI R Y, LI P, et al. Transient power quality disturbance classification based on chaos-correlation-integral [J]. Chinese Journal of Scientific Instrument, 2015, 36(1): 160-166.)
[5] ARES J, LARA J A, LIZCANO D, et al. A soft computing framework for classifying time series based on fuzzy sets of events [J]. Information Sciences, 2016, 330: 125-144.
[6] PADMAVATHI S, RAMANUJAM E. Naive bayes classifier for ECG abnormalities using multivariate maximal time series motif [J]. Procedia Computer Science, 2015, 47: 222-228.
[7] GE X, SMYTH P. Deformable Markov model templates for time-series pattern matching [C]// Proceedings of the 2000 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2000: 81-90.
[8] KALPAKIS K, GADA D, PUTTAGUNTA V. Distance measures for effective clustering of ARIMA time-series [C]// Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2001: 273-280.
[9] KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Dimensionality reduction for fast similarity search in large time series databases [J]. Knowledge & Information Systems, 2001, 3(3): 263-286. [10] CHAKRABARTI K, KEOGH E, MEHROTRA S, et al. Locally adaptive dimensionality reduction for indexing large time series databases [J]. ACM Transactions on Database Systems, 2002, 27(2): 188-228.
[11] LIN J, KEOGH E, LONARDI S, et al. Finding motifs in time series [C]// Proceedings of the 2nd Workshop on Temporal Data Mining. New York: ACM, 2002: 53-68.
[12] 馬百鸣.基于DTW度量的时间序列主旨模式提取[D].大连:大连理工大学,2011:1-3. (MA B M. Motif extraction algorithm of time series based on DTW [D]. Dalian: Dalian University of Technology, 2011:1-3.)
[13] PATEL P, KEOGH E, LIN J, et al. Mining motifs in massive time series databases [C]// Proceedings of the 2002 IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2002: 370-377.
[14] LIN J, LI Y. Finding approximate frequent patterns in streaming medical data [C]// Proceedings of the 2010 IEEE 23rd International Symposium on Computer-Based Medical Systems (CBMS). Washington, DC: IEEE Computer Society, 2010: 13-18.
[15] MURAKAMI K, DOKI S, OKUMA S, et al. A study of extraction method of motion patterns observed frequently from time-series posture data [C]// Proceeding of the 2005 IEEE International Conference on Systems, Man and Cybernetics. Piscataway, NJ: IEEE, 2005: 3610-3615.
[16] FU T-C, CHUNG F-L, LUK R, et al. Preventing meaningless stock time series pattern discovery by changing perceptually important point detection [C]// FSKD 2005: Proceedings of the 2005 Second International Conference on Fuzzy Systems and Knowledge Discovery, LNCS 3613. Berlin: Springer, 2005: 1171-1174.
[17] EAMONN KEOGH J L. Clustering of time series subsequences is meaningless: implications for past and future research [J]. Knowledge & Information Systems, 2003, 8(2):154-177.
[18] CHIU B, KEOGH E, et al. Probabilistic discovery of time series motifs [C]// KDD03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2003: 493-498.
[19] MINNEN D, ISBELL C, ESSA I, et al. Detecting subdimensional motifs: an efficient algorithm for generalized multivariate pattern discovery [C]// Proceedings of the 2007 Seventh IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2007: 601-606
[20] LIN Y, MCCOOL M D, GHORBANI A A. Time series motif discovery and anomaly detection based on subseries join [J]. IAENG International Journal of Computer Science, 2010, 37(3): 259-271. [21] SILVA D F, YEH C-C M, ENRIQUE G, et al. SiMPle: assessing music similarity using subsequences joins [C]// Proceedings of the 17th International Society for Music Information Retrieval Conference, New York: ISMIR, 2016:23-29.
[22] GRABOCKA J, SCHILLING N, SCHMIDTTHIEME L. Latent time-series motifs [J]. ACM Transactions on Knowledge Discovery from Data, 2016, 11(1): 1-20.
[23] YANKOV D, KEOGH E, MEDINA J, et al. Detecting timeseries motifs under uniform scaling [C]// KDD07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2007: 844-853.
[24] LI Y, LIN J. Approximate variable-length time series motif discovery using grammar inference [C]// MDMKDD10: Proceedings of the 10th International Workshop on Multimedia Data Mining. New York: ACM, 2010: Article No. 10.
[25] LI Y, LIN J, OATES T. Visualizing variable-length time series motifs [C]// Proceedings of the 12th SIAM International Conference on Data Mining. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2012: 895-906.
[26] DUY T C, ANH D T. A fast method for motif discovery in large time series database under dynamic time warping [C]// Proceeding of the Sixth International Conference on Knowledge and Systems Engineering, AISC 326. Cham: Springer, 2015: 155-167.
[27] MUEEN A, ZHU Y, YEH M, et al. The fastest similarity search algorithm for time series subsequences under euclidean distance [EB/OL]. (2015-08-01) [2017-11-01]. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html.
[28] MUEEN A, NATH S, LIU J. Fast approximate correlation for massive time-series data [C]// Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2010: 171-182.
[29] SAKURAI Y, PAPADIMITRIOU S, FALOUTSOS C. BRAID: stream mining through group lag correlations [C]// Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2005: 599-610.
[30] RAKTHANMANON T, CAMPANA B, MUEEN A, et al. Searching and mining trillions of time series subsequences under dynamic time warping [C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2012: 262-270.
[31] BELACHEW M T, GILLIS N. Solving the maximum clique problem with symmetric rank-one non-negative matrix approximation [J]. Journal of Optimization Theory and Applications, 2017, 173(1): 279-296.
[32] MUEEN A, KEOGH E. Online discovery and maintenance of time series motif [EB/OL]. (2010-06-25) [2017-11-01]. http://alumni.cs.ucr.edu/~mueen/OnlineMotif/.
转载注明来源:https://www.xzbu.com/8/view-14941827.htm