Show simple item record

Distributed Anomaly Detection and Prevention for Virtual Platforms

dc.contributor.advisorYahyapour, Ramin Prof. Dr.
dc.contributor.authorJehangiri, Ali Imran
dc.titleDistributed Anomaly Detection and Prevention for Virtual Platformsde
dc.contributor.refereeYahyapour, Ramin Prof. Dr.
dc.description.abstractengAn increasing number of applications are being hosted on cloud based platforms. Cloud platforms are serving as a general computing facility and applications being hosted on these platforms range from simple multi-tier web applications to complex social networking, eCommerce and Big Data applications. High availability, performance and auto-scaling are key requirements of Cloud based applications. Cloud platforms serve these requirements using dynamic provisioning of resources in on-demand, multi-tenant fashion. A key challenge for cloud service providers is to ensure the Quality of Service (QoS), as a user / customer requires more explicit guarantees of QoS for provisioning of services. Cloud service performance problems can directly lead to extensive financial loses. Thus, control and verification of QoS become a vital concern for any production level deployment. Therefore, it is crucial to address performance as a managed objective. The success of cloud services depends critically on automated problem diagnostics and predictive analytics enabling organizations to manage their performance proactively. Moreover, effective and advance monitoring is equally important for performance management support in clouds. In this thesis, we explore the key techniques for developing monitoring and performance management systems to achieve robust cloud systems. At first, two case studies are presented as a motivation for the need of a scalable monitoring and analytics framework. It includes a case study on performance issues of a software service, which is hosted on a virtualized platform. In the second case study, cloud services are analyzed that are offered by a large IT service provider. A generalization of case studies forms the basis for the requirement specifications which are used for state-of-the-art analysis. Although, some solutions for particular challenges have already been provided, a scalable approach for performance problem diagnosis and prediction is still missing. For addressing this issue, a distributed scalable monitoring and analytics framework is presented in the first part of this thesis. We conducted a thorough analysis of technologies to be used by our framework. The framework makes use of existing monitoring and analytics technologies. However, we develop custom collectors to retrieve data non-intrusively from different layers of cloud. In addition, we develop the analytics subscriber and publisher components to retrieve service related events from different APIs and sends alerts to the SLA Management component for taking corrective measures. Further, we implemented an Open Cloud Computing Interface (OCCI) monitoring extension using OCCI Mixin mechanism.  To deal with performance problem diagnosis, a novel distributed parallel approach for performance anomaly detection is presented. First all anomalous metrics are found from a distributed database of time-series for a particular window. For comparative analysis three light-weight statistical anomaly detection techniques are selected. We extend these techniques to work with MapReduce paradigm and assess and compare the methods in terms of precision, recall, execution time, speedup and scale up. Next, we correlate the anomalous metrics with the target SLO in order to locate the suspicious metrics. We implemented and evaluated our approach on a production Cloud encompassing Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) service models. Experimental results confirm that our approach is efficient and effective in capturing the metrics causing performance anomalies. Finally, we present the design and implementation of an online anomaly prediction system for cloud computing infrastructures. We further present an experimental evaluation of a set of anomaly prediction methods that aim at predicting upcoming periods of high utilization or poor performance with enough lead time to enable the appropriate scheduling, scaling, and migration of virtual resources. Using real data sets gathered from Cloud platforms of a university data center, we compare several approaches ranging from time-series (e.g. auto regression (AR)) to statistical classification methods (e.g. Bayesian classifier). We observe that linear time-series models, especially AR models, are most likely suitable to model QoS measures and forecast their future values. Moreover, linear time-series models can be integrated with Machine Learning (ML) methods to improve proactive QoS
dc.contributor.coRefereeTchernykh, Andrei Prof. Dr.
dc.contributor.thirdRefereeDamm, Carsten Prof. Dr.
dc.contributor.thirdRefereeFu, Xiaoming Prof. Dr.
dc.contributor.thirdRefereeHogrefe, Dieter Prof. Dr.
dc.contributor.thirdRefereeKurth, Winfried Prof. Dr.
dc.subject.engCloud+Monitoring+Analytics+Performance+Diagnosis+Prediction+root cause+time series+Machine Learning+Big datade
dc.affiliation.instituteFakultät für Mathematik und Informatikde
dc.subject.gokfullInformatik (PPN619939052)de

Files in this item


This item appears in the following Collection(s)

Show simple item record