Nowadays most customers are looking for multi-cloud and container solutions. The main critical point for their business is providing a better service and make the customer happy. The efficiency of the IT Ops team key to the superior customer experience. In most case customers reports the issue and support will fix the issue but support is not aware of the problems (like node failures, resource crunch limits) in the multi-container environment until customers report them. Even though monitoring and alerts systems exist in the current market that only provide alerts when an issue occurs BUT we need smarter solutions to analyze existing systems and predict future anomalies.
The proposed system will do:
- Data collection (unstructured data) from k8s components across the environments
- Identifies the common pattern happens in the failure cases.
- Creates a Knowledge base for the identified patterns with related components . (Structured data)
- Uses a specific data model for the prediction
- Use the output from data model to predict the analysis.
- Send the alerts and reports
This is further classified as 3 main components in the proposed architecture:
- Data collection
- Data Prediction
- Alers & Reports
Resources that can be considered for the analysis and prediction: Storage devices- Capacity, State Network devices ( LB, Firewalls)- Like Link status , Packet drops Compute Nodes: CPU,Memory,I/O, Storage
Solution Approach: -- Create data model -- Scan & Filter Data -- Extract Entity -- Annotate Data and Input to Model -- Process Output from Model -- Notify / Recommend / Self Heal
Goal for this Hackweek
Use existing log collector to collect the data from rancher k8s clusters and come up with a appropriate data model.
ML, Python, kubernetes, data model, monitoring tools. @
No Hackers yet
This project is part of:
Hack Week 20
[comment]: # (Please use the project descriptio...
The goal of the proje...