Project Description

Nowadays most customers are looking for multi-cloud and container solutions. The main critical point for their business is providing a better service and make the customer happy. The efficiency of the IT Ops team key to the superior customer experience. In most case customers reports the issue and support will fix the issue but support is not aware of the problems (like node failures, resource crunch limits) in the multi-container environment until customers report them. Even though monitoring and alerts systems exist in the current market that only provide alerts when an issue occurs BUT we need smarter solutions to analyze existing systems and predict future anomalies.

The proposed system will do:

  1. Data collection (unstructured data) from k8s components across the environments
  2. Identifies the common pattern happens in the failure cases.
  3. Creates a Knowledge base for the identified patterns with related components . (Structured data)
  4. Uses a specific data model for the prediction
  5. Use the output from data model to predict the analysis.
  6. Send the alerts and reports

This is further classified as 3 main components in the proposed architecture:

  1. Data collection
  2. Data Prediction
  3. Alers & Reports

Resources that can be considered for the analysis and prediction: 
 Storage devices- Capacity, State Network devices ( LB, Firewalls)- Like Link status , Packet drops Compute Nodes: CPU,Memory,I/O, Storage

Solution Approach: -- Create data model -- Scan & Filter Data -- Extract Entity -- Annotate Data and Input to Model -- Process Output from Model -- Notify / Recommend / Self Heal

Goal for this Hackweek

Use existing log collector to collect the data from rancher k8s clusters and come up with a appropriate data model.


