File size: 3,271 Bytes
53bfb09
 
 
 
 
 
 
 
 
 
 
cd218ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Optimized Llm Log Classification
emoji: 😻
colorFrom: green
colorTo: pink
sdk: streamlit
sdk_version: 1.42.2
app_file: app.py
pinned: false
---

# Optimized Log Classification Using LLMs
---
A comprehensive framework for hybrid log classification that integrates multiple analytical techniques to effectively process and categorize log data.
This system leverages different methods to handle simple, complex, and sparsely labeled log patterns.
 ---

## Overview

This project combines three primary classification strategies:

- **Regex-based Classification**
  Captures predictable patterns using predefined regular expressions.

- **Embedding-based Classification**
  Uses Sentence Transformers to generate embeddings followed by Logistic Regression for nuanced pattern recognition.

- **LLM-assisted Classification**
  Employs large language models to classify data when traditional methods struggle due to limited labeled samples.

![System Architecture](resources/arch.png)

---

## Directory Structure

- **`training/`**
  Contains notebooks and scripts for training the models and experimenting with different approaches.

- **`models/`**
  Stores pre-trained models such as the logistic regression classifier and embedding models.

- **`resources/`**
  Holds auxiliary files like CSV datasets, output samples, and images.

- **Root Directory**
  Includes the main API server (`server.py`) and the command-line classification utility (`classify.py`).

---

## Installation & Setup

1. **Clone the Repository**
   ```bash
   git clone <your_repository_url>
   ```

2. **Install Dependencies**
   Ensure Python is installed and run:
   ```bash
   pip install -r requirements.txt
   ```

3. **Train the Model (if needed)**
   Open and run the training notebook:
   ```bash
   jupyter notebook training/log_classification.ipynb
   ```

4. **Run the API Server**
   Start the server using one of the following methods:
   - Direct execution:
     ```bash
     python server.py
     ```
   - With Uvicorn:
     ```bash
     uvicorn server:app --reload
     ```
   Access the API documentation at:
   - Main Endpoint: [http://127.0.0.1:8000/](http://127.0.0.1:8000/)
   - Swagger UI: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
   - Redoc: [http://127.0.0.1:8000/redoc](http://127.0.0.1:8000/redoc)

5. **Running the Streamlit App**
   To start the Streamlit application for log classification:
   ```bash
   streamlit run app.py
   ```
   This command will launch the app in your browser at a URL like http://localhost:8501.
---

## Usage Instructions

- **Input Data**
  Upload a CSV file with the following columns:
  - `source`
  - `log_message`

- **Output**
  The system processes the logs and returns a CSV file with an additional `target_label` column indicating the classification result.

---

## Customization

Feel free to modify and extend the classification logic in the following modules:
- `processor_bert.py`
- `processor_llm.py`
- `processor_regex.py`

These modules are designed to be flexible, allowing you to tailor the classification approaches to your specific needs.

---

## Contributions
Contributions, feedback, and feature requests are welcome.
Please open an issue or submit a pull request in your GitHub repository.