SeoyeonPark1223 commited on
Commit
7974173
·
verified ·
1 Parent(s): fef64b3

Upload 6 files

Browse files
README.md CHANGED
@@ -1,8 +1,120 @@
1
  ---
2
  language:
3
- - ko
4
  tags:
5
- - cnn
6
- - audio
7
- - mfcc
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
+ - ko
4
  tags:
5
+ - cnn
6
+ - audio
7
+ - mfcc
8
+ - homecam
9
+ - senior
10
+ ---
11
+
12
+ # Model Card for `SilverAudio`
13
+
14
+ <!-- Provide a quick summary of what the model is/does. -->
15
+
16
+ ## Model Details
17
+
18
+ - The audio model is a critical component of the `SilverAssistant` system, designed to detect potentially dangerous situations such as falls, cries for help, or signs of violence through audio analysis. Its primary purpose is to enhance user safety while respecting privacy.
19
+ - By leveraging Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction and Convolutional Neural Networks (CNN) for classification, the model ensures accurate detection without requiring constant video monitoring, preserving user privacy.
20
+ ![Audio Model](./pics/audio-overview.png)
21
+
22
+ ### Model Description
23
+
24
+ <!-- Provide a longer summary of what this model is. -->
25
+
26
+ - **Activity with:** NIPA-Google(2024.10.23-20224.11.08), Kosa Hackathon(2024.12.9)
27
+ - **Model type:** Audio CNN Model
28
+ - **API used:** Keras
29
+ - **Dataset:** [HuggingFace SilverDataset](https://huggingface.co/datasets/SilverAvocado/SilverDataset)
30
+ - **Code:** [GitHub Silver Model Code](https://github.com/silverAvocado/silver-model-code)
31
+ - **Language(s) (NLP):** Korean
32
+
33
+ ## Training Details
34
+ ### Dataset Preperation
35
+
36
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
37
+
38
+ - **HuggingFace:** [HuggingFace SilverDataset](https://huggingface.co/datasets/SilverAvocado/SilverDataset)
39
+ - **Description:**
40
+ - The dataset used for this audio model consists of `.npy` files containing MFCC features extracted from raw audio recordings. These recordings include various real-world scenarios, such as:
41
+ - Criminal activities
42
+ - Violence
43
+ - Falls
44
+ - Cries for help
45
+ - Normal indoor sounds
46
+
47
+ - Feature Extraction Process
48
+ 1. Audio Collection:
49
+ - Audio samples were sourced from datasets, such as AI Hub, to ensure coverage of diverse scenarios.
50
+ - [AI Hub 위급상황 음성/음향](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=170)
51
+ - These include emergency and non-emergency sounds to train the model for accurate classification.
52
+ 2. MFCC Extraction:
53
+ - The raw audio signals were processed to extract Mel-Frequency Cepstral Coefficients (MFCC).
54
+ - The MFCC features effectively capture the frequency characteristics of the audio, making them suitable for sound classification tasks.
55
+ ![MFCC Output](./pics/mfcc-output.png)
56
+ 3. Output Format:
57
+ - The extracted MFCC features are saved as `13 x n` numpy arrays, where:
58
+ - 13: Represents the number of MFCC coefficients (features).
59
+ - n: Corresponds to the number of frames in the audio segment.
60
+ 4. Saved Dataset:
61
+ - The processed `13 x n` MFCC arrays are stored as `.npy` files, which serve as the direct input to the model.
62
+
63
+ ### Model Description
64
+
65
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
66
+
67
+ - **Model Structure:**
68
+ ![Audio Model Structure](./pics/audio-model-structure.png)
69
+ - Input Layer
70
+ - Shape: (128, 128, 3)
71
+ - Represents the input MFCC features reshaped into a 128x128 image-like format with 3 channels (RGB)
72
+ - This transformation helps the CNN process MFCC features as spatial data.
73
+ - Convolutional Layers (Conv2D)
74
+ 1. First Conv2D Layer:
75
+ - Extracts low-level features from the MFCC input (e.g., frequency characteristics).
76
+ - Applies filters (kernels) to create feature maps by learning patterns in the audio data.
77
+ 2. Second Conv2D Layer:
78
+ - Further processes the feature maps, learning more complex and hierarchical audio features.
79
+ - MaxPooling2D Layers
80
+ - Located after each Conv2D layer.
81
+ - Reduces the spatial dimensions of feature maps, lowering computational complexity and extracting dominant features.
82
+ - Helps in preventing overfitting by reducing noise and irrelevant details.
83
+ - Flatten Layer
84
+ - Converts the 2D feature maps into a 1D vector.
85
+ - Prepares the data for the dense (fully connected) layers, which act as the classifier.
86
+ - Fully Connected Layers (Dense)
87
+ 1. First Dense Layer:
88
+ - Contains 128 neurons.
89
+ - Processes the flattened features to learn complex relationships in the data.
90
+ 2. Dropout Layer:
91
+ - Randomly sets a fraction of input neurons to zero during training.
92
+ - Helps in regularization, preventing overfitting by ensuring the model doesn’t rely on specific neurons.
93
+ 3. Output Dense Layer:
94
+ - Neurons: 2 (for binary classification)
95
+ - **Class 0: Normal**
96
+ - **Class 1: Danger**
97
+ - Activation: Softmax (to output probabilities for each class).
98
+ - **Model Performance:**
99
+ ![Model Performance](./pics/audio-performance.png)
100
+ 1. Accuracy and Preprocessing (Table Summary)
101
+ - The CNN model achieves the highest accuracy of 98.37% in the 8th configuration.
102
+ - Key factors contributing to this performance:
103
+ - Input Size: 128 \times 128 \times 3, leveraging image-like MFCC features.
104
+ - Padding/Splitting: Audio segments are preprocessed into 5-second splits, ensuring uniform input.
105
+ - No Noise Addition: Training without additional noise leads to better feature retention.
106
+ - Library: Keras is used for training and architecture implementation.
107
+ - Sample Rate: A consistent sample rate of 16,000 Hz was maintained for all preprocessing steps.
108
+ ![Confusion Matrix](./pics/confusion-matrix.png)
109
+ 2. Confusion Matrix Analysis
110
+ - High Precision: Minimal false positives suggest the model is very specific when identifying emergencies.
111
+ - High Recall: Minimal false negatives indicate that most emergencies are correctly identified.
112
+
113
+ ## Model Usage
114
+ - `Silver Assistant` Project
115
+ - The audio model is a critical component of the SilverAssistant system, designed to detect potentially dangerous situations such as falls, cries for help, or signs of violence through audio analysis. Its primary purpose is to enhance user safety while respecting privacy
116
+ - [GitHub SilverAvocado](https://github.com/silverAvocado)
117
+
118
+ ## Conclusion
119
+ - The `SilverAudio` model demonstrates exceptional performance in detecting emergency audio scenarios with high accuracy and reliability, achieving a peak accuracy of 98.37% in the 8th configuration. By leveraging MFCC features and a robust CNN-based architecture, the model effectively classifies audio inputs into predefined categories (normal vs danger). Its ability to operate on preprocessed `.npy` files ensures efficient inference, making it suitable for real-time applications.
120
+ - This model is a vital component of the `SilverAssistant` project, empowering the system to accurately detect critical situations and enable timely interventions. Its real-world applicability, achieved through comprehensive training on diverse datasets such as AI Hub, makes it a powerful tool for enhancing the safety and well-being of vulnerable individuals, particularly the elderly.
pics/audio-model-structure.png ADDED
pics/audio-overview.png ADDED
pics/audio-performance.png ADDED
pics/confusion-matrix.png ADDED
pics/mfcc-output.png ADDED