Update app.py
Browse files
app.py
CHANGED
@@ -18,21 +18,11 @@ zephyr_model = "HuggingFaceH4/zephyr-7b-beta"
|
|
18 |
pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
|
19 |
|
20 |
standard_sys = f"""
|
21 |
-
You will be provided a list of visual
|
22 |
|
23 |
-
|
24 |
-
These visual infos are extracted from the video that is usually a short sequence.
|
25 |
-
|
26 |
-
Please note that the following list of image descriptions (visual events) was generated by extracting individual frames from a continuous video featuring one or more subjects.
|
27 |
-
Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time.
|
28 |
-
Alternatively, the video may show distinct individuals who share similarities within the given set of descriptors.
|
29 |
-
Regardless, assume temporal continuity among the frames unless otherwise specified.
|
30 |
-
|
31 |
-
Audio events are actually the entire scene description based only on the audio of the video.
|
32 |
-
|
33 |
-
Your job is to use these informations to smartly deduce and provide a very short resume about what is happening in the origin video.
|
34 |
-
Provide a short resume about what you understood.
|
35 |
|
|
|
36 |
"""
|
37 |
|
38 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|
|
|
18 |
pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
|
19 |
|
20 |
standard_sys = f"""
|
21 |
+
You will be provided a list of visual details observed at regular intervals, along with an audio description. These pieces of information originate from a single video. The visual details are extracted from the video at fixed time intervals and represent consecutive frames. Typically, the video consists of a brief sequence showing one or more subjects...
|
22 |
|
23 |
+
Please note that the following list of image descriptions (visual details) was obtained by extracting individual frames from a continuous video featuring one or more subjects. Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time. Regardless, assume temporal continuity among the frames unless otherwise specified.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
Audio events are actually the entire scene description based only on the audio of the video. Your job is to integrate these multimodal inputs intelligently and provide a very short resume about what is happening in the origin video. Provide a succinct overview of what you understood.
|
26 |
"""
|
27 |
|
28 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|