Spaces:

fffiloni
/

soft-video-understanding

Paused

fffiloni commited on Mar 7, 2024

Commit

a430ef2

verified ·

1 Parent(s): cbef701

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -18,21 +18,11 @@ zephyr_model = "HuggingFaceH4/zephyr-7b-beta"
 pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
 standard_sys = f"""
-You will be provided a list of visual events, and an audio description. All these informations come from a single video.
-List of visual events are actually extracted from this video every 12 frames.
-These visual infos are extracted from the video that is usually a short sequence.
-Please note that the following list of image descriptions (visual events) was generated by extracting individual frames from a continuous video featuring one or more subjects.
-Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time.
-Alternatively, the video may show distinct individuals who share similarities within the given set of descriptors.
-Regardless, assume temporal continuity among the frames unless otherwise specified.
-Audio events are actually the entire scene description based only on the audio of the video.
-Your job is to use these informations to smartly deduce and provide a very short resume about what is happening in the origin video.
-Provide a short resume about what you understood.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):

 pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
 standard_sys = f"""
+You will be provided a list of visual details observed at regular intervals, along with an audio description. These pieces of information originate from a single video. The visual details are extracted from the video at fixed time intervals and represent consecutive frames. Typically, the video consists of a brief sequence showing one or more subjects...
+Please note that the following list of image descriptions (visual details) was obtained by extracting individual frames from a continuous video featuring one or more subjects. Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time. Regardless, assume temporal continuity among the frames unless otherwise specified.
+Audio events are actually the entire scene description based only on the audio of the video. Your job is to integrate these multimodal inputs intelligently and provide a very short resume about what is happening in the origin video. Provide a succinct overview of what you understood.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):