|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- generated_from_trainer |
|
- documentation_tag |
|
- tag_generation |
|
- github |
|
- github_tag |
|
- tagging |
|
- github_repo |
|
- summarization |
|
metrics: |
|
- rouge |
|
widget: |
|
- text: susya plant disease detector ml powered app to assist farmers in crop disease |
|
detection and alerts product walkthrough download product apk here machine learning |
|
python notebook solutions system to detect the problem when it arises and warn |
|
the farmers disease detection using machine learning model enabled through android |
|
app which uses flask api solution to overcome the problem once it arises remedy |
|
is suggested for the disease detected by the app using ml model solution that |
|
will ensure that the problem will never occur in the future again pdf report is |
|
generated on the disease predicted along with user information pdf can be used |
|
as a document to be submitted in nearby krishibhavan thereby seeking help easily |
|
method that will reduce the impact of the dilemma to a significant level disease |
|
detected news can be sent to other users as a notification which contatins userplant |
|
and disease this will help other farmers take up precautions thereby reducing |
|
the impact of the dilemma to a significant level considering a region machine |
|
learning model multiclass image classifier built on pytorch framework using cnn |
|
architecture currently project detects 17 states of disease in 4 plants aiming |
|
kerala state namely cherry pepper potato and tomato framework pytorch architecture convolutional |
|
neural networks validation accuracy 777 how to train upload the python notebook |
|
to google colab and run each cell for training the model i have included a demo |
|
dataset to configure quickly you can use this kaggle dataset which is the original |
|
one with huge amount of pictures how it works the input image dataset is converted |
|
to tensor and is passed through a cnn model returning an output value corresponding |
|
to the plant disease input image tensor is passed through four convolutional layers |
|
and then flattened and inputted to fully connected layers api api is built using |
|
flask framework and hosted in render the api provides two functionalities they |
|
are plant disease detection accepts a post request with an image in the form of |
|
base64 string and returns plant disease and remedy notification accepts a post |
|
request with plant user and disease which is then pushed as a notification to |
|
other users to warn them regarding a probable outbreak of disease how to use api |
|
has been built on this classifier url user has to send a post request to the |
|
given api with base64 string of the image to be input python import requests url imgdata base64 |
|
string of image r requestsposturljson imageimgdata printrtextstrip outputpython |
|
diseaseseptoria leaf spotplanttomatoremedyremove infected leaves immediatelyfungonil |
|
and daconil app download product apk here to run app shell cd app flutter run |
|
to build app shell cd app flutter build apk features authentication using google |
|
oauth user profile page uses camera or device media to get an image of the crop |
|
preview the image and sends it to api for disease detection result page showing |
|
detected disease and remedy generates a pdf report to saveshare predicted disease |
|
details option to send the generated result as a notification warning to other |
|
users tech stack used python pytorch flask flutter firebase contributors nanda |
|
kishor m paiml model api ajay krishna k v flutter dev api hari krishnan uml model |
|
data collection antony s johnflutter dev |
|
example_title: 'Github Cleaned Readme #1' |
|
pipeline_tag: summarization |
|
base_model: t5-small |
|
model-index: |
|
- name: t5-small-github-repo-tag-generation |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# t5-small-github-repo-tag-generation |
|
|
|
Machine Learning model to generate Tags for Github Repositories based on their Documentation [README.md] . This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) fine-tuned on a collection of repositoreis from [Kaggle/vatsalparsaniya/github-repositories-analysis](https://www.kaggle.com/datasets/vatsalparsaniya/github-repositories-analysis). While usually formulated as a multi-label classification problem, this model deals with _tag generation_ as a text2text generation task (inspiration and reference: [fabiochiu/t5-base-tag-generation](https://huggingface.co/fabiochiu/t5-base-tag-generation)). |
|
<br><br> |
|
The Inference API here expects a cleaned readme text, the code for cleaning the readme is also given below. |
|
<br><br> |
|
Finetuning Notebook Reference: [Hugging face summarization notebook](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb). |
|
|
|
|
|
# How to use the model |
|
|
|
Input : Github Repo URL<br> |
|
Output : Tags |
|
|
|
Remarks: Ensure the repo has README.<b>md</b> |
|
### Installations |
|
|
|
```python |
|
pip install transformers nltk clean-text beautifulsoup4 |
|
``` |
|
### Code |
|
Imports |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
import re |
|
import nltk |
|
nltk.download('punkt') |
|
from cleantext import clean |
|
from bs4 import BeautifulSoup |
|
from markdown import Markdown |
|
import requests |
|
from io import StringIO |
|
import string |
|
``` |
|
|
|
Preprocessing |
|
```python |
|
# Script to convert Markdown to plain text |
|
# Reference : Stackoverflow == https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text |
|
|
|
def unmark_element(element, stream=None): |
|
if stream is None: |
|
stream = StringIO() |
|
if element.text: |
|
stream.write(element.text) |
|
for sub in element: |
|
unmark_element(sub, stream) |
|
if element.tail: |
|
stream.write(element.tail) |
|
return stream.getvalue() |
|
|
|
|
|
# patching Markdown |
|
Markdown.output_formats["plain"] = unmark_element |
|
__md = Markdown(output_format="plain") |
|
__md.stripTopLevelTags = False |
|
|
|
|
|
def unmark(text): |
|
return __md.convert(text) |
|
|
|
def readme_extractor(github_repo_url): |
|
try: |
|
|
|
# Get repo HTML using BeautifulSoup |
|
html_content = requests.get(github['python', 'machine learning', 'ml', 'cnn']_repo_url).text |
|
soup = BeautifulSoup(html_content, "html.parser") |
|
|
|
# Get README File URL from Repository |
|
readme_url = "https://github.com/" + soup.find("a",{"title":"README.md"}).get("href") |
|
|
|
# Generate raw readme file URL |
|
# https://github.com/rasbt/python-machine-learning-book/blob/master/README.md --> https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/README.md |
|
readme_raw_url = readme_url.replace("/blob/","/") |
|
readme_raw_url = readme_raw_url.replace("github.com","raw.githubusercontent.com") |
|
https://github.com/Lightning-AI/lightning |
|
readme_html_content = requests.get(readme_raw_url ).text |
|
readme_soup = BeautifulSoup(readme_html_content, "html.parser") |
|
readme_text = readme_soup.get_text() |
|
documentation_text = unmark(readme_text) |
|
return documentation_text |
|
except: |
|
print("FAILED : ",github_repo_url ) |
|
return "README_NOT_MARKDOWN" |
|
|
|
def clean_readme(readme): |
|
text = clean(readme, no_emoji=True) |
|
lst = re.findall('http://\S+|https://\S+', text) |
|
for i in lst: |
|
text = text.replace(i, '') |
|
text = "".join([i for i in text if i not in string.punctuation]) |
|
text = text.lower() |
|
text = text.replace("\n"," ") |
|
return text |
|
``` |
|
Postprocess Tags [Removing duplicates] |
|
```python |
|
def post_process_tags(tag_string): |
|
final_tags = [] |
|
for tag in tag_string.split(","): |
|
if tag.strip() in final_tags or len(tag.strip()) <=1: |
|
continue |
|
final_tags.append(tag.strip()) |
|
return final_tags |
|
``` |
|
|
|
Main Function |
|
```python |
|
def github_tags_generate(github_repo_url): |
|
readme = readme_extractor(github_repo_url) |
|
readme = clean_readme(readme) |
|
inputs = tokenizer([readme], max_length=1536, truncation=True, return_tensors="pt") |
|
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, |
|
max_length=128) |
|
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0] |
|
tags = post_process_tags(decoded_output) |
|
|
|
return tags |
|
|
|
|
|
|
|
github_tags_generate("https://github.com/Enter_Repo_URL") |
|
|
|
# github_tags_generate("https://github.com/nandakishormpai/Plant_Disease_Detector") |
|
# ['python', 'machine learning', 'ml', 'cnn'] |
|
``` |
|
|
|
## Dataset Preparation |
|
Over the 1000 articles from the dataset, only 870 had tags and the readme was longer than 50 characters. They were filtered out and using BeautifulSoup, README.md was scraped out. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
The results might contain duplicate tags that must be handled in the postprocessing of results. postprocessing code also given. |
|
|
|
|
|
## Results |
|
|
|
It achieves the following results on the evaluation set: |
|
- Loss: 1.8196 |
|
- Rouge1: 25.0142 |
|
- Rouge2: 8.1802 |
|
- Rougel: 22.77 |
|
- Rougelsum: 22.8017 |
|
- Gen Len: 19.0 |
|
|
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 40 |
|
- mixed_precision_training: Native AMP |
|
|
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.1 |
|
- Pytorch 1.13.1+cu116 |
|
- Datasets 2.10.0 |
|
- Tokenizers 0.13.2 |
|
|