README.md · nandakishormpai/t5-small-github-repo-tag-generation at refs/pr/17

t5-small-github-repo-tag-generation / README.md

librarian-bot

Librarian Bot: Add base_model information to model

da80fdb over 1 year ago

preview code

raw

history blame

9.58 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- generated_from_trainer
	- documentation_tag
	- tag_generation
	- github
	- github_tag
	- tagging
	- github_repo
	- summarization
	metrics:
	- rouge
	widget:
	- text: susya plant disease detector ml powered app to assist farmers in crop disease
	detection and alerts product walkthrough download product apk here machine learning
	python notebook solutions system to detect the problem when it arises and warn
	the farmers disease detection using machine learning model enabled through android
	app which uses flask api solution to overcome the problem once it arises remedy
	is suggested for the disease detected by the app using ml model solution that
	will ensure that the problem will never occur in the future again pdf report is
	generated on the disease predicted along with user information pdf can be used
	as a document to be submitted in nearby krishibhavan thereby seeking help easily
	method that will reduce the impact of the dilemma to a significant level disease
	detected news can be sent to other users as a notification which contatins userplant
	and disease this will help other farmers take up precautions thereby reducing
	the impact of the dilemma to a significant level considering a region machine
	learning model multiclass image classifier built on pytorch framework using cnn
	architecture currently project detects 17 states of disease in 4 plants aiming
	kerala state namely cherry pepper potato and tomato framework pytorch architecture convolutional
	neural networks validation accuracy 777 how to train upload the python notebook
	to google colab and run each cell for training the model i have included a demo
	dataset to configure quickly you can use this kaggle dataset which is the original
	one with huge amount of pictures how it works the input image dataset is converted
	to tensor and is passed through a cnn model returning an output value corresponding
	to the plant disease input image tensor is passed through four convolutional layers
	and then flattened and inputted to fully connected layers api api is built using
	flask framework and hosted in render the api provides two functionalities they
	are plant disease detection accepts a post request with an image in the form of
	base64 string and returns plant disease and remedy notification accepts a post
	request with plant user and disease which is then pushed as a notification to
	other users to warn them regarding a probable outbreak of disease how to use api
	has been built on this classifier url user has to send a post request to the
	given api with base64 string of the image to be input python import requests url imgdata base64
	string of image r requestsposturljson imageimgdata printrtextstrip outputpython
	diseaseseptoria leaf spotplanttomatoremedyremove infected leaves immediatelyfungonil
	and daconil app download product apk here to run app shell cd app flutter run
	to build app shell cd app flutter build apk features authentication using google
	oauth user profile page uses camera or device media to get an image of the crop
	preview the image and sends it to api for disease detection result page showing
	detected disease and remedy generates a pdf report to saveshare predicted disease
	details option to send the generated result as a notification warning to other
	users tech stack used python pytorch flask flutter firebase contributors nanda
	kishor m paiml model api ajay krishna k v flutter dev api hari krishnan uml model
	data collection antony s johnflutter dev
	example_title: 'Github Cleaned Readme #1'
	pipeline_tag: summarization
	base_model: t5-small
	model-index:
	- name: t5-small-github-repo-tag-generation
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# t5-small-github-repo-tag-generation

	Machine Learning model to generate Tags for Github Repositories based on their Documentation [README.md] . This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) fine-tuned on a collection of repositoreis from [Kaggle/vatsalparsaniya/github-repositories-analysis](https://www.kaggle.com/datasets/vatsalparsaniya/github-repositories-analysis). While usually formulated as a multi-label classification problem, this model deals with _tag generation_ as a text2text generation task (inspiration and reference: [fabiochiu/t5-base-tag-generation](https://huggingface.co/fabiochiu/t5-base-tag-generation)).
	<br><br>
	The Inference API here expects a cleaned readme text, the code for cleaning the readme is also given below.
	<br><br>
	Finetuning Notebook Reference: [Hugging face summarization notebook](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb).


	# How to use the model

	Input : Github Repo URL<br>
	Output : Tags

	Remarks: Ensure the repo has README.<b>md</b>
	### Installations

	```python
	pip install transformers nltk clean-text beautifulsoup4
	```
	### Code
	Imports
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import re
	import nltk
	nltk.download('punkt')
	from cleantext import clean
	from bs4 import BeautifulSoup
	from markdown import Markdown
	import requests
	from io import StringIO
	import string
	```

	Preprocessing
	```python
	# Script to convert Markdown to plain text
	# Reference : Stackoverflow == https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text

	def unmark_element(element, stream=None):
	if stream is None:
	stream = StringIO()
	if element.text:
	stream.write(element.text)
	for sub in element:
	unmark_element(sub, stream)
	if element.tail:
	stream.write(element.tail)
	return stream.getvalue()


	# patching Markdown
	Markdown.output_formats["plain"] = unmark_element
	__md = Markdown(output_format="plain")
	__md.stripTopLevelTags = False


	def unmark(text):
	return __md.convert(text)

	def readme_extractor(github_repo_url):
	try:

	# Get repo HTML using BeautifulSoup
	html_content = requests.get(github['python', 'machine learning', 'ml', 'cnn']_repo_url).text
	soup = BeautifulSoup(html_content, "html.parser")

	# Get README File URL from Repository
	readme_url = "https://github.com/" + soup.find("a",{"title":"README.md"}).get("href")

	# Generate raw readme file URL
	# https://github.com/rasbt/python-machine-learning-book/blob/master/README.md --> https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/README.md
	readme_raw_url = readme_url.replace("/blob/","/")
	readme_raw_url = readme_raw_url.replace("github.com","raw.githubusercontent.com")
	https://github.com/Lightning-AI/lightning
	readme_html_content = requests.get(readme_raw_url ).text
	readme_soup = BeautifulSoup(readme_html_content, "html.parser")
	readme_text = readme_soup.get_text()
	documentation_text = unmark(readme_text)
	return documentation_text
	except:
	print("FAILED : ",github_repo_url )
	return "README_NOT_MARKDOWN"

	def clean_readme(readme):
	text = clean(readme, no_emoji=True)
	lst = re.findall('http://\S+\|https://\S+', text)
	for i in lst:
	text = text.replace(i, '')
	text = "".join([i for i in text if i not in string.punctuation])
	text = text.lower()
	text = text.replace("\n"," ")
	return text
	```
	Postprocess Tags [Removing duplicates]
	```python
	def post_process_tags(tag_string):
	final_tags = []
	for tag in tag_string.split(","):
	if tag.strip() in final_tags or len(tag.strip()) <=1:
	continue
	final_tags.append(tag.strip())
	return final_tags
	```

	Main Function
	```python
	def github_tags_generate(github_repo_url):
	readme = readme_extractor(github_repo_url)
	readme = clean_readme(readme)
	inputs = tokenizer([readme], max_length=1536, truncation=True, return_tensors="pt")
	output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
	max_length=128)
	decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
	tags = post_process_tags(decoded_output)

	return tags



	github_tags_generate("https://github.com/Enter_Repo_URL")

	# github_tags_generate("https://github.com/nandakishormpai/Plant_Disease_Detector")
	# ['python', 'machine learning', 'ml', 'cnn']
	```

	## Dataset Preparation
	Over the 1000 articles from the dataset, only 870 had tags and the readme was longer than 50 characters. They were filtered out and using BeautifulSoup, README.md was scraped out.


	## Intended uses & limitations

	The results might contain duplicate tags that must be handled in the postprocessing of results. postprocessing code also given.


	## Results

	It achieves the following results on the evaluation set:
	- Loss: 1.8196
	- Rouge1: 25.0142
	- Rouge2: 8.1802
	- Rougel: 22.77
	- Rougelsum: 22.8017
	- Gen Len: 19.0


	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 40
	- mixed_precision_training: Native AMP



	### Framework versions

	- Transformers 4.26.1
	- Pytorch 1.13.1+cu116
	- Datasets 2.10.0
	- Tokenizers 0.13.2