Commit
·
3ba2b45
1
Parent(s):
b756aab
added multiple ways to download a model
Browse files- download.ipynb +168 -0
- mlflow.txt +43 -0
download.ipynb
ADDED
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"## 1. Download models from pipeline module"
|
8 |
+
]
|
9 |
+
},
|
10 |
+
{
|
11 |
+
"cell_type": "code",
|
12 |
+
"execution_count": 1,
|
13 |
+
"metadata": {},
|
14 |
+
"outputs": [],
|
15 |
+
"source": [
|
16 |
+
"from transformers import pipeline"
|
17 |
+
]
|
18 |
+
},
|
19 |
+
{
|
20 |
+
"cell_type": "code",
|
21 |
+
"execution_count": 2,
|
22 |
+
"metadata": {},
|
23 |
+
"outputs": [
|
24 |
+
{
|
25 |
+
"name": "stderr",
|
26 |
+
"output_type": "stream",
|
27 |
+
"text": [
|
28 |
+
"All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.\n",
|
29 |
+
"\n",
|
30 |
+
"All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.\n",
|
31 |
+
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.\n"
|
32 |
+
]
|
33 |
+
}
|
34 |
+
],
|
35 |
+
"source": [
|
36 |
+
"summarizer = pipeline(\"summarization\", model=\"google-t5/t5-small\", tokenizer=\"google-t5/t5-small\", truncation=True, framework=\"tf\")"
|
37 |
+
]
|
38 |
+
},
|
39 |
+
{
|
40 |
+
"cell_type": "code",
|
41 |
+
"execution_count": 4,
|
42 |
+
"metadata": {},
|
43 |
+
"outputs": [
|
44 |
+
{
|
45 |
+
"name": "stderr",
|
46 |
+
"output_type": "stream",
|
47 |
+
"text": [
|
48 |
+
"WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
|
49 |
+
"I0000 00:00:1725208010.799880 11454710 service.cc:146] XLA service 0x340929ca0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:\n",
|
50 |
+
"I0000 00:00:1725208010.799988 11454710 service.cc:154] StreamExecutor device (0): Host, Default Version\n",
|
51 |
+
"2024-09-01 21:56:50.805880: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n",
|
52 |
+
"I0000 00:00:1725208010.830035 11454710 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\n"
|
53 |
+
]
|
54 |
+
},
|
55 |
+
{
|
56 |
+
"name": "stdout",
|
57 |
+
"output_type": "stream",
|
58 |
+
"text": [
|
59 |
+
"[{'summary_text': 'MLflow is an open-source platform created by Databricks . it is designed to manage the four primary phases of the machine learning lifecycle . this includes experiment tracking, project packaging, model deployment, and lifecycle management . each experiment is documented and replicated in the future .'}]\n"
|
60 |
+
]
|
61 |
+
}
|
62 |
+
],
|
63 |
+
"source": [
|
64 |
+
"with open('mlflow.txt', 'r') as file:\n",
|
65 |
+
" print(summarizer(file.read()))"
|
66 |
+
]
|
67 |
+
},
|
68 |
+
{
|
69 |
+
"cell_type": "markdown",
|
70 |
+
"metadata": {},
|
71 |
+
"source": [
|
72 |
+
"## 2. Download Models using hf_hub_download\n",
|
73 |
+
"- This will store the models in your cache, and you don't have to download the model everytime you use it."
|
74 |
+
]
|
75 |
+
},
|
76 |
+
{
|
77 |
+
"cell_type": "code",
|
78 |
+
"execution_count": 5,
|
79 |
+
"metadata": {},
|
80 |
+
"outputs": [],
|
81 |
+
"source": [
|
82 |
+
"from huggingface_hub import hf_hub_download"
|
83 |
+
]
|
84 |
+
},
|
85 |
+
{
|
86 |
+
"cell_type": "code",
|
87 |
+
"execution_count": 7,
|
88 |
+
"metadata": {},
|
89 |
+
"outputs": [
|
90 |
+
{
|
91 |
+
"name": "stderr",
|
92 |
+
"output_type": "stream",
|
93 |
+
"text": [
|
94 |
+
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
95 |
+
"To disable this warning, you can either:\n",
|
96 |
+
"\t- Avoid using `tokenizers` before the fork if possible\n",
|
97 |
+
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
|
98 |
+
]
|
99 |
+
},
|
100 |
+
{
|
101 |
+
"data": {
|
102 |
+
"application/vnd.jupyter.widget-view+json": {
|
103 |
+
"model_id": "bd7624af3b3d4a2bba2c889754d04560",
|
104 |
+
"version_major": 2,
|
105 |
+
"version_minor": 0
|
106 |
+
},
|
107 |
+
"text/plain": [
|
108 |
+
"pytorch_model.bin: 0%| | 0.00/242M [00:00<?, ?B/s]"
|
109 |
+
]
|
110 |
+
},
|
111 |
+
"metadata": {},
|
112 |
+
"output_type": "display_data"
|
113 |
+
},
|
114 |
+
{
|
115 |
+
"data": {
|
116 |
+
"text/plain": [
|
117 |
+
"'/Users/seemasaharan/.cache/huggingface/hub/models--google-t5--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/pytorch_model.bin'"
|
118 |
+
]
|
119 |
+
},
|
120 |
+
"execution_count": 7,
|
121 |
+
"metadata": {},
|
122 |
+
"output_type": "execute_result"
|
123 |
+
}
|
124 |
+
],
|
125 |
+
"source": [
|
126 |
+
"hf_hub_download(repo_id=\"google-t5/t5-small\", filename=\"pytorch_model.bin\")"
|
127 |
+
]
|
128 |
+
},
|
129 |
+
{
|
130 |
+
"cell_type": "markdown",
|
131 |
+
"metadata": {},
|
132 |
+
"source": [
|
133 |
+
"## 3. Download Models using Git"
|
134 |
+
]
|
135 |
+
},
|
136 |
+
{
|
137 |
+
"cell_type": "code",
|
138 |
+
"execution_count": null,
|
139 |
+
"metadata": {},
|
140 |
+
"outputs": [],
|
141 |
+
"source": [
|
142 |
+
"!git lfs install\n",
|
143 |
+
"!git clone <model-URL>"
|
144 |
+
]
|
145 |
+
}
|
146 |
+
],
|
147 |
+
"metadata": {
|
148 |
+
"kernelspec": {
|
149 |
+
"display_name": "Python 3 (ipykernel)",
|
150 |
+
"language": "python",
|
151 |
+
"name": "python3"
|
152 |
+
},
|
153 |
+
"language_info": {
|
154 |
+
"codemirror_mode": {
|
155 |
+
"name": "ipython",
|
156 |
+
"version": 3
|
157 |
+
},
|
158 |
+
"file_extension": ".py",
|
159 |
+
"mimetype": "text/x-python",
|
160 |
+
"name": "python",
|
161 |
+
"nbconvert_exporter": "python",
|
162 |
+
"pygments_lexer": "ipython3",
|
163 |
+
"version": "3.12.4"
|
164 |
+
}
|
165 |
+
},
|
166 |
+
"nbformat": 4,
|
167 |
+
"nbformat_minor": 4
|
168 |
+
}
|
mlflow.txt
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Introduction to MLflow: A Comprehensive Guide to Managing the Machine Learning Lifecycle
|
2 |
+
|
3 |
+
Machine learning (ML) has become an essential tool for data scientists and organizations, offering the ability to predict trends, automate decision-making, and gain insights from vast amounts of data. However, the lifecycle of developing, deploying, and managing machine learning models is complex, involving multiple stages and a variety of tools. MLflow, an open-source platform created by Databricks, addresses these challenges by providing an integrated environment for managing the end-to-end machine learning lifecycle.
|
4 |
+
|
5 |
+
Overview of MLflow
|
6 |
+
|
7 |
+
MLflow is designed to manage the four primary phases of the machine learning lifecycle: experiment tracking, project packaging, model deployment, and lifecycle management. These components are modular, meaning that they can be used independently or together, depending on the needs of the project.
|
8 |
+
|
9 |
+
1. MLflow Tracking: This component allows users to log and query experiments, storing parameters, metrics, and artifacts (such as models or data) associated with each run. This is crucial for reproducibility, as it ensures that every experiment is documented and can be replicated in the future. MLflow Tracking supports a variety of storage backends, including local files, SQL databases, and cloud storage, making it versatile for different deployment environments.
|
10 |
+
|
11 |
+
2. MLflow Projects: A project is a way to package data science code in a reusable and reproducible manner. MLflow Projects use standardized formats, typically a directory containing code and a configuration file (`MLproject`), which defines dependencies and execution commands. This modularity promotes collaboration across teams, as projects can be easily shared and run in various environments with consistent results.
|
12 |
+
|
13 |
+
3. MLflow Models: Once a model is trained, it needs to be packaged and deployed for inference. MLflow Models standardize how models are stored and deployed. A model in MLflow can be saved in different formats such as TensorFlow, PyTorch, Scikit-learn, and more, allowing for seamless integration with existing frameworks. Furthermore, MLflow Models facilitate the deployment of models as a REST API or within a cloud environment, making the transition from experimentation to production smoother and more efficient.
|
14 |
+
|
15 |
+
4. MLflow Model Registry: The Model Registry component provides a central repository for managing the lifecycle of ML models. It allows users to version models, transition models through different stages (e.g., staging, production), and maintain a detailed history of models, including who trained them and when. This component is essential for organizations that need to manage multiple models in production, ensuring that only the best-performing models are deployed.
|
16 |
+
|
17 |
+
Benefits of Using MLflow
|
18 |
+
|
19 |
+
MLflow offers several key benefits that make it an attractive option for managing the machine learning lifecycle:
|
20 |
+
|
21 |
+
1. Reproducibility: By tracking experiments and standardizing the packaging of projects and models, MLflow ensures that experiments can be reproduced with consistent results. This is vital for both research and production, where reproducibility guarantees the reliability and validity of models.
|
22 |
+
|
23 |
+
2. Scalability: MLflow is designed to be scalable, supporting a wide range of deployment environments from local development setups to large-scale cloud platforms. This flexibility allows teams to scale their ML operations as needed, whether they are working on small projects or large enterprise applications.
|
24 |
+
|
25 |
+
3. Interoperability: MLflow supports a wide variety of machine learning frameworks and languages, making it a versatile tool for teams that use different tools and technologies. This interoperability reduces the friction between different stages of the ML lifecycle and allows for seamless integration into existing workflows.
|
26 |
+
|
27 |
+
4. Collaboration: With features like the Model Registry and the standardized format of MLflow Projects, teams can collaborate more effectively. Models and experiments can be shared, reviewed, and iterated upon by different members of a team, ensuring that the collective knowledge and efforts are harnessed to produce the best possible outcomes.
|
28 |
+
|
29 |
+
5. Model Governance: The ability to version and manage models through the MLflow Model Registry ensures that organizations can maintain strict control over their models in production. This is particularly important for industries with regulatory requirements, where knowing which model was used for a particular decision can be crucial.
|
30 |
+
|
31 |
+
Challenges and Considerations
|
32 |
+
|
33 |
+
While MLflow provides a robust platform for managing the machine learning lifecycle, there are challenges and considerations to keep in mind:
|
34 |
+
|
35 |
+
1. Learning Curve: For teams unfamiliar with MLflow, there is a learning curve associated with understanding and implementing its components. However, once mastered, the benefits typically outweigh the initial investment of time and effort.
|
36 |
+
|
37 |
+
2. Integration with Existing Tools: Although MLflow is designed to be interoperable, integrating it with an organization’s existing infrastructure and tools can require significant effort. Custom integrations or plugins may be necessary for specific use cases.
|
38 |
+
|
39 |
+
3. Cost of Infrastructure: Depending on the scale at which MLflow is deployed, the associated infrastructure costs can be significant. Organizations need to carefully consider the cost-benefit ratio when deploying MLflow, particularly if they require high availability and scalability.
|
40 |
+
|
41 |
+
Conclusion
|
42 |
+
|
43 |
+
MLflow has emerged as a powerful tool for managing the machine learning lifecycle, offering features that address the challenges of reproducibility, scalability, and collaboration. By providing an integrated environment for tracking experiments, packaging projects, deploying models, and managing model lifecycles, MLflow enables organizations to streamline their ML operations and accelerate their path to production. While there are challenges in adopting and integrating MLflow, the benefits it offers make it a valuable asset for data science teams looking to optimize their machine learning workflows.
|