Spaces:
Running
Running
FAQ for TU257 Data Analytics | |
: My chart This page will contain Questions I've received about the module, the topics, labs, and assignments. | |
It can be challenging to ask questions, in class or after class etc. This FAQ attempts to address these many challenges and allows for the sharing of knowledge. | |
Students can contact me with their questions. I will attempt to respond promptly with answers. To assist with knowledge sharing with the whole class, I will post the questions I receive on this page, along with the answers. [The names of the student asking the question will not be listed] | |
IMPORTANT: When I say, I will attempt to respond promptly, it means I will endeavor to respond within a day or two of getting the question. If I don't respond as quickly as you would like, just remember I have other classes, task and roles to perform each day. If I haven't responded with three days, then Yes Please get onto me and Gently remind me :-) | |
IMPORTANT: There will be a delay to any questions asked during weekends, holidays and vacation periods. Questions will be answered after such periods. | |
If you see a week with No questions, that means no one asked me a question. | |
IMPORTANT: It is important that students check this page regularly for new Q&A. There will be no notices posted to the class group when new Q&A are added. | |
Week 0 - Course Admin | |
Q: Is there an exam? | |
A: Nope, no exam. There are two assessments. The combination of marks from these makes up your final mark for the module. See the Module Introduction & Admin for the breakdown of these marks. | |
Q: How quickly will we get feedback on the assessment? | |
A: Typically within 2-3 working weeks. Depending on dates/timing, etc it might be a little longer. Feedback consists of a short paragraph on your assessment highlighting good things and things that needed some additional work | |
Week 1 - Introduction | |
Q: Will we be coding or doing lab work during the Week 1 | |
A: There will be a bit of an overlap with your other module. Similar tasks to complete with getting your environment setup and installing software. These tasks aren't very complex, but some people might have minor issues. It's important to get these resolved asap. | |
Week 2 - Bank Holiday | |
Q: Do we have a class or work to do this week? | |
A: There is NO class and technically NO work you need to complete. If you'd like to do some learning, I've links on a webpage with some tutorials with using Pandas to process data. Check out the link to the webpage. NB. You don't have to do this. | |
Week 3 | |
Q: I cannot get the library to install in Anaconda. Is there another way to do this | |
A: Yes, run the following command on the command-line, it will do the install, and the library will then appear in your Anaconda environment | |
conda install -c conda-forge ydata-profiling | |
Q: I get an error when I change the directory to the location of the data file | |
A: This means you have typed the full directory path to the file incorrectly. Double check the full path and make sure the path in the notebook matches. If you get an error message it means there is a typo in the path. Keep checking until it works. | |
Q: In the directory path should I use / or \ in the directory path? | |
A: It shouldn't matter if you use / in the path. But if you get an error message (and you are using a Windows machine) just change it to \. If you are using a Mac machine you can use / | |
Week 4 | |
Q: My chart/graph doesn't display. All I get is an empty cell. What is wrong and how do I fix it? | |
A: There is nothing wrong. Go to the cell containing the code to display the chart/graph and run the cell for a second time. It should now display it. | |
Week 5 | |
Week 6 | |
Assignment A | |
Q: I'm trying to load the data set into a pandas dataframe, but it isn't loading the header correctly. Can you suggest what I need to do to make this work correctly for me? | |
A: The read_csv function assumes the file is in CSV format. By default, the column separator for CSV is a comma, but any character can be used as a separator. See the documentation for this function https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html | |
For example, if the data set has a semicolon separator, you can use the following. | |
# import dataset | |
df = pd.read_csv('.........', sep=';') | |
Q: Is there a sample report/solution you can give us, for example, a sample from last year? | |
A: The problem with providing sample solutions for assignments is, I'll just get 20+ reports which are exactly the same as the sample. Effectively it will be a copy and paste from the sample. There is no learning experience with doing this and for this reason NO sample solutions will be provided. I have given you a template notebook you can use. This contains a structured layout of sections you will typically have to work through. These sections also correspond to the marking scheme. Each student/team is responsible for completing these sections with code and clear documentation covering the work completed and discussions of your work and findings. This makes every report/notebook/assignment submission individual to each student/team. | |
Q: Is there a marking rubric? | |
A: The marking scheme is provided on the assignment handout. Although it isn't in a rubric format, it does give you the available marks for each section. Marks will be awarded on a range from Zero to the maximum mark for a section, and this range of marks will be based on the depth of detail, analysis, and discussions completed for each section. For example, if you write a few lines of code and only give a short general comment for these, then you will get the minimum of marks. Similarly, if you just reuse the code from the Demo Notebooks, I've supplied and provided no additional comments/discussions/analysis, then you will be looking at potential marks in the lower ranges, possibly getting zero marks. On the other hand, if you write lots of code, the code is well documented, and you provide in-depth analysis of the results and perform additional analysis, then you might be looking at a mark towards the upper range of marks. | |
Q: Do we need to write a separate report, in addition to the notebook? | |
A: There is no need for a separate report document. Everything should be in the notebook and it is this notebook you should submit. | |
Q: Does each person need to submit an assignment on BrightSpace? | |
A: Only one submission from a team should be submitted on BrightSpace. The name, student number, etc must be clearly given at the top of each notebook. Feedback will consist of a short paragraph and a mark. This will be entered into BrightSpace, and it is the responsibility of the person receiving this, to share it with other students in the team. The short feedback paragraph will focus on areas in your submission that needed more work (i.e. areas where you lost marks). You can learn from this, and try to address these for Assignment-2. | |
Q: I'm a little confused if I need to use Up Sampling or Under Sampling. Can you help guide us on this? | |
A: Up/Down Sampling allows you to create a modified data set. Most of the classification algorithms like to see a similar number of cases/records for each value in the Target variable. In some instances, they like to see a 50:50 split (for binary classification). In most data sets and work related problems for classification, one of the values in the Target variable is be a small percentage of the overall data set. To over come this small percentage, Up Sampling can be used to increase the number of cases/records for this particular variable/feature. This will typically be used when dealing with small data sets, like what we have for the Assignment. When you have LARGE data sets (Big Data), consisting of several millions or tens of millions of cases/records, you might want to use Down Sampling to bring the data set down to a manageable size to work with and for the algorithms to work, without running out of computer memory (RAM). | |
Q: Is the assignment based on what is covered in weeks 5+6? | |
A: The assessment work is based on everything we have covered up to and including week 7. | |
Q: Should we spend any time understanding the domain knowledge of the dataset? | |
A: You don't have to spend a lot of time on this. Every probably has some knowledge of the domains for the two problem data sets. Use this knowledge and try to think about if you were an employee and what you might say or tell your manager about the work you did. | |
Q: If columns are correlated should we remove them? | |
A: Have a look back over the notes/demo relating to correlation analysis. It's also vital to have a look at the meta-data or the descriptive information that's provided with the data set. This will give you some hints about certain attributes/features in the data set. | |
Q: I ran Naive Bayes and Decision Tree on the data (Supermarket case study) and got 100% accuracy on both. Is that beginner's luck? totally random? | |
A: Yes that's a bit of beginner's luck or perhaps bad luck. It is commonly referred to as having an 'overfitted model'. I've mentioned this several times in class. Have a look at the previous question (above) about correlated data and have a look at the notes and examples that cover this. Hopefully, this is a good hint towards what you might have to do. Also, check out the documentation for the data set and any related articles. | |
Q: Is it ok to pick 2-3 algorithms to build a model? | |
A: Remember the quote from George Box, "all models are wrong but some are more useful than others". Something to consider is what algorithms you can test and evaluate, and if they are sufficient. | |
Q: For problem set 2, should we perform any data manipulations based on the date the data was extracted or the current date? | |
A: It should be based on the date the data was extracted. You need to use this to correctly calculate any additional information you think is necessary. If you used the current date (e.g. today date) then all calculations would be incorrect. Additionally, all calculations would give a different outcome depending on whether you run those calculations today, tomorrow, or the day after, etc. FYI, the date of extraction is 23rd February, 1998. | |
Assessment - General Questions & Comments | |
Q: What feedback will we get? | |
A: For a group assessment, one person should submit the assignment on BrightSpace. Feedback will consist of a short paragraph and a mark. This will be entered into BrightSpace for the assignment submission. The student receiving it should share it with the other members of the team. For individual assessments, feedback will be similar with a paragraph and mark. | |
Q: How much detail should we include in the assignment for each topic? | |
A: It's important to remember when completing this assignment you explain your work, why you are doing a task, what the outcomes are, what they mean, how it feeds into the next step/cell etc. Document all of this as code comments and MarkDown. This helps the reader to see and understand what you are doing and (most importantly) why you are doing it, and (even more importantly) you can explain the outcomes and what they mean. This will be useful for potential employers who might ask to see examples of your work. The more detail you include beyond the copy & paste of the code examples I've given, the better it will be for you and the more marks I can award. Copy&Paste of what I've given will not gain many marks. | |
Q: Can I get regular feedback on my assessment work, during the weeks leading up to the Due Date? | |
A: All the materials necessary to complete the assessment have been covered. Your task is to apply these to the specified problem/use case. This is standard for assessments. If you have a specific question about the problem/user case, this will be addressed and shared with the class and on this FAQ page. Questions such as, is this code correct, or is this the correct answer, what is missing, etc will not be answered. | |
Q: Can I get a higher mark/grade? | |
A: The final mark/grade for the assessment has been carefully considered and will be in keeping with the marking and standards of previous students on this module and with other similar modules. All marks are classed as provisional and subject to change as per TU Dublin General Assessment Regulations. All assessments and marks/grades are reviewed by an External Examiner to ensure the standard of marking and feedback. | |
Check the TU Dublin General Assessment Regulations. | |
Some students expect to receive a very high mark for their work, typically in the 80%-100% range. This is an incorrect assumption. Marks are awarded on a 0%-100% range. Only a small percentage of students will attain a mark of >70%. Most students will achieve a mark between 50%-69% | |
Q: I'm not happy with the mark/grade I've received. What can I do? | |
A: Check the TU Dublin General Assessment Regulations for details on what you can do regarding this. The procedure for appealing your grade is outlined along with the fees for requesting this. | |
Q: I attempted a section of the assessment and I didn't get full marks, Why? | |
A: Marks are awarded on a sliding scale and are based on several factors such as completeness, correctness, depth of detail, explanations, etc. If you didn't get full marks then some details were probably missing. | |
Week 7 | |
Q: When I first run TPOT with the default hyperparameters setting, all 5 generations gave the same CV score (see cells 51 & 52 below). When I changed some of the hyperparameters' in the TPOT algorithm it appeared to execute without any issue but came back with the following message: "Pipeline encountered that has previously been evaluated during the optimization process. Using the score from the previous evaluation." | |
A: Check out this post on the TPOP Github repository, which contains details about this message, and what can be done https://github.com/EpistasisLab/tpot/issues/927 | |
Q: AutoML has inbuilt steps to perform Data Select and Data Transformations. Does this mean I don't have to do any of these steps manually? | |
A: Data cleaning and preparation still need to be performed before using the dataset for AutoML. With AutoML they perform some additional Feature Engineering and Feature Selection. For example, with Feature Selection they will run a variety of statistical tests to select which features to include in each iteration. This typically involved selecting a random subset of the features to include in the model. | |
Week 8 | |
Week 9 | |
Q: I cannot install or find mlxtends library in Anaconda. How can I install it? | |
A: If you go to the Anaconda webpage for this library you will find a command for you to run in a Terminal/Command Line window. You had to do similar when installing tpot library for AutoML | |
conda install -c conda-forge mlxtend | |
You might need to close Anaconda and reopen (or refresh the environment), it should then appear in your list of installed libraries. If you had Jupyter Notebook open, you will need to restart the Kernal (see menu option) | |
Assignment B (See section above on Assessment - General Questions & Coments) | |
Q: How many references would be enough for each notebook? | |
A: The answer to everything in IT is "it depends". It depends on the topic you are covering and how much detail is included. This can be a useful section to illustrate to employers your reading on the topic and your understanding of the topic. I'd suggest having a minimum of three references. One of these can be the reference to the data set you are using. But consider adding 4, 5, 6 references, up to a maximum of 8. | |
Q: For Assignment B, is it OK to use the dataset for Assignment A that we didn't work on? | |
A: The assignment handout, see section Important, mentions you "should not include any of the datasets and examples used throughout the module". This will include the data set not used in Assignment A. | |
The only exception to this is for AutoML. | |
Q: How many different data sets should we use for Assessment B? | |
A: You need to use two different data sets. Given the nature of the topics, you'll need to use two different datasets. | |
Q: How can we share the datasets we used in the assignment with you? | |
A: There are a number of ways to do this. If you downloaded the data set from the internet using python code (using pandas or otherwise) then just leave that code in your notebook submission. If you encounter/discover there are some download limits or rate limits or the download | |
can be temperamental you can give a Dropbox or Google Drive link. An alternative is to include a ZIP file with your submission in Brightspace | |
Week 10 | |
Q: I've seen examples using PCA (and other approaches) for Clustering. Should we use this (for our assignment)? | |
A: PCA or Principal Component Analysis, is a dimensional reduction method. It is complex to use and understand what it produces and because of this PCA and other similar approaches/algorithms are outside the scope of this module. If you decide to continue your studies to higher levels you will encounter PCA etc in other modules such as Machine Learning where the various challenges of using it will be explored. | |
For Assessment-B, you don't have to use PCA, and using it may limit your ability to explain your work and discoveries in the data. | |
Week 11 | |
Week 12 | |
Q: Is there a class/topic for this week? | |
A: There is a topic for this week but most people will be busy finishing Assessment-B and maybe you are still working on your final assessment in your other module. It will be a busy week. I've recorded videos of the lecture and you can view these at any time, along with the additional readings. The topic is not assessed. Instead of covering the topic live during our scheduled class time, we will instead have a Q&A session for Assessment-B. I think most people found the same session for Assessment-A useful earlier in the semester. We'll do similar this week | |
Week 13 | |
Q: Is there a class/topic for this week? | |
A: There is no Class this week as people will be busy completing and submitting Assessment-B. | |