Predictive Analysis and Classification

Lukket Opslået 2 måneder siden Betalt ved levering
Lukket Betalt ved levering

Overview of the Task:

The original test sheets contain many data sets each with 49 numbers. Each data set is a column. Each of the data sets/columns has 7 out of 49 numbers selected as Process numbers. These are given in bold red. Now, the last column, the rightmost column, is the target data set for prediction. All other columns are data sets to be used for training the model. The project's ultimate objective is to predict the 7 process numbers of that last column/data set using Machine Learning models. We are using as many as 5 different types of ML models to predict these 7 pattern numbers from the target data set which is the last column of each test sheet.

During this process of prediction, we have come across certain observations. We had to solve those observations and improve the prediction accuracy by overcoming those observations with methods or approaches to be developed by expert data scientists.

This task named “Data analysis and classification” is for that objective.

We have predicted the 7-process number of approximately 50 data sets using these 5 ML models at various test sizes. These prediction results are illustrated in the Excel workbook file named: “Comparison of prediction results of 50 data sets”. How to read and understand this Excel workbook is explained below:

1) The workbook has 50 sheets. The leftmost sheet is named 388 and it goes to 438 at the rightmost sheet. Out of these 50 sheets data is now filled up to 431, totalling to 44 data sets. Data of the remaining sheets shall be filled in due course as the data becomes available.

2) The numbers given as the sheet names are the numbers of the data sets. From 388, 438. Each of these numbers is also the name of the target data set, the rightmost column of each test sheet.

3) One data asset can have up to 6 to 7 test sheets. Named 388-1, 3881A, 3881B, 388-2 …. up to 388-5. Each test sheet has a varying number of data sets for training and one target data set. The number of data sets in each test sheet is stated in the Test sheet names.

4) A test sheet name starts with the number of the target variable (or target column) where we have to predict the 7 numbers.

5) Each of the 50 sheets of the workbook has a list of 9 numbers predicted by different ML models. The models used were RF - Random Forest Classifier, SVML - SVM Linear Classifier kernel, SVMR - SVM RBF Classifier kernel, SVMP – SVM poly classifier kernel and NB - Naive Bayes Classifier.

6) The actual 7 values or pattern numbers are given in the coloured cells in the top left of each sheet. Wherever these numbers have occurred in prediction results are also coloured with respective colours.

7) You may also notice something like - 388-1, 388-2, 388-3, 388-4, etc. These are different variations of test sheets of the dataset numbered 388 in each of these 5 to 7 various test sheets 388 is the target column. So, we make predictions using each of these test sheets of various sizes.

? Finally, we noticed getting better results by changing the test sizes during the test-train split. So, we have also tested each of the models in different test sizes - 0.2, 0.3, 0.4, 0.5, 0.6. These test size values are given in brackets against each test sheet name.

9) At the top left of each you can also notice 'Result type'. This describes a special data manipulation criterion. 'No column removal' - No columns are removed from the test sheet, 'Two column removal' - First two columns are removed from the test sheet, 'Four column removal' - First four columns the first four training data sets are removed from the test sheet etc. This resulted in increased prediction accuracy a little bit, so please be on the lookout for this variable.

The Task:

A. You have to first look through various predictions of each sheet, there are 150 predictions in each sheet, and count, list out/tabulate the facts available there such as:

a) How many of the pattern numbers have occurred in each type of prediction?

b) Which type of prediction has the highest number of correct pattern numbers?

c) Which type of prediction has a consistent result? This means having a similar number of correct numbers repeatedly.

d) Variations in Dataset: Explore the variations of the same dataset (e.g., 388-1, 388-2) and note any significant differences in prediction accuracy.

e) Effect of Test Sizes: Investigate the impact of different test sizes (0.2, 0.3, 0.4, 0.5, 0.6) on prediction accuracy for each model.

f) Influence of 'Result Type': Assess how different 'Result Types' affect the accuracy, especially whether column removal enhances or hinders the predictions.

And so on….

All such observations/facts available there will help us determine which type of mode and at what test size value has the best performance.

B. Analyse each test sheet in detail using various metrics used in data science to determine what are the characteristics of a test sheet or the target data set that gives the best prediction result.

a) Prediction Accuracy: Calculate the overall accuracy of predictions for each test sheet. This involves assessing the ratio of correct predictions to the total number of predictions.

b) Precision, Recall, and F1 Score: Break down the performance using precision, recall, and F1 score metrics. Precision measures the accuracy of positive predictions, recall assesses the ability to capture all positive instances, and F1 score combines both metrics.

c) Feature Importance: If applicable, analyze the importance of features in the prediction. This is particularly relevant if certain columns or variables significantly influence the model's performance. You may use the SHAP graphs generated using interpretML to achieve this.

d) Hyperparameter Tuning: Explore the impact of hyperparameter tuning on model performance. Assess how adjustments to parameters influence the predictive accuracy.

C. Analyse each dataset (each data set is the same as each column and has 49 numbers) in detail using various metrics that can be derived from a data set without taking into account or considering the prediction results.

a) Descriptive Statistics: Compute basic descriptive statistics such as mean, median, standard deviation, minimum, and maximum values. This provides an initial understanding of the central tendency and variability of the dataset.

b) Data Distribution: Visualize the distribution of the dataset using histograms, box plots, or kernel density plots. This helps identify any skewness, outliers, or patterns within the data.

The objective of this analysis and expected results:

After this detailed study and analysis, we will get the following ability/knowledge

I) Be able to classify or categorise the Test Sheets into categories or classes like:

a) Most friendly with SVM linear with ----test size.

b) Needs removal or addition of data set to get various metric values to satisfy getting better prediction results.

c) ……

d) …..

II) Be able to classify or categorise individual data sets into categories or classes like:

a) Most friendly with SVM linear with ----test size.

b) Needs removal or addition of data set to get various metric values to satisfy getting better prediction results.

c) ….

d) ……

III) Be able to remove or add training data sets from a test sheet to get the highest possible number of correct predictions per different types of prediction models and test size.

IV) Any other corrective actions to help us get high prediction accuracy

Plan of Action

In order to ensure the precise predictions of these models we have to compute a few metrics. These metrics generally depict the efficiency of the model. The list of these metrics is mentioned below along with details: -

Accuracy: Proportion for correctly classified occurrences as defined in the pattern set. You have to compute the counts which are matching to the pattern sets and compute the proportions. Similarly, it will give us the error rate as well. We know the threshold and use it to interpret the results.

Confusion Matrix: Accuracy alone is not enough to conclude the efficiency of the model. Conduct the in-depth analysis using underlying information. This matrix will give the True Negatives and True Positives. False Negative and False Positives. These measures will help to understand what are the variations and whether we can rely on a particular model or not.

Sensitivity and Specificity: These measures will give us an overview of how many true positives (Predictions) are identified as pattern numbers. Similarly, how many numbers are identified as non-pattern numbers?

Statistisk analyse Statistikker Analyse Machine Learning (ML) Datavidenskab

Projekt ID: #37842295

Om projektet

17 bud Remote projekt Aktiv 1 måned siden

17 freelancere byder i gennemsnit ₹14812 timen for dette job

sajjadtaghvaeifr

Hi, I hope you are doing fine. I have almost 10 years of experience in machine learning algorithms. I can implement various types of artificial intelligence algorithms including yours with Matlab, Python and etc. I hav Flere

₹75000 INR in 7 dage
(40 bedømmelser)
6.8
nrajuu07

I am excited to submit my proposal for the task of analyzing and classifying predictions generated by machine learning models for your data sets. After carefully reviewing the project overview and requirements, I am co Flere

₹5500 INR in 7 dage
(91 bedømmelser)
6.0
eldadshericks

Hello there, I am a Data Analyst by practice, and by using appropriate statistical software, I can help you discover facts from your pool of data. I run both quantitative and qualitative analyses, professionally; I use Flere

₹6000 INR in 3 dage
(82 bedømmelser)
6.3
HiraMahmood4072

Hlo! I have done MS in statistics. I read your job description. I have expertise in SPSS, R studio, excel, ML, data science and statistical analysis. I will provide you with the finest work that perfectly aligns with y Flere

₹5500 INR på 1 dag
(30 bedømmelser)
4.8
Shahzadanalyst

With a passion for data analysis and problem-solving, I believe I'm the ideal candidate for your project. My proficiency in Python libraries, including Pandas, Numpy, Matplotlib, Scipy, and Scikit-Learn aligns perfectl Flere

₹5500 INR in 3 dage
(5 bedømmelser)
2.9
harshbhatti7704

Dear Hiring Manager, I am excited about the opportunity to analyze and classify the data sheets for your project. With my expertise in statistics, machine learning, and data science, I am confident in conducting compr Flere

₹5200 INR på 1 dag
(3 bedømmelser)
1.6
janita24

Hi There Rajat K., Good afternoon! My name is Jane a skilled data analyst with skills including Statistical Analysis, Statistics, Machine Learning (ML), Data Science and Analytics. I have over 5 years in tutoring data Flere

₹6105 INR in 4 dage
(2 bedømmelser)
0.0
ainyrehman12

Hi there, I a Data Scientist proficient in Python. I can assist in analyzing your dataset and enhancing prediction accuracy using machine learning models. With expertise in data analysis and classification, I'll condu Flere

₹5000 INR in 7 dage
(0 bedømmelser)
0.0
shantgaurav10121

Hello I have an expertise on improving the prediction model accuracy using Python and machine learning. I have completed around 30+projects in python so far. I will be happy to do analysis for you.

₹5500 INR in 2 dage
(0 bedømmelser)
0.0
pcnguyen2901

Having acquired knowledge of deep learning through courses on Coursera, I am confident in my ability to successfully complete projects in this domain. My coursework has equipped me with a comprehensive understanding of Flere

₹6000 INR in 7 dage
(0 bedømmelser)
0.0
gargmolugarg

Hi! I studied machine learning methods at the London School of Economics (THE Ranked 11, QS Ranked 8, for economics and social sciences)! I'm familiar with running Support Vector Machines in R and Python, and can provi Flere

₹6000 INR in 14 dage
(0 bedømmelser)
0.0
subhpram

Dear hiring manager, I can help you in this Predictive Analysis and Classification assignment. It suits my experience and capability. I am very interested in discussing the assignment details further. I am available Flere

₹12000 INR in 7 dage
(0 bedømmelser)
0.0
AryAyush01

I am offering to complete the entire project within one week for a total cost of 6000 INR. My proposal includes thorough research, precise execution, and timely delivery of the project to meet your expectations.

₹6000 INR in 7 dage
(0 bedømmelser)
0.0
NormanDaniel711

Hi, I have one year of experience in data science and analytics, and I graduated with a master's degree. I am ready for work at any time. I have conducted numerous analyses, and my expertise includes Machine Learning, Flere

₹5500 INR in 9 dage
(0 bedømmelser)
0.0