Write three very simple python scripts to reduce .cvs files

  • Status: Closed
  • Præmier: $35
  • Modtagne indlæg: 2
  • Vinder: sarasixti

Konkurrence Instruktioner

I need three simple scripts to reduce data (information about mutations in proteins):

All the scripts should work in mac command line and with a random input files, e.g.:
python [login to view URL] [login to view URL]

Here are the scripts:
Script 1:
Reduce [login to view URL] file to generate [login to view URL] file by copying only the lines with unique "Organism" and "Total number of mutations" values (please find the files in the attachement) (XXX is just an example, a generic name of a protein):

For instance, if [login to view URL] file looks like this:

Organism: Total number of mutations:
Helicobacter pylori 0
Helicobacter pylori 0
Helicobacter pylori 1
Helicobacter pylori 0
Escherichia coli 0
Escherichia coli 2
Escherichia coli 1
Escherichia coli 0
Escherichia coli 1

then the XXX_reduced file will look like this:

Organism: Total number of mutations:
Helicobacter pylori 0
Helicobacter pylori 1
Escherichia coli 0
Escherichia coli 2
Escherichia coli 1


Script 2:
Reduce [login to view URL] to [login to view URL] by using the following rule:
for every organism name that has identical first name, leave the first line with the highest number of mutations:

For instance, the lines:

Organism: Total number of mutations:
Helicobacter acinonychis 1
Helicobacter bilis 0
Helicobacter cetorum 0
Helicobacter cinaedi 2
Helicobacter felis 2

will be reduced to:
Helicobacter cinaedi 2


Script 3:
Finally, I need a script [login to view URL] to create [login to view URL] file from multiple .cvs files:

python [login to view URL] [login to view URL]
[login to view URL] [login to view URL]

This script should generate [login to view URL] file with the following columns:

" Merge all .csv files into one .csv file
" Remove all the columns but "Organism name Organism Groups Lifestyle Size (Mb) GC%"
" Remove all the duplicates, so that the .cvs contains only lines with unique combination of "Organism name Organism Groups Lifestyle Size (Mb) GC%" values.

For instance, the list:


Organism name Organism Groups Lifestyle Size (Mb) GC%
Bla-bla1 X A 1 20
Bla-bla1 X A 1 20
Bla-bla2 X A 2 30
Bla-bla2 X B 2 30


should be reduced to:

Organism name Organism Groups Lifestyle Size (Mb) GC%
Bla-bla1 X A 1 20
Bla-bla2 X A 2 30
Bla-bla2 X B 2 30



" Add columns "XXX", "YYY", "ZZZ" and fill these columns with values by using the following rule: for each "Organism name" in the [login to view URL] file find identical "Organism name" value in the [login to view URL] file, then find the corresponding value in the "Total number of mutations" column of [login to view URL] file, and print this "Total number of mutations" value into the XXX column of the [login to view URL] file. If the [login to view URL] file does not have a matching "Organism name" entry, then print "-" in the "Total number of mutations" column of the [login to view URL] file.
" Finally, add a column "Total number of mutated proteins" in the [login to view URL] file and fill it with values by counting how many XXX, YYY and ZZZ columns are not equal to "0" or "-". For example:


Organism: XXX YYY ZZZ Total number of mutated proteins
Helicobacter cinaedi 2 1 5 3
Escherichia coli 0 0 0 0
Weird name 1 0 4 2
Gibberish word 0 - 4 1


" So, at the end, the [login to view URL] file will have the following columns:
Organism name Organism Groups Lifestyle Size (Mb) GC% Organism XXX YYY ZZZ Total number of mutated proteins

Please find the examples of the .csv files in the attachment. Also, if you are sure you can do this within an hour, please post on the Clarification board so that other Freelancers know that this project is most likely already taken care of by somebody.

When you complete the task, please make a screenshot of a small portion of the [login to view URL] file so that I can see that I explained the task clearly and can award the project.

Thank you.
Sergey

Anbefalede Evner

Bedste indlæg fra denne konkurrence

Se flere indlæg

Offentlig Præciserings Opslagstavle

  • sergeyvmelnikov
    Konkurrenceafholder
    • 3 months ago

    Dear all, it seems my explanations were not clear enough. I have attached three sample files to these contest and I was expecting you to process these files. So, rather than using generic XXX, YYY and ZZZ from my explanations, I was expecting you to use AlaRS_new, IleRS_new and LeuRS_new.

    • 3 months ago
  • roshansanthoshh
    roshansanthoshh
    • 3 months ago

    Have completed the code. Can provide you the code right now itself.

    • 3 months ago

Sådan kommer du i gang med konkurrencer

  • Opret din konkurrence

    Opret din konkurrence Hurtigt og nemt

  • Få tonsvis af indlæg

    Få tonsvis af indlæg Fra hele verden

  • Tildel det bedste indlæg

    Tildel det bedste indlæg Download filerne - Nemt!

Opret en Konkurrence Nu eller slut dig til os i dag!