Advanced Python for Data Science Assignment 1

  1. Do the following using the Linux/Mac shell or GitBash on Windows:

    1. Navigate to your the directory where you would like to store materials for the class. While doing so experiment with the tab based autocomplete. It will make navigating the shell much quicker.
    2. Create a directory for all of your homework related material for this course.
    3. Navigate into this directory and create a new subdirectory for this assignment.
    4. Using nano (or a command line editor of your choosing) create a text file titled diary.txt.
    5. Enter today’s date and the statement “I am now a command line genius! Waa haa haa haa haa.” You may substitue a different transcription of your evil laugh if appropriate.
    6. Realize that you don’t want your enemies to be able to find out your secret so easily and change the name of your file to ornithology_notes.txt. No one would ever want to look at that.
    7. Make a backup of this file by creating a copy of it called ornithology_notes_backup.txt
  2. We have data on bird communities that we’ve collected that we need to analyze. The data has three columns, a date, a common name, and a count of the number of individuals.

    2013-03-22 bluejay 5
    2013-03-22 mallard 9
    2013-03-21 robin 1
    

    Download one of these files using the curl command:

    curl -O https://nyu-cds.github.io/courses/data/data_drycanyon_2013.txt

    If we wanted to find the least common species in the data file and store that information we could do something like:

    sort data_drycanyon_2013.txt -k 3 -n > sorted_counts.txt
    head -1 sorted_counts.txt > least_common_species.txt
    

    Now we want to get the most common species at the site. You can do this using the tail command. Since we don’t need the intermediate sorted_counts.txt file, use a pipe instead of creating the intermediate file.

    Save both the curl command and the one line command for storing the most common species in a text file called get_most_common_species.sh.

  3. We have data on bird communities that we’ve collected that we need to analyze. The data has three columns, a date, a common name, and a count of the number of individuals.

    2013-03-22 bluejay 5
    2013-03-22 mallard 9
    2013-03-21 robin 1
    

    Download the data files using the curl command:

    curl -O https://nyu-cds.github.io/courses/data/data_drycanyon_2013.txt
    curl -O https://nyu-cds.github.io/courses/data/data_greencanyon_2013.txt
    curl -O https://nyu-cds.github.io/courses/data/data_logancanyon_2013.txt
    

    We want to count the total number of individuals of each species that were seen in each data file. We could solve this problem ourselves, but our lab mate has already written some code that does this. Instead of rewriting the code ourselves we can simply add it to a pipeline. Let’s go ahead and download the file:

    curl -O https://nyu-cds.github.io/courses/code/species_counts.py

    To run this code we need to tell the shell to run it using python, which we do by giving it the name of the program that will run it, then the name of our program, and then the input.

    python species_counts.py data_greencanyon_2013.txt

    This can then be integrated into our pipeline. So if we want to sort based on the total number of individuals:

    python species_counts.py data_greencanyon_2013.txt | sort -k 2 -n

    This is great for a single datafile with a particular name, but we’ve been collecting data on birds from a number of different places and we’d like to conduct all of these analyses simultaneously. Write a simple for loop that loops over all of the files in the current directory that have the general form of data_*.txt, prints out the name of the datafile, and then runs species_counts.py on the datafile. Save this in a text file called all_species_counts.sh.