The study of Biology has changed dramatically since I earned my Bachelor's degree. Technological progress has made the collection and analyses of huge databases of information commonplace. I firmly believe that all modern biologists must now not only possess content knowledge from the field, but also be proficient in data science skills. Therefore, I have been teaching myself so that I can prepare our Ferris students for biological research in the twenty-first century.

Learning About Data Science

Much of what I have learned of data science has come from reading books - lots of books. Many of these are listed in my Good Reads list. If I could only recommend one book, though, it would have to be R for Data Science by Wickham and Grolemund. I have also learned (and practiced) additional skills by taking a variety of different online courses. I have also found these to be useful as I have refined my own online teaching skills.

Coursera

I have completed several courses on Coursera from the University of Michigan and John Hopkins University. The built in quizzes and exercises are not too challenging, but they do give you a chance to try out R and Python on real datasets. I personally found that the lectures ran from Freshman-level to first-year graduate-level pretty quickly. I suspect that most people taking these MOOCs would not track with the math presented very well.

DataCamp

I have also recently begun to try out courses on DataCamp. Here, the instruction are more basic - a short video lecture is presented and then you are left off to solve a practical problem. DataCamp has a web plugin for you to write your code and check it out online. It pays to read the instructions for each problem carefully. They are looking for one specific solution as the answer.

The Carpentries

I have also gotten involved with The Carpentries. DataCarpentry and SoftwareCarpentry are communities of instructors, trainers, maintainers, helpers, and supporters who share a mission to teach foundational computational and data science skills to researchers. I have gone through a training program and am now a certified instructor for the Carpentries.

My Toolbox

I have slowly developed a set of data science tools that I feel proficient with. These particular tools are operating system agnostic - they can be run on Windows, Macintosh, and Linux. In fact, I do run them under all three regularly. I have also set them up on this DigitalOcean droplet so that I can access them with my Chromebook too. These tools can be accessed from the Data Science tab in the top navigation bar of this page.

bash

At various times in my career, I have use many of the common Unix shells - csh, ksh, and tcsh. My current shell-of-choice is bash (which I use on all of my computing devices). The command line interface is my main way of writing text documents (like this file). I prefer to write using BBEdit, but I also use vim, nano, and emacs to get the job done. For version control, I like to use git and GitHub (though I also have mercurial and BitBucket set up to teach others. I use a lot of Unix tools like awk, sed, and make to quickly automate repetitive and boring tasks. An excellent resource to get introduced to the command line world is A Practical Guide to Linux Commands, Editors, and Shell Programming.

OpenRefine

Nearly all datasets that I get to work with are messy. Values are miscoded, there are missing values, duplicate entries, typos and poor formatting. OpenRefine is an program that helps to clean up those problems. I am still learning this particular tool, but it is a great way to get a pile of messy data cleaned up in a reproducible way.

Python

Python is a powerful and approachable programming language that can be used for data science (and many other things too). Most times, I use Jupyter Notebooks to write Python code. On my machines, I have loaded kernels for Python 2, Python 3, R, and Octave. This allows my to use many different languages within my data analysis projects. Gotta use the right tool for the job…

SQL

I also use relational databases to store my data. These includemySQL or, more often, SQLite databases. I can easily access these databases from the command line or using Python and R.

R

My tool of choice for Data Science is R. I nearly always use the integrated development environment RStudio to access this programming language.

My ‘R-senal’

The R programming language can be expanded (dramatically) by adding in extensions via packages. There are a LOT of these available. Here is a brief list of some of the ones that I use the most often.

RStudio readr bookdown broom devtools dplyr forcats ggplot2 googledrive googlesheets knitr lubridate pipe readxl rmarkdown roxygen2 rsample shiny stringr testthat tibble tidyr tidyverse

Like somethat that you read here? Feel free to share it.