Data Science

The study of Biology has changed dramatically since I earned my Bachelor's degree. Technological progress has made the collection and analyses of huge databases of information commonplace. I firmly believe that all modern biologists must now not only possess content knowledge from the field, but also be proficient in data science skills. Therefore, I have been teaching myself so that I can prepare our Ferris students for biological research in the twenty-first century.

Learning About Data Science

Much of what I have learned of data science has come from reading books - lots of books. Many of these are listed in my Good Reads list. If I could only recommend one book, though, it would have to be R for Data Science by Wickham and Grolemund. I have also learned (and practiced) additional skills by taking a variety of different online courses. I have also found these to be useful as I have refined my own online teaching skills.

Coursera

I have completed several courses on Coursera from the University of Michigan and John Hopkins University. The built in quizzes and exercises are not too challenging, but they do give you a chance to try out R and Python on real datasets. I personally found that the lectures ran from Freshman-level to first-year graduate-level pretty quickly. I suspect that most people taking these MOOCs would not track with the math presented very well.

DataCamp

I have also recently begun to try out courses on DataCamp. Here, the instruction are more basic - a short video lecture is presented and then you are left off to solve a practical problem. DataCamp has a web plugin for you to write your code and check it out online. It pays to read the instructions for each problem carefully. They are looking for one specific solution as the answer.

Chromebook Data Science

During the fall semester of 2019, I am hoping to find the time to complete another online course. This course was created by Jeff Leek and others at the Johns Hopkins Bloomberg School of Public Health. It consists of twelve online courses that demonstrate how data science analysis can be done on even devices as limited as Chromebooks.

The Carpentries

I have also gotten involved with The Carpentries. DataCarpentry and SoftwareCarpentry are communities of instructors, trainers, maintainers, helpers, and supporters who share a mission to teach foundational computational and data science skills to researchers. I have gone through a training program and am now a certified instructor for the Carpentries.

Masters in Data Science

If you are like me, and are interested in learning more about data science, you might find these resources useful. This site provides links to many different schools offering graduate degrees in data science. Many of these are online or blended delivery programs. You can also find more information about data science careers and programs at this website.

My Toolbox

I have slowly developed a set of data science tools that I feel proficient with. These particular tools are operating system agnostic - they can be run on Windows, Macintosh, and Linux. In fact, I do run them under all three regularly. I have also set them up on this DigitalOcean droplet so that I can access them with my Chromebook too. These tools can be accessed from the Data Science tab in the top navigation bar of this page.

bash

At various times in my career, I have use many of the common Unix shells - csh, ksh, and tcsh. My current shell-of-choice is bash (which I use on all of my computing devices). The command line interface is my main way of writing text documents (like this file). I prefer to write using BBEdit, but I also use vim, nano, and emacs to get the job done. For version control, I like to use git and GitHub (though I also have mercurial and BitBucket set up to teach others. I use a lot of Unix tools like awk, sed, and make to quickly automate repetitive and boring tasks. An excellent resource to get introduced to the command line world is A Practical Guide to Linux Commands, Editors, and Shell Programming.

OpenRefine

Nearly all datasets that I get to work with are messy. Values are miscoded, there are missing values, duplicate entries, typos and poor formatting. OpenRefine is an program that helps to clean up those problems. I am still learning this particular tool, but it is a great way to get a pile of messy data cleaned up in a reproducible way.

Python

Python is a powerful and approachable programming language that can be used for data science (and many other things too). Most times, I use Jupyter Notebooks to write Python code. On my machines, I have loaded kernels for Python 2, Python 3, R, and Octave. This allows my to use many different languages within my data analysis projects. Gotta use the right tool for the job…

SQL

I also use relational databases to store my data. These includemySQL or, more often, SQLite databases. I can easily access these databases from the command line or using Python and R.

R

My tool of choice for Data Science is R. I nearly always use the integrated development environment RStudio to access this programming language. I have set up an RStudio server with my DigitalOcean droplet to allow me to work using my Chromebook when I want. In addition, I have set up a Shiny Server as well. This server enables me to deliver interactive data applications, reports, and dashboards.

My ‘R-senal’

The R programming language can be expanded (dramatically) by adding in extensions via packages. There are a LOT of these available. Here is a brief list of some of the ones that I use the most often.