Learning About Data Science
Much of what I have learned of data science has come from reading books - lots of books. Many of these are listed in my Good Reads list. If I could only recommend one book, though, it would have to be R for Data Science by Wickham and Grolemund. I have also learned (and practiced) additional skills by taking a variety of different online courses. I have also found these to be useful as I have refined my own online teaching skills.
I have completed several courses on Coursera from the University of Michigan and John Hopkins University. The built in quizzes and exercises are not too challenging, but they do give you a chance to try out R and Python on real datasets. I personally found that the lectures ran from Freshman-level to first-year graduate-level pretty quickly. I suspect that most people taking these MOOCs would not track with the math presented very well.
I have also recently begun to try out courses on DataCamp. Here, the instruction are more basic - a short video lecture is presented and then you are left off to solve a practical problem. DataCamp has a web plugin for you to write your code and check it out online. It pays to read the instructions for each problem carefully. They are looking for one specific solution as the answer.
I have also gotten involved with The Carpentries. DataCarpentry and SoftwareCarpentry are communities of instructors, trainers, maintainers, helpers, and supporters who share a mission to teach foundational computational and data science skills to researchers. I have gone through a training program and am now a certified instructor for the Carpentries.
I have slowly developed a set of data science tools that I feel proficient with. These particular tools are operating system agnostic - they can be run on Windows, Macintosh, and Linux. In fact, I do run them under all three regularly. I have also set them up on this DigitalOcean droplet so that I can access them with my Chromebook too. These tools can be accessed from the
Data Science tab in the top navigation bar of this page.
At various times in my career, I have use many of the common Unix shells - csh, ksh, and tcsh. My current shell-of-choice is bash (which I use on all of my computing devices). The command line interface is my main way of writing text documents (like this file). I prefer to write using BBEdit, but I also use
emacs to get the job done. For version control, I like to use
GitHub (though I also have
BitBucket set up to teach others. I use a lot of Unix tools like
make to quickly automate repetitive and boring tasks. An excellent resource to get introduced to the command line world is A Practical Guide to Linux Commands, Editors, and Shell Programming.
Nearly all datasets that I get to work with are messy. Values are miscoded, there are missing values, duplicate entries, typos and poor formatting. OpenRefine is an program that helps to clean up those problems. I am still learning this particular tool, but it is a great way to get a pile of messy data cleaned up in a reproducible way.
Python is a powerful and approachable programming language that can be used for data science (and many other things too). Most times, I use Jupyter Notebooks to write Python code. On my machines, I have loaded kernels for
Octave. This allows my to use many different languages within my data analysis projects. Gotta use the right tool for the job…
I also use relational databases to store my data. These include
mySQL or, more often,
SQLite databases. I can easily access these databases from the command line or using Python and R.
My tool of choice for Data Science is
R. I nearly always use the integrated development environment RStudio to access this programming language.
R programming language can be expanded (dramatically) by adding in extensions via packages. There are a LOT of these available. Here is a brief list of some of the ones that I use the most often.