Here’s what I’ve been doing lately, working on, things I’m thinking about, or things I’m interested in. Generally these posts are divided into opinions, tutorials, and updates.
“The best way to predict the future is to to create it.”
Just came out with a new post on Medium called Data Science Hiring Tips. If you’re out of free articles this month and would like to take a read, send me an email and I’ll send you a copy of the article. Lately, I’ve noticed that a lot of hiring tips are very general and there isn’t good advice about hiring in data science, so I’ve tried to make my article include very specific tips.
As of a few minutes ago, I will be working for IBM as a data scientist in Austin, TX. I’m glad to be working with a company that has a high ethical standard for data. I really believe IBM is an emerging leader in AI and data science and I am looking forward to taking this position!
I am now on Medium! My first article is “How to Win a Data Science Competition” and draws on some of my recent experiences and trends I noticed in other competitors. Mainly, I am not seeing Jupyter Notebooks used to there full potential.
I got to be a mentor at the first ever tamudatathon. I was lucky to have worked with two of the organizers in the past, Josiah Coad and Chinmay Phulse on a past data science project, and they allowed me to be a part of this. It was a great event, especially considering that it was the first event of its kind, their only problems were a result of them being too successful.
I have a new twitter for my goings on in data science, it should be much like this blog, except more day-to-day and with fewer tutorials. When my team won the Texas A&M Institute of Data Science competition it was not because we had much better models. It was because we had these roles on our team, and we focused on making actionable results. https://t.co/lxwtD2feHn — Brandon Walker (@BranWalkerData) September 26, 2019
If you’ve read even a little bit of my blog you’ll notice I am a big fan of R. One of best thins about R, in my opinion, is its community of packages. They’re generally all well maintained, written, and (most importantly) work well together. If you have a task you want to get done in R, I have two primary recommendations. 1. Check out the CRAN Task Views. It will help you find whatever packages are relavent to the task you want to get done.
I made my first tutorial over the tidyverse, which is the best style to follow when coding in R. Here’s a link the interactive tutorial I set up. Please let me know if it is or is not good.
In the past year I’ve done two things that stand above all else in their value to my education. I’d recommend you do the same. As the title suggests, I coded a neural network without using any external packages (except numpy). I think it’s alright to use numpy as it will only handle your matrix multiplication. Doing this allows you to really understand activation functions, loss functions, learning rates, and backwards propagation.
This is just a quick post of a good article I read on Medium today. I myself am guilty of some of the problems addressed in it, and I am much less likely to commit one of these errors. Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them
I spend about at least an hour and a half each work day on the train, in that time I’ve been using the app SoloLearn on my phone. There are lots of mini-courses and quizzes to test your knowledge on, related to both data science and computer science. While it is no substitute for doing a real project, or even taking an online course, it has been really helpful for me in reinforcing some of my knowledge with SQL, HTML, and CSS.
If you are looking at a lot of data science jobs or Kaggle competitions you’ll notice that there is a lot of demand for computer vision and nautral language processing (NLP). Here’s the steps I took to learn NLP This sounds like a unrelated first step, but I learned how to make use of Shiny in R. I read the book Text Mining With R by Julia Silge and David Robinson.
I have about a 45 minute commute both ways to my job via the Washington, D.C. Metro. This gives me a lot of time to sit and read, I have found excellent free books to read at bookdown.org. Of the books on there I have read Forecasting: Principles and Practice, Text Mining with R, and blogdown: Creating Websites with R Markdown. I am now reading R Mardown: The Definitive Guide.
If you want to be considered good at R, it’s best you know ggplot. If you’re using base R graphics I get the impression other data scientists may look at your graphics as childish (though I think there is nothing wrong with using base R). I’ll give a quick walk through of ggplot and making use of ggplotly from the plotly package. We’re going to use the cars data set, which is comes with R so don’t worry about getting it.
If you’re an up and coming data scientist or student you may want to be building your portfolio. If that’s the case here are some quick suggestions on datasets you may want to work with. Don’t pick a dataset that is common. If I was a hiring manager and saw you do an analysis on the iris dataset, I would not be impressed. Aim for something that isn’t on the UCI Machine Learning Respository.
Today marks the end of my first week as a data sciece intern at SphereOI Studios! It’s been a great first week, I’ve already gotten into the thick of it and I can tell it’s going to be a fantastic learning experience!
I’ve heard that Jeff Bezos has three main questions he asks his employees considers when hiring someone. Will you admire this person? Will this person raise the average level of effectiveness of the group they’re entering? Along what dimension might this person be a superstar? The earliest reference I can find to this is from his 1998 Letter to Share Holders. The second question, “Will this person raise the average level of effectiveness of the group they’re entering?
RStudio the company that produces the IDE of the same name, also produces some business facing products. One I would be very excited to try out is RStudio Connect because it solves a lot of the problems I have encountered in my work experience. Check out this video from rstudio:conf2019().
Well, I am finally done with my time at Texas A&M. I finished with a double major in statistics and economics. I’m sad my time here has come to a close, but I am excited to be able to work on data science all the time now!
After taking 4th place last year in the undergraduate competition, my team and I won the graduate student competition. There are two primary reasons I think we won. Our results were made interactive and viewable online. You can find them here. Our results were actionable. We actually largely ignored the suggestions of what to cover by the competition organizers, beause all the other team would likely cover that as well and instead focused on giving results that would be most useful to making decisions.
This was my third hackathon! I built a web app that constructs a spotify playlist in your library with music you’ve never listened to based on the artists you are most frequently listening to. It did not work so well because I developed the algorithim myself and ran with my first iteration of it, but I learend a bit about OAuth, which was the hardest part. I’d really recommend hackathons to students interested in data science, it’s important to develop your skills that are necessary to complete data science but aren’t machine learning (e.
Plotting is a big part of getting your point across in data science. Even if you know how to create plots, selecting the right plot is often not done when I think it should be. Take a look at the three pie charts I made with R. See if you can tell if the red or blue section is bigger in each chart. Here is the same data, displayed as a bar chart.
Last night was my first night teaching/presenting to the Texas A&M Analytics Club! I will be doing this for the next 2 semesters in collaboration with a few of my fellow statistics majors. In my first presentation I dicussed data cleaning and the ETL process, since that is usually common to all data science projects and typically not covered in regular curriculum (unfortunately).
If you are learning R, something that makes your code much more readable to yourself and others is making use of the pipe operator, which looks like this %>%. You can make use of the pipe operator by installing and loading either the magrittr package or the tidyverse package. The pipe will take the object on the left and pass it in to the first argument of the function on the right.
This month I happened to be featured as the amazing economic student of the month. I answered a few questions the department asked me and they put the interview up at this link.
If you’re a college student your university may pay for your access to Lynda.com (mine did, many students from other schools that I have talked to say their’s did). You can also get access to all these courses on LinkedIn Learning if you have LinkedIn Premium. Wether you have LinkedIn Learning or Lynda.com, you can publish the certificates of these courses onto your LinkedIn. Once you have access, I would watch the following courses in the following order.
I’m going to go over two ways to quickly get products on to your résumé. Both of them require you to have the RStudio IDE installed, and that you know a little R. I’ll walk through both in more in-depth articles later, but you may be able to get started with just this. Rmarkdown and Rpubs This first project is really a recommended project if you don’t have anything on your resume yet.
Opinion/Tutorial: What should you do if you're a high school student or college freshman and want to get into data science?
If you’re a high school student or college freshman and you find machine learning/AI really neat and want to learn here are my best suggestions for you. Learn Calculus If you want to understand machine learning you won’t get far without understanding statistics, probability theory, and distribution theory. You won’t be good at any of those if you don’t take calculus. If you think you have to wait until your calculus class to get started you’re wrong!
Yesterday, my good friends Juliang Li (now at Google), Ishan Vasandani (now at PwC), Ian George (still at Texas A&M), and I competed in the first ever Texas A&M Institute of Data Science Competition. The competition centered around forecasting trip revenue for a taxi ride in Chicago, and we took 4th place. Besides needing to improve my skill with time series (which is not covered in my curriculum for some reason), I realize that an impressive part of having a good presentation, no matter how skilled you are at communicating, is having an interactivity.
If you’ve discovered my blog you may wonder why I have one at all. Why don’t I just post content to facebook/twitter instead? 3 main reasons. It’s kind of a portfolio that I can display It’s for me to keep track of what I am learning It’s for other data science students to learn from In each blog post I’ll be keeping track of the data science work I do and what I am learning from, in hackathons, data science competitions, Kaggle, personal projects, books I’m reading, classes, and MOOCs I am taking.