Learning Data Science


Exploring the intersection between data science and education

Flatiron Capstone: LANL Earthquake Prediction

For my capstone project as a Flatiron Data Science student, I decided to take a crack at a Kaggle competition: LANL Earthquake Prediction. I chose this competition because I thought it would be a good way to stretch myself out of my confort zone, which turned out to be true! It was challenging in a number of ways:


Regression with Long Datasets in Pandas

For my capstone project, I am taking a crack at the LANL Earthquake Prediction Competition on Kaggle. This competition aims to use acoustic data recorded by ground sensors to predict the timing of earthquakes in a laboratory experiment. The training dataset for this competition has over 600M rows representing acoustic measurements of the soil, and the target is the remaining time until the next earthquake occurs. Since the acoustic signal and the target are the only variables in the dataset, the datafile is not too large to process on my moderately-powerful PC. However, working with such a long dataset is a new challenge for me, and I thought I’d share some of what I learned while trying to wrangle it.


Time Management in Deep Learning

I think it is very easy to get stressed about time when working on machine learning problems. This can be a whole bunch of big picture reasons:


Working with Genetic Variant Classification Data

For Module 3, we were asked to solve a classification problem using a publicly available dataset. I wanted to see if I could apply the techniques that I have been learning to datasets that I was unfamiliar with. I figured that this would stretch me to think about ways to select and tune models without domain knowledge to inform my decisions.


Fixed Effects Regression Models in Statsmodels

In this blog post, I describe how I used pandas and statsmodels to implement a fixed effects regression model: a useful but counterintuitive type of regression model. I will also walk through the proper interpretation of the main coefficient of interest from this model. Hopefully, you will finish this post feeling comfortable with implementing such a model yourself, and understanding when/why you might want to do so.