R has always been my go-to for building regression models. This is a step-by-step, dear-diary type guide of my attempt to model basic single variable linear regression in Python, using Scikit-Learn (for the first time).
I like to learn through examples. I found the official Scikit-Learn tutorial, which is very thorough, well-documented, and enough for me to get started.
Choosing a Topic
Choosing data for a single variate resgression is relatively easy. I wanted to investigate the relationsip between an underlying and its levered ETFs: GDX (Gold Miners ETF), NUGT (Gold Miners Bull 3x), and DUST (Gold Miners Bear 3x). I was first introduced to the levered ETFs at work, and have always wanted to learn more about how the relationship between the underlying and levered instrument holds up. The regression I’m about to construct is fairly forward – regressing GDX returns against NUGT returns.
Get the data, calculate the daily return of each instrument. GDX is the explanatory, or X variable. NUGT and DUST are two independent variables for two single variate regression.
Getting the Data
I found the historical closing prices of GDX, NUGT, and DUST over the past three months on NASDAQ. Download the data in .csv (scroll all the way down to download).
Install All Necessary Packages
I already had
pandas on my Mac, but still needed to install Scikit-Learn. I followed this guide.
Unfortunately, I ran into the following error when upgrading my numpy:
$ sudo pip install --upgrade numpy The directory '/Users/janetye/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/janetye/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting numpy Downloading numpy-1.11.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (3.9MB) 100% |████████████████████████████████| 3.9MB 281kB/s Installing collected packages: numpy Found existing installation: numpy 1.8.0rc1 DEPRECATION: Uninstalling a distutils installed project (numpy) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project. Uninstalling numpy-1.8.0rc1: Exception: Traceback (most recent call last): File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 317, in run prefix=options.prefix_path, File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 736, in install requirement.uninstall(auto_confirm=True) File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 742, in uninstall paths_to_remove.remove(auto_confirm) File "/Library/Python/2.7/site-packages/pip/req/req_uninstall.py", line 115, in remove renames(path, new_path) File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 267, in renames shutil.move(old, new) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move copy2(src, real_dst) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2 copystat(src, dst) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat os.chflags(dst, st.st_flags) OSError: [Errno 1] Operation not permitted: '/tmp/pip-hfh1B7-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy-1.8.0rc1-py2.7.egg-info'
I had recently upgraded to Mac OS Sierra. The above error is relating to six. I had a similar issue when I first upgraded to El Capitan. This post discusses the issue in detail. I found luck with
sudo pip install --upgrade numpy --ignore-installed-six.
Line 1-4: import graphing tools, numpy for dealing with arrays, scikit-learn.
Line 11: import data into a numpy array using
genfromtxt. Documentation here.
skip_header = 2 because the first row is header.
dtype by default is set to float. I had trouble importing the numbers as float. Some set
dtype=None to let numpy automatically detect the type of data. I decided to force it to be
str, and later convert to float type.
Line 13: I’m only interested in the second column of data, which is closing price.
Line 15: As aforementioned, I’m now getting rid of the quotes that come with string type (from Line 11), as casting the type to float.
Line 17: Calculate the daily return – difference in price between two consecutive days normalized by first day’s price.
Line 19-20: I reserve the last 20 data points as my testing data. The rest is for training the model.
Line 22-43: Same set up for DUST and NUGT.
Line 48-50: Before running any regression, I like to scatterplot the data to see if there is any apparent relationship. See below.
From the graph, we see that there is a strong, negative, and linear relationship between GLD return and DUST return.
Line 53-55: Line 53 fits a regression with using the training data. Line 54-55 uses the model to fit the hold-out test data.
Line 56-57: Basically “pause” the data, and waits for user input to proceed. Line 58-64: Summary stats. In this case, the regression returns
Return of DUST = -0.0057 - 8.858 * Return of GLD. The residual sum of square is 0.01. R^2 is weak at 0.06, meaning that only 6% of variability in the return of DUST can be explained by the linear relationship with the return of GLD.
Line 65-67: Plots the graph with the regression line in blue.
Line 71-96: Regression of GLD against NUGT. Here, the least sum of square regression is
Return of NUGT = 0.006125 + 8.909 * Return of GLD. The residual sum of squares is 0.01, and R^2 is even weaker at 0.02.