Linear Regression with Python Scikit Learn: Levered ETFs

R has always been my go-to for building regression models. This is a step-by-step, dear-diary type guide of my attempt to model basic single variable linear regression in Python, using Scikit-Learn (for the first time).

I like to learn through examples. I found the official Scikit-Learn tutorial, which is very thorough, well-documented, and enough for me to get started.

Feel free to scroll down for actual code.

Choosing a Topic

Choosing data for a single variate resgression is relatively easy. I wanted to investigate the relationsip between an underlying and its levered ETFs: GDX (Gold Miners ETF), NUGT (Gold Miners Bull 3x), and DUST (Gold Miners Bear 3x). I was first introduced to the levered ETFs at work, and have always wanted to learn more about how the relationship between the underlying and levered instrument holds up. The regression I’m about to construct is fairly forward – regressing GDX returns against NUGT returns.

The Plan

Get the data, calculate the daily return of each instrument. GDX is the explanatory, or X variable. NUGT and DUST are two independent variables for two single variate regression.

Getting the Data

I found the historical closing prices of GDX, NUGT, and DUST over the past three months on NASDAQ. Download the data in .csv (scroll all the way down to download).

Install All Necessary Packages

I already had numpy and pandas on my Mac, but still needed to install Scikit-Learn. I followed this guide.

Unfortunately, I ran into the following error when upgrading my numpy:

$ sudo pip install --upgrade numpy
The directory '/Users/janetye/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/janetye/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting numpy
  Downloading numpy-1.11.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (3.9MB)
    100% |████████████████████████████████| 3.9MB 281kB/s 
Installing collected packages: numpy
  Found existing installation: numpy 1.8.0rc1
    DEPRECATION: Uninstalling a distutils installed project (numpy) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling numpy-1.8.0rc1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 317, in run
    prefix=options.prefix_path,
  File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 736, in install
    requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 742, in uninstall
    paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip/req/req_uninstall.py", line 115, in remove
    renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 267, in renames
    shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
    copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
    copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
    os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-hfh1B7-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy-1.8.0rc1-py2.7.egg-info'

I had recently upgraded to Mac OS Sierra. The above error is relating to six. I had a similar issue when I first upgraded to El Capitan. This post discusses the issue in detail. I found luck with sudo pip install --upgrade numpy --ignore-installed-six.

Actual Code
Break Downs

Line 1-4: import graphing tools, numpy for dealing with arrays, scikit-learn.
Line 11: import data into a numpy array using genfromtxt. Documentation here. skip_header = 2 because the first row is header. dtype by default is set to float. I had trouble importing the numbers as float. Some set dtype=None to let numpy automatically detect the type of data. I decided to force it to be str, and later convert to float type.
Line 13: I’m only interested in the second column of data, which is closing price.
Line 15: As aforementioned, I’m now getting rid of the quotes that come with string type (from Line 11), as casting the type to float.
Line 17: Calculate the daily return – difference in price between two consecutive days normalized by first day’s price.
Line 19-20: I reserve the last 20 data points as my testing data. The rest is for training the model.
Line 22-43: Same set up for DUST and NUGT.
Line 48-50: Before running any regression, I like to scatterplot the data to see if there is any apparent relationship. See below. GLD vs Dust

From the graph, we see that there is a strong, negative, and linear relationship between GLD return and DUST return.

Line 53-55: Line 53 fits a regression with using the training data. Line 54-55 uses the model to fit the hold-out test data.
Line 56-57: Basically “pause” the data, and waits for user input to proceed. Line 58-64: Summary stats. In this case, the regression returns Return of DUST = -0.0057 - 8.858 * Return of GLD. The residual sum of square is 0.01. R^2 is weak at 0.06, meaning that only 6% of variability in the return of DUST can be explained by the linear relationship with the return of GLD.
Line 65-67: Plots the graph with the regression line in blue.
Line 71-96: Regression of GLD against NUGT. Here, the least sum of square regression is Return of NUGT = 0.006125 + 8.909 * Return of GLD. The residual sum of squares is 0.01, and R^2 is even weaker at 0.02.