Exploring the S&P 500 using ETFs holdings

Published November 14, 2019

Introduction

Stock indexes are everywhere. They are the gauge of the stock market, displayed on every finance newspaper and yet are quite obscure! Their historical constituents are often hidden and not available.

Using public ETF holdings, we can retrieve a portfolio tracking the underlying index, and discover its historical constituents. In this article we replicate the most popular index in the world, the S&P 500 thanks to the iShares Core S&P 500 ETF.

Keep in mind this method can be applied to any Index, as long as there is a corresponding ETF publishing their holdings!

This article is split in two parts. The first one is the data collection process, and the second one is the study of the S&P 500. If you are only interested in the study, you can click here.

Data Collection

Holdings for the iShares Core S&P 500 ETF holdings are available here. Scraping is quite straightforward as we just need to iterate dates, and parse the csv:

import io
import multiprocessing
import requests
import pandas as pd

def get_holdings(date):
    resp = requests.get(
        'https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf/1467271812596.ajax',
        params={
            'fileType': 'csv',
            'fileName': 'IVV_holdings',
            'dataType': 'fund',
            'asOfDate': date.strftime('%Y%m%d'),
        }
    )

    fund_info = pd.read_csv(
        io.StringIO(resp.text),
        skiprows=1,
        nrows=8,
        header=None,
        names=['attribute', 'value'],
    )

    fund_info = fund_info.set_index('attribute').T
    fund_info = fund_info.reset_index(drop=True)
    fund_info['date'] = date

    holdings = pd.read_csv(
        io.StringIO(resp.text),
        skiprows=10,
        dtype='str'
    )

    holdings['date'] = date

    return fund_info, holdings

dates = pd.bdate_range('2009-01-01', '2019-11-19')

with multiprocessing.Pool(16) as pool:
    res = pool.map(get_holdings, dates)

fund_infos = pd.concat([fund_info for fund_info, _ in res], axis=0, ignore_index=True)
holdings = pd.concat([holding for _, holding in res], axis=0, ignore_index=True)

Data Cleaning

We just remove incomplete rows, rename and keep only columns of interest and finally type them.

fund_infos = fund_infos.replace({'-': pd.np.nan, '': pd.np.nan})
fund_infos = fund_infos[fund_infos.count(axis=1) > 2]

fund_infos = fund_infos.rename(columns={
    'Total Net Assets': 'total_net_assets',
    'Shares Outstanding': 'shares_oustanding',
})

fund_infos = fund_infos[['date', 'total_net_assets', 'shares_oustanding']]

for column in ['total_net_assets', 'shares_oustanding']:
    fund_infos[column] = pd.to_numeric(fund_infos[column].fillna('').str.replace(',', ''))

holdings = holdings.replace({'-': pd.np.nan, '': pd.np.nan})
holdings = holdings[holdings.count(axis=1) > 2]

holdings = holdings.rename(columns={
    'Asset Class': 'asset_class',
    'Exchange': 'exchange',
    'ISIN': 'isin',
    'Market Value': 'market_value',
    'Name': 'name',
    'Notional Value': 'notional_value',
    'Price': 'price',
    'SEDOL': 'sedol',
    'Sector': 'sector',
    'Shares': 'shares',
    'Ticker': 'ticker',
    'Weight (%)': 'weight',
})

for column in ['market_value', 'notional_value', 'price', 'shares', 'weight']:
    holdings[column] = pd.to_numeric(holdings[column].fillna('').str.replace(',', ''))

fund_infos.to_csv('./fund_infos.csv', index=False)
holdings.to_csv('./holdings.csv', index=False)

Data Presentation

We collected two datasets, describing the iShares Core S&P 500 ETF. Frequency is monthly before 2017, then daily afterwards. Below we show an extract of the data, and the links if you want to download them.

Holdings

datetickernameweightpricesharesisin
2019–11–20MSFTMICROSOFT CORP4.41149.625.76509e+07US5949181045
2019–11–20AAPLAPPLE INC4.32263.193.20746e+07US0378331005
2019–11–20AMZNAMAZON COM INC2.81745.533.1373e+06US0231351067
2019–11–20FBFACEBOOK CLASS A INC1.84197.511.81643e+07US30303M1027
2019–11–20BRKBBERKSHIRE HATHAWAY INC CLASS B1.65217.481.48054e+07US0846707026

Download CSV (54M)

Fund Infos

datetotal_net_assetsshares_oustanding
2019–11–201.95462e+116.254e+08
2019–11–191.96129e+116.253e+08
2019–11–181.96233e+116.253e+08
2019–11–151.96712e+116.2715e+08
2019–11–141.95161e+116.271e+08

Download CSV (103K)

Historical Constituents

Let’s start our study by displaying the historical constituents. Below is a treemap, where each rectangle is a stock with an area proportional to its weight in the index. Remember, the S&P 500 is a float-adjusted market capitalization index of the top 500 stocks in the US market. So here we have a view of the biggest companies per market capitalization.

By clicking on a specific year, we are able to see the evolution of stocks and their sectors. The rise of Technology and the fall of Energy is quite impressive these last 10 years!

USDMARGIN_USDESZ9ESU9ESM9ESH9UBFUTBLKFDSXTSLASEEMOSBMSALBWRKPKGEMNCFAVYFMCNUEMLMIFFFCXCEAMCRVMCIPCTVALYBBLLPPGNEMDOWSHWECLDDAPDDWDPLINMACSLGAIVVNOREGKIMIRMFRTHSTEXRDREUDRMAAPEAKHCPCBREAREVTRESSBXPWYODLRSBACPSAEQRAVBWELLSPGEQIXPLDCCIAMTPCGPNWNRGNILNTCNPATOAESEVRGCMSAEEAWKPPLETRDTEFEESEIXWECEDPEGXELSREEXCAEPSODUKDNEENFXXECHPNOVHFCFTIDVNCOGAPANBLMROFANGCXOBKRBHGEHESHALPXDWMBOKEOXYKMIAPCVLOMPCEOGSLBPSXCOPCVX0.9%XOM1.1%COTYCPBTAPSJMLWHRLBFBKCAGKHCCLXCHDMKCKRMNSTHSYADMTSNSTZGISSYYELWBAKMBCLMDLZMOPMCOST0.5%WMT0.7%PEP0.7%KO0.8%PG1.2%ROLPWRFLSFLRSNARHIPNRNLSNFBHSAOSALKURITXTJBHTHIICHRWALLEAALXYLMASJECIEXGWWEXPDARNCWABKSUDOVRSGEFXCPRTUALLLLFTVFASTCTASVRSKSWKROKAMEPHTDGPCARLUVINFOCMIIRDALJCIROPFDXETNWMHRSGDEMRLHXNSCITWNOCDECSXRTNCATUPSMMMLMTGEUTXUNPHONBA0.8%UAAUALBJWNGPSRLMATMHRBHOGHBIGTFLCPRIWHRTPRPVHNWLMHKLEGKSSBWAWYNNTSCOPHMNCLHLKQAAPULTATIFNVRHASGRMNEXPEDRIMGMKMXGPCLENDHICCLBBYRCLLVSCMGAPTVDLTRVFCHLTEBAYAZOYUMORLYFMARROSTDGGMTGTTJXBKNGLOWSBUXNKEMCD0.6%HD0.9%AMZN2.8%TFCFATFCFTRIPNWSNWSAFOXDISCAVIABIPGDISHDISCKTTWOFOXACBSOMCCTLTWTRTMUSEAATVICHTRNFLX0.5%CMCSA0.8%VZ1%DIS1%T1.1%GOOGL1.5%GOOG1.5%FB1.8%TMKUNMJEFBHFAMGZIONPBCTIVZBENAIZRJFRELNCGLETFCCMASIVBPFGNDAQLCBOERFMKTXHBANCINFKEYFRCCFGAJGMTBMSCIFITBAMPWLTWSYFNTRSHIGSTTDFSTROWSTITRVMCOALLPRUMETAFLPGRBKBBTCOFAONAIGSCHWICEMMCBLKMSSPGIPNCGSCBCMEAXPUSBC0.7%WFC0.8%BAC1.1%JPM1.6%BRKB1.7%NKTRPRGOMYLDVAABMDUHSPKIHSICXRAYVARHOLXDGXABCWCGWATTFXLHINCYCOOCAHMTDRMDCERNALGNIDXXCNCALXNAIQVZBHREGNMCKHCABAXILMNHUMEWBIIBVRTXZTSBSXAGNSYKISRGBDXCICELGANTMGILDDHRLLYCVSTMOABBVBMY0.5%AMGN0.5%ABT0.6%MDT0.6%PFE0.8%MRK0.8%UNH1%JNJ1.4%IPGPADSXRXJNPRFLIRFFIVWUDXCANETSYMCSTXQRVONLOKLDOSJKHYCTXSBRWDCSWKSNTAPMXIMITFTNTAKAMCDNSVRSNSNPSMCHPKEYSCDWANSSXLNXTSSHPEGLWFLTPAYXMSIKLACHPQTELAPHRHTCTSHADSKLRCXADIAMDMUNOWGPNAMATFISVINTUADPFISQCOMTXNIBMPYPLORCLAVGOACNNVDACRM0.6%ADBE0.6%CSCO0.7%MA1%INTC1%V1.2%AAPL4.3%MSFT4.4%Cash and/or Derivatives0%Materials3%Real Estate3%Utilities3%Energy4%Consumer Staples7%Industrials10%Consumer Discretionary10%Communication10%Financials13%Health Care14%Information Technology23%

Weights Distribution

How are the companies distributed by market capitalization? We plot below the distribution of weights per stock. It seems the distribution is close to a power of law.

0.50 %1.00 %1.50 %2.00 %2.50 %3.00 %3.50 %4.00 %

Conclusion

In this article we did a very light analysis of the S&P 500 Index, as our main goal was to obtain a portfolio that we will be able to use later on, since we now have a way to obtain an historical universe of stocks free of suvivorship bias. This is the first step in order to try other quantitative analysis. You can of course explore more in depth this data set, by simply downloading it or running the code by yourself. Happy exploration!