Indego operates a bike sharing network in Philadelphia. They also publish anonymized data detailing usage of the service.
As a part of my coursework for the Data Science & Strategic Analytics masters program at Stockton University, I was tasked with creating a visualization that told a data story regarding the bike sharing program using the Python library matplotlib. Here’s how I solved this project and produced an animated visualization showing bike usage over time.
How It Works
To create the visualization, I wanted to create a simple and repeatable workflow. Rather than designing multiple scripts that had to be run individually or doing things manually, my goal from a technical standpoint was to have a single script that I could execute from the command line that would run all of the needed code and output a final animated GIF suitable for sharing.
Downloading the Data
The bash script works by first obtaining a list of all quarterly data files published on the Indego website. It does this by using a grep expression to filter out just the URLs we are interested in from the HTML code of the web page that we have downloaded using curl. It stores the list of URLs as an array and iterates over each URL to download a ZIP file containing data in CSV format.
To conserve bandwidth, I used the -N argument in the script’s calls to wget to tell it not to download the data files if they are already present on the user’s computer and have not changed on the server.
Once the data files are downloaded, the bash script subsequently extracts each ZIP archive of data and stores the extracted CSV files that each cover one quarter’s worth of activity in a directory for each year from 2015-2019.
Creating the Visualization
The Python code utilizes Pandas and GeoPandas to clean the data as well as parse the latitude and longitude columns in order to plot them using matplotlib. Some of the data files have latitude and longitude values equal to zero, which caused problems with the visualization, so I cleaned the dataset to remove these problematic cases.
The Python code uses matplotlib for visualizing trends in Indego usage by plan. I wanted to map this against a background of a map of the city of Philadelphia and was able to do so using open data published by the city government.
The city publishes a GeoJSON file that can be used to draw the city limits and various neighborhoods within Philadelphia. GeoPandas makes it simple to read in this file and plot it using matplotlib.
For this visualization, I chose to focus on the destinations column of the dataset in order to plot the markers on a map of Philadelphia. A different marker and color is used based upon which plan the customer was subscribed to when they used the bike sharing network:
- Indego30 - Red star
- Walk-up (no plan) - Blue dot
- IndegoFlex - Green triangle
Putting It All Together
The end result of the Python script is a PNG file for each quarter’s worth of data from 2015 to the present. The final step uses the ImageMagick library to stitch all of the PNG files into the final animated GIF visualizing trends in bike usage over time.
The Code
Bash Script
#!/bin/bash
# This script downloads all Indego data and stores it in a folder for each year
# Get a list of URLs to download
URL_LIST=$(curl https://www.rideindego.com/about/data/ | grep -e ".zip" | \
grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep -e "uploads" | \
tr -d '"' | cut -c 6-)
# Iterate over each line in the URL list to download the data
echo $URL_LIST | while read line ; do
wget -N $line -P data/
done
# Find all downloaded ZIPs and extract them, then run Python script
YEARS=(2015 2016 2017 2018 2019)
for year in ${YEARS[@]}; do
FILES=$(find data/ -type f -name *$year*)
for file in $FILES; do
unzip -o $file -d data/$year
done
YEAR_CSV=$(find data/$year/ -type f -name "*.csv" | sort -n)
for csv in $YEAR_CSV; do
python3 plot.py $csv
done
done
# Convert the plotted pngs to an animated GIF sequence
convert -delay 75 -loop 0 *.png indego.gif
Python Script (plot.py)
import sys
import os
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Set coordinate system for map
crs = {'init': 'epsg:4326'}
# Create name of plot based upon filename of CSV
base = os.path.basename(sys.argv[1])
file_name = os.path.splitext(base)[0]
ext = '.png'
plot_file_name = file_name + ext
print("Plotting " + sys.argv[1])
# Read in CSV file to pandas
df = pd.read_csv(sys.argv[1])
# Remove problematic cases where either lat or lon is equal to 0
df = df[df.start_lat != 0]
df = df[df.start_lon != 0]
df = df[df.end_lat != 0]
df = df[df.end_lon != 0]
# Convert to a geodataframe
destinations = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.end_lon, df.end_lat))
# Load GeoJSON file of Philadelphia neighborhoods
philly = gpd.read_file('Neighborhoods_Philadelphia.geojson')
# Create map
fig, ax = plt.subplots(figsize=(15, 15))
philly.plot(ax=ax, color='grey')
destinations[destinations['passholder_type'] == 'Indego30'].plot(
ax=ax, color='red', marker="*", label="Indego30")
destinations[destinations['passholder_type'] == 'Walk-up'].plot(
ax=ax, color='blue', marker="8", label="Walk-up")
destinations[destinations['passholder_type'] == 'IndegoFlex'].plot(
ax=ax, color='green', marker="^", label="IndegoFlex")
ax.axis('off')
plt.legend()
plt.title('Indego Usage by Plan Type')
plt.savefig(plot_file_name, format="png")