Scenario

A junior data analyst working is part of the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the analyst’s recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Analysis Process

1. Loading Required Packages and Data

The packages contains the functions that will be used in this study. If they cannot be loaded, the packages must be installed first.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.0
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
1.1 Import dataframes of 2022 data

The project officially begins with importing all the needed data.

jan_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202209-divvy-tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202210-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202211-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202212-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Wrangle Data and Combine Into a Single Datset

2.2 Compare column of dataframes

The column names of each uploaded datasets are then compared.

colnames(jan_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(feb_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(mar_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(apr_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(may_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(jun_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(jul_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(aug_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(sep_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(oct_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(nov_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(dec_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

In this step, it could be seen that they possess the same column names.

2.3 Compare datatype of each dataframe using loop
months_2022 <- list(jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022, sep_2022, oct_2022, nov_2022, dec_2022)
results <- lapply(months_2022, function(month) Map(function(a, b) all(class(a) == class(b)), jan_2022, month))
print(results)
## [[1]]
## [[1]]$ride_id
## [1] TRUE
## 
## [[1]]$rideable_type
## [1] TRUE
## 
## [[1]]$started_at
## [1] TRUE
## 
## [[1]]$ended_at
## [1] TRUE
## 
## [[1]]$start_station_name
## [1] TRUE
## 
## [[1]]$start_station_id
## [1] TRUE
## 
## [[1]]$end_station_name
## [1] TRUE
## 
## [[1]]$end_station_id
## [1] TRUE
## 
## [[1]]$start_lat
## [1] TRUE
## 
## [[1]]$start_lng
## [1] TRUE
## 
## [[1]]$end_lat
## [1] TRUE
## 
## [[1]]$end_lng
## [1] TRUE
## 
## [[1]]$member_casual
## [1] TRUE
## 
## 
## [[2]]
## [[2]]$ride_id
## [1] TRUE
## 
## [[2]]$rideable_type
## [1] TRUE
## 
## [[2]]$started_at
## [1] TRUE
## 
## [[2]]$ended_at
## [1] TRUE
## 
## [[2]]$start_station_name
## [1] TRUE
## 
## [[2]]$start_station_id
## [1] TRUE
## 
## [[2]]$end_station_name
## [1] TRUE
## 
## [[2]]$end_station_id
## [1] TRUE
## 
## [[2]]$start_lat
## [1] TRUE
## 
## [[2]]$start_lng
## [1] TRUE
## 
## [[2]]$end_lat
## [1] TRUE
## 
## [[2]]$end_lng
## [1] TRUE
## 
## [[2]]$member_casual
## [1] TRUE
## 
## 
## [[3]]
## [[3]]$ride_id
## [1] TRUE
## 
## [[3]]$rideable_type
## [1] TRUE
## 
## [[3]]$started_at
## [1] TRUE
## 
## [[3]]$ended_at
## [1] TRUE
## 
## [[3]]$start_station_name
## [1] TRUE
## 
## [[3]]$start_station_id
## [1] TRUE
## 
## [[3]]$end_station_name
## [1] TRUE
## 
## [[3]]$end_station_id
## [1] TRUE
## 
## [[3]]$start_lat
## [1] TRUE
## 
## [[3]]$start_lng
## [1] TRUE
## 
## [[3]]$end_lat
## [1] TRUE
## 
## [[3]]$end_lng
## [1] TRUE
## 
## [[3]]$member_casual
## [1] TRUE
## 
## 
## [[4]]
## [[4]]$ride_id
## [1] TRUE
## 
## [[4]]$rideable_type
## [1] TRUE
## 
## [[4]]$started_at
## [1] TRUE
## 
## [[4]]$ended_at
## [1] TRUE
## 
## [[4]]$start_station_name
## [1] TRUE
## 
## [[4]]$start_station_id
## [1] TRUE
## 
## [[4]]$end_station_name
## [1] TRUE
## 
## [[4]]$end_station_id
## [1] TRUE
## 
## [[4]]$start_lat
## [1] TRUE
## 
## [[4]]$start_lng
## [1] TRUE
## 
## [[4]]$end_lat
## [1] TRUE
## 
## [[4]]$end_lng
## [1] TRUE
## 
## [[4]]$member_casual
## [1] TRUE
## 
## 
## [[5]]
## [[5]]$ride_id
## [1] TRUE
## 
## [[5]]$rideable_type
## [1] TRUE
## 
## [[5]]$started_at
## [1] TRUE
## 
## [[5]]$ended_at
## [1] TRUE
## 
## [[5]]$start_station_name
## [1] TRUE
## 
## [[5]]$start_station_id
## [1] TRUE
## 
## [[5]]$end_station_name
## [1] TRUE
## 
## [[5]]$end_station_id
## [1] TRUE
## 
## [[5]]$start_lat
## [1] TRUE
## 
## [[5]]$start_lng
## [1] TRUE
## 
## [[5]]$end_lat
## [1] TRUE
## 
## [[5]]$end_lng
## [1] TRUE
## 
## [[5]]$member_casual
## [1] TRUE
## 
## 
## [[6]]
## [[6]]$ride_id
## [1] TRUE
## 
## [[6]]$rideable_type
## [1] TRUE
## 
## [[6]]$started_at
## [1] TRUE
## 
## [[6]]$ended_at
## [1] TRUE
## 
## [[6]]$start_station_name
## [1] TRUE
## 
## [[6]]$start_station_id
## [1] TRUE
## 
## [[6]]$end_station_name
## [1] TRUE
## 
## [[6]]$end_station_id
## [1] TRUE
## 
## [[6]]$start_lat
## [1] TRUE
## 
## [[6]]$start_lng
## [1] TRUE
## 
## [[6]]$end_lat
## [1] TRUE
## 
## [[6]]$end_lng
## [1] TRUE
## 
## [[6]]$member_casual
## [1] TRUE
## 
## 
## [[7]]
## [[7]]$ride_id
## [1] TRUE
## 
## [[7]]$rideable_type
## [1] TRUE
## 
## [[7]]$started_at
## [1] TRUE
## 
## [[7]]$ended_at
## [1] TRUE
## 
## [[7]]$start_station_name
## [1] TRUE
## 
## [[7]]$start_station_id
## [1] TRUE
## 
## [[7]]$end_station_name
## [1] TRUE
## 
## [[7]]$end_station_id
## [1] TRUE
## 
## [[7]]$start_lat
## [1] TRUE
## 
## [[7]]$start_lng
## [1] TRUE
## 
## [[7]]$end_lat
## [1] TRUE
## 
## [[7]]$end_lng
## [1] TRUE
## 
## [[7]]$member_casual
## [1] TRUE
## 
## 
## [[8]]
## [[8]]$ride_id
## [1] TRUE
## 
## [[8]]$rideable_type
## [1] TRUE
## 
## [[8]]$started_at
## [1] TRUE
## 
## [[8]]$ended_at
## [1] TRUE
## 
## [[8]]$start_station_name
## [1] TRUE
## 
## [[8]]$start_station_id
## [1] TRUE
## 
## [[8]]$end_station_name
## [1] TRUE
## 
## [[8]]$end_station_id
## [1] TRUE
## 
## [[8]]$start_lat
## [1] TRUE
## 
## [[8]]$start_lng
## [1] TRUE
## 
## [[8]]$end_lat
## [1] TRUE
## 
## [[8]]$end_lng
## [1] TRUE
## 
## [[8]]$member_casual
## [1] TRUE
## 
## 
## [[9]]
## [[9]]$ride_id
## [1] TRUE
## 
## [[9]]$rideable_type
## [1] TRUE
## 
## [[9]]$started_at
## [1] TRUE
## 
## [[9]]$ended_at
## [1] TRUE
## 
## [[9]]$start_station_name
## [1] TRUE
## 
## [[9]]$start_station_id
## [1] TRUE
## 
## [[9]]$end_station_name
## [1] TRUE
## 
## [[9]]$end_station_id
## [1] TRUE
## 
## [[9]]$start_lat
## [1] TRUE
## 
## [[9]]$start_lng
## [1] TRUE
## 
## [[9]]$end_lat
## [1] TRUE
## 
## [[9]]$end_lng
## [1] TRUE
## 
## [[9]]$member_casual
## [1] TRUE
## 
## 
## [[10]]
## [[10]]$ride_id
## [1] TRUE
## 
## [[10]]$rideable_type
## [1] TRUE
## 
## [[10]]$started_at
## [1] TRUE
## 
## [[10]]$ended_at
## [1] TRUE
## 
## [[10]]$start_station_name
## [1] TRUE
## 
## [[10]]$start_station_id
## [1] TRUE
## 
## [[10]]$end_station_name
## [1] TRUE
## 
## [[10]]$end_station_id
## [1] TRUE
## 
## [[10]]$start_lat
## [1] TRUE
## 
## [[10]]$start_lng
## [1] TRUE
## 
## [[10]]$end_lat
## [1] TRUE
## 
## [[10]]$end_lng
## [1] TRUE
## 
## [[10]]$member_casual
## [1] TRUE
## 
## 
## [[11]]
## [[11]]$ride_id
## [1] TRUE
## 
## [[11]]$rideable_type
## [1] TRUE
## 
## [[11]]$started_at
## [1] TRUE
## 
## [[11]]$ended_at
## [1] TRUE
## 
## [[11]]$start_station_name
## [1] TRUE
## 
## [[11]]$start_station_id
## [1] TRUE
## 
## [[11]]$end_station_name
## [1] TRUE
## 
## [[11]]$end_station_id
## [1] TRUE
## 
## [[11]]$start_lat
## [1] TRUE
## 
## [[11]]$start_lng
## [1] TRUE
## 
## [[11]]$end_lat
## [1] TRUE
## 
## [[11]]$end_lng
## [1] TRUE
## 
## [[11]]$member_casual
## [1] TRUE
## 
## 
## [[12]]
## [[12]]$ride_id
## [1] TRUE
## 
## [[12]]$rideable_type
## [1] TRUE
## 
## [[12]]$started_at
## [1] TRUE
## 
## [[12]]$ended_at
## [1] TRUE
## 
## [[12]]$start_station_name
## [1] TRUE
## 
## [[12]]$start_station_id
## [1] TRUE
## 
## [[12]]$end_station_name
## [1] TRUE
## 
## [[12]]$end_station_id
## [1] TRUE
## 
## [[12]]$start_lat
## [1] TRUE
## 
## [[12]]$start_lng
## [1] TRUE
## 
## [[12]]$end_lat
## [1] TRUE
## 
## [[12]]$end_lng
## [1] TRUE
## 
## [[12]]$member_casual
## [1] TRUE

In this step, it could be seen that the dataframes have the same datatype for each of their columns.

2.4 Stack individual dataframes into one dataframe
all_trips <- bind_rows(jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022, sep_2022, oct_2022, nov_2022, dec_2022)

The individual dataframes are compiled into one big dataframe to start doing the needed cleaning and calculations.

3. Clean up and Add Data to Prepare for Analysis

colnames(all_trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
nrow(all_trips)
## [1] 5667717
dim(all_trips)
## [1] 5667717      13
head(all_trips)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
tail(all_trips)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 7BDEDE9860418… classi… 2022-12-07 06:52:45 2022-12-07 06:56:36 Sangam… 13409  
## 2 43ABEE85B6E15… classi… 2022-12-05 06:51:04 2022-12-05 06:54:48 Sangam… 13409  
## 3 F041C89A3D1F0… electr… 2022-12-14 17:06:28 2022-12-14 17:19:27 Bernar… 18016  
## 4 A2BECB88430BE… classi… 2022-12-08 16:27:47 2022-12-08 16:32:20 Wacker… KA1503…
## 5 37B392960E566… classi… 2022-12-28 09:37:38 2022-12-28 09:41:34 Sangam… 13409  
## 6 2DD1587210BA4… classi… 2022-12-09 00:27:25 2022-12-09 00:35:28 Southp… 13235  
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
str(all_trips)
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(all_trips)
##    ride_id          rideable_type        started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5667717    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00                     
##  NA's   :5858    NA's   :5858

The new table that has been created is inspected.

3.1 Add year, month and day columns
all_trips$year <- format(as.Date(all_trips$started_at), "%Y")
all_trips$month <- format(as.Date(all_trips$started_at), "%m")
all_trips$day <- format(as.Date(all_trips$started_at), "%d")

Information regarding the year, month and day of the trips are extracted to three new columns.

3.2 Add day_of_week column in making day as number ( 1 = Sunday)
all_trips <- all_trips %>% mutate(day_of_the_week = wday(as.Date(all_trips$started_at)))

The day_of_week column is transformed to numbers with Sunday being “1”.

3.3 Add month_of_year column in making month as number ( 1 = January)
all_trips <- all_trips %>% mutate(month_of_year = month(as.Date(all_trips$started_at)))

The month_of_year column is transformed to numbers with January being “1”.

3.4 Adding column ride_length
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

The ride_length column is added to calculate the duration of each trips taken.

3.5 Convert column ride_length to numeric
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE

The ride_length column is converted to numeric type.

3.4 Remove negative ride_length
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Negative values are removed from the ride_length column since it is not possible to have trip durations lesser than 0.

4. Conduct Descriptive Analysis

4.1 ride_length descriptive analysis (in sec)
summary(all_trips_v2$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     359     630    1232    1134 2483235  833041

The minimum, maximum, median, mean, 1st quartile and 3rd quartile and the number of NA values of the ride_length column is calculated.

4.2 Compare members and casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                1901.1018
## 2                     member                 770.2833
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                      814
## 2                     member                      536
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                  2483235
## 2                     member                    93594
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                        0
## 2                     member                        0

The mean, median, maximum and minimum values for members (annual) and casual members are compared with each other.

4.3 Compare average ride_length by each day for each members vs casual
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_the_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_the_week
## 1                      casual                            1
## 2                      member                            1
## 3                      casual                            2
## 4                      member                            2
## 5                      casual                            3
## 6                      member                            3
## 7                      casual                            4
## 8                      member                            4
## 9                      casual                            5
## 10                     member                            5
## 11                     casual                            6
## 12                     member                            6
## 13                     casual                            7
## 14                     member                            7
##    all_trips_v2$ride_length
## 1                 2216.0519
## 2                  854.8404
## 3                 1903.7340
## 4                  743.0153
## 5                 1684.9592
## 6                  733.1064
## 7                 1614.0556
## 8                  731.6995
## 9                 1665.2090
## 10                 743.8793
## 11                1837.6679
## 12                 758.3058
## 13                2117.0493
## 14                 862.1894

The average trip duration in each day of the day of the week for each annual and casual members are compared with each other.

4.4 Visualize number by rides by rider type AND drop NA values; days_of_the_week VS number of rides
all_trips_v2 %>%
  drop_na() %>%
  group_by(member_casual,day_of_the_week) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(day_of_the_week) %>% 
  mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>% 
  mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>%
  ggplot(aes(x = day_of_the_week, y = number_of_rides, fill= member_casual)) + 
  geom_col(position="stack") + # Add stacked column graph
  scale_x_continuous(breaks = seq(1, 7)) + # Set x-axis to 1-7
  geom_text(aes(label = paste0(percent_number_of_rides, "%")), 
            position = position_stack(vjust = 0.5)) + # Add percentage labels for subscription types
  ggtitle("Subscription Type on Number of rides per Day of Week", subtitle = "(1 = Sunday)") +
  xlab("Day of the Week") +
  ylab("Number of Rides") +
  scale_fill_discrete(name = "Subscription Type") + # Change legend title for member_casual
  theme(title = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The number of rides per customer type is plotted against the day of each week and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • Saturday has the most number of rides.

  • Monday has the least number of rides.

  • Members have more number of rides per day except in Saturday.

  • Members have the most number of rides in Tuesday and Wednesday and least in Saturday.

  • Casuals have the most number of rides in Saturday and least in Tuesday and Wednesday.

4.5 Visualize average duration; days_of_the_week VS ride_length
all_trips_v2 %>%
  drop_na() %>%
  group_by(member_casual,day_of_the_week) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(day_of_the_week) %>% 
  mutate(percent_average_duration = average_duration/sum(average_duration)) %>% 
  mutate(percent_average_duration = round(percent_average_duration*100)) %>%
  ggplot(aes(x = day_of_the_week, y = average_duration, fill=member_casual)) + 
  geom_col(position="stack") +
  scale_x_continuous(breaks = seq(1, 7)) +
  geom_text(aes(label = paste0(percent_average_duration, "%")), 
            position = position_stack(vjust = 0.5)) +
  ggtitle("Subscription Type on Ride Duration per Day of Week", subtitle = "(1 = Sunday)") +
  xlab("Day of Week") +
  ylab("Ride Duration") +
  scale_fill_discrete(name = "Subscription Type") + 
  theme(title = element_text(size = 15),
          axis.title.x = element_text(size = 15),
          axis.title.y = element_text(size = 15),
          axis.text.x = element_text(size = 12),
          axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The ride duration per customer type is plotted against the day of each week and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • Sunday and Saturday both have the highest ride duration.

  • Wednesday has the lowest ride duration.

  • Casuals have higher ride duration than Members in every day of the week.

4.6 Visualize number of rides by subscription type; month_of_year VS number of rides
all_trips_v2 %>%
  drop_na() %>%
  group_by(member_casual, month_of_year) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(month_of_year) %>% 
  mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>% 
  mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>% 
  ggplot(aes(x = month_of_year, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "stack") +
  scale_x_continuous(breaks = seq(1, 12)) +
  geom_text(aes(label = paste0(percent_number_of_rides, "%")), 
            position = position_stack(vjust = 0.5)) +
  ggtitle("Subscription Type on Number of rides per Month of Year", subtitle = "(1 = January)") +
  xlab("Month of Year") +
  ylab("Number of Rides") +
  scale_fill_discrete(name = "Subscription Type") +
  theme(title = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The number of rides per customer type is plotted against the month of the year and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • July has the most number of rides.

  • January has the least number of rides.

  • Members and Casuals have the most number of rides in June, July and August and the least in January and February.

4.7 Visualize average duration; month_of_year VS ride_length
all_trips_v2 %>%
  drop_na() %>%
  group_by(member_casual,month_of_year) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(month_of_year) %>% 
  mutate(percent_average_duration = average_duration/sum(average_duration)) %>% 
  mutate(percent_average_duration = round(percent_average_duration*100)) %>%
  ggplot(aes(x = month_of_year, y = average_duration, fill=member_casual)) +
  geom_col(position="stack") +
  scale_x_continuous(breaks = seq(1, 12)) +
  geom_text(aes(label = paste0(percent_average_duration, "%")), 
            position = position_stack(vjust = 0.5)) +
  ggtitle("Subscription Type on Ride Duration per Month of Year", subtitle = "(1 = January)") +
  xlab("Month of Year") +
  ylab("Ride Duration") +
  scale_fill_discrete(name = "Subscription Type") +
  theme(title = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The ride duration per customer type is plotted against the month of the year and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • Casuals have higher ride duration than Members in every month of the year.

  • April and May have the highest ride duration while December has the lowest ride duration.

4.8 Visualize number of rides; bike type vs number of rides
all_trips_v2 %>%
  drop_na() %>%
  mutate(rideable_type = gsub("_", " ", rideable_type)) %>% # Replace all "_" to " " in rideable_type
  group_by(member_casual, rideable_type) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(rideable_type) %>% 
  mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>% 
  mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>% 
  ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "stack") +
  geom_text(aes(label = paste0(percent_number_of_rides, "%")), 
            position = position_stack(vjust = 0.5)) +
  ggtitle("Subscription Type on Number of rides per Bike Type") +
  xlab("Bike Type") +
  ylab("Number of Rides") +
  scale_fill_discrete(name = "Subscription Type") +
  theme(title = element_text(size = 15),
      axis.title.x = element_text(size = 15),
      axis.title.y = element_text(size = 15),
      axis.text.x = element_text(size = 12),
      axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The number of rides per customer type is plotted against the bike type and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • Classic bike has highest number of rides. Members rode classic bikes more than Casuals.

  • Docked bikes has lowest number of rides. Docked bikes are ridden only by casuals.

  • Members rode electric bikes more than Casuals.

4.9 Visualize average duration; bike type vs ride_length
all_trips_v2 %>%
  drop_na() %>%
  mutate(rideable_type = gsub("_", " ", rideable_type)) %>%
  group_by(member_casual, rideable_type) %>% 
  summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>% 
  group_by(rideable_type) %>% 
  mutate(percent_average_duration = average_duration/sum(average_duration)) %>% 
  mutate(percent_average_duration = round(percent_average_duration*100)) %>%
  ggplot(aes(x = rideable_type, y = average_duration, fill = member_casual)) +
  geom_col(position = "stack") +
  geom_text(aes(label = paste0(percent_average_duration, "%")), 
            position = position_stack(vjust = 0.5)) +
  ggtitle("Subscription Type on Ride Duration per Bike Type") +
  xlab("Bike Type") +
  ylab("Ride Duration") +
  scale_fill_discrete(name = "Subscription Type") +
  theme(title = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_blank());
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The ride duration per customer type is plotted against the bike type and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:

  • Docked bikes have the highest ride duration which are only ridden by Casuals.

  • Electric bikes have the lowest ride duration.

  • Casuals have higher ride duration than Members per bike type.

5. Export Summary File into csv

counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual +
                      all_trips_v2$day_of_the_week +
                      all_trips_v2$month_of_year+
                      all_trips_v2$rideable_type, FUN = mean)
write.csv(counts, file='~/Bike-Sharing/avg_ride_length.csv')

The dataframe used in analysis is then exported to a csv file in the event that it will be used for analysis in other platforms.

Summary and Recommendations

From the analysis done, the following actions are recommended to convert more casual members to annual type subscription: