A junior data analyst working is part of the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the analyst’s recommendations, so they must be backed up with compelling data insights and professional data visualizations.
The packages contains the functions that will be used in this study. If they cannot be loaded, the packages must be installed first.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
The project officially begins with importing all the needed data.
jan_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202209-divvy-tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202210-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202211-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2022 <- read_csv("~/Bike-Sharing/Case Study - Bike-Share/Datasets/2022/202212-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The column names of each uploaded datasets are then compared.
colnames(jan_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(feb_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
In this step, it could be seen that they possess the same column names.
months_2022 <- list(jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022, sep_2022, oct_2022, nov_2022, dec_2022)
results <- lapply(months_2022, function(month) Map(function(a, b) all(class(a) == class(b)), jan_2022, month))
print(results)
## [[1]]
## [[1]]$ride_id
## [1] TRUE
##
## [[1]]$rideable_type
## [1] TRUE
##
## [[1]]$started_at
## [1] TRUE
##
## [[1]]$ended_at
## [1] TRUE
##
## [[1]]$start_station_name
## [1] TRUE
##
## [[1]]$start_station_id
## [1] TRUE
##
## [[1]]$end_station_name
## [1] TRUE
##
## [[1]]$end_station_id
## [1] TRUE
##
## [[1]]$start_lat
## [1] TRUE
##
## [[1]]$start_lng
## [1] TRUE
##
## [[1]]$end_lat
## [1] TRUE
##
## [[1]]$end_lng
## [1] TRUE
##
## [[1]]$member_casual
## [1] TRUE
##
##
## [[2]]
## [[2]]$ride_id
## [1] TRUE
##
## [[2]]$rideable_type
## [1] TRUE
##
## [[2]]$started_at
## [1] TRUE
##
## [[2]]$ended_at
## [1] TRUE
##
## [[2]]$start_station_name
## [1] TRUE
##
## [[2]]$start_station_id
## [1] TRUE
##
## [[2]]$end_station_name
## [1] TRUE
##
## [[2]]$end_station_id
## [1] TRUE
##
## [[2]]$start_lat
## [1] TRUE
##
## [[2]]$start_lng
## [1] TRUE
##
## [[2]]$end_lat
## [1] TRUE
##
## [[2]]$end_lng
## [1] TRUE
##
## [[2]]$member_casual
## [1] TRUE
##
##
## [[3]]
## [[3]]$ride_id
## [1] TRUE
##
## [[3]]$rideable_type
## [1] TRUE
##
## [[3]]$started_at
## [1] TRUE
##
## [[3]]$ended_at
## [1] TRUE
##
## [[3]]$start_station_name
## [1] TRUE
##
## [[3]]$start_station_id
## [1] TRUE
##
## [[3]]$end_station_name
## [1] TRUE
##
## [[3]]$end_station_id
## [1] TRUE
##
## [[3]]$start_lat
## [1] TRUE
##
## [[3]]$start_lng
## [1] TRUE
##
## [[3]]$end_lat
## [1] TRUE
##
## [[3]]$end_lng
## [1] TRUE
##
## [[3]]$member_casual
## [1] TRUE
##
##
## [[4]]
## [[4]]$ride_id
## [1] TRUE
##
## [[4]]$rideable_type
## [1] TRUE
##
## [[4]]$started_at
## [1] TRUE
##
## [[4]]$ended_at
## [1] TRUE
##
## [[4]]$start_station_name
## [1] TRUE
##
## [[4]]$start_station_id
## [1] TRUE
##
## [[4]]$end_station_name
## [1] TRUE
##
## [[4]]$end_station_id
## [1] TRUE
##
## [[4]]$start_lat
## [1] TRUE
##
## [[4]]$start_lng
## [1] TRUE
##
## [[4]]$end_lat
## [1] TRUE
##
## [[4]]$end_lng
## [1] TRUE
##
## [[4]]$member_casual
## [1] TRUE
##
##
## [[5]]
## [[5]]$ride_id
## [1] TRUE
##
## [[5]]$rideable_type
## [1] TRUE
##
## [[5]]$started_at
## [1] TRUE
##
## [[5]]$ended_at
## [1] TRUE
##
## [[5]]$start_station_name
## [1] TRUE
##
## [[5]]$start_station_id
## [1] TRUE
##
## [[5]]$end_station_name
## [1] TRUE
##
## [[5]]$end_station_id
## [1] TRUE
##
## [[5]]$start_lat
## [1] TRUE
##
## [[5]]$start_lng
## [1] TRUE
##
## [[5]]$end_lat
## [1] TRUE
##
## [[5]]$end_lng
## [1] TRUE
##
## [[5]]$member_casual
## [1] TRUE
##
##
## [[6]]
## [[6]]$ride_id
## [1] TRUE
##
## [[6]]$rideable_type
## [1] TRUE
##
## [[6]]$started_at
## [1] TRUE
##
## [[6]]$ended_at
## [1] TRUE
##
## [[6]]$start_station_name
## [1] TRUE
##
## [[6]]$start_station_id
## [1] TRUE
##
## [[6]]$end_station_name
## [1] TRUE
##
## [[6]]$end_station_id
## [1] TRUE
##
## [[6]]$start_lat
## [1] TRUE
##
## [[6]]$start_lng
## [1] TRUE
##
## [[6]]$end_lat
## [1] TRUE
##
## [[6]]$end_lng
## [1] TRUE
##
## [[6]]$member_casual
## [1] TRUE
##
##
## [[7]]
## [[7]]$ride_id
## [1] TRUE
##
## [[7]]$rideable_type
## [1] TRUE
##
## [[7]]$started_at
## [1] TRUE
##
## [[7]]$ended_at
## [1] TRUE
##
## [[7]]$start_station_name
## [1] TRUE
##
## [[7]]$start_station_id
## [1] TRUE
##
## [[7]]$end_station_name
## [1] TRUE
##
## [[7]]$end_station_id
## [1] TRUE
##
## [[7]]$start_lat
## [1] TRUE
##
## [[7]]$start_lng
## [1] TRUE
##
## [[7]]$end_lat
## [1] TRUE
##
## [[7]]$end_lng
## [1] TRUE
##
## [[7]]$member_casual
## [1] TRUE
##
##
## [[8]]
## [[8]]$ride_id
## [1] TRUE
##
## [[8]]$rideable_type
## [1] TRUE
##
## [[8]]$started_at
## [1] TRUE
##
## [[8]]$ended_at
## [1] TRUE
##
## [[8]]$start_station_name
## [1] TRUE
##
## [[8]]$start_station_id
## [1] TRUE
##
## [[8]]$end_station_name
## [1] TRUE
##
## [[8]]$end_station_id
## [1] TRUE
##
## [[8]]$start_lat
## [1] TRUE
##
## [[8]]$start_lng
## [1] TRUE
##
## [[8]]$end_lat
## [1] TRUE
##
## [[8]]$end_lng
## [1] TRUE
##
## [[8]]$member_casual
## [1] TRUE
##
##
## [[9]]
## [[9]]$ride_id
## [1] TRUE
##
## [[9]]$rideable_type
## [1] TRUE
##
## [[9]]$started_at
## [1] TRUE
##
## [[9]]$ended_at
## [1] TRUE
##
## [[9]]$start_station_name
## [1] TRUE
##
## [[9]]$start_station_id
## [1] TRUE
##
## [[9]]$end_station_name
## [1] TRUE
##
## [[9]]$end_station_id
## [1] TRUE
##
## [[9]]$start_lat
## [1] TRUE
##
## [[9]]$start_lng
## [1] TRUE
##
## [[9]]$end_lat
## [1] TRUE
##
## [[9]]$end_lng
## [1] TRUE
##
## [[9]]$member_casual
## [1] TRUE
##
##
## [[10]]
## [[10]]$ride_id
## [1] TRUE
##
## [[10]]$rideable_type
## [1] TRUE
##
## [[10]]$started_at
## [1] TRUE
##
## [[10]]$ended_at
## [1] TRUE
##
## [[10]]$start_station_name
## [1] TRUE
##
## [[10]]$start_station_id
## [1] TRUE
##
## [[10]]$end_station_name
## [1] TRUE
##
## [[10]]$end_station_id
## [1] TRUE
##
## [[10]]$start_lat
## [1] TRUE
##
## [[10]]$start_lng
## [1] TRUE
##
## [[10]]$end_lat
## [1] TRUE
##
## [[10]]$end_lng
## [1] TRUE
##
## [[10]]$member_casual
## [1] TRUE
##
##
## [[11]]
## [[11]]$ride_id
## [1] TRUE
##
## [[11]]$rideable_type
## [1] TRUE
##
## [[11]]$started_at
## [1] TRUE
##
## [[11]]$ended_at
## [1] TRUE
##
## [[11]]$start_station_name
## [1] TRUE
##
## [[11]]$start_station_id
## [1] TRUE
##
## [[11]]$end_station_name
## [1] TRUE
##
## [[11]]$end_station_id
## [1] TRUE
##
## [[11]]$start_lat
## [1] TRUE
##
## [[11]]$start_lng
## [1] TRUE
##
## [[11]]$end_lat
## [1] TRUE
##
## [[11]]$end_lng
## [1] TRUE
##
## [[11]]$member_casual
## [1] TRUE
##
##
## [[12]]
## [[12]]$ride_id
## [1] TRUE
##
## [[12]]$rideable_type
## [1] TRUE
##
## [[12]]$started_at
## [1] TRUE
##
## [[12]]$ended_at
## [1] TRUE
##
## [[12]]$start_station_name
## [1] TRUE
##
## [[12]]$start_station_id
## [1] TRUE
##
## [[12]]$end_station_name
## [1] TRUE
##
## [[12]]$end_station_id
## [1] TRUE
##
## [[12]]$start_lat
## [1] TRUE
##
## [[12]]$start_lng
## [1] TRUE
##
## [[12]]$end_lat
## [1] TRUE
##
## [[12]]$end_lng
## [1] TRUE
##
## [[12]]$member_casual
## [1] TRUE
In this step, it could be seen that the dataframes have the same datatype for each of their columns.
all_trips <- bind_rows(jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022, sep_2022, oct_2022, nov_2022, dec_2022)
The individual dataframes are compiled into one big dataframe to start doing the needed cleaning and calculations.
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(all_trips)
## [1] 5667717
dim(all_trips)
## [1] 5667717 13
head(all_trips)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹rideable_type,
## # ²start_station_name, ³start_station_id
tail(all_trips)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 7BDEDE9860418… classi… 2022-12-07 06:52:45 2022-12-07 06:56:36 Sangam… 13409
## 2 43ABEE85B6E15… classi… 2022-12-05 06:51:04 2022-12-05 06:54:48 Sangam… 13409
## 3 F041C89A3D1F0… electr… 2022-12-14 17:06:28 2022-12-14 17:19:27 Bernar… 18016
## 4 A2BECB88430BE… classi… 2022-12-08 16:27:47 2022-12-08 16:32:20 Wacker… KA1503…
## 5 37B392960E566… classi… 2022-12-28 09:37:38 2022-12-28 09:41:34 Sangam… 13409
## 6 2DD1587210BA4… classi… 2022-12-09 00:27:25 2022-12-09 00:35:28 Southp… 13235
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹rideable_type,
## # ²start_station_name, ³start_station_id
str(all_trips)
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5667717] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(all_trips)
## ride_id rideable_type started_at
## Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00
## Mode :character Mode :character Median :2022-07-22 15:03:59.00
## Mean :2022-07-20 07:21:18.74
## 3rd Qu.:2022-09-16 07:21:29.00
## Max. :2022-12-31 23:59:26.00
##
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5667717 Length:5667717
## 1st Qu.:2022-05-28 19:43:07.00 Class :character Class :character
## Median :2022-07-22 15:24:44.00 Mode :character Mode :character
## Mean :2022-07-20 07:40:45.33
## 3rd Qu.:2022-09-16 07:39:03.00
## Max. :2023-01-02 04:56:45.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5667717
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5858 NA's :5858
The new table that has been created is inspected.
all_trips$year <- format(as.Date(all_trips$started_at), "%Y")
all_trips$month <- format(as.Date(all_trips$started_at), "%m")
all_trips$day <- format(as.Date(all_trips$started_at), "%d")
Information regarding the year, month and day of the trips are extracted to three new columns.
all_trips <- all_trips %>% mutate(day_of_the_week = wday(as.Date(all_trips$started_at)))
The day_of_week column is transformed to numbers with Sunday being “1”.
all_trips <- all_trips %>% mutate(month_of_year = month(as.Date(all_trips$started_at)))
The month_of_year column is transformed to numbers with January being “1”.
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
The ride_length column is added to calculate the duration of each trips taken.
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
The ride_length column is converted to numeric type.
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
Negative values are removed from the ride_length column since it is not possible to have trip durations lesser than 0.
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 359 630 1232 1134 2483235 833041
The minimum, maximum, median, mean, 1st quartile and 3rd quartile and the number of NA values of the ride_length column is calculated.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1901.1018
## 2 member 770.2833
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 814
## 2 member 536
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 2483235
## 2 member 93594
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 0
## 2 member 0
The mean, median, maximum and minimum values for members (annual) and casual members are compared with each other.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_the_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_the_week
## 1 casual 1
## 2 member 1
## 3 casual 2
## 4 member 2
## 5 casual 3
## 6 member 3
## 7 casual 4
## 8 member 4
## 9 casual 5
## 10 member 5
## 11 casual 6
## 12 member 6
## 13 casual 7
## 14 member 7
## all_trips_v2$ride_length
## 1 2216.0519
## 2 854.8404
## 3 1903.7340
## 4 743.0153
## 5 1684.9592
## 6 733.1064
## 7 1614.0556
## 8 731.6995
## 9 1665.2090
## 10 743.8793
## 11 1837.6679
## 12 758.3058
## 13 2117.0493
## 14 862.1894
The average trip duration in each day of the day of the week for each annual and casual members are compared with each other.
all_trips_v2 %>%
drop_na() %>%
group_by(member_casual,day_of_the_week) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(day_of_the_week) %>%
mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>%
mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>%
ggplot(aes(x = day_of_the_week, y = number_of_rides, fill= member_casual)) +
geom_col(position="stack") + # Add stacked column graph
scale_x_continuous(breaks = seq(1, 7)) + # Set x-axis to 1-7
geom_text(aes(label = paste0(percent_number_of_rides, "%")),
position = position_stack(vjust = 0.5)) + # Add percentage labels for subscription types
ggtitle("Subscription Type on Number of rides per Day of Week", subtitle = "(1 = Sunday)") +
xlab("Day of the Week") +
ylab("Number of Rides") +
scale_fill_discrete(name = "Subscription Type") + # Change legend title for member_casual
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The number of rides per customer type is plotted against the day of each week and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
Saturday has the most number of rides.
Monday has the least number of rides.
Members have more number of rides per day except in Saturday.
Members have the most number of rides in Tuesday and Wednesday and least in Saturday.
Casuals have the most number of rides in Saturday and least in Tuesday and Wednesday.
all_trips_v2 %>%
drop_na() %>%
group_by(member_casual,day_of_the_week) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(day_of_the_week) %>%
mutate(percent_average_duration = average_duration/sum(average_duration)) %>%
mutate(percent_average_duration = round(percent_average_duration*100)) %>%
ggplot(aes(x = day_of_the_week, y = average_duration, fill=member_casual)) +
geom_col(position="stack") +
scale_x_continuous(breaks = seq(1, 7)) +
geom_text(aes(label = paste0(percent_average_duration, "%")),
position = position_stack(vjust = 0.5)) +
ggtitle("Subscription Type on Ride Duration per Day of Week", subtitle = "(1 = Sunday)") +
xlab("Day of Week") +
ylab("Ride Duration") +
scale_fill_discrete(name = "Subscription Type") +
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The ride duration per customer type is plotted against the day of each week and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
Sunday and Saturday both have the highest ride duration.
Wednesday has the lowest ride duration.
Casuals have higher ride duration than Members in every day of the week.
all_trips_v2 %>%
drop_na() %>%
group_by(member_casual, month_of_year) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(month_of_year) %>%
mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>%
mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>%
ggplot(aes(x = month_of_year, y = number_of_rides, fill = member_casual)) +
geom_col(position = "stack") +
scale_x_continuous(breaks = seq(1, 12)) +
geom_text(aes(label = paste0(percent_number_of_rides, "%")),
position = position_stack(vjust = 0.5)) +
ggtitle("Subscription Type on Number of rides per Month of Year", subtitle = "(1 = January)") +
xlab("Month of Year") +
ylab("Number of Rides") +
scale_fill_discrete(name = "Subscription Type") +
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The number of rides per customer type is plotted against the month of the year and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
July has the most number of rides.
January has the least number of rides.
Members and Casuals have the most number of rides in June, July and August and the least in January and February.
all_trips_v2 %>%
drop_na() %>%
group_by(member_casual,month_of_year) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(month_of_year) %>%
mutate(percent_average_duration = average_duration/sum(average_duration)) %>%
mutate(percent_average_duration = round(percent_average_duration*100)) %>%
ggplot(aes(x = month_of_year, y = average_duration, fill=member_casual)) +
geom_col(position="stack") +
scale_x_continuous(breaks = seq(1, 12)) +
geom_text(aes(label = paste0(percent_average_duration, "%")),
position = position_stack(vjust = 0.5)) +
ggtitle("Subscription Type on Ride Duration per Month of Year", subtitle = "(1 = January)") +
xlab("Month of Year") +
ylab("Ride Duration") +
scale_fill_discrete(name = "Subscription Type") +
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The ride duration per customer type is plotted against the month of the year and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
Casuals have higher ride duration than Members in every month of the year.
April and May have the highest ride duration while December has the lowest ride duration.
all_trips_v2 %>%
drop_na() %>%
mutate(rideable_type = gsub("_", " ", rideable_type)) %>% # Replace all "_" to " " in rideable_type
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(rideable_type) %>%
mutate(percent_number_of_rides = number_of_rides/sum(number_of_rides)) %>%
mutate(percent_number_of_rides = round(percent_number_of_rides*100)) %>%
ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
geom_col(position = "stack") +
geom_text(aes(label = paste0(percent_number_of_rides, "%")),
position = position_stack(vjust = 0.5)) +
ggtitle("Subscription Type on Number of rides per Bike Type") +
xlab("Bike Type") +
ylab("Number of Rides") +
scale_fill_discrete(name = "Subscription Type") +
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The number of rides per customer type is plotted against the bike type and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
Classic bike has highest number of rides. Members rode classic bikes more than Casuals.
Docked bikes has lowest number of rides. Docked bikes are ridden only by casuals.
Members rode electric bikes more than Casuals.
all_trips_v2 %>%
drop_na() %>%
mutate(rideable_type = gsub("_", " ", rideable_type)) %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
group_by(rideable_type) %>%
mutate(percent_average_duration = average_duration/sum(average_duration)) %>%
mutate(percent_average_duration = round(percent_average_duration*100)) %>%
ggplot(aes(x = rideable_type, y = average_duration, fill = member_casual)) +
geom_col(position = "stack") +
geom_text(aes(label = paste0(percent_average_duration, "%")),
position = position_stack(vjust = 0.5)) +
ggtitle("Subscription Type on Ride Duration per Bike Type") +
xlab("Bike Type") +
ylab("Ride Duration") +
scale_fill_discrete(name = "Subscription Type") +
theme(title = element_text(size = 15),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 12),
axis.text.y = element_blank());
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The ride duration per customer type is plotted against the bike type and is visualized using a stacked column chart. The percentages are shown for each member (subscription) type. The findings are as follows:
Docked bikes have the highest ride duration which are only ridden by Casuals.
Electric bikes have the lowest ride duration.
Casuals have higher ride duration than Members per bike type.
counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual +
all_trips_v2$day_of_the_week +
all_trips_v2$month_of_year+
all_trips_v2$rideable_type, FUN = mean)
write.csv(counts, file='~/Bike-Sharing/avg_ride_length.csv')
The dataframe used in analysis is then exported to a csv file in the event that it will be used for analysis in other platforms.
From the analysis done, the following actions are recommended to convert more casual members to annual type subscription:
Regarding the bike type, promotions should be offered for docked bikes as they are more popular among casual members in terms of both ride frequency and duration, compared to annual members.
Regarding the monthly promotions, July should be favored as it has the highest number of rides. Another promotion should be considered based on trip duration, as casual members consistently have longer rides compared to annual members every month.
Regarding the weekly promotions, Saturdays should receive a promotion aimed at casual members as this is the day when casual members have the highest number of rides and longest trip duration.