tomrea.uk

dbplyr


Steps to get started with dplyr & Postgres, using dbplyr.

Postgres Setup

Don’t install Postgres. Docker pull Postgres

To avoid installing Postgres locally use Docker.

Pull image for the latest stable release:

docker pull postgres

Start Postgres:

docker run \
  -d --rm \
  --name pg-docker \
  -e POSTGRES_PASSWORD=docker \
  -p 127.0.0.1:5432:5432 \
  postgres

db Connection

Setup

The DBI package provides a simple interface for connecting to databases and, along with the odbc package, makes it easy to set up connections to any with Open Database Connectivity (ODBC) drivers.

Install odbc dependencies and Postgres driver:

sudo apt-get install unixodbc unixodbc-dev
sudo apt-get install odbc-postgresql

Install DBI and odbc:

install.packages(c('DBI','odbc'))

Check odbc package can find Postgres driver:

# list unique driver names
unique(odbc::odbcListDrivers()['name'])
##                 name
## 1    PostgreSQL ANSI
## 7 PostgreSQL Unicode

Connection

Connect to Postgres database:

# load DBI package
library(DBI)

# create connection
con <- dbConnect(
  odbc::odbc(),
  Driver = 'PostgreSQL Unicode',
  Server = '127.0.0.1',
  Database = 'postgres',
  UID = 'postgres',
  PWD = 'docker',
  Port = 5432
)

dbplyr

Introduction to dbplyr

Install tidyverse packages (and nycflights13 data):

install.packages(c('tidyverse', 'nycflights13'))

Use dbplyr to load nycflights13 data to database:

# load tidyverse packages
library(tidyverse)

# copy data to database
con %>%
  copy_to(nycflights13::flights, 'flights')

dbplyr interacts with database tables through table references:

# create table reference to 'flights' table
flights <- tbl(con, 'flights')

# return top of flights table
head(flights)
## # Source:   lazy query [?? x 19]
## # Database: postgres [postgres@localhost:/postgres]
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Once a table reference has been created you can use dplyr to interrogate the table.

dplyr tries to interact with the database as little as possible:

# this will not touch the database
flights_jfk <-
  flights %>%
  filter(origin == 'JFK')

# this will
flights_jfk %>%
  ggplot() +
  geom_boxplot(aes(carrier, distance))