Quantcast
Channel: rOpenSci - open tools for open science
Viewing all 668 articles
Browse latest View live

rOpenSci Software Review: Always Improving

$
0
0

The R package ecosystem now contains more than 10K packages, and several flagship packages belong under the rOpenSci suite. Some of these are: magick for image manipulation, plotly for interactive plots, and git2r for interacting with git.

rOpenSci is a community of people making software to facilitate open and reproducible science/research. While the rOpenSci team continues to develop and maintain core infrastructure packages, an increasing number of packages in our suite are contributed by members of the extended R community.

In the early days we accepted contributions to our suite without any formal process for submission or acceptance. When someone wanted to contribute software to our collection, and we could envision scientific applications, we just moved it aboard. But as our community and codebase grew, we began formalizing standards and processes to control quality. This is what became our peer review process. You can read more about it in our recent blog post.

As our submissions have grown over the past couple of years, our standards around peer review have also changed and continue to evolve in response to changing community needs and updates to the R development infrastructure.

Although a large number of packages submitted to CRAN could also be part of rOpenSci, our submissions are limited to packages that fit our mission and are able to pass a stringent and time intensive review process.

Here, we summarize some of the more important changes to peer review at rOpenSci over the past year. The most recent information can always be found at https://onboarding.ropensci.org/.

We’ve Expanded Our Scope

Our Aims and Scope document what types of packages we accept from community contributors. The scope emerges from three main guidelines. First, we accept packages that fit our mission of enabling open and reproducible research. Second, we only accept packages that we feel our editors and community of reviewers are competent to review. Third, we accept packages for which we can reasonably endorse as improving on existing solutions. In practice, we don’t accept general packages. That’s why, for instance, our “data munging” category only applies to packages designed to work with specific scientific data types.

We’ve refined our focal areas from

  • data retrieval
  • data visualization
  • data deposition
  • data munging
  • reproducibility

to

  • data retrieval - packages for accessing and downloading data from online sources with scientific applications
  • data deposition - packages that support deposition of data into research repositories, including data formatting and metadata generation
  • data munging - packages for processing data from formats mentioned above
  • data extraction - packages that aid in retrieving data from unstructured sources such as text, images and PDFs, as well as parsing scientific data types and outputs from scientific equipment
  • database access - bindings and wrappers for generic database APIs
  • reproducibility - tools that facilitate reproducible research
  • geospatial data - accessing geospatial data, manipulating geospatial data, and converting between geospatial data formats
  • text analysis (pilot) - we are piloting a sub-specialty area for text analysis which includes implementation of statistical/ML methods for analyzing or extracting text data

You will note that we’ve removed data visualization. We’ve had some truly excellent data visualization packages come aboard, starting with plotly. But since then we’ve found data visualization is too general a field for us to confidently evaluate, and at this point have dropped it from our main categories.

We’ve also added geospatial and text analysis as areas where we accept packages that might seem more general or methods-y than we would otherwise. These are areas where we’ve built, among our staff and reviewers, topic-specific expertise.

Given that we accept packages that improve on existing solutions, in practice we generally avoid accepting more than one package of similar scope. We’ve also added clarifying language about what this entails and how we define overlap with other packages.

We now strongly encourage pre-submission inquiries to quickly assess whether the package falls into scope. Some of these lead to suggesting the person submit their package, while others are determined out-of-scope. This reduces effort on all sides for packages that be out-of-scope. Many authors do this prior to completing their package so they can decide whether to tailor their development process to rOpenSci.

To see examples of what has recently been determined to be out-of-scope, see the out-of-scope label in the onboarding repository.

As always, we’d like to emphasize that even when packages are out-of-scope, we’re very grateful that authors consider an rOpenSci submission!

Standards changes

Our packaging guide contains both recommended and required best practices. They evolve continually as our community reaches consensus on best practices that we want to encourage and standardize. Here are some of the changes we’ve incorporated in the past year.

  • We now encourage using a object_verb() function naming scheme to avoid namespace conflicts.
  • We now encourage functions to facilitate piping workflows if possible. We don’t have an official stance on using pipes in a package.
  • We’ve clarified and filled out README recommendations
  • Documentation: we now recommend inclusion of a package level manual file, and at least one vignette.
  • Testing: we clarify that packages should pass devtools::check() on all major platforms, that each package should have a test suite that covers all major functionality.
  • Continuous integration (CI): we now require that packages with compiled code need to run continuous integration on all major platforms; integrate reporting of test coverage; include README badges of CI and coverage.
  • We’ve clarified our recommended scaffolding suggestions around XML to be more nuanced. Briefly, we recommend the xml2 package in general, but XML package may be needed in some cases.
  • We added a section on CRAN gotchas to help package maintainers avoid common pitfalls encountered during CRAN submission.

Standards changes often take place because we find that both editors and reviewers are making the same recommendations on multiple packages. Other requirements are added as good practices become accessible to the broader community. For instance, CI and code coverage reporting have gone from recommended to required as the tooling and documentation/tutorials for these have made them more accessible.

Process changes

Editors

As the pace of package submissions increases, we’ve expanded our editorial team to keep up. Maëlle Salmon joined us in February, bringing our team to four. With four, we need to be more coordinated, so we’ve moved to a system of a rotating editor-in-chief, who makes decisions about scope, assigns handling editors, and brings up edge cases for discussion with the whole team.

The process our editors follow is summarized in our editors’ guide, which will help bring editors up to speed when we further expand our team.

Automation

As submissions increase, the entire process benefits more from automation. Right now most steps of the review system are manual - we aim to automate as much as possible. Here’s a few things we’re doing or planning on:

  1. With every package submission, we run goodpractice on the package to highlight common issues. We do this manually right now, but we’re working on an automated system (aka, bot) for automatically running goodpractice on submissions and reporting back to the issue. Other rOpenSci specific checks, e.g., checking rOpenSci policies, are likely to be added in to this system.
  2. Reminders: Some readers that have reviewed for rOpenSci may remember the bot that would remind you to get your review in. We’ve disabled it for now - but will likely bring it back online soon. Right now, editors do these reminders manually.
  3. On approval, packages go through a number of housekeeping steps to ensure a smooth transfer. Eventually we’d like to automate this process.

Other changes

  • JOSS harmonization/co-submission: For authors wishing to submit their software papers to the Journal of Open Source Software after acceptance, we have also begun streamlining the process. Editors check to make sure that the paper clearly states the scientific application, includes a separate .bib file and that the accepted version of the software is deposited at Zenodo or Figshare with a DOI. Having these steps completed allows for a fast track acceptance at JOSS.
  • Reviewer template and guide: We now have a reviewer template - making reviews more standardized, and helping reviewers know what to look for. In addition, we have an updated reviewer guide that gives high level guidance, as well as specific things to look for, tools to use, and examples of good reviews. In addition, the guide gives guidance on how to submit reviews.
  • Badges: We now have badges for rOpenSci review. The badges show whether a package is under review or has been approved. Packages that are undergoing review or have been approved can put this badge in their README.


Get in touch with us on Twitter (@ropensci, or in the comments) if you have any questions or thoughts about our software review policies, scope, or processes.

To find out more about our software review process join us on the next rOpenSci Community Call

We hope to see you soon in the onboarding repository as a submitter or as a reviewer!


Accessing patent data with the patentsview package

$
0
0

Why care about patents?

1. Patents play a critical role in incentivizing innovation, without which we wouldn’t have much of the technology we rely on everyday

What does your iPhone, Google’s PageRank algorithm, and a butter substitute called Smart Balance all have in common?

…They all probably wouldn’t be here if not for patents. A patent provides its owner with the ability to make money off of something that they invented, without having to worry about someone else copying their technology. Think Apple would spend millions of dollars developing the iPhone if Samsung could just come along and rip it off? Probably not.

2. Patents offer a great opportunity for data analysis

There are two primary reasons for this:

  • Patent data is public. In return for the exclusive right to profit off an invention, an individual/company has to publicly disclose the details of their invention to the rest of the world.Examples of those details include the patent’s title, abstract, technology classification, assigned organizations, etc.
  • Patent data can answer questions that people care about. Companies (especially big ones like IBM and Google) have a vested interest in extracting insights from patents, and spend a lot of time/resources trying figure out how to best manage their intellectual property (IP) rights. They’re plagued by questions like“who should I sell my underperforming patents to,” “which technology areas are open to new innovations,” “what’s going to be the next big thing in the world of buttery spreads,” etc. Patents offer a way to provide data-driven answers to these questions.

Combined, these two things make patents a prime target for data analysis. However, until recently it was hard to get at the data inside these documents. One had to either collect it manually using the official United States Patent and Trademark Office (USPTO) search engine, or figure out a way to download, parse, and model huge XML data dumps. Enter PatentsView.

PatentsView and the patentsview package

PatentsView is one of USPTO’s new initiatives intended to increase the usability and value of patent data. One feature of this project is a publicly accessible API that makes it easy to programmatically interact with the data. A few of the reasons why I like the API (and PatentsView more generally):

  • The API is free (no credential required) and currently doesn’t impose rate limits/bandwidth throttling.
  • The project offers bulk downloads of patent data on their website (in a flat file format), for those who want to be closest to the data.
  • Both the API and the bulk download data contain disambiguated entities such as inventors, assignees, organizations, etc. In other words, the API will tell you whether it thinks that John Smith on patent X is the same person as John Smith on patent Y. 1

The patentsview R package is a wrapper around the PatentsView API. It contains a function that acts as a client to the API (search_pv()) as well as several supporting functions. Full documentation of the package can be found on itswebsite.

Installation

You can install the stable version of patentsview from CRAN:

install.packages("patentsview")

Or get the development version from GitHub:

if (!require(devtools)) install.packages("devtools")

devtools::install_github("ropensci/patentsview")

Getting started

The package has one main function, search_pv(), that makes it easy to send requests to the API. There are two parameters to search_pv() that you’re going to want to think about just about every time you call it -query and fields. You tell the API how you want to filter the patent data with query, and which fields you want to retrieve withfields. 2

query

Your query has to use the PatentsView query language, which is a JSON-based syntax that is similar to the one used by Lucene. You can write the query directly and pass it as a string to search_pv():

library(patentsview)

qry_1 <- '{"_gt":{"patent_year":2007}}'
search_pv(query = qry_1, fields = NULL) # This will retrieve a default set of fields
#> $data
#> #### A list with a single data frame on the patent data level:
#>
#> List of 1
#>  $ patents:'data.frame': 25 obs. of  3 variables:
#>   ..$ patent_id    : chr [1:25] "7313829" ...
#>   ..$ patent_number: chr [1:25] "7313829" ...
#>   ..$ patent_title : chr [1:25] "Sealing device for body suit and sealin"..
#>
#> $query_results
#> #### Distinct entity counts across all downloadable pages of output:
#>
#> total_patent_count = 100,000

…Or you can use the domain specific language (DSL) provided in thepatentsview package to help you write the query:

qry_2 <- qry_funs$gt(patent_year = 2007) # All DSL functions are in the qry_funs list
qry_2 # qry_2 is the same as qry_1
#> {"_gt":{"patent_year":2007}}

search_pv(query = qry_2)
#> $data
#> #### A list with a single data frame on the patent data level:
#>
#> List of 1
#>  $ patents:'data.frame': 25 obs. of  3 variables:
#>   ..$ patent_id    : chr [1:25] "7313829" ...
#>   ..$ patent_number: chr [1:25] "7313829" ...
#>   ..$ patent_title : chr [1:25] "Sealing device for body suit and sealin"..
#>
#> $query_results
#> #### Distinct entity counts across all downloadable pages of output:
#>
#> total_patent_count = 100,000

qry_1 and qry_2 will result in the same HTTP call to the API. Both queries search for patents in USPTO that were published after 2007. There are three gotchas to look out for when writing a query:

  1. Field is queryable. The API has 7 endpoints (the default endpoint is “patents”), and each endpoint has its own set of fields that you can filter on. The fields that you can filter on are not necessarily the same as the ones that you can retrieve. In other words, the fields that you can include in query (e.g.,patent_year) are not necessarily the same as those that you can include in fields. To see which fields you can query on, look in the fieldsdf data frame (View(patentsview::fieldsdf)) for fields that have a “y” indicator in their can_query column for your given endpoint.
  2. Correct data type for field. If you’re filtering on a field in your query, you have to make sure that the value you are filtering on is consistent with the field’s data type. For example,patent_year has type “integer,” so if you pass 2007 as a string then you’re going to get an error (patent_year = 2007 is good,patent_year = "2007" is no good). You can find a field’s data type in the fieldsdf data frame.
  3. Comparison function works with field’s data type. The comparison function(s) that you use (e.g., the greater-than function shown above, qry_funs$gt()) must be consistent with the field’s data type. For example, you can’t use the “contains” function on fields of type “integer” (qry_funs$contains(patent_year = 2007) will throw an error). See ?qry_funs for more details.

In short, use the fieldsdf data frame when you write a query and you should be fine. Check out the writing queries vignette for more details.

fields

Up until now we have been using the default value for fields. This results in the API giving us some small set of default fields. Let’s see about retrieving some more fields:

search_pv(
  query = qry_funs$gt(patent_year = 2007),
  fields = c("patent_abstract", "patent_average_processing_time","inventor_first_name", "inventor_total_num_patents")
)
#> $data
#> #### A list with a single data frame (with list column(s) inside) on the patent data level:
#>
#> List of 1
#>  $ patents:'data.frame': 25 obs. of  3 variables:
#>   ..$ patent_abstract               : chr [1:25] "A sealing device for a"..
#>   ..$ patent_average_processing_time: chr [1:25] "1324" ...
#>   ..$ inventors                     :List of 25
#>
#> $query_results
#> #### Distinct entity counts across all downloadable pages of output:
#>
#> total_patent_count = 100,000

The fields that you can retrieve depends on the endpoint that you are hitting. We’ve been using the “patents” endpoint thus far, so all of these are retrievable:fieldsdf[fieldsdf$endpoint == "patents", "field"]. You can also useget_fields() to list the retrievable fields for a given endpoint:

search_pv(
  query = qry_funs$gt(patent_year = 2007),
  fields = get_fields(endpoint = "patents", groups = c("patents", "inventors"))
)
#> $data
#> #### A list with a single data frame (with list column(s) inside) on the patent data level:
#>
#> List of 1
#>  $ patents:'data.frame': 25 obs. of  31 variables:
#>   ..$ patent_abstract                       : chr [1:25] "A sealing devi"..
#>   ..$ patent_average_processing_time        : chr [1:25] "1324" ...
#>   ..$ patent_date                           : chr [1:25] "2008-01-01" ...
#>   ..$ patent_firstnamed_assignee_city       : chr [1:25] "Cambridge" ...
#>   ..$ patent_firstnamed_assignee_country    : chr [1:25] "US" ...
#>   ..$ patent_firstnamed_assignee_id         : chr [1:25] "b9fc6599e3d60c"..
#>   ..$ patent_firstnamed_assignee_latitude   : chr [1:25] "42.3736" ...
#>   ..$ patent_firstnamed_assignee_location_id: chr [1:25] "42.3736158|-71"..
#>   ..$ patent_firstnamed_assignee_longitude  : chr [1:25] "-71.1097" ...
#>   ..$ patent_firstnamed_assignee_state      : chr [1:25] "MA" ...
#>   ..$ patent_firstnamed_inventor_city       : chr [1:25] "Lucca" ...
#>   ..$ patent_firstnamed_inventor_country    : chr [1:25] "IT" ...
#>   ..$ patent_firstnamed_inventor_id         : chr [1:25] "6416028-3" ...
#>   ..$ patent_firstnamed_inventor_latitude   : chr [1:25] "43.8376" ...
#>   ..$ patent_firstnamed_inventor_location_id: chr [1:25] "43.8376211|10."..
#>   ..$ patent_firstnamed_inventor_longitude  : chr [1:25] "10.4951" ...
#>   ..$ patent_firstnamed_inventor_state      : chr [1:25] "Tuscany" ...
#>   ..$ patent_id                             : chr [1:25] "7313829" ...
#>   ..$ patent_kind                           : chr [1:25] "B1" ...
#>   ..$ patent_number                         : chr [1:25] "7313829" ...
#>   ..$ patent_num_cited_by_us_patents        : chr [1:25] "5" ...
#>   ..$ patent_num_claims                     : chr [1:25] "25" ...
#>   ..$ patent_num_combined_citations         : chr [1:25] "35" ...
#>   ..$ patent_num_foreign_citations          : chr [1:25] "0" ...
#>   ..$ patent_num_us_application_citations   : chr [1:25] "0" ...
#>   ..$ patent_num_us_patent_citations        : chr [1:25] "35" ...
#>   ..$ patent_processing_time                : chr [1:25] "792" ...
#>   ..$ patent_title                          : chr [1:25] "Sealing device"..
#>   ..$ patent_type                           : chr [1:25] "utility" ...
#>   ..$ patent_year                           : chr [1:25] "2008" ...
#>   ..$ inventors                             :List of 25
#>
#> $query_results
#> #### Distinct entity counts across all downloadable pages of output:
#>
#> total_patent_count = 100,000

Example

Let’s look at a quick example of pulling and analyzing patent data. We’ll look at patents from the last ten years that are classified below the H04L63/00 CPC code. Patents in this area relate to “network architectures or network communication protocols for separating internal from external traffic.” 3 CPC codes offer a quick and dirty way to find patents of interest, though getting a sense of their hierarchy can be tricky.

  1. Download the data
library(patentsview)

# Write a query:
query <- with_qfuns( # with_qfuns is basically just: with(qry_funs, ...)
  and(
    begins(cpc_subgroup_id = 'H04L63/02'),
    gte(patent_year = 2007)
  )
)

# Create a list of fields:
fields <- c(
  c("patent_number", "patent_year"),
  get_fields(endpoint = "patents", groups = c("assignees", "cpcs"))
)

# Send HTTP request to API's server:
pv_res <- search_pv(query = query, fields = fields, all_pages = TRUE)
  1. See where the patents are coming from (geographically)
library(leaflet)
library(htmltools)
library(dplyr)
library(tidyr)

data <-
  pv_res$data$patents %>%
    unnest(assignees) %>%
    select(assignee_id, assignee_organization, patent_number,
           assignee_longitude, assignee_latitude) %>%
    group_by_at(vars(-matches("pat"))) %>%
    mutate(num_pats = n()) %>%
    ungroup() %>%
    select(-patent_number) %>%
    distinct() %>%
    mutate(popup = paste0("<font color='Black'>",
                          htmlEscape(assignee_organization), "<br><br>Patents:",
                          num_pats, "</font>")) %>%
    mutate_at(vars(matches("_l")), as.numeric) %>%
    filter(!is.na(assignee_id))

leaflet(data) %>%
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
  addCircleMarkers(lng = ~assignee_longitude, lat = ~assignee_latitude,
                   popup = ~popup, ~sqrt(num_pats), color = "yellow")


  1. Plot the growth of the field’s topics over time
library(ggplot2)
library(RColorBrewer)

data <-
  pv_res$data$patents %>%
    unnest(cpcs) %>%
    filter(cpc_subgroup_id != "H04L63/02") %>% # remove patents categorized into only top-level category of H04L63/02
    mutate(
      title = case_when(
        grepl("filtering", .$cpc_subgroup_title, ignore.case = T) ~"Filtering policies",
        .$cpc_subgroup_id %in% c("H04L63/0209", "H04L63/0218") ~"Architectural arrangements",
        grepl("Firewall traversal", .$cpc_subgroup_title, ignore.case = T) ~"Firewall traversal",
        TRUE ~
          .$cpc_subgroup_title
      )
    ) %>%
    mutate(title = gsub(".*(?=-)-", "", title, perl = TRUE)) %>%
    group_by(title, patent_year) %>%
    count() %>%
    ungroup() %>%
    mutate(patent_year = as.numeric(patent_year))

ggplot(data = data) +
  geom_smooth(aes(x = patent_year, y = n, colour = title), se = FALSE) +
  scale_x_continuous("\nPublication year", limits = c(2007, 2016),
                     breaks = 2007:2016) +
  scale_y_continuous("Patents\n", limits = c(0, 700)) +
  scale_colour_manual("", values = brewer.pal(5, "Set2")) +
  theme_bw() + # theme inspired by https://hrbrmstr.github.io/hrbrthemes/
  theme(panel.border = element_blank(), axis.ticks = element_blank())

Learning more

For analysis examples that go into a little more depth, check out thedata applications vignettes on the package’s website. If you’re just interested in search_pv(), there areexamples on the site for that as well. To contribute to the package or report an issue, check out the issues page on GitHub.

Acknowledgments

I’d like to thank the package’s two reviewers, Paul Oldham and Verena Haunschmid, for taking the time to review the package and providing helpful feedback. I’d also like to thankMaëlle Salmon for shepherding the package along the rOpenSci review process, as well Scott Chamberlain and Stefanie Butland for their miscellaneous help.


  1. This is both good and bad, as there are errors in the disambiguation. The algorithm that is responsible for the disambiguation was created by the winner of the PatentsView Inventor Disambiguation Technical Workshop.
  2. These two parameters end up getting translated into a MySQL query by the API’s server, which then gets sent to a back-end database. query and fields are used to create the query’s WHERE and SELECT clauses, respectively.
  3. There is a slightly more in-depth definition that says that these are patents “related to the (logical) separation of traffic/(sub-) networks to achieve protection.”

rrricanes to Access Tropical Cyclone Data

$
0
0

What is rrricanes

Why Write rrricanes?

There is a tremendous amount of weather data available on the internet. Much of it is in raw format and not very easy to obtain. Hurricane data is no different. When one thinks of this data they may be inclined to think it is a bunch of map coordinates with some wind values and not much else. A deeper look will reveal structural and forecast data. An even deeper look will find millions of data points from hurricane reconnaissance, computer forecast models, ship and buoy observations, satellite and radar imagery, …

rrricanes is an attempt to bring this data together in a way that doesn’t just benefit R users, but other languages as well.

I began learning R in 2015 and immediately had wished I had a hurricane-specific dataset when Hurricane Patricia became a harmless, but historic hurricane roaming the Pacific waters. I found this idea revisited again as Hurricane Matthew took aim at Florida and the southeast in 2016. Unable to use R to study and consolidate Matthew’s data in R led me to begin learning package development. Thus, the birth of rrricanes.

In this article, I will take you on a lengthy tour of the most important features of rrricanes and what the data means. If you have a background working with hurricane data, most of this will be redundant. My aim here is to cover the big ideas behind the package and explain them under the assumption you, the reader, are unfamiliar with the data offered.

rrricanesis not intended to be used in emergency situations. I write this article as areas I have lived or currently live are under the gun from Hurricane Harvey and rrricanes is unable to obtain data due to external issues (I will describe these later). It is designed with the intent of answering questions and exploring ideas outside of a time-sensitive environment.

rrricanes will not be available in CRAN for quite some time. The current schedule is May 15, 2018 (the “start” of the East Pacific hurricane season). This year is soley for testing under real-time conditions.

And rrricanesdata

The NHC archives text products dating back to at least 1998 (some earlier years exist but yet to be implemented in this package). Accessing this data is a time-consuming process on any computer. A limit of 4 requests per second is put in place to avoid being banned (or restricted) from the archives. So, if a hurricane has 20 text products you wish to pull and parse, this will take 5 seconds. Most cyclones have more and some, far more.

rrricanesdata is a compliment package to rrricanes. rrricanesdata contains post-scraped datasets of the archives for all available storms with the exception of advisories issued in the current month.This means you can explore the various datasets without the wait.

rrricanesdata will be updated monthly if an advisory has been issued the previous month. There will be regular monthly updates approximately from May through November - the typical hurricane season. In some cases, a cyclone may develop in the off-season. rrricanesdata will be updated on the same schedule.

ELI5 the Data

This package covers tropical cyclones that have developed in the Atlantic basin (north Atlantic ocean) or East Pacific basin (northeast Pacific east of 140#°W). Central Pacific (140#°W - 180#°W) may be mixed in if listed in the NHC archives.

While traditionally the hurricane season for each basin runs from mid-May or June through November, some cyclones have developed outside of this time frame.

Every tropical cylone (any tropical low whether classified as a tropical depression, tropical storm or hurricane) contains a core set of text products officially issued from the National Hurricane Center. These products are issued every six hours.

Much of this data has changed in format over the years. Some products have been discontinued and replaced by new products or wrapped into existing products. Some of these products are returned in raw text format; it is not cleaned and may contain HTML characters. Other products are parsed with every piece of data extracted and cleaned.

I have done my best to ensure data is high quality. But, I cannot guarantee it is perfect. If you do believe you have found an error, please let me know; even if it seems small. I would rather be notified of a false error than ignore a true one.

The Products

Each advisory product is listed below with an abbreviation in parentheses. Unless otherwise noted, these products are issued every six hours. Generally, the times issued are 03:00, 09:00, 15:00 and 21:00 UTC. Some products may be issued in three-hour increments and, sometimes, two-hour increments. update can be issued at any time.

  • Storm Discussion (discus) - These are technical discussions centered on the current structure of the cyclone, satellite presentation, computer forecast model tendencies and more. These products are not parsed.

  • Forecast/Adivsory (fstadv) - This data-rich product lists the current location of the cyclone, its wind structure, forecast and forecast wind structure.

  • Public Advisory (public) - These are general text statements issued for the public-at-large. Information in these products is a summary of the Forecast/Advisory product along with any watches and warnings issued, changed, or cancelled. Public Advisory products are the only regularly-scheduled product that may be issued intermittently (every three hours and, occasionally, every two hours) when watches and warnings are in effect. These products are not parsed.

  • Wind Speed Probabilities (wndprb) - These products list the probability of a minimum sustained wind speed expected in a given forecast window. This product replaces the Strike Probabilities product beginning in 2006 (see below).

  • Updates (update) - Tropical Cyclone Updates may be issued at any time if a storm is an immediate threat to land or if the cyclone undergoes a significant change of strength or structure. The information in this product is general. These products are not parsed.

Discontinued Products

  • Strike Probabilities (prblty) - List the probability of a tropical cyclone passing within 65 nautical miles of a location within a forecast window. Replaced in 2006 by the Wind Speed Probabilities product.

  • Position Estimates (posest) - Typically issued as a storm is threatening land but generally rare (see Hurricane Ike 2008, Key AL092008). It is generally just an update of the current location of the cyclone. After the 2011 hurricane season, this product was discontinued; Updates are now issued in their place. These products are not parsed.

Primary Key

Every cyclone has a Key. However, not all products contain this value (prblty, for example). Products issued during and after the 2005 hurricane season contain this variable.

Use Key to tie datasets together. If Key does not exist, you will need to use a combination of Name and Date, depending on your requirements. Keep in mind that, unless a name is retired, names are recycled every seven years. For example, there are multiple cyclones named Katrina but you may want to isolate on Katrina, 2005.

Installation

rrricanes will not be submitted to CRAN until prior to the hurricane season, 2018. It can be installed via github using devtools:

devtools::install_github("ropensci/rrricanes", build_vignettes = TRUE)

Optional Supporting Packages

rrricanesdata uses a drat repository to host the large, pre-processed datasets.

install.packages("rrricanesdata",
                 repos = "https://timtrice.github.io/drat/",
                 type = "source")

To use high resolution tracking charts, you may also wish to install the `rnaturalearthhires’ package:

install.packages("rnaturalearthhires",
                 repos = "http://packages.ropensci.org",
                 type = "source")

Linux users may also need to install:

  • libgdal-dev
  • libproj-dev
  • libxml2-dev

Get a List of Storms

We start exploring rrricanes by finding a storm (or storms) we wish to analyze. For this, we use get_storms. There are two optional parameters:

  • years Between 1998 and current year

  • basins One or both “AL” and “EP”

An empty call to the function will return storms for both the Atlantic and East Pacific basin for the current year.

library(dplyr)
library(rrricanes)
get_storms() %>% print(n = nrow(.))
## # A tibble: 33 x 4
##     Year                           Name Basin
##    <dbl>                          <chr> <chr>
##  1  2017          Tropical Storm Arlene    AL
##  2  2017            Tropical Storm Bret    AL
##  3  2017           Tropical Storm Cindy    AL
##  4  2017       Tropical Depression Four    AL
##  5  2017             Tropical Storm Don    AL
##  6  2017           Tropical Storm Emily    AL
##  7  2017             Hurricane Franklin    AL
##  8  2017                 Hurricane Gert    AL
##  9  2017               Hurricane Harvey    AL
## 10  2017 Potential Tropical Cyclone Ten    AL
## 11  2017                 Hurricane Irma    AL
## 12  2017                 Hurricane Jose    AL
## 13  2017                Hurricane Katia    AL
## 14  2017                  Hurricane Lee    AL
## 15  2017                Hurricane Maria    AL
## 16  2017          Tropical Storm Adrian    EP
## 17  2017         Tropical Storm Beatriz    EP
## 18  2017          Tropical Storm Calvin    EP
## 19  2017                 Hurricane Dora    EP
## 20  2017               Hurricane Eugene    EP
## 21  2017             Hurricane Fernanda    EP
## 22  2017            Tropical Storm Greg    EP
## 23  2017    Tropical Depression Eight-E    EP
## 24  2017               Hurricane Hilary    EP
## 25  2017                Hurricane Irwin    EP
## 26  2017   Tropical Depression Eleven-E    EP
## 27  2017            Tropical Storm Jova    EP
## 28  2017              Hurricane Kenneth    EP
## 29  2017           Tropical Storm Lidia    EP
## 30  2017                 Hurricane Otis    EP
## 31  2017                  Hurricane Max    EP
## 32  2017                Hurricane Norma    EP
## 33  2017           Tropical Storm Pilar    EP
## # ... with 1 more variables: Link <chr>

Function get_storms returns four variables:

  • Year - year of the cyclone.

  • Name - name of the cyclone.

  • Basin - basin the cyclone developed (AL for Atlantic, EP for east Pacific).

  • Link - URL to the cyclone’s archive page.

The variables Name and Link are the only variables that could potentially change. For example, you’ll notice a Name value of Potential Tropical Cyclone Ten. If this storm became a tropical storm then it would receive a new name and the link to the archive page would change as well.

For this example we will explore Hurricane Harvey.

Text Products

Current Data

Once we have identified the storms we want to retrieve we can begin working on getting the products. In the earlier discussion of the available products, recall I used abbreviations such as discus, fstadv, etc. These are the terms we will use when obtaining data.

The easiest method to getting storm data is the function get_storm_data. This function can take multiple storm archive URLs and return multiple datasets within a list.

ds <- get_storms() %>%
  filter(Name == "Hurricane Harvey") %>%
  pull(Link) %>%
  get_storm_data(products = c("discus", "fstadv"))

This process may take some time (particularly, fstadv products). This is because the NHC website allows no more than 80 connections every 10 seconds. rrricanes processes four links every half second.

rrricanes uses the dplyr progress bar to keep you informed of the status. You can turn this off by setting option dplyr.show_progress to FALSE.

An additional option is rrricanes.working_msg; FALSE by default. This option will show a message for each advisory currently being worked. I primarily added it to help find products causing problems but you may find it useful at some point.

At this point, we have a list - ds - of dataframes. Each dataframe is named after the product.

names(ds)
## [1] "discus" "fstadv"

discus is one of the products that isn’t parsed; the full text of the product is returned.

str(ds$discus)
## Classes 'tbl_df', 'tbl' and 'data.frame':	43 obs. of  6 variables:
##  $ Status  : chr  "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ...
##  $ Name    : chr  "Nine" "Harvey" "Harvey" "Harvey" ...
##  $ Adv     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date    : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ...
##  $ Key     : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Contents: chr  "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nPotential Tropical Cyclone Nine Discussion Number   1\nNWS National"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number   2\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number   3\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number   4\nNWS National Hurricane"| __truncated__ ...

The fstadv dataframes, however, are parsed and contain the bulk of the information for the storm.

str(ds$fstadv)
## Classes 'tbl_df', 'tbl' and 'data.frame':	43 obs. of  117 variables:
##  $ Status       : chr  "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ...
##  $ Name         : chr  "Nine" "Harvey" "Harvey" "Harvey" ...
##  $ Adv          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date         : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ...
##  $ Key          : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Lat          : num  13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ...
##  $ Lon          : num  -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ...
##  $ Wind         : num  30 35 35 35 35 35 35 35 35 30 ...
##  $ Gust         : num  40 45 45 45 45 45 45 45 45 40 ...
##  $ Pressure     : num  1008 1004 1005 1004 1005 ...
##  $ PosAcc       : num  30 30 30 30 30 40 40 30 30 30 ...
##  $ FwdDir       : num  270 270 270 270 270 275 275 275 275 275 ...
##  $ FwdSpeed     : num  15 16 16 16 18 18 16 18 19 19 ...
##  $ Eye          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SeasNE       : num  NA 60 60 60 75 150 60 60 45 NA ...
##  $ SeasSE       : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ SeasSW       : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ SeasNW       : num  NA 60 60 60 60 60 45 60 45 NA ...
##  $ NE64         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SE64         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SW64         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NW64         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NE50         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SE50         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SW50         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NW50         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NE34         : num  NA 30 50 50 60 60 60 60 0 NA ...
##  $ SE34         : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ SW34         : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ NW34         : num  NA 30 50 50 60 60 60 60 60 NA ...
##  $ Hr12FcstDate : POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 06:00:00" ...
##  $ Hr12Lat      : num  13.1 13.1 13.2 13.2 13.3 13.6 14 14 14.1 14.3 ...
##  $ Hr12Lon      : num  -56.4 -58.3 -59.9 -61.7 -63.8 -65.7 -66.8 -68.7 -70.9 -73 ...
##  $ Hr12Wind     : num  30 35 35 35 35 35 35 35 35 30 ...
##  $ Hr12Gust     : num  40 45 45 45 45 45 45 45 45 40 ...
##  $ Hr12NE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12SE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12SW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12NW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12NE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12SE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12SW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12NW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr12NE34     : num  NA 30 50 50 60 60 60 60 60 NA ...
##  $ Hr12SE34     : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ Hr12SW34     : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ Hr12NW34     : num  NA 30 50 50 60 60 60 60 60 NA ...
##  $ Hr24FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-18 18:00:00" ...
##  $ Hr24Lat      : num  13.2 13.4 13.6 13.5 13.6 13.9 14.3 14.3 14.4 14.6 ...
##  $ Hr24Lon      : num  -59.8 -61.6 -63.3 -65.2 -67.3 -69.3 -70.4 -72.7 -74.9 -77 ...
##  $ Hr24Wind     : num  35 40 40 40 40 40 40 40 40 35 ...
##  $ Hr24Gust     : num  45 50 50 50 50 50 50 50 50 45 ...
##  $ Hr24NE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24SE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24SW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24NW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24NE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24SE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24SW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24NW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr24NE34     : num  50 40 50 50 60 60 60 60 60 60 ...
##  $ Hr24SE34     : num  0 0 0 0 30 0 0 0 0 0 ...
##  $ Hr24SW34     : num  0 0 0 0 30 0 0 0 0 0 ...
##  $ Hr24NW34     : num  50 40 50 50 60 60 60 60 60 60 ...
##  $ Hr36FcstDate : POSIXct, format: "2017-08-19 00:00:00" "2017-08-19 06:00:00" ...
##  $ Hr36Lat      : num  13.5 13.7 13.9 13.9 14 14.2 14.5 14.6 14.9 15.2 ...
##  $ Hr36Lon      : num  -63.2 -65.1 -67 -68.8 -71.1 -73 -74.3 -76.7 -78.7 -80.5 ...
##  $ Hr36Wind     : num  40 45 45 40 40 40 40 45 45 40 ...
##  $ Hr36Gust     : num  50 55 55 50 50 50 50 55 55 50 ...
##  $ Hr36NE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36SE64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36SW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36NW64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36NE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36SE50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36SW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36NW50     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr36NE34     : num  60 60 60 60 60 60 60 70 70 60 ...
##  $ Hr36SE34     : num  0 30 30 30 30 0 0 0 0 0 ...
##  $ Hr36SW34     : num  0 30 30 30 30 0 0 0 0 0 ...
##  $ Hr36NW34     : num  60 60 60 60 60 60 60 70 70 60 ...
##  $ Hr48FcstDate : POSIXct, format: "2017-08-19 12:00:00" "2017-08-19 18:00:00" ...
##  $ Hr48Lat      : num  13.9 14 14.1 14.1 14.3 14.5 14.8 15.2 15.7 16 ...
##  $ Hr48Lon      : num  -66.7 -68.8 -70.9 -72.7 -75 -76.7 -78.1 -80.1 -82.4 -83.8 ...
##  $ Hr48Wind     : num  45 45 50 50 50 45 45 50 50 45 ...
##  $ Hr48Gust     : num  55 55 60 60 60 55 55 60 60 55 ...
##  $ Hr48NE50     : num  NA NA 30 30 30 NA NA 30 30 NA ...
##  $ Hr48SE50     : num  NA NA 0 0 0 NA NA 0 0 NA ...
##  $ Hr48SW50     : num  NA NA 0 0 0 NA NA 0 0 NA ...
##  $ Hr48NW50     : num  NA NA 30 30 30 NA NA 30 30 NA ...
##  $ Hr48NE34     : num  60 60 60 60 70 60 60 90 90 70 ...
##  $ Hr48SE34     : num  30 30 30 30 30 30 0 50 50 0 ...
##  $ Hr48SW34     : num  30 30 30 30 30 30 0 40 40 0 ...
##  $ Hr48NW34     : num  60 60 60 60 70 60 60 70 70 70 ...
##  $ Hr72FcstDate : POSIXct, format: "2017-08-20 12:00:00" "2017-08-20 18:00:00" ...
##  $ Hr72Lat      : num  14.5 14.5 14.8 15 15 15.5 16.5 17 17.5 18 ...
##  $ Hr72Lon      : num  -74.5 -76.5 -78.6 -80.2 -82 -83.5 -84.7 -86.5 -88 -89 ...
##  $ Hr72Wind     : num  55 55 60 60 60 60 55 55 60 45 ...
##  $ Hr72Gust     : num  65 65 75 75 75 75 65 65 75 55 ...
##   [list output truncated]

Each product can also be accessed on its own. For example, if you only wish to view discus products, use the get_discus function. fstadv products can be accessed with get_fstadv. Every products specific function is preceeded by get_.

To understand the variable definitions, access the help file for each of these functions (i.e., ?get_fstadv). They contain full definitions on the variables and their purpose.

As you can see, the fstadv dataframe is very wide. There may be instances you only want to focus on specific pieces of the product. I’ve developed tidy functions to help trim these datasets:

  • tidy_fcst

  • tidy_fcst_wr

  • tidy_fstadv

  • tidy_wr

These datasets exist in rrricanesdata as fcst, fcst_wr, adv, and wr, respectively (see below).

Most tropical cyclone forecast/advisory products will contain multiple forecast points. Initially, only three-day forecasts were issued. Beginning the with the 2003 season, 96 hour (five-day) forecasts were issued.

If a storm is not expected to survive the full forecast period, then only relevant forecasts will be issued.

We use tidy_fcst to return these forecast points in a tidy fashion from fstadv.

str(tidy_fcst(ds$fstadv))
## Classes 'tbl_df', 'tbl' and 'data.frame':	283 obs. of  8 variables:
##  $ Key     : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Adv     : num  1 1 1 1 1 1 1 2 2 2 ...
##  $ Date    : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ...
##  $ FcstDate: POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 12:00:00" ...
##  $ Lat     : num  13.1 13.2 13.5 13.9 14.5 15.5 17 13.1 13.4 13.7 ...
##  $ Lon     : num  -56.4 -59.8 -63.2 -66.7 -74.5 -82 -87.5 -58.3 -61.6 -65.1 ...
##  $ Wind    : num  30 35 40 45 55 65 65 35 40 45 ...
##  $ Gust    : num  40 45 50 55 65 80 80 45 50 55 ...

Wind radius values are issued with parameters of 34, 50 and 64. These values are the radius to which minimum one-minute sustained winds can be expected or exist.

A tropical depression will not have associated wind radius values since the maximum winds of a depression are 30 knots. If a tropical storm has winds less than 50 knots, then it will only have wind radius values for the 34-knot wind field. If winds are greater than 50 knots, then it will have wind radius values for 34 and 50 knot winds. A hurricane will have all wind radius fields.

Wind radius values are further seperated by quadrant; NE (northeast), SE, SW and NW. Not all quadrants will have values; particularly if the cyclone is struggling to organize. For example, you will often find a minimal hurricane only has hurricane-force winds (64 knots) in the northeast quadrant.

When appropriate, a forecast/advisory product will contain these values for the current position and for each forecast position. Use tidy_wr and tidy_fcst_wr, respectively, for these variables.

str(tidy_wr(ds$fstadv))
## Classes 'tbl_df', 'tbl' and 'data.frame':	56 obs. of  8 variables:
##  $ Key      : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Adv      : num  2 3 4 5 6 7 8 9 15 16 ...
##  $ Date     : POSIXct, format: "2017-08-17 21:00:00" "2017-08-18 03:00:00" ...
##  $ WindField: num  34 34 34 34 34 34 34 34 34 34 ...
##  $ NE       : num  30 50 50 60 60 60 60 0 100 80 ...
##  $ SE       : num  0 0 0 0 0 0 0 0 0 30 ...
##  $ SW       : num  0 0 0 0 0 0 0 0 0 20 ...
##  $ NW       : num  30 50 50 60 60 60 60 60 50 50 ...
str(tidy_fcst_wr(ds$fstadv))
## Classes 'tbl_df', 'tbl' and 'data.frame':	246 obs. of  9 variables:
##  $ Key      : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Adv      : num  1 1 1 1 1 2 2 2 2 2 ...
##  $ Date     : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ...
##  $ FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-19 00:00:00" ...
##  $ WindField: num  34 34 34 34 50 34 34 34 34 34 ...
##  $ NE       : num  50 60 60 80 30 30 40 60 60 80 ...
##  $ SE       : num  0 0 30 40 0 0 0 30 30 40 ...
##  $ SW       : num  0 0 30 40 0 0 0 30 30 40 ...
##  $ NW       : num  50 60 60 80 30 30 40 60 60 80 ...

Lastly, you may only want to focus on current storm details. For this, we use tidy_fstadv:

str(tidy_fstadv(ds$fstadv))
## Classes 'tbl_df', 'tbl' and 'data.frame':	43 obs. of  18 variables:
##  $ Key     : chr  "AL092017" "AL092017" "AL092017" "AL092017" ...
##  $ Adv     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date    : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ...
##  $ Status  : chr  "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ...
##  $ Name    : chr  "Nine" "Harvey" "Harvey" "Harvey" ...
##  $ Lat     : num  13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ...
##  $ Lon     : num  -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ...
##  $ Wind    : num  30 35 35 35 35 35 35 35 35 30 ...
##  $ Gust    : num  40 45 45 45 45 45 45 45 45 40 ...
##  $ Pressure: num  1008 1004 1005 1004 1005 ...
##  $ PosAcc  : num  30 30 30 30 30 40 40 30 30 30 ...
##  $ FwdDir  : num  270 270 270 270 270 275 275 275 275 275 ...
##  $ FwdSpeed: num  15 16 16 16 18 18 16 18 19 19 ...
##  $ Eye     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ SeasNE  : num  NA 60 60 60 75 150 60 60 45 NA ...
##  $ SeasSE  : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ SeasSW  : num  NA 0 0 0 0 0 0 0 0 NA ...
##  $ SeasNW  : num  NA 60 60 60 60 60 45 60 45 NA ...

In release 0.2.1, tidy_fstadv will be renamed to tidy_adv.

One final note on the data: all speed variables are measured in knots, distance variables in nautical miles, and pressure variables in millibars. Functions knots_to_mph and mb_to_in are available for speed/pressure conversions. Function nm_to_sm to convert nautical miles to survey miles will be included in release 0.2.1.

Archived Data

rrricanesdata was built to make it easier to get pre-processed datasets. As mentioned earlier, rrricanesdata will be updated the first of every month if any advisory was issued for the previous month. (As I am now writing this portion in September, all of Hurricane Harvey’s advisories - the last one issued the morning of August 31 - exist in rrricanesdata release 0.0.1.4.)

As with rrricanes, rrricanesdata is not available in CRAN (nor will be due to size limitations).

I’ll load all datasets with the call:

library(rrricanesdata)
data(list = data(package = "rrricanesdata")$results[,3])

All core product datasets are available. The dataframes adv, fcst, fcst_wr and wr are the dataframes created by tidy_fstadv, tidy_fcst, tidy_fcst_wr and tidy_wr, respectively.

Tracking Charts

rrricanes also comes with helper functions to quickly generate tracking charts. These charts use rnaturalearthdata (for high resolution maps, use package rnaturalearthhires). These charts are not required - Bob Rudis demonstrates demonstrates succintly - so feel free to experiment.

You can generate a default plot for the entire globe with tracking_chart:

tracking_chart()

You may find this handy when examining cyclones that cross basins (from the Atlantic to east Pacific such as Hurricane Otto, 2016).

tracking_chart takes three parameters (in addition to dots for other ggplot calls):

  • countries - By default, show country borders

  • states - By default, show state borders

  • res - resolution; default is 110nm.

We do not see countries and states in the map above because of the ggplot defaults. Let’s try it again:

tracking_chart(color = "black", size = 0.1, fill = "white")

We can “zoom in” on each basin with helper functions al_tracking_chart and ep_tracking_chart:

al_tracking_chart(color = "black", size = 0.1, fill = "white")

ep_tracking_chart(color = "black", size = 0.1, fill = "white")

GIS Data

GIS data exists for some cyclones and varies by year. This is a relatively new archive by the NHC and is inconsistent from storm to storm.

The “gis” functions are as follows:

  • gis_advisory

  • gis_latest

  • gis_prob_storm_surge

  • gis_windfield

  • gis_wsp

Another area of inconsistency with these products is how they are organized. For example, gis_advisory, gis_prob_storm_surge and gis_windfield can be retrieved with a storm Key (unique identifier for every cyclone; see fstadv$Key). Except for gis_prob_storm_surge, you can even pass an advisory number (see fstadv$Adv).

gis_wsp requires a datetime value; to access a specific GIS package for a storm’s advisory you would need to use a variable such as fstadv$Date, subtract three hours and convert to “%Y%m%d%H” format (“%m”, “%d”, and “%H” are optional).

All above functions only return URL’s to their respective datasets. This was done to allow you to validate the quantity of datasets you wish to retrieve as, in some cases, the dataset may not exist at all or there may be several available. Use gis_download with the requested URL’s to retrieve your datasets.

Let’s go through each of these. First, let’s get the Key of Hurricane Harvey:

# Remember that ds already and only contains data for Hurricane Harvey
key <- ds$fstadv %>% pull(Key) %>% first()

gis_advisory

gis_advisory returns a dataset package containing past and forecast plot points and lines, a forecast cone (area representing where the cyclone could track), wind radius data and current watches and warnings.

gis_advisory takes two parameters:

  • Key

  • advisory (optional)

If we leave out advisory we get all related datasets for Hurricane Harvey:

x <- gis_advisory(key = key)
length(x)
## [1] 77
head(x, n = 5L)
## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001.zip"
## [2] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001A.zip"
## [3] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002.zip"
## [4] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002A.zip"
## [5] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_003.zip"

As you can see, there is quite a bit (and why the core gis functions only return URLs rather than the actual datasets). Let’s trim this down a bit. Sneaking a peek (cheating) I find advisory 19 seems a good choice.

gis_advisory(key = key, advisory = 19)
## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_019.zip"

Good; there is a data package available for this advisory. Once you have confirmed the package you want to retrieve, use gis_download to get the data.

gis <- gis_advisory(key = key, advisory = 19) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_lin"
## with 1 features
## It has 7 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pgn"
## with 1 features
## It has 7 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pts"
## with 8 features
## It has 23 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_ww_wwlin"
## with 5 features
## It has 8 fields

Let’s see what we have.

str(gis)
## List of 4
##  $ al092017_019_5day_lin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots
##   .. ..@ data       :'data.frame':	1 obs. of  7 variables:
##   .. .. ..$ STORMNAME: chr "Harvey"
##   .. .. ..$ STORMTYPE: chr "HU"
##   .. .. ..$ ADVDATE  : chr "1000 PM CDT Thu Aug 24 2017"
##   .. .. ..$ ADVISNUM : chr "19"
##   .. .. ..$ STORMNUM : num 9
##   .. .. ..$ FCSTPRD  : num 120
##   .. .. ..$ BASIN    : chr "AL"
##   .. ..@ lines      :List of 1
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ...
##   .. .. .. .. ..@ ID   : chr "0"
##   .. ..@ bbox       : num [1:2, 1:2] -97.5 25.2 -94.6 29.5
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "x" "y"
##   .. .. .. ..$ : chr [1:2] "min" "max"
##   .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"
##  $ al092017_019_5day_pgn:Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   .. ..@ data       :'data.frame':	1 obs. of  7 variables:
##   .. .. ..$ STORMNAME: chr "Harvey"
##   .. .. ..$ STORMTYPE: chr "HU"
##   .. .. ..$ ADVDATE  : chr "1000 PM CDT Thu Aug 24 2017"
##   .. .. ..$ ADVISNUM : chr "19"
##   .. .. ..$ STORMNUM : num 9
##   .. .. ..$ FCSTPRD  : num 120
##   .. .. ..$ BASIN    : chr "AL"
##   .. ..@ polygons   :List of 1
##   .. .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
##   .. .. .. .. ..@ Polygons :List of 1
##   .. .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. .. .. .. ..@ labpt  : num [1:2] -95.4 29.2
##   .. .. .. .. .. .. .. ..@ area   : num 51.7
##   .. .. .. .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. .. .. .. ..@ coords : num [1:1482, 1:2] -94.6 -94.7 -94.7 -94.7 -94.7 ...
##   .. .. .. .. ..@ plotOrder: int 1
##   .. .. .. .. ..@ labpt    : num [1:2] -95.4 29.2
##   .. .. .. .. ..@ ID       : chr "0"
##   .. .. .. .. ..@ area     : num 51.7
##   .. ..@ plotOrder  : int 1
##   .. ..@ bbox       : num [1:2, 1:2] -100 24.9 -91 33
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "x" "y"
##   .. .. .. ..$ : chr [1:2] "min" "max"
##   .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"
##  $ al092017_019_5day_pts:Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   .. ..@ data       :'data.frame':	8 obs. of  23 variables:
##   .. .. ..$ ADVDATE  : chr [1:8] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ...
##   .. .. ..$ ADVISNUM : chr [1:8] "19" "19" "19" "19" ...
##   .. .. ..$ BASIN    : chr [1:8] "AL" "AL" "AL" "AL" ...
##   .. .. ..$ DATELBL  : chr [1:8] "10:00 PM Thu" "7:00 AM Fri" "7:00 PM Fri" "7:00 AM Sat" ...
##   .. .. ..$ DVLBL    : chr [1:8] "H" "H" "M" "M" ...
##   .. .. ..$ FCSTPRD  : num [1:8] 120 120 120 120 120 120 120 120
##   .. .. ..$ FLDATELBL: chr [1:8] "2017-08-24 7:00 PM Thu CDT" "2017-08-25 7:00 AM Fri CDT" "2017-08-25 7:00 PM Fri CDT" "2017-08-26 7:00 AM Sat CDT" ...
##   .. .. ..$ GUST     : num [1:8] 90 115 135 120 85 45 45 45
##   .. .. ..$ LAT      : num [1:8] 25.2 26.1 27.2 28.1 28.6 28.5 28.5 29.5
##   .. .. ..$ LON      : num [1:8] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95
##   .. .. ..$ MAXWIND  : num [1:8] 75 95 110 100 70 35 35 35
##   .. .. ..$ MSLP     : num [1:8] 973 9999 9999 9999 9999 ...
##   .. .. ..$ SSNUM    : num [1:8] 1 2 3 3 1 0 0 0
##   .. .. ..$ STORMNAME: chr [1:8] "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" ...
##   .. .. ..$ STORMNUM : num [1:8] 9 9 9 9 9 9 9 9
##   .. .. ..$ STORMSRC : chr [1:8] "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" ...
##   .. .. ..$ STORMTYPE: chr [1:8] "HU" "HU" "MH" "MH" ...
##   .. .. ..$ TCDVLP   : chr [1:8] "Hurricane" "Hurricane" "Major Hurricane" "Major Hurricane" ...
##   .. .. ..$ TAU      : num [1:8] 0 12 24 36 48 72 96 120
##   .. .. ..$ TCDIR    : num [1:8] 315 9999 9999 9999 9999 ...
##   .. .. ..$ TCSPD    : num [1:8] 9 9999 9999 9999 9999 ...
##   .. .. ..$ TIMEZONE : chr [1:8] "CDT" "CDT" "CDT" "CDT" ...
##   .. .. ..$ VALIDTIME: chr [1:8] "25/0000" "25/1200" "26/0000" "26/1200" ...
##   .. ..@ coords.nrs : num(0)
##   .. ..@ coords     : num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   .. ..@ bbox       : num [1:2, 1:2] -97.5 25.2 -94.6 29.5
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   .. .. .. ..$ : chr [1:2] "min" "max"
##   .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"
##  $ al092017_019_ww_wwlin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots
##   .. ..@ data       :'data.frame':	5 obs. of  8 variables:
##   .. .. ..$ STORMNAME: chr [1:5] "Harvey" "Harvey" "Harvey" "Harvey" ...
##   .. .. ..$ STORMTYPE: chr [1:5] "HU" "HU" "HU" "HU" ...
##   .. .. ..$ ADVDATE  : chr [1:5] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ...
##   .. .. ..$ ADVISNUM : chr [1:5] "19" "19" "19" "19" ...
##   .. .. ..$ STORMNUM : num [1:5] 9 9 9 9 9
##   .. .. ..$ FCSTPRD  : num [1:5] 120 120 120 120 120
##   .. .. ..$ BASIN    : chr [1:5] "AL" "AL" "AL" "AL" ...
##   .. .. ..$ TCWW     : chr [1:5] "TWA" "HWA" "TWR" "TWR" ...
##   .. ..@ lines      :List of 5
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.7 -97.4 -97.2 24.3 25.2 ...
##   .. .. .. .. ..@ ID   : chr "0"
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ...
##   .. .. .. .. ..@ ID   : chr "1"
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ...
##   .. .. .. .. ..@ ID   : chr "2"
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:10, 1:2] -95.6 -95.3 -95.1 -95.1 -94.8 ...
##   .. .. .. .. ..@ ID   : chr "3"
##   .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
##   .. .. .. .. ..@ Lines:List of 1
##   .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. .. .. .. .. ..@ coords: num [1:16, 1:2] -97.3 -97.3 -97.3 -97.4 -97.4 ...
##   .. .. .. .. ..@ ID   : chr "4"
##   .. ..@ bbox       : num [1:2, 1:2] -97.7 24.3 -94.4 29.8
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "x" "y"
##   .. .. .. ..$ : chr [1:2] "min" "max"
##   .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"

We get four spatial dataframes - points, polygons and lines.

names(gis)
## [1] "al092017_019_5day_lin" "al092017_019_5day_pgn" "al092017_019_5day_pts"
## [4] "al092017_019_ww_wwlin"

With the expection of point spatial dataframes (which can be converted to dataframe using tibble::as_data_frame, use helper function shp_to_df to convert the spatial dataframes to dataframes.

Forecast Track

library(ggplot2)
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_path(data = shp_to_df(gis$al092017_019_5day_lin), aes(x = long, y = lat))

Use geom_path instead of geom_line to keep the positions in order.

You can “zoom in” even further using ggplot2::coord_equal. For that, we need to know the limits of our objects (minimum and maximum latitude and longitude) or bounding box. Thankfully, the sp package can get us this information with the bbox function.

But, we don’t want to use the “al092017_019_5day_lin” dataset. Our gis dataset contains a forecast cone which expands well beyond the lines dataset. Take a look:

sp::bbox(gis$al092017_019_5day_lin)
##     min   max
## x -97.5 -94.6
## y  25.2  29.5
sp::bbox(gis$al092017_019_5day_pgn)
##          min       max
## x -100.01842 -90.96327
## y   24.86433  33.01644

So, let’s get the bounding box of our forecast cone dataset and zoom in on our map.

bb <- sp::bbox(gis$al092017_019_5day_pgn)
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_path(data = shp_to_df(gis$al092017_019_5day_lin),
            aes(x = long, y = lat)) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

That’s much better. For simplicity I’m going to save the base map, bp, without the line plot.

bp <- al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

Forecast Points

Forecast points identify each forecast position along with forecast winds and date. Remember that for point spatial dataframes you use tibble::as_data_frame rather than sp_to_df.

bp +
  geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts),
             aes(x = long, y = lat))

If you ran the code above you would get an error.

Error in FUN(X[[i]], ...) : object 'long' not found

Why? The variable long does not exist as it does in other GIS datasets; it is lon. This is one of the inconsistencies I was referring to previously. Additionally, the variables are all uppercase.

names(gis$al092017_019_5day_pts)
##  [1] "ADVDATE"   "ADVISNUM"  "BASIN"     "DATELBL"   "DVLBL"
##  [6] "FCSTPRD"   "FLDATELBL" "GUST"      "LAT"       "LON"
## [11] "MAXWIND"   "MSLP"      "SSNUM"     "STORMNAME" "STORMNUM"
## [16] "STORMSRC"  "STORMTYPE" "TCDVLP"    "TAU"       "TCDIR"
## [21] "TCSPD"     "TIMEZONE"  "VALIDTIME"

Let’s try it again.

bp +
  geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts),
             aes(x = LON, y = LAT))

Better.

Forecast Cone

A forecast cone identifies the probability of error in a forecast. Forecasting tropical cyclones is tricky business - errors increase the further out a forecast is issued. Theoretically, any area within a forecast cone is at risk of seeing cyclone conditions within the given period of time.

Generally, a forecast cone package contains two subsets: 72-hour forecast cone and 120-hour forecast cone. This is identified in the dataset under the variable FCSTPRD. Let’s take a look at the 72-hour forecast period:

bp +
  geom_polygon(data = shp_to_df(gis$al092017_019_5day_pgn) %>%
                 filter(FCSTPRD == 72),
               aes(x = long, y = lat, color = FCSTPRD))

Nothing there!

As mentioned earlier, these are experimental products issued by the NHC and they do contain inconsistencies. To demonstrate, I’ll use Hurricane Ike advisory 42.

df <- gis_advisory(key = "AL092008", advisory = 42) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_lin"
## with 2 features
## It has 9 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pgn"
## with 2 features
## It has 9 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pts"
## with 13 features
## It has 20 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_ww_wwlin"
## with 5 features
## It has 10 fields
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = shp_to_df(df$al092008_042_5day_pgn) %>%
                 filter(FCSTPRD == 72),
                  aes(x = long, y = lat))

We do, however, have a 120-hour forecast cone for Hurricane Harvey. Let’s go ahead and plot that.

bp +
  geom_polygon(data = gis$al092017_019_5day_pgn,
               aes(x = long, y = lat), alpha = 0.15)

It’s an odd-looking forecast cone, for sure. But this demonstrates the entire area that Harvey could have potentially traveled.

Watches and Warnings

Our last dataset in this package is “al092017_09_ww_wlin”. These are the current watches and warnings in effect. This is a spatial lines dataframe that needs shp_to_df. Again, we use geom_path instead of geom_line. And we want to group our paths by the variable TCWW.

bp +
  geom_path(data = shp_to_df(gis$al092017_019_ww_wwlin),
            aes(x = long, y = lat, group = group, color = TCWW), size = 1)

The paths won’t follow our coastlines exactly but you get the idea. The abbreviations don’t really give much information, either. Convert TCWW to factor and provide better labels for your legend.

ww_wlin <- shp_to_df(gis$al092017_019_ww_wwlin)
ww_wlin$TCWW <- factor(ww_wlin$TCWW,
                              levels = c("TWA", "TWR", "HWA", "HWR"),
                              labels = c("Tropical Storm Watch","Tropical Storm Warning","Hurricane Watch","Hurricane Warning"))

bp +
  geom_path(data = ww_wlin,
            aes(x = long, y = lat, group = group, color = TCWW), size = 1)

See Forecast/Adivsory GIS on the rrricanes website for an example of putting all of this data together in one map.

gis_prob_storm_surge

We can also plot the probablistic storm surge for given locations. Again, you will need the storm Key for this function. There are two additional parameters:

  • products

  • datetime

products can be one or both of “esurge” and “psurge”. esurge shows the probability of the cyclone exceeding the given storm surge plus tide within a given forecast period. psurge shows the probability of a given storm surge within a specified forecast period.

One or more products may not exist depending on the cyclone and advisory.

The products parameter expects a list of values for each product. For esurge products, valid values are 10, 20, 30, 40 or 50. For psurge products, valid values are 0, 1, 2, …, 20.

Let’s see if any esurge products exist for Harvey.

length(gis_prob_storm_surge(key = key,
                            products = list("esurge" = seq(10, 50, by = 10))))
## [1] 150

And psurge:

length(gis_prob_storm_surge(key = key, products = list("psurge" = 0:20)))
## [1] 630

So, we have access to a ton of data here. When discussing gis_advisory, we were able to filter by advisory number. With gis_prob_storm_surge, this is not an option; we have to use the datetime parameter to filter. Let’s find the Date for advisory 19.

(d <- ds$fstadv %>% filter(Adv == 19) %>% pull(Date))
## [1] "2017-08-25 03:00:00 UTC"
esurge

Now, let’s view all esurge products for date only (exlude time).

gis_prob_storm_surge(key = key,
                     products = list("esurge" = seq(10, 50, by = 10)),
                     datetime = strftime(d, "%Y%m%d", tz = "UTC"))
##  [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip"
##  [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082506.zip"
##  [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082512.zip"
##  [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082518.zip"
##  [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip"
##  [6] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082506.zip"
##  [7] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082512.zip"
##  [8] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082518.zip"
##  [9] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip"
## [10] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082506.zip"
## [11] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082512.zip"
## [12] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082518.zip"
## [13] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip"
## [14] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082506.zip"
## [15] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082512.zip"
## [16] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082518.zip"
## [17] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip"
## [18] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082506.zip"
## [19] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082512.zip"
## [20] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082518.zip"

That’s still quite a bit. We can filter it to more by adding hour to the datetime parameter.

gis_prob_storm_surge(key = key,
                     products = list("esurge" = seq(10, 50, by = 10)),
                     datetime = strftime(d, "%Y%m%d%H", tz = "UTC"))

This call will give you an error:

Error: No data available for requested storm/advisory

But, this isn’t entirely correct. When an advisory package is issued it contains information for the release time. Some of the GIS datasets are based on the release time -3 hours. So, we need to subtract 3 hours from d.

Note: There is an additional value that, as of the latest release is not extracted, records the position of the cyclone three hours prior. (As I understand it from the NHC, this is due to the time it takes to collect and prepare the data.) Per Issue #102, these values will be added for release 0.2.1. Therefore, instead of subtracting three hours from the Date variable, you can simply use the PrevPosDate value for this function.

Let’s try it again with the math:

gis_prob_storm_surge(key = key,
                     products = list("esurge" = seq(10, 50, by = 10)),
                     datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H",
                                         tz = "UTC"))
## [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip"
## [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip"
## [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip"
## [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip"
## [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip"

As I don’t want to get all of these datasets, I’ll limit my esurge to show surge values with at least a 50% chance of being exceeded:

gis <- gis_prob_storm_surge(key = key,
                            products = list("esurge" = 50),
                            datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H",
                                                tz = "UTC")) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_e50"
## with 97 features
## It has 2 fields

This will bring us a spatial polygon dataframe.

df <- shp_to_df(gis$al092017_2017082500_e50)
bb <- sp::bbox(gis$al092017_2017082500_e50)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':	161313 obs. of  9 variables:
##  $ long   : num  -93.2 -93.2 -93.2 -93.2 -93.2 ...
##  $ lat    : num  29.9 29.9 29.9 29.9 29.9 ...
##  $ order  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece  : Factor w/ 349 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group  : Factor w/ 7909 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id     : chr  "0" "0" "0" "0" ...
##  $ POINTID: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TCSRG50: num  0 0 0 0 0 0 0 0 0 0 ...
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = df,
            aes(x = long, y = lat, group = group, color = TCSRG50)) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-43

This plot tells us that, along the central Texas coast, the expected storm surge along with tides is greater than 7.5 feet and there is a 50% chance of this height being exceeded.

psurge

The psurge product gives us the probabilistic storm surge for a location within the given forecast period.

gis <- gis_prob_storm_surge(key = key,
                            products = list("psurge" = 20),
                            datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H",
                                                tz = "UTC")) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_gt20"
## with 12 features
## It has 2 fields

This will bring us a spatial polygon dataframe.

df <- shp_to_df(gis$al092017_2017082500_gt20)
bb <- sp::bbox(gis$al092017_2017082500_gt20)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':	3293 obs. of  9 variables:
##  $ long     : num  -96.8 -96.8 -96.8 -96.8 -96.8 ...
##  $ lat      : num  28.5 28.5 28.4 28.4 28.4 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece    : Factor w/ 54 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group    : Factor w/ 227 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id       : chr  "0" "0" "0" "0" ...
##  $ POINTID  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PSurge20c: num  1 1 1 1 1 1 1 1 1 1 ...
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = df,
            aes(x = long, y = lat, group = group, color = PSurge20c)) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-47

This map shows the cumulative probability that a storm surge of greater than 20 feet will be seen within the highlighted regions.

This particular map doesn’t help much as we’ve zoomed in too far. What may provide use is a list of probability stations as obtained from the NHC. For this, you can use al_prblty_stations (ep_prblty_stations returns FALSE since, as of this writing, the format is invalid).

stations <- al_prblty_stations()

al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = df,
            aes(x = long, y = lat, group = group, color = PSurge20c)) +
  geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-48

gis_windfield

When possible, there may also be a cyclone wind radius dataset for the current and forecast positions. With this function we can resort back to Key and an advisory number.

gis <- gis_windfield(key = key, advisory = 19) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_forecastradii"
## with 15 features
## It has 13 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_initialradii"
## with 3 features
## It has 13 fields
names(gis)
## [1] "al092017_2017082503_forecastradii" "al092017_2017082503_initialradii"

Let’s get the bounding box and plot our initialradii dataset.

bb <- sp::bbox(gis$al092017_2017082503_initialradii)

al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii),
            aes(x = long, y = lat, group = group, fill = factor(RADII)),
            alpha = 0.5) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-51

And add the forecast wind radii data onto the chart (modifying bb):

bb <- sp::bbox(gis$al092017_2017082503_forecastradii)

al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii),
            aes(x = long, y = lat, group = group, fill = factor(RADII)),
            alpha = 0.5) +
  geom_polygon(data = shp_to_df(gis$al092017_2017082503_forecastradii),
               aes(x = long, y = lat, group = group, fill = factor(RADII)),
               alpha = 0.5) +
  geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-52

gis_wsp

Our last GIS dataset is wind speed probabilities. This dataset is not storm specific nor even basin-specific; you may get results for cyclones halfway across the world.

The two parameters needed are:

  • datetime - again, using the %Y%m%d%H format (not all values are required)

  • res - Resolution of the probabilities; 5 degrees, 0.5 degrees and 0.1 degrees.

Wind fields are for 34, 50 and 64 knots. Not all resolutions or windfields will be available at a given time.

Sticking with our variable d, let’s first make sure there is a dataset that exists for that time.

gis_wsp(datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC"))
## [1] "http://www.nhc.noaa.gov/gis/"

For this article, we’ll stick to the higher resolution plot.

we need a temporarily fixed function to replace gis_wsp(), which will be fixed in package soon

gis_wsp_2 <- function(datetime, res = c(5, 0.5, 0.1)) {
  res <- as.character(res)
  res <- stringr::str_replace(res, "^5$", "5km")
  res <- stringr::str_replace(res, "^0.5$", "halfDeg")
  res <- stringr::str_replace(res, "^0.1$", "tenthDeg")
  year <- stringr::str_sub(datetime, 0L, 4L)
  request <- httr::GET("http://www.nhc.noaa.gov/gis/archive_wsp.php",
                       query = list(year = year))
  contents <- httr::content(request, as = "parsed", encoding = "UTF-8")
  ds <- rvest::html_nodes(contents, xpath = "//a") %>% rvest::html_attr("href") %>%
    stringr::str_extract(".+\\.zip$") %>% .[stats::complete.cases(.)]
  if (nchar(datetime) < 10) {
    ptn_datetime <- paste0(datetime, "[:digit:]+")
  } else {
    ptn_datetime <- datetime
  }
  ptn_res <- paste(res, collapse = "|")
  ptn <- sprintf("%s_wsp_[:digit:]{1,3}hr(%s)", ptn_datetime,
                 ptn_res)
  links <- ds[stringr::str_detect(ds, ptn)]
  links <- paste0("http://www.nhc.noaa.gov/gis/", links)
  return(links)
}
gis <- gis_wsp_2(
  datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC"),
  res = 5) %>%
  gis_download()
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp34knt120hr_5km"
## with 11 features
## It has 1 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp50knt120hr_5km"
## with 11 features
## It has 1 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp64knt120hr_5km"
## with 11 features
## It has 1 fields

All of these datasets are spatial polygon dataframes. Again, we will need to convert to dataframe using shp_to_df.

bb <- sp::bbox(gis$`2017082500_wsp34knt120hr_5km`)

Examine the structure.

df <- shp_to_df(gis$`2017082500_wsp34knt120hr_5km`)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':	24182 obs. of  8 variables:
##  $ long      : num  -97.2 -97.2 -97.2 -97.3 -97.3 ...
##  $ lat       : num  20.3 20.3 20.3 20.3 20.4 ...
##  $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece     : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group     : Factor w/ 52 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id        : chr  "0" "0" "0" "0" ...
##  $ PERCENTAGE: chr  "<5%" "<5%" "<5%" "<5%" ...
al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = df,
               aes(x = long, y = lat, group = group, fill = PERCENTAGE),
               alpha = 0.25) +
  coord_equal(xlim = c(bb[1,1], bb[1,2]),
              ylim = c(bb[2,1], bb[2,2]))

plot of chunk unnamed-chunk-58

There aren’t many ways we can narrow this down other than using arbitrary longitude values. The observations in the dataset do not have a variable identifying which storm each set of values belongs to. So, I’ll remove the coord_equal call so we’re only focused on the Atlantic basin.

al_tracking_chart(color = "black", size = 0.1, fill = "white") +
  geom_polygon(data = df,
               aes(x = long, y = lat, group = group, fill = PERCENTAGE),
               alpha = 0.25)

plot of chunk unnamed-chunk-59

Of course, you can narrow it down further as you see fit.

Do not confuse this GIS dataset with the wndprb product or similar prblty products; both of which only identify probabilities for given locations.

gis_latest

For active cyclones, you can retrieve all available GIS datasets using gis_latest. Note that, unlike the previous GIS functions, this function will return a list of all GIS dataframes available.

gis <- gis_latest()

Now we have a large list of GIS spatial dataframes. Two things to point out here; first, we now have a “windswath” GIS dataset. This dataset, to the best of my knowledge, does not exist on it’s own. Therefore, no archived “windswath” datasets are available.

Second, I have found this data fluctuates even from minute to minute. Earlier this year when attempting to develop automated reporting, I found the return value of the call would vary almost with every call.

Of course, that doesn’t mean it is not valuable, and why it has been included. You can easily perform checks for specific data you are looking for. If it doesn’t exist, bail and try again in a few minutes.

Potential Issues Using rrricanes

I cannot stress enough that rrricanesis not intended for use during emergency situations, as I myself learned during Hurricane Harvey. The package currently relies on the NHC website which, I truly believe, is curated by hand.

The most common problems I’ve noticed are:

  • The NHC website unable to load or slow to respond. This was a hassle in previous releases but seems to be ironed out as of release 0.2.0.6. Nonetheless, there may be instances where response time is slow.

  • Incorrect storm archive links. Another example would be during Harvey when the link to Harvey’s archive page was listed incorrectly. If I manually typed the link as it should be, the storm’s correct archive page would load. However, the NHC website listed it incorrectly on the annual archives page.

As I become more aware of potential problems, I will look for workarounds. I would be greatly appreciative of any problems being posted to the rrricanes repository.

I will also post known issues beyond my control (such as NHC website issues) to Twitter using the #rrricanes hashtag.

Future Plans

The following data will be added to rrricanes as time allows:

  • Reconnaissance data (release 0.2.2)

  • Computer forecast model data (release 0.2.3)

  • Archived satellite images (tentative)

  • Ship and buoy data (tentative)

Reconnaissance data itself will be a massive project. There are numerous types of products. And, as advisory product formats have changed over the years, so have these. Any help in this task would be tremendously appreciated!

Some computer forecast models are in the public domain and can certainly be of tremendous use. Some are difficult to find (especially archived). The caution in this area is that many websites post this data but may have limitations on how it can be accessed.

Additionally, data may be added as deemed fitting.

Contribute

Anyone is more than welcome to contribute to the package. I would definitely appreciate any help. See Contributions for more information.

I would ask that you follow the Tidyverse style guide. Release 0.2.1 will fully incorporate these rules.

You do not need to submit code in order to be listed as a contributor. If there is a data source (that can legally be scraped) that you feel should be added, please feel free to submit a request. Submitting bug reports and feature requests are all extremely valuable to the success of rrricanes.

Acknowledgments

I want to thank the rOpenSci community for embracing rrricanes and accepting the package into their vast portfolio. This is my first attempt and putting a project into part of a larger community and the lessons learned have been outstanding.

I want to thank Maelle Salmon who, in a sense, has been like a guiding angel from start to finish during the entire onboarding and review process.

I want to give a very special thanks to Emily Robinson and Joseph Stachelek for taking the time to put rrricanes to the test, giving valuable insight and recommendations on improving it.

And a thank-you also to James Molyneux, Mark Padgham, and Bob Rudis, all of whom have offered guidance or input that has helped make rrricanes far better than it would have been on my own.

Help us build capacity of open software users and developers with Hacktoberfest

$
0
0

One of rOpenSci’s aims is to build capacity of software users and developers and foster a sense of pride in their work. What better way to do that than to encourage you to participate in Hacktoberfest, a month-long celebration of open source software!

It doesn’t take much to get involved

Beginners to experts. Contributors and package maintainers welcome. You can get involved by applying the label Hacktoberfest to issues in your rOpenSci repo (or any project) that are ready for contributors to work on. You can find already-labelled rOpenSci and ropenscilabs issues here. A contribution can be anything - fixing typos, improving documentation, writing tests, fixing bugs, or creating new features. Who better to improve a vignette than the person who’s using the package?!

Consider opening beginner-level issues and label them beginner and Hacktoberfest. Remember, everyone had a first pull request so put yourself in their shoes, keep our code of conduct in mind and help open up the doors to our community!

Have a look at this great discussion on first-timers-only issues with pointers to other discussions and resources. Please chime in in the comments below with your suggestions below for ways to help newcomers get involved.

Everyone who signs up gets stickers, but make four pull requests between October 1–31 and you get a limited edition Hacktoberfest 2017 t-shirt.

Spread the word

Share this blog post and link to your issue or link to all the issues! #Hacktoberfest #rstats

googleLanguageR - Analysing language through the Google Cloud Machine Learning APIs

$
0
0

One of the greatest assets human beings possess is the power of speech and language, from which almost all our other accomplishments flow. To be able to analyse communication offers us a chance to gain a greater understanding of one another.

To help you with this, googleLanguageR is an R package that allows you to perform speech-to-text transcription, neural net translation and natural language processing via the Google Cloud machine learning services.

An introduction to the package is below, but you can find out more details at the googleLanguageR website.

Google’s bet

Google predicts that machine learning is to be a fundamental feature of business, and so they are looking to become the infrastructure that makes machine learning possible. Metaphorically speaking: If machine learning is electricity, then Google wants to be the pylons carrying it around the country.

Google may not be the only company with such ambitions, but one advantage Google has is the amount of data it possesses. Twenty years of web crawling has given it an unprecedented corpus to train its models. In addition, its recent moves into voice and video gives it one of the biggest audio and speech datasets, all of which have been used to help create machine learning applications within its products such as search and Gmail. Further investment in machine learning is shown by Google’s purchase of Deepmind, a UK based A.I. research firm that recently was in the news for defeating the top Go champion with its neural network trained Go bot. Google has also taken an open-source route with the creation and publication of Tensorflow, a leading machine learning framework.

Whilst you can create your own machine learning models, for those users who haven’t the expertise, data or time to do so, Google also offers an increasing range of machine learning APIs that are pre-trained, such as image and video recognition or job search. googleLanguageR wraps the subset of those machine learning APIs that are language flavoured - Cloud Speech, Translation and Natural Language.

Since they carry complementary outputs that can be used in each other’s input, all three of the APIs are included in one package. For example, you can transcribe a recording of someone speaking in Danish, translate that to English and then identify how positive or negative the writer felt about its content (sentiment analysis) then identify the most important concepts and objects within the content (entity analysis).

Motivations

Fake news

One reason why I started looking at this area was the growth of ‘fake news’, and its effect on political discourse on social media. I wondered if there was some way to put metrics on how much a news story fuelled one’s own bias within your own filter bubble. The entity API provides a way to perform entity and sentiment analysis at scale on tweets, and by then comparing different users and news sources preferences the hope is to be able to judge how much they are in agreement with your own bias, views and trusted reputation sources.

Make your own Alexa

Another motivating application is the growth of voice commands that will become the primary way of user interface with technology. Already, Google reports up to 20% of search in its app is via voice search. I’d like to be able to say “R, print me out that report for client X”. A Shiny app that records your voice, uploads to the API then parses the return text into actions gives you a chance to create your very own Alexa-like infrastructure.

The voice activated internet connected speaker, Amazon’s Alexa - image from www.amazon.co.uk

Translate everything

Finally, I live and work in Denmark. As Danish is only spoken by less than 6 million people, applications that work in English may not be available in Danish very quickly, if at all. The API’s translation service is the one that made the news in 2016 for “inventing its own language”, and offers much better English to Danish translations that the free web version and may make services available in Denmark sooner.

Using the library

To use these APIs within R, you first need to do a one-time setup to create a Google Project, add a credit card and authenticate which is detailed on the package website.

After that, you feed in the R objects you want to operate upon. The rOpenSci review helped to ensure that this can scale up easily, so that you can feed in large character vectors which the library will parse and rate limit as required. The functions also work within tidyverse pipe syntax.

Speech-to-text

The Cloud Speech API is exposed via the gl_speech function.

It supports multiple audio formats and languages, and you can either feed a sub-60 second audio file directly, or perform asynchrnous requests for longer audio files.

Example code:

library(googleLanguageR)

my_audio <- "my_audio_file.wav"
gl_speech(my_audio)
#  A tibble: 1 x 3
#  transcript confidence                 words
#* <chr>      <dbl>                <list>
#1 Hello Mum  0.9227779 <data.frame [19 x 3]>

Translation

The Cloud Translation API lets you translate text via gl_translate

As you are charged per character, one tip here if you are working with lots of different languages is to perform detection of language offline first using another rOpenSci package, cld2. That way you can avoid charges for text that is already in your target language i.e. English.

library(googleLanguageR)
library(cld2)
library(purrr)

my_text <- c("Katten sidder på måtten", "The cat sat on the mat")

## offline detect language via cld2
detected <- map_chr(my_text, detect_language)
# [1] "DANISH"  "ENGLISH"

## get non-English text
translate_me <- my_text[detected != "ENGLISH"]

## translate
gl_translate(translate_me)
## A tibble: 1 x 3
#                 translatedText detectedSourceLanguage                    text
#*                         <chr>                  <chr>                   <chr>
#1 The cat is sitting on the mat                     da Katten sidder på måtten

Natural Language Processing

The Natural Language API reveals the structure and meaning of text, accessible via the gl_nlp function.

It returns several analysis:

  • Entity analysis - finds named entities (currently proper names and common nouns) in the text along with entity types, salience, mentions for each entity, and other properties. If possible, will also return metadata about that entity such as a Wikipedia URL.
  • Syntax - analyzes the syntax of the text and provides sentence boundaries and tokenization along with part of speech tags, dependency trees, and other properties.
  • Sentiment - the overall sentiment of the text, represented by a magnitude [0, +inf] and score between -1.0 (negative sentiment) and 1.0 (positive sentiment)

These are all useful to get an understanding of the meaning of a sentence, and has potentially the greatest number of applications of the APIs featured. With entity analysis, auto categorisation of text is possible; the syntax returns let you pull out nouns and verbs for parsing into other actions; and the sentiment analysis allows you to get a feeling for emotion within text.

A demonstration is below which gives an idea of what output you can generate:

library(googleLanguageR)
quote <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."
nlp <- gl_nlp(quote)

str(nlp)
#List of 6
# $ sentences        :List of 1
#  ..$ :'data.frame':	1 obs. of  4 variables:
#  .. ..$ content    : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."
#  .. ..$ beginOffset: int 0
#  .. ..$ magnitude  : num 0.6
#  .. ..$ score      : num -0.6
# $ tokens           :List of 1
#  ..$ :'data.frame':	20 obs. of  17 variables:
#  .. ..$ content       : chr [1:20] "Two" "things" "are" "infinite" ...
#  .. ..$ beginOffset   : int [1:20] 0 4 11 15 23 25 29 38 42 48 ...
#  .. ..$ tag           : chr [1:20] "NUM" "NOUN" "VERB" "ADJ" ...
#  .. ..$ aspect        : chr [1:20] "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" ...
#  .. ..$ case          : chr [1:20] "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" ...
#  .. ..$ form          : chr [1:20] "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" ...
#  .. ..$ gender        : chr [1:20] "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" ...
#  .. ..$ mood          : chr [1:20] "MOOD_UNKNOWN" "MOOD_UNKNOWN" "INDICATIVE" "MOOD_UNKNOWN" ...
#  .. ..$ number        : chr [1:20] "NUMBER_UNKNOWN" "PLURAL" "NUMBER_UNKNOWN" "NUMBER_UNKNOWN" ...
#  .. ..$ person        : chr [1:20] "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" ...
#  .. ..$ proper        : chr [1:20] "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" ...
#  .. ..$ reciprocity   : chr [1:20] "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" ...
#  .. ..$ tense         : chr [1:20] "TENSE_UNKNOWN" "TENSE_UNKNOWN" "PRESENT" "TENSE_UNKNOWN" ...
#  .. ..$ voice         : chr [1:20] "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" ...
#  .. ..$ headTokenIndex: int [1:20] 1 2 2 2 2 6 2 6 9 6 ...
#  .. ..$ label         : chr [1:20] "NUM" "NSUBJ" "ROOT" "ACOMP" ...
#  .. ..$ value         : chr [1:20] "Two" "thing" "be" "infinite" ...
# $ entities         :List of 1
#  ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	6 obs. of  9 variables:
#  .. ..$ name         : chr [1:6] "human stupidity" "things" "universe" "universe" ...
#  .. ..$ type         : chr [1:6] "OTHER" "OTHER" "OTHER" "OTHER" ...
#  .. ..$ salience     : num [1:6] 0.1662 0.4771 0.2652 0.2652 0.0915 ...
#  .. ..$ mid          : Factor w/ 0 levels: NA NA NA NA NA NA
#  .. ..$ wikipedia_url: Factor w/ 0 levels: NA NA NA NA NA NA
#  .. ..$ magnitude    : num [1:6] NA NA NA NA NA NA
#  .. ..$ score        : num [1:6] NA NA NA NA NA NA
#  .. ..$ beginOffset  : int [1:6] 42 4 29 86 29 86
#  .. ..$ mention_type : chr [1:6] "COMMON" "COMMON" "COMMON" "COMMON" ...
# $ language         : chr "en"
# $ text             : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."
# $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1 obs. of  2 variables:
#  ..$ magnitude: num 0.6
#  ..$ score    : num -0.6

Acknowledgements

This package is 10 times better due to the efforts of the rOpenSci reviewers Neal Richardson and Julia Gustavsen, who have whipped the documentation, outputs and test cases into the form they are today in 0.1.0. Many thanks to them.

Hopefully, this is just the beginning and the package can be further improved by its users - if you do give the package a try and find a potential improvement, raise an issue on GitHub and we can try to implement it. I’m excited to see what users can do with these powerful tools.

Governance, Engagement, and Resistance in the Open Science Movement: A Comparative Study

$
0
0

A growing community of scientists from a variety of disciplines is moving the norms of scientific research toward open practices. Supporters of open science hope to increase the quality and efficiency of research by enabling the widespread sharing of datasets, research software source code, publications, and other processes and products of research. The speed at which the open science community seems to be growing mirrors the rapid development of technological capabilities, including robust open source scientific software, new services for data sharing and publication, and novel data science techniques for working with massive datasets. Organizations like rOpenSci harness such capabilities and deploy various combinations of these research tools, or what I refer to here as open science infrastructures, to facilitate open science.

As studies of other digital infrastructures have pointed out, developing and deploying the technological capabilities that support innovative work within a community of practitioners constitutes just one part of making innovation happen. As quickly as the technical solutions to improving scientific research may be developing, a host of organizational and social issues are lagging behind and hampering the open science community’s ability to inscribe open practices in the culture of scientific research. Remedying organizational and social issues requires paying attention to open science infrastructures’ human components, such as designers, administrators, and users, as well as the policies, practices, and organizational structures that contribute to the smooth functioning of these systems.1,2 These elements of infrastructure development have, in the past, proven to be equal to or more important than technical capabilities in determining the trajectory the infrastructure takes (e.g., whether it “succeeds” or “fails”).3,4

As a postdoc with rOpenSci, I have begun a qualitative, ethnographic project to explore the organizational and social processes involved in making open science the norm in two disciplines: astronomy and ecology. I focus on these two disciplines to narrow, isolate, and compare the set of contextual factors (e.g., disciplinary histories, research norms, and the like) that might influence perceptions of open science. Specifically, I aim to answer the following research questions (RQ):

  • RQ1a: What are the primary motivations of scientists who actively engage with open science infrastructures?

  • RQ1b: What are the factors that influence resistance to open science among some scientists?

  • RQ2: What strategies do open science infrastructure leaders use to encourage participation, govern contributions, and overcome resistance to open science infrastructure use?

    a. To what extent do governance strategies balance standardization and flexibility, centralization and decentralization, and voluntary and mandatory contributions?

    b. By what mechanisms are open science policies and practices enforced?

    c. What are the commonalities and differences in the rationale behind choices of governance strategies?

Below, I describe how I am systematically investigating these questions in two parts. In Part 1, I am identifying the issues raised by scientists who engage with or resist the open science movement. In Part 2, I am studying the governance strategies open science leaders and decision-makers use to elicit engagement with open science infrastructures in these disciplines.

Part 1: Engagement with and Resistance to Open Science

I am firmly rooted in a research tradition which emphasizes that studying the uptake of a new technology or technological approach, no matter the type of work or profession, begins with capturing how the people charged with changing their activities respond to the change “on the ground.” In this vein, Part 1 of the study aims to lend empirical support or opposition to arguments for and against open science that are commonly found in opinion pieces, on social media, and in organizational mission statements. A holistic reading of such documents reveals several commonalities in the positions for and against open science. Supporters of open science often cite increased transparency, reproducibility, and collaboration as the overwhelming benefits of making scientific research processes and products openly available. Detractors highlight concerns over “scooping,” ownership, and the time costs of curating and publishing code and data.

I am seeking to verify and test these claims by interviewing and surveying astronomers and ecologists or, more broadly, earth and environmental scientists who fall on various parts of the open science engagement-to-resistance spectrum. I am conducting interviews using a semi-structured interview protocol5 across all interviewees. I will then use a qualitative data analysis approach based on the grounded theory method6 to extract themes from the responses, focusing on the factors that promote engagement (e.g., making data available, spending time developing research software, or making publications openly accessible) or resistance (e.g., unwillingness to share code used in a study or protecting access to research data). Similar questions will be asked at scale via a survey.

Armed with themes from the responses, I will clarify and refine the claims often made in the public sphere about the benefits and drawbacks of open science. I hope to develop this part of the study into actionable recommendations for promoting open science, governing contributions to open science repositories, and addressing the concerns of scientists who are hesitant about engagement.

Part 2: Focusing on Governance

Even with interviews and surveys of scientists on the ground, it is difficult to systematically trace and analyze the totality of social and political processes that support open science infrastructure development because the processes occur across geographic, disciplinary, and other boundaries.

However, as others have pointed out,7 the organizational and social elements of digital infrastructure development often become visible and amenable to study through infrastructure governance. Governance refers to the combination of “executive and management roles, program oversight functions organized into structures, and policies that define management principles and decision making.”8 Effective governance provides projects with the direction and oversight necessary to achieve desired outcomes of infrastructure development while allowing room for creativity and innovation.2,9,10 Studying a project’s governance surfaces the negotiation processes that occur among stakeholders—users, managers, organizations, policymakers, and the like—throughout the development process. Outcomes include agreements about the types of technologies used, standards defining the best practices for technology use, and other policies to ensure that a robust, sustainable infrastructure evolves.9,11

Despite the scientific research community’s increasing reliance on open science infrastructures, few studies compare different infrastructure governance strategies2 and even fewer develop new or revised strategies for governing infrastructure development and use.12 The primary goal of this part of the project is to address this gap in our understanding of the governance strategies used to create, maintain, and grow open science infrastructures.

I am administering this part of the study by conducting in-depth, semi-structured interviews with leaders of various open science infrastructure projects supporting work in astronomy and ecology. I define “leaders” in this context as individuals or small groups of individuals who make decisions about the management of open science infrastructures and their component parts. This set of leaders includes founders and administrators of widely-used scientific software packages and collections of packages, of open data repositories, of open access publication and preprint services, and various combinations of open science tools. Furthermore, I intend to interview the leaders of organizations with which the open science community interacts—top publication editors, for example—to gauge how open science practices and processes are being governed outside of active open science organizations.

I will conduct qualitative coding as described above to develop themes from the responses of open science leaders. I will then ground these themes in the literature on digital infrastructure governance—which emphasizes gradual, decentralized, and voluntary development—and look for avenues to improve governance strategies.

Alongside the interview and survey methods, I am actively observing and retaining primary documents from the ongoing discourse around open science in popular scientific communication publications (e.g., Nature and Science), conferences and meetings (e.g., OpenCon and discipline-specific hackweeks), and in the popular media/social media (e.g., The New York Times and Twitter threads).

Preliminary Themes

I entered this project with a very basic understanding of how open science “works”—the technical and social mechanisms by which scientists make processes and outputs publicly available. In learning about the open science movement, in general and in particular instantiations, I’ve begun to see the intricacies involved in efforts to change scientific research and its modes of communication: research data publication, citation, and access; journal publication availability; and research software development and software citation standards. Within the community trying to sustain these changes are participants and leaders who are facing and tackling several important issues head-on. I list some of the most common engagement, resistance, and governance challenges appearing in interview and observation transcripts below.

Participation challenges

  • Overcoming the fear of sharing code and data, specifically the fear of sharing “messy” code and the fear of being shamed for research errors.
  • Defending the time and financial costs of participation in open science—particularly open source software development—to supervisors, collaborators, or tenure and promotion panels who are not engaged with open science.
  • Finding time to make code and data usable for others (e.g., through good documentation or complete metadata) and, subsequently, finding a home where code and data can easily be searched and found.

Governance challenges

  • Navigating the issue of convincing researchers that software development and data publication/archiving “count” as research products, even though existing funding, publication, and tenure and promotion models may not yet value those contributions.
  • Developing guidelines and processes for conducting peer review on research publication, software, and data contributions, especially the tensions involved in “open review.”
  • Deciding whose responsibility it is to enforce code and data publication standards or policies, both within open science organizations and in traditional outlets like academic journals.

The points raised in this post and the questions guiding my project might seem like discussions you’ve had too many times over coffee during a hackathon break or over beers after a conference session. If so, I’d love to hear from you, even if you are not an astronomer, an ecologist, or an active leader of an open science infrastructure (dsholler at berkeley dot edu). I am always looking for new ideas, both confirming and disconfirming, to refine my approach to this project.


  1. Braa, J., Hanseth, O., Heywood, A., Mohammed, W., Shaw, V. 2007. Developing health information systems in developing countries: The flexible standards strategy. MIS Quarterly, 31(2), 381-402. https://doi.org/10.2307/25148796
  2. Tilson, D., Lyytinen, K., Sørensen, C. 2010 Digital infrastructures: The missing IS research agenda. Information Systems Research, 21(4), 748-759. https://doi.org/10.1287/isre.1100.0318
  3. Borgman, C. L. 2010. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT Press, Cambridge, MA. ISBN: 9780262250863
  4. Vaast, E., Walsham, G. 2009. Trans-situated learning: Supporting a network of practice with an information infrastructure. Information Systems Research, 20(4), 547-564. https://doi.org/10.1287/isre.1080.0228
  5. Spradley, J. P. (2016). The ethnographic interview. Longegrove, IL: Waveland Press. ISBN: 0030444969
  6. Corbin, J., Strauss, A., 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1), 3-21. https://doi.org/10.1007/BF00988593
  7. Barrett, M., Davidson, E., Prabhu, J., Vargo, S. L. 2015. Service innovation in the digital age: Key contributions and future directions. MIS Quarterly, 39(1) 135-154. DOI: 10.25300/MISQ/2015/39:1.03
  8. Hanford, M. 2005. Defining program governance and structure. IBM developerWorks. Available at: https://www.ibm.com/developerworks/rational/library/apr05/hanford/.
  9. Star, S. L., Ruhleder, K. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111-134. https://doi.org/10.1287/isre.7.1.111
  10. Edwards, P. N., Jackson, S. J., Bowker, G. C., Knobel, C. P. 2007. Understanding infrastructure: Dynamics, tensions, and design. Final report for Workshop on History and Theory of Infrastructure: Lessons for New Scientific Cyberinfrastructures. NSF Report. Available at: https://deepblue.lib.umich.edu/handle/2027.42/49353.
  11. Hanseth, O., Jacucci, E., Grisot, M., Aanestad, M. 2006. Reflexive standardization: Side effects and complexity in standard making. MIS Quarterly, 30(1), 563-581. https://doi.org/10.2307/25148773
  12. Hanseth, O., & Lyytinen, K. (2010). Design theory for dynamic complexity in information infrastructures: the case of building internet. Journal of Information Technology, 25(1), 1-19.

Changes to Internet Connectivity in R on Windows

$
0
0

This week we released version 3.0 of the curl R package to CRAN. You may have never used this package directly, but curl provides the foundation for most HTTP infrastructure in R, including httr, rvest, and all packages that build on it. If R packages need to go online, chances are traffic is going via curl.

This release introduces an important change for Windows users: we are switching from OpenSSL to Secure Channel on Windows 7 / 2008-R2 and up. Let me explain this in a bit more detail.

Why Switching SSL backends

The libcurl C library requires an external crypto library to provide the SSL layer (the S in HTTPS). On Linux / MacOS, libcurl is included with the OS so we don’t worry about this. However on Windows we ship our own build of libcurl so we can choose if we want to build against OpenSSL or Windows native SSL api called Secure Channel, also referred to as just “WinSSL”.

Thus far we have always used libcurl with OpenSSL, which works consistently on all versions of Windows. However OpenSSL requires that we provide our own CA bundle, which is not ideal. In particular users on corporate / government networks have reported difficulty connecting to the internet in R. The reason is often that their enterprise gateway / proxy uses custom certificates which are installed in the Windows certificate manager, but are not present in R’s bundle.

Moreover shipping our own CA bundle can be a security risk. If a CA gets hacked, the corresponding certificate needs to be revoked immediately. Operating systems can quickly push a security update to all users, but we cannot do this in R.

Switching to WinSSL

If we build libcurl against Windows native Secure Channel, it automatically uses the same SSL certificates as Internet Explorer. Hence we do not have to ship and maintain a custom CA bundle. Earlier this year I tried to switch the curl package to WinSSL, and everything seemed to work great on my machine.

However when we started checking reverse dependecies on CRAN WinBuilder, many packages depending on curl started to fail! It turned out Windows versions before Windows 7 do not natively support TLS 1.1 and 1.2 by default. Because TLS 1.2 is used by the majority of HTTPS servers today, WinSSL is basically useless on these machines. Unfortunately this also includes CRAN WinBuilder which runs Windows 2008 (the server edition of Vista).

So we had no choice but to roll back to OpenSSL in order to keep everything working properly on CRAN. Bummer.

Towards Dual SSL

I had almost given up on this when a few weeks ago Daniel Stenberg posted the following announcement on the libcurl mailing list:

Hi friends! As of minutes ago, libcurl has the ability to change SSL backend dynamically at run-time - if built with the support enabled. That means that the choice does no longer only have to happen at build-time.

This new feature gives us exactly the flexibility we need. We can take advantage of native Secure Channel on Windows 7 and up which are almost all users. However we can keep things working in legacy servers by falling back on OpenSSL on these machines, including the CRAN win builder.

code

So this is where we are. Version 3.0 of the curl R package uses the latest libcurl 7.56.0 and automatically switches to native SSL on Windows 7 and up. If all goes well, nobody should not notice any changes, except those people on enterprise networks where things will, hopefully, magically start working.

Feedback

Because each Windows network seems to have a different setup, testing and debugging these things is often difficult. We are interested to hear from Windows users if updating to curl 3.0 has improved the situation, or if any unexpected side effects arise. Please open an issue on Github if you run into problems.

.rprofile: David Smith

$
0
0

David Smith, R Community Lead at MicrosoftDavid Smith is a Blogger and Community Lead at Microsoft. I had the chance to interview David last May at rOpenSci unconf17. We spoke about his career, the process of working remote within a team, community development/outreach and his personal methods for discovering great content to share and write about.


KO: What is your name, job title, and how long have you been using R?

DS: My name is David Smith. I work at Microsoft and my self-imposed title is ‘R Community Lead’. I’ve been working with R specifically for about 10 years, but I’d been working with S since the early 90s.

KO: How did you transition into using R?

DS: I was using S for a long, long time, and I worked for the company that commercialized S, where I was a project manager for S-PLUS. And I got out of that company, and then worked for a startup in a different industry for a couple of years. But while I was there, the people that founded Revolution Analytics approached me because they were setting up a company to build a commercial business around R, and reached out to me because of my connections with the S community.

KO: So you came to Microsoft through Revolution?

DS: That’s correct. I was with Revolution Analytics and then Microsoft bought that company, so I’ve been with Microsoft since then.

KO: How has that transition gone, is there a Revolution team inside of Microsoft, or has it become more spread out?

DS: It’s been more spread out. It got split up into the engineering people and the marketing people, sales people all got distributed around. When I first went to Microsoft I started off in the engineering group doing product management. But I was also doing the community role, and it just wasn’t a very good fit just time-wise, between doing community stuff and doing code or product management. So then I switched to a different group called the Ecosystem team, and so now I’m 100% focused on community within a group that’s focused on the ecosystem in general.

The one piece of advice I could give anyone starting out in their careers is - write what you do, write it in public, and make it so that other people can reproduce what you did.

KO: What does it mean to be 100% community focused, do you do a lot of training?

DS: I don’t do a lot of training myself, but I work with a lot of other people on the team who do training. We’re focused mainly on building up the ecosystem of people that ultimately add value to the products that Microsoft has. And we’re specifically involved in the products that Microsoft has that now incorporate R by building up the value of the R ecosystem.

KO: What does your day-to-day look like, are you in an office, do you work remote?

DS: I work from home. I had moved from Seattle where Microsoft is headquartered to Chicago a couple of months before the acquisition happened, so I wasn’t about to move back to Seattle. But they let me work from home in Chicago, which has worked out great because most of my job is communicating with the community at large. So I do the Revolutions Blog, which I’ve been writing for eight or nine years now, writing stories about people using R, and applications of R packages. All as a way of communicating to the wider world the cool stuff that people can do with R, and also to the R community occasionally, what kind of things you can do with R in the Microsoft environment.

KO: Have you always been a writer or interested in writing and communications?

DS: No, no. I have a mathematics and computer science degree. I’m not trained as a writer. But it’s actually been useful having the perspective of statistics and mathematics and programming, and to bring that to a broader audience through writing. I’ve learned a lot about the whole writing and blogging and journalism process through that, but I’m certainly not trained in that way.

KO: How does your Ecosystems team at Microsoft function and collaborate?

DS: Unlike many teams at Microsoft, our team is very distributed. We have people working remotely from Denver, I’m in Chicago, Seattle, we’re all kind of distributed all around the place. So we meet virtually through Skype, have video meetings once a week and communicate a lot online.

KO: What kind of tools are you using?

DS: Traditionally, as in Microsoft, mainly email and Skype for the meetings. I set up an internal team focused around community more broadly around Microsoft and we use Microsoft Teams for that, which is a little bit like Slack. But a lot of the stuff that I do is more out in the open, so I use a lot of Twitter and Github for the code that I point to and stuff like that.

KO: How do you manage your Twitter?

DS: Twitter I do manually in real-time. I don’t do a lot of scheduling except for @RLangTip which is a feed of daily R tips. And for that I do scheduling through Tweetdeck on the web.

KO: How many Twitter accounts are you managing?

DS: I run @revodavid which is my personal twitter account, and @RLangTip which is R language tips. I tweet for @R_Forwards which is the diversity community for R, @RConsortium, the R Consortium, so quite a few.

KO: How long has this been a core part of your work day?

DS: The community thing as a focus, maybe five or six years? My career path for a long time was in product management. So I managed S-PLUS as a product for a long time, I managed another product at a different startup, and then I came to Revolution and I did a combination of engineering and product management. But in the last 18 months I’ve been 100% in the community space.

KO: How did you get into product management to begin with?

DS: That’s a good question that I’m not sure I know the answer to. I started off my first job after university – I actually left university specifically to become a support engineer for S-PLUS. When I took on that role, they didn’t really have product management yet at that company, and so when they were looking for somebody to basically steer S-PLUS as a product, it was a good fit for me and an opportunity to move to the States. I took that on and I kind of just learned product management as I did it. I went to a few sort of training/seminar type things, but I didn’t study it.

KO: Sure. It seems like something that people just kind of get saddled with sometimes?

DS: Exactly. It’s a discipline that doesn’t really have discipline. But for the various companies I’ve worked for, mostly startups, they all seem to have very different perspectives on what product management is and what the role of a product manager is.

KO: Yeah, I know what you mean. Are you happy to have sort of moved away from that?

DS: I am in the sense of – it was different being in a startup where being a product manager was more like being the shepherd of an entire product ecosystem, whereas in a big company the product manager is a lot more focused and inherently so, a lot more narrow. I happen to prefer the bigger picture I guess.

Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy.

KO: What’s your process for deciding what things you talk about and bring to the community?

DS: Honestly, I kind of focus from the point of view of what interests me personally. Which doesn’t sound very community oriented at all… but it’s an exercise in empathy. If I can write about something, or find a topic that I might find is interesting or exciting and I can communicate that with other people, I’m motivated to write about it and I hope that people are then motivated to learn about it. Kind of the antithesis of this is when I worked in marketing for a while; a lot of that style of writing was the bane of my existence because you’re producing these documents that literally are designed for nobody to read, in this language that nobody engages with. I much prefer blogging and tweeting because it’s much more directly for people.

KO: What have some of your most popular or successful engagements been about? Feel free to interpret ‘successful’ in any way.

DS: Well, from the point of view of what has been the most rewarding part of my job, is finding under-recognized or just these really cool things that people have done that just haven’t had a lot of exposure. And I’ve got a fairly big audience and a fairly wide reach, and it’s always fun for me to find things that people have done that maybe haven’t been seen. And it’s not my work, but I can – you know – take an eight page document that somebody’s written that has really cool things in it and just pull out various things. There is so much very cool stuff that people have done, half of the battle is getting it out there.

KO: What are some of your favorite sources for discovering cool things on the internet?

DS: There are channels on Reddit that I get a lot of material from, like /r/dataisbeautiful and things like that. It’s hard to say particular accounts on twitter, but I’ve spent a lot of time following people where I’ve read one of their blog posts and I find their twitter account, and they have just a few followers, I’ll follow them, and then over time it amounts to some good stuff. I have twitter open all day, every day. I don’t read everything on my feed every day, but I certainly keep it open.

KO: How much of your day is just spent exploring?

DS: A lot of it. I spend about half of any given day reading. It takes a long time, but every now and then you find this really cool stuff.

It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself.

KO: Do you have any last nuggets of wisdom for people starting out their careers in R?

DS: For people starting out their careers, I think one of the most important skills to learn is that communication skill. It’s one thing to be able to do really cool stuff in R or any other language, but until you can distill that down into something that other people consume, it’s going to be hard to sell yourself. And it’s also going to be hard to be valuable. A lot of the people I’ve watched evolve in the community are people who have begun very early in their careers, blogging about what they do. The one piece of advice I could give anyone starting out in their careers is - write what you do, write it in public, and make it so that other people can reproduce what you did.


Data from Public Bicycle Hire Systems

$
0
0

A new rOpenSci package provides access to data to which users may already have directly contributed, and for which contribution is fun, keeps you fit, and helps make the world a better place. The data come from using public bicycle hire schemes, and the package is called bikedata. Public bicycle hire systems operate in many cities throughout the world, and most systems collect (generally anonymous) data, minimally consisting of the times and locations at which every single bicycle trip starts and ends. The bikedata package provides access to data from all cities which openly publish these data, currently including London, U.K., and in the U.S.A., New York, Los Angeles, Philadelphia, Chicago, Boston, and Washington DC. The package will expand as more cities openly publish their data (with the newly enormously expanded San Francisco system next on the list).

Why bikedata?

The short answer to that question is that the package provides access to what is arguably one of the most spatially and temporally detailed databases of finely-scaled human movement throughout several of the world’s most important cities. Such data are likely to prove invaluable in the increasingly active and well-funded attempt to develop a science of cities. Such a science does not yet exist in any way comparable to most other well-established scientific disciplines, but the importance of developing a science of cities is indisputable, and reflected in such enterprises as the NYU-based Center for Urban Science and Progress, or the UCL-based Centre for Advanced Spatial Analysis.

People move through cities, yet at present anyone faced with the seemingly fundamental question of how, when, and where people do so would likely have to draw on some form of private data (typically operators of transport systems or mobile phone providers). There are very few open, public data providing insight into this question. The bikedata package aims to be one contribution towards filling this gap. The data accessed by the package are entirely open, and are constantly updated, typically on a monthly basis. The package thus provides ongoing insight into the dynamic changes and reconfigurations of these cities. Data currently available via the package amounts to several tens of Gigabytes, and will expand rapidly both with time, and with the inclusion of more cities.

Why are these data published?

In answer to that question, all credit must rightfully go to Adrian Short, who submitted a Freedom of Information request in 2011 to Transport for London for usage statistics from the relatively new, and largely publicly-funded, bicycle scheme. This request from one individual ultimately resulted in the data being openly published on an ongoing basis. All U.S. systems included in bikedata commenced operation subsequent to that point in time, and many of them have openly published their data from the very beginning. The majority of the world’s public bicycle hire systems (see list here) nevertheless do not openly publish data, notably including very large systems in China, France, and Spain. One important aspiration of the bikedata package is to demonstrate the positive benefit for the cities themselves of openly and easily facilitating complex analyses of usage data, which brings us to …

What’s important about these data?

As mentioned, the data really do provide uniquely valuable insights into the movement patterns and behaviour of people within some of the world’s major cities. While the more detailed explorations below demonstrate the kinds of things that can be done with the package, the variety of insights these data facilitate is best demonstrated through considering the work of other people, exemplified by Todd Schneider’s high-profile blog piece on the New York City system. Todd’s analyses clearly demonstrate how these data can provide insight into where and when people move, into inter-relationships between various forms of transport, and into relationships with broader environmental factors such as weather. As cities evolve, and public bicycle hire schemes along with them, data from these systems can play a vital role in informing and guiding the ongoing processes of urban development. The bikedata package greatly facilitates analysing such processes, not only through making data access and aggregation enormously easier, but through enabling analyses from any one system to be immediately applied to, and compared with, any other systems.

How it works

The package currently focusses on the data alone, and provides functionality for downloading, storage, and aggregation. The data are stored in an SQLite3 database, enabling newly published data to be continually added, generally with one simple line of code. It’s as easy as:

store_bikedata (city = "chicago", bikedb = "bikedb")

If the nominated database (bikedb) already holds data for Chicago, only new data will be added, otherwise all historical data will be downloaded and added. All bicycle hire systems accessed by bikedata have fixed docking stations, and the primary means of aggregation is in terms of “trip matrices”, which are square matrices of numbers of trips between all pairs of stations, extracted with:

trips <- bike_tripmat (bikedb = "bikedb", city = "chi")

Note that most parameters are highly flexible in terms of formatting, so pretty much anything starting with "ch" will be recognised as Chicago. Of course, if the database only contains data for Chicago, the city parameter may be omitted entirely. Trip matrices may be filtered by time, through combinations of year, month, day, hour, minute, or even second, as well as by demographic characteristics such as gender or date of birth for those systems which provide such data. (These latter data are freely provided by users of the systems, and there can be no guarantee of their accuracy.) These can all be combined in calls like the following, which further demonstrates the highly flexible ways of specifying the various parameters:

trips <- bike_tripmat ("bikedb", city = "london, innit",
                       start_date = 20160101, end_date = "16,02,28",
                       start_time = 6, end_time = 24,
                       birth_year = 1980:1990, gender = "f")

The second mode of aggregation is as daily time series, via the bike_daily_trips() function. See the vignette for further details.

What can be done with these data?

Lots of things. How about examining how far people ride. This requires getting the distances between all pairs of docking stations as routed through the street network, to yield a distance matrix corresponding to the trip matrix. The latest version of bikedata has a brand new function to perform exactly that task, so it’s as easy as

devtools::install_github ("ropensci/bikedata") # to install latest version
dists <- bike_distmat (bikedb = bikedb, city = "chicago")

These are distances as routed through the underlying street network, with street types prioritised for bicycle travel. The network is extracted from OpenStreetMap using the rOpenSci osmdata package, and the distances are calculated using a brand new package called dodgr (Distances on Directed Graphs). (Disclaimer: It’s my package, and this is a shameless plug for it - please use it!)

The distance matrix extracted with bike_distmat is between all stations listed for a given system, which bike_tripmat will return trip matrices only between those stations in operation over a specified time period. Because systems expand over time, the two matrices will generally not be directly comparable, so it is necessary to submit both to the bikedata function match_matrices():

dim (trips); dim (dists)
## [1] 581 581

## [1] 636 636
mats <- match_matrices (trips, dists)
trips <- mats$trip
dists <- mats$dist
dim (trips); dim (dists)
## [1] 581 581

## [1] 581 581
identical (rownames (trips), rownames (dists))
## [1] TRUE

Distances can then be visually related to trip numbers to reveal their distributional form. These matrices contain too many values to plot directly, so the hexbin package is used here to aggregate in a ggplot.

library (hexbin)
library (ggplot2)
dat <- data.frame (distance = as.vector (dmat),
                   number = as.vector (trips))
ggplot (dat, aes (x = distance, y = number)) +
    stat_binhex(aes(fill = log (..count..))) +
    scale_x_log10 (breaks = c (0.1, 0.5, 1, 2, 5, 10, 20),
                   labels = c ("0.1", "0.5", "1", "2", "5", "10", "20")) +
    scale_y_log10 (breaks = c (10, 100, 1000)) +
    scale_fill_gradientn(colours = c("seagreen","goldenrod1"),
                         name = "Frequency", na.value = NA) +
    guides (fill = FALSE)

The central region of the graph (yellow hexagons) reveals that numbers of trips generally decrease roughly exponentially with increasing distance (noting that scales are logarithmic), with most trip distances lying below 5km. What is the “average” distance travelled in Chicago? The easiest way to calculate this is as a weighted mean,

sum (as.vector (dmat) * as.vector (trips) / sum (trips), na.rm = TRUE)
## [1] 2.510285

giving a value of just over 2.5 kilometres. We could also compare differences in mean distances between cyclists who are registered with a system and causal users. These two categories may be loosely considered to reflect “residents” and “non-residents”. Let’s wrap this in a function so we can use it for even cooler stuff in a moment.

dmean <- function (bikedb = "bikedb", city = "chicago")
{
    tm <- bike_tripmat (bikedb = bikedb, city = city)
    tm_memb <- bike_tripmat (bikedb = bikedb, city = city, member = TRUE)
    tm_nomemb <- bike_tripmat (bikedb = bikedb, city = city, member = FALSE)
    stns <- bike_stations (bikedb = bikedb, city = city)
    dists <- bike_distmat (bikedb = bikedb, city = city)
    mats <- match_mats (dists, tm_memb)
    tm_memb <- mats$trip
    mats <- match_mats (dists, tm_nomemb)
    tm_nomemb <- mats$trip
    mats <- match_mats (dists, tm)
    tm <- mats$trip
    dists <- mats$dists

    d0 <- sum (as.vector (dists) * as.vector (tm) / sum (tm), na.rm = TRUE)
    dmemb <- sum (as.vector (dists) * as.vector (tmemb) / sum (t_memb), na.rm = TRUE)
    dnomemb <- sum (as.vector (dists) * as.vector (tm_nomemb) / sum (tm_nomemb), na.rm = TRUE)
    res <- c (d0, dmemb / dnomemb)
    names (res) <- c ("dmean", "ratio_memb_non")
    return (res)
}

Differences in distances ridden between “resident” and “non-resident” cyclists can then be calculated with

dmean (bikedb = bikedb, city = "ch")
##          dmean ratio_memb_non
##       2.510698       1.023225

And system members cycle slightly longer distances than non-members. (Do not at this point ask about statistical tests - these comparisons are made between millions–often tens of millions–of points, and statistical significance may always be assumed to be negligibly small.) Whatever the reason for this difference between “residents” and others, we can use this exact same code to compare equivalent distances for all cities which record whether users are members or not (which is all cities except London and Washington DC).

cities <- c ("ny", "ch", "bo", "la", "ph") # NYC, Chicago, Boston, LA, Philadelphia
sapply (cities, function (i) dmean (bikedb = bikedb, city = i))
##                       ny       ch       bo       la       ph
## dmean          2.8519131 2.510285 2.153918 2.156919 1.702372
## ratio_memb_non 0.9833729 1.023385 1.000635 1.360099 1.130929

And we thus discover that Boston manifests the greatest equality in terms of distances cycled between residents and non-residents, while LA manifests the greatest difference. New York City is the only one of these five in which non-members of the system actually cycle further than members. (And note that these two measures can’t be statistically compared in any direct way, because mean distances are also affected by relative numbers of member to non-member trips.) These results likely reflect a host of (scientifically) interesting cultural and geo-spatial differences between these cities, and demonstrate how the bikedata package (combined with dodgr and osmdata) can provide unique insight into differences in human behaviour between some of the most important cities in the U.S.

Visualisation

Many users are likely to want to visualise how people use a given bicycle system, and in particular are likely to want to produce maps. This is also readily done with the dodgr package, which can route and aggregate transit flows for a particular mode of transport throughout a street network. Let’s plot bicycle flows for the Indego System of Philadelphia PA. First get the trip matrix, along with the coordinates of all bicycle stations.

devtools::install_github ("gmost/dodgr") # to install latest version
city <- "ph"
# store_bikedata (bikedb = bikedb, city = city) # if not already done
trips <- bike_tripmat (bikedb = bikedb, city = city)
stns <- bike_stations (bikedb = bikedb, city = city)
xy <- stns [, which (names (stns) %in% c ("longitude", "latitude"))]

Flows of cyclists are calculated between those xypoints, so the trips table has to match the stns table:

indx <- match (stns$stn_id, rownames (trips))
trips <- trips [indx, indx]

The dodgr package can be used to extract the underlying street network surrounding those xy points (expanded here by 50%):

net <- dodgr_streetnet (pts = xy, expand = 0.5) %>%
    weight_streetnet (wt_profile = "bicycle")

We then need to align the bicycle station coordinates in xy to the nearest points (or “vertices”) in the street network:

verts <- dodgr_vertices (net)
pts <- verts$id [match_pts_to_graph (verts, xy)]

Flows between these points can then be mapped onto the underlying street network with:

flow <- dodgr_flows (net, from = pts, to = pts, flow = trips) %>%
    merge_directed_flows ()
net <- net [flow$edge_id, ]
net$flow <- flow$flow

See the dodgr documentation for further details of how this works. We’re now ready to plot those flows, but before we do, let’s overlay them on top of the rivers of Philadelphia, extracted with rOpenSci’s osmdata package.

q <- opq ("Philadelphia pa")
rivers1 <- q %>%
    add_osm_feature (key = "waterway", value = "river", value_exact = FALSE) %>%
    osmdata_sf (quiet = FALSE)
rivers2 <- q %>%
    add_osm_feature (key = "natural", value = "water") %>%
    osmdata_sf (quiet = FALSE)
rivers <- c (rivers1, rivers2)

And finally plot the map, using rOpenSci’s osmplotr package to prepare a base map with the underlying rivers, and the ggplot2::geom_segment() function to add the line segments with colours and widths weighted by bicycle flows.

#gtlibrary (osmplotr)
require (ggplot2)
bb <- get_bbox (c (-75.22, 39.91, -75.10, 39.98))
cols <- colorRampPalette (c ("lawngreen", "red")) (30)
map <- osm_basemap (bb, bg = "gray10") %>%
    add_osm_objects (rivers$osm_multipolygons, col = "gray20") %>%
    add_osm_objects (rivers$osm_lines, col = "gray20") %>%
    add_colourbar (zlims = range (net$flow / 1000), col = cols)
map <- map + geom_segment (data = net, size = net$flow / 50000,
                           aes (x = from_lon, y = from_lat, xend = to_lon, yend = to_lat,
                                colour = flow, size = flow)) +
    scale_colour_gradient (low = "lawngreen", high = "red", guide = "none")
print_osm_map (map)

The colour bar on the right shows thousands of trips, with the map revealing the relatively enormous numbers crossing the South Street Bridge over the Schuylkill River, leaving most other flows coloured in the lower range of green or yellows. This map thus reveals that anyone wanting to see Philadelphia’s Indego bikes in action without braving the saddle themselves would be best advised to head straight for the South Street Bridge.

Future plans

Although the dodgr package greatly facilitates the production of such maps, the code is nevertheless rather protracted, and it would probably be very useful to convert much of the code in the preceding section to an internal bikedata function to map trips between pairs of stations onto corresponding flows through the underlying street networks.

Beyond that point, and the list of currently open issues awaiting development on the github repository, future development is likely to depend very much on how users use the package, and on what extra features people might want. How can you help? A great place to start might be the official Hacktoberfest issue, helping to import the next lot of data from San Francisco. Or just use the package, and open up a new issue in response to any ideas that might pop up, no matter how minor they might seem. See the contributing guidelines for general advice.

Acknowledgements

Finally, this package wouldn’t be what it is without my co-author Richard Ellison, who greatly accelerated development through encouraging C rather than C++ code for the SQL interfaces. Maëlle Salmon majestically guided the entire review process, and made the transformation of the package to its current polished form a joy and a pleasure. I remain indebted to both Bea Hernández and Elaine McVey for offering their time to extensively test and review the package as part of rOpenSci’s onboarding process. The review process has made the package what it is, and for that I am grateful to all involved!

Building Communities Together at ozunconf, 2017

$
0
0

Just last week we organised the 2nd rOpenSci ozunconference, the sibling rOpenSci unconference, held in Australia. Last year it was held in Brisbane, this time around, the ozunconf was hosted in Melbourne, from October 27-27, 2017.

At the ozunconf, we brought together 45 R-software users and developers, scientists, and open data enthusiasts from academia, industry, government, and non-profits. Participants travelled from far and wide, with people coming from 6 cities around Australia, 2 cities in New Zealand, and one city in the USA. Before the ozunconf we discussed and dreamt up projects to work on for a few days, then met up and brought about a bakers dozen of them into reality.

Upskilling participants

On Day 0, one day before the ozunconf, Roger Peng and I ran a half day training session on how to develop R packages and share them on GitHub. The participants picked things up really quickly, and by the end of the session, everyone could make an R package, and push it to GitHub. We also introduced them to the wonders of RMarkdown. The event then kicked on to the R-Ladies Melbourne special one year anniversary event, which featured a great talk and introduction to Random Forests by Elisabeth Vogel.

Bringing people together

Before the ozunconf, we discussed various ideas for projects in the GitHub issues. Things really started to pick up in the last week and we ended up at 41 issues - almost as many issues as participants.

Day one kicked off with decorating some hex cookies, baked by Di Cook. This uncovered a fun fact that Stefan Milton Bache - creator of the beloved pipe operator (%>%) from the magrittr package, apparently also created the first #rstats hex sticker.

We then stuck the various projects that had been discussed throughout the week around the room and participants sticker voted on projects that they were interested in working on. Introductions were made, and quotes like these (from Steph de Silva) led to entertaining discussions around data:

We were really lucky to be in the beautiful Monash City Campus, a place that almost seems to have been designed for an unconf, with some classroom style space, as well as plenty of nooks and crannies to sit in, including an outdoor astroturfed garden complete with bean bags and native flora.

The venue even seemed to reflect our love of hex stickers, providing a nice hex sticker themed carpet:

We had some great sponsors for this event, including rOpenSci, RStudioThe RConsortium, The Ingham Institute, and Monash Business School. The event was also organised by myself, Di Cook, Rob Hyndman, and also Miles McBain.

We wrapped up at the end of day 2, giving each projects group three minutes to debrief on their projects, using the unconf style - only the README.md (mostly!). You can check out all the ozunconf projects here, thanks to a template from Sean Kross. Soon we will publish a series of short posts covering some of these great fun projects.

Here’s a quick taster:

  • realtime. Realtime streamingplots built on the p5.js library.

  • stow. A simplified version control interface to git, from within R.

  • icon. Easily access and insert web icons into HTML and PDF documents.

  • ochRe. Provide Australia-themed Colour Palettes.

We’ll share a quick summary of all of the projects over the coming weeks.

What’s Your Story?

Some key #ozunconf lessons from Steph de Silva

  • The opportunity to try but not succeed is a luxury to be savoured
  • Mistakes make you a better programmer
  • The best thing about R isn’t the language, it’s the number of people around who want to help
  • Your skills are valuable, so your productivity is too. Investing in the tools that maximise it is worthwhile.
  • git really is out to get you.

A few people have already written about their unconf17 experience. Have you? Share the link in the comments below and we’ll add it here.

Projects posts

Community posts

Introducing people to the rOpenSci community

rOpenSci has had a profound impact on me and my work. At the end of 2015 I got in touch with them to discuss arranging an unconference in Australia, and they welcomed me and my friends. Today, I am proud to be welcoming those from the ozunconf to this big, kind, wonderful community, and say, as Shannon Ellis summed up: “Hey! You there! You are welcome here”. It was also really great to have a diverse group of participants at the ozunconf, and in particular, that 40% of participants were women or other underrepresented genders.

Starts, not ends

One thing that I’ve realised in my involvement with organising and attending these events is that when the unconf ends, it feels a bit sad, sure, to say goodbye to the environment, the community, the friends, and the projects. At the last unconf in LA, we were sending out a stream of tweets, “it’s notoveruntilit’sover”. But, in reflection, standing back, taking it all in, the unconference doesn’t really end - it just starts. It starts many new things - projects, ideas, collaborations, and friendships.

The ozunconf comes to an end. Now, let’s get started.

Image Convolution in R using Magick

$
0
0

Release 1.4 of the magick package introduces a new feature called image convolution that was requested by Thomas L. Pedersen. In this post we explain what this is all about.

Kernel Matrix

The new image_convolve() function applies a kernel over the image. Kernel convolution means that each pixel value is recalculated using the weighted neighborhood sum defined in the kernel matrix. For example lets look at this simple kernel:

library(magick)

kern <- matrix(0, ncol = 3, nrow = 3)
kern[1, 2] <- 0.25
kern[2, c(1, 3)] <- 0.25
kern[3, 2] <- 0.25
kern
##      [,1] [,2] [,3]
## [1,] 0.00 0.25 0.00
## [2,] 0.25 0.00 0.25
## [3,] 0.00 0.25 0.00

This kernel changes each pixel to the mean of its horizontal and vertical neighboring pixels, which results in a slight blurring effect in the right-hand image below:

img <- image_read('logo:')
img_blurred <- image_convolve(img, kern)
image_append(c(img, img_blurred))

image_appended

Standard Kernels

Many operations in magick such as blurring, sharpening, and edge detection are actually special cases of image convolution. The benefit of explicitly usingimage_convolve() is more control. For example, we can blur an image and then blend it together with the original image in one step by mixing a blurring kernel with the unit kernel:

img %>% image_convolve('Gaussian:0x5', scaling = '60,40%')

mixed

The above requires a bit of explanation. ImageMagick defines several commonstandard kernels such as the gaussian kernel. Most of the standard kernels take one or more parameters, e.g. the example above used a gaussian kernel with 0 radius and 5 sigma.

In addition, scaling argument defines the magnitude of the kernel, and possibly how much of the original picture should be mixed in. Here we mix 60% of the blurring with 40% of the original picture in order to get a diffused lightning effect.

Edge Detection

Another area where kernels are of use is in edge detection. A simple example of a direction-aware edge detection kernel is the Sobel kernel. As can be seen below, vertical edges are detected while horizontals are not.

img %>% image_convolve('Sobel') %>% image_negate()

edges

Something less apparent is that the result of the edge detection is truncated. Edge detection kernels can result in negative color values which get truncated to zero. To combat this it is possible to add a bias to the result. Often you’ll end up with scaling the kernel to 50% and adding 50% bias to move the midpoint of the result to 50% grey:

img %>% image_convolve('Sobel', scaling = '50%', bias = '50%')

50pct

Sharpening

ImageMagick has many more edge detection kernels, some of which are insensitive to the direction of the edge. To emulate a classic high-pass filter from photoshop usedifference of gaussians kernel:

img %>% image_convolve('DoG:0,0,2') %>% image_negate()

dog

As with the blurring, the original image can be blended in with the transformed one, effectively sharpening the image along edges.

img %>% image_convolve('DoG:0,0,2', scaling = '100, 100%')

combination

The ImageMagick documentation has more examples of convolve with various avaiable kernels.

Using Magick with RMarkdown and Shiny

$
0
0
cover image

This week magick 1.5 appeared on CRAN. The latest update adds support for using images in knitr documents and shiny apps. In this post we show how this nicely ties together a reproducible image workflow in R, from source image or plot directly into your report or application.

library(magick)
stopifnot(packageVersion('magick') >= 1.5)

Also the magick intro vignette has been updated in this version to cover the latest features available in the package.

Magick in Knitr / RMarkdown Documents

Magick 1.5 is now fully compatible with knitr. To embed magick images in your rmarkdown report, simply use standard code chunk syntax in your Rmd file. No special options or packages are required; the image automatically appears in your documents when printed!

# Example from our post last week
image_read('logo:') %>%
  image_convolve('DoG:0,0,2') %>%
  image_negate() %>%
  image_resize("400x400")

fig1

You can also combine this with the magick graphics device to post process or animate your plots and figures directly in knitr. Again no special packages or system dependencies are required.

# Produce graphic
fig <- image_graph(width = 800, height = 600, res = 96)
ggplot2::qplot(factor(cyl), data = mtcars, fill = factor(gear))
invisible(dev.off())

print(fig)

fig2


# Some post-processing
frink <- image_read("https://jeroen.github.io/images/frink.png")

fig %>%
  image_rotate(10) %>%
  image_implode(.6) %>%
  image_composite(frink, offset = "+140+70") %>%
  image_annotate("Very usefull stuff", size = 40, location = "+300+100", color = "navy", boxcolor = "pink")

fig3

Same works for animation with image_animate(); the figure shows automatically up in the report as a gif image:

image_read("https://jeroen.github.io/images/banana.gif") %>%
  image_apply( function(banana){
    image_composite(fig, banana, offset = "+200+200")
  }) %>%
  image_resize("50%") %>%
  image_animate()

fig4

The magick vignette source code is itself written in Rmarkdown, so it’s great example to see this in action. Try rendering it in RStudio to see how easy it is!

Magick in Shiny Apps

While we’re at it, several people had asked how to use magick images in shiny apps. The easiest way is to write the image to a tempfile() within the renderImage() callback function. For example the server part could look like this:

output$img <- renderImage({
    tmpfile <- image %>%
      image_resize(input$size) %>%
      image_implode(input$implode) %>%
      image_blur(input$blur, input$blur) %>%
      image_rotate(input$rotation) %>%
      image_write(tempfile(fileext='jpg'), format = 'jpg')

  # Return a list
  list(src = tmpfile, contentType = "image/jpeg")
})

Below is a simple shiny app that demonstrates this. Have a look at the source code or just run it in R:

library(shiny)
library(magick)
runGitHub("jeroen/shinymagick")

tigrou

Perhaps there’s an even better way to make this work by wrapping magick images into an htmlwidget but I have not figured this out yet.

solrium 1.0: Working with Solr from R

$
0
0

Nearly 4 years ago I wrote on this blog about an R package solr for working with the database Solr. Since then we’ve created a refresh of that package in the solrium package. Since solrium first hit CRAN about two years ago, users have raised a number of issues that required breaking changes. Thus, this blog post is about a major version bump in solrium.


What is Solr?

Solr is a “search platform” - a NoSQL database - data is organized by so called documents that are xml/json/etc blobs of text. Documents are nested within either collections or cores (depending on the mode you start Solr in). Solr makes it easy to search for documents, with a huge variety of parameters, and a number of different data formats (json/xml/csv). Solr is similar to Elasticsearch (see our Elasticsearch client elastic) - and was around before it. Solr in my opinion is harder to setup than Elasticsearch, but I don’t claim to be an expert on either.


Vignettes

Noteable features

  • Added in v1, you can now work with many connection objects to different Solr instances.
  • Methods for the major search functionalities: search, highlight, stats, mlt, group, and facet. In addition, a catch all function all to combine all of those.
  • Comprehensive coverage of the Solr HTTP API
  • Can coerce data from Solr API into data.frame’s when possible

Setup

Install solrium

install.packages("solrium")

Or get the development version:

devtools::install_github("ropensci/solrium")
library(solrium)


Initialize a client

A big change in v1 of solrium is solr_connect has been replaced by SolrClient. Now you create an R6 connection object with SolrClient, then you can call methods on that R6 object, OR you can pass the connection object to functions.

By default, SolrClient$new() sets connections details for a Solr instance that’s running on localhost, and on port 8983.

(conn <- SolrClient$new())
#> <Solr Client>
#>   host: 127.0.0.1
#>   path:
#>   port: 8983
#>   scheme: http
#>   errors: simple
#>   proxy:

On instantiation, it does not check that the Solr instance is up, but merely sets connection details. You can check if the instance is up by doing for example (assuming you have a collection named gettingstarted):

conn$ping("gettingstarted")
#> $responseHeader
#> $responseHeader$zkConnected
#> [1] TRUE
#>
#> $responseHeader$status
#> [1] 0
#>
#> $responseHeader$QTime
#> [1] 163
#>
#> $responseHeader$params
#> $responseHeader$params$q
#> [1] "{!lucene}*:*"
#>
#> $responseHeader$params$distrib
#> [1] "false"
#>
#> $responseHeader$params$df
#> [1] "_text_"
#>
#> $responseHeader$params$rows
#> [1] "10"
#>
#> $responseHeader$params$wt
#> [1] "json"
#>
#> $responseHeader$params$echoParams
#> [1] "all"
#>
#>
#>
#> $status
#> [1] "OK"

A good hint when connecting to a publicly exposed Solr instance is that you likely don’t need to specify a port, so a pattern like this should work to connect to a URL like http://foobar.com/search:

SolrClient$new(host = "foobar.com", path = "search", port = NULL)

If the instance uses SSL, simply specify that like:

SolrClient$new(host = "foobar.com", path = "search", port = NULL, scheme = "https")


Query and body parameters

Another big change in the package is that we wanted to make it easy to determine whether your Solr query gets passed as query parameters in a GET request or as body in a POST request. Solr clients in some other languages do this, and it made sense to port over that idea here. Now you pass your key-value pairs to either params or body. If nothing is passed to body, we do a GET request. If something is passed to body we do a POST request, even if there’s also key-value pairs passed to params.

This change does break the interface we had in the old version, but we think it’s worth it.

For example, to do a search you have to pass the collection name and a list of named parameters:

conn$search(name = "gettingstarted", params = list(q = "*:*"))
#> # A tibble: 5 x 5
#>      id   title title_str  `_version_` price
#>   <chr>   <chr>     <chr>        <dbl> <int>
#> 1    10 adfadsf   adfadsf 1.582913e+18    NA
#> 2    12  though    though 1.582913e+18    NA
#> 3    14 animals   animals 1.582913e+18    NA
#> 4     1    <NA>      <NA> 1.582913e+18   100
#> 5     2    <NA>      <NA> 1.582913e+18   500

You can instead pass the connection object to solr_search:

solr_search(conn, name = "gettingstarted", params = list(q = "*:*"))
#> # A tibble: 5 x 5
#>      id   title title_str  `_version_` price
#>   <chr>   <chr>     <chr>        <dbl> <int>
#> 1    10 adfadsf   adfadsf 1.582913e+18    NA
#> 2    12  though    though 1.582913e+18    NA
#> 3    14 animals   animals 1.582913e+18    NA
#> 4     1    <NA>      <NA> 1.582913e+18   100
#> 5     2    <NA>      <NA> 1.582913e+18   500

And the same pattern applies for the other functions:

  • solr_facet
  • solr_group
  • solr_mlt
  • solr_highlight
  • solr_stats
  • solr_all


New functions for atomic updates

A user requested the ability to do atomic updates - partial updates to documents without having to re-index the entire document.

Two functions were added: update_atomic_json and update_atomic_xml for JSON and XML based updates. Check out their help pages for usage.


Search results as attributes

solr_search and solr_all in v1 gain attributes that include numFound, start, and maxScore. That is, you can get to these three values after data is returned. Note that some Solr instances may not return all three values.

For example, let’s use the Public Library of Science Solr search instance at http://api.plos.org/search:

plos <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL)

Search

res <- plos$search(params = list(q = "*:*"))

Get attributes

attr(res, "numFound")
#> [1] 1902279
attr(res, "start")
#> [1] 0
attr(res, "maxScore")
#> [1] 1


Automatically adjust rows parameter

A user higlighted that there’s a performance penalty when asking for too many rows. The resulting change in solrium is that in some search functions we automatically adjust the rows parameter to avoid the performance penalty.


Usage in other packages

I maintain 4 other packages that use solrium: rplos, ritis, rdatacite, and rdryad. If you are interested in using solrium in your package, looking at any of those four packages will give a good sense of how to do it.


Notes

solr pkg

The solr package will soon be archived on CRAN. We’ve moved all packages depending on it to solrium. Let me know ASAP if you have any complaints about archiving it on CRAN.

Feedback!

Please do upgrade/install solriumv1 and let us know what you think.


.rprofile: Mara Averick

$
0
0

Mara Averick, Data Nerd At LargeMara Averick is a non-profit data nerd, NBA stats junkie, and most recently, tidyverse developer advocate at RStudio. She is the voice behind two very popular Twitter accounts, @dataandme and @batpigandme. Mara and I discussed sports analytics, how attending a cool conference can change the approach to your career, and how she uses Twitter as a mechanism for self-imposed forced learning.



KO: What is your name, job title, and how long have you been using R? [Note: This interview took place in May 2017. Mara joined RStudio as their tidyverse developer advocate in November 2017.]

MA: My name is Mara Averick, I do consulting, data science, I just say “data nerd at large” because I’ve seen those Venn diagrams and I’m definitely not a data scientist. I used R in high school for fantasy basketball. I graduated from high school in 2003, and then in college used SPSS, and I didn’t use R for a long time. And then I was working with a company that does grant proposals for non-profits, doing all of the demand- and outcome-analysis and it all was in Excel and I thought, we could do better - R might also be helpful for this. It turns out there’s a package for American Community Survey data in R (acs), so that was how I got back into R.

KO: How did you find out about R when you first started using it in high school?

MA: I honestly don’t remember. I didn’t even use RStudio until two years ago. I think it was probably from other fantasy nerds?

KO: Is there an underground R fantasy basketball culture?

MA: Well R for fantasy football is legit. Fantasy Football Analytics is all R modeling.

KO: That’s awesome - so now, do you work with sports analytics? Or is that your personal project/passion?

MA: A little bit of both, I worked for this startup called Stattleship (@stattleship). Because I’ll get involved with anything if there’s a good pun involved… and so we were doing sports analytics work that kind of ended up shifting more in a marketing direction. I still do consulting with the head data scientist [Tanya Cashorali] for that [at TCB Analytics]. Some of the analysis/consulting will be with companies who are doing either consumer products for sports or data journalism stuff around sports analytics.

KO: How often do you use R now?

MA: Oh, I use R like every day. I use it… I don’t use Word any more. [Laughter] Yeah so one of the things about basketball is that there are times of the year where there are games every day. So that’s been my morning workflow for a while - scraping basketball data.

KO: So you get up every morning and scrape what’s new in Basketball?

MA: Yeah! So I end up in RStudio bright and early (often late, as well).

KO: So is that literally what the first half hour of your day looks like?

MA: No, so incidentally that’s kind of how this Twitter thing got started. My dog has long preceded me on Twitter and the internet at large, he’s kind of an internet famous dog @batpigandme. There’s an application called Buffer which allows you to schedule tweets and facebook page posts, which was most of Batpig’s traffic - facebook page visits from Japan. And so I had this morning routine (started in the winter when I had one of those light things you sit in front of for a certain number of minutes) where I would wake up and schedule batpig posts while I’m sitting there and read emails. And that ended up being a nice morning workflow thing.

I went to a Do Good Data conference, which is a Data Analysts for Social Good (@DA4SG) event, just over two years ago, and everyone there was giving out their twitter handles, and I was like, oh - maybe people who aren’t dogs also use Twitter? [Laughter] So that was how I ended up creating my own account @dataandme independent from Batpig.

KO: What happened after you went to this conference? Was it awesome, did it inspire you?

MA: Yeah so, I was the stats person at the company I was working at. And I didn’t realize there was all this really awesome work being done with really rigorous evaluation that wasn’t necessarily federal grant proposal stuff. So I was really inspired by that and started learning more about what other people were doing, some of it in R, some of it not. I kept in touch with some of the people from that conference. And then NBA Twitter is also a thing it turns out, and NBA, R/Statistics is also a really big thing so that was kind of what pulled me in. And it was really fun. A lot of interesting projects and people that I work with were all through that [Twitter] which still surprises me - that I can read a book and tell the author something and they care? It’s weird.

I like to make arbitrary rules for myself, one of the things is I don’t tweet stuff that I haven’t read.

KO: Everyone loves your twitter account. How do you find and curate the things you end up posting about?

MA: I like to make arbitrary rules for myself, one of the things is I don’t tweet stuff that I haven’t read. I like to learn new things and/or I have to learn new things every day so I basically started scheduling [tweets] as a way to make myself read the things that I want to read and get back to.

KO: Wait, so you schedule a tweet and then you’re like, okay well this is my deadline to read this thing - or I’ll be a liar.

MA: Totally.

KO: Whoa that’s awesome.

MA: I’ve also never not finished a book in my life. It’s one of my rules, I’m really strict about it.

KO: That’s a lot of pressure!

MA: So that was kind of how it started out - especially because I didn’t even know all the stuff I didn’t know. Then, as I’ve used R more and more, there’s stuff that I’ve just happened to read because I don’t know what I’m doing.

KO: The more you learn the more you can learn.

MA: Yeah so now a lot of the stuff [tweets] is stuff I end up reading over the course of the day and then add it [to the queue]. Or it’s just stuff I’ve already read when I feel like being lazy.

KO: Do you have side projects other than the basketball/sports stuff?

MA: I actually majored in science and technology studies, which means I was randomly trained in ethical/legal/social implications of science. So I’m working on some data ethics projects which unfortunately I can’t talk about. And then my big side project for total amusement was this D3.js in Action analysis of Archer which is a cartoon that I watch. But that’s also how I learned really how to use tidytext. So then I ended up doing a technical review for David [Robinson] and Julia’s [Silge] book Text Mining with R: A Tidy Approach. It was super fun. So yeah, I always have a bunch of random side projects going on.

KO: How is your work-life balance?

MA: It’s funny because I like what I do. So I don’t always know where that starts and ends. And I’m really bad at capitalism. It never occurs to me that I should be paid for doing some things. Especially if it involves open data and open source - surely you can’t charge for that? But I read a lot of stuff that’s not R too. I think I’m getting sort of a balance, but I’m not sure.

KO: Switching back to your job-job now. Are you on a team, are you remote, are you in an office, what are the logistics like?

MA: Kind of all of the above. In my old job I was on a team but I was the only person doing anything data related. And I developed some really lazy habits from that - really ugly code and committing terrible stuff to git. But with this NBA project I end up working with a lot of different people (who are also basketball-stat nerds).

KO: Do you work with people who are employed by the actual NBA teams, or just people who are really interested in the subject?

MA: No, so there is an unfortunate attrition of people whom I work with when they get hired by teams - which is not unfortunate, it’s awesome, but then they can no longer do anything with us. So that’s collaborative work but I don’t work on a team anymore.

KO: So you don’t have daily stand-ups or anything.

MA: No, no. I could probably benefit from that, but my goal is never to be 100% remote. After I went to that first data conference, I felt like being around all these people who are so much smarter than I am, and know so much more than I do is intimidating, but I also learned so much. And I learned so many things I was doing, not wrong, but inefficiently. I still learn about 80 things I’m doing inefficiently every day.

My goal right now - stop holding on to all of my projects that are not as done as I want them to be, and will never be done.

KO: Do you have set beginnings and endings to projects? How many projects are you juggling at a given time?

MA: After doing federal grant proposals, it doesn’t feel like anything is a deadline compared to that. They don’t care if your house burned down if it’s not in at the right time. So nothing feels as hard and fast as that. There are certain things like the NBA that —

KO: There are timely things.

MA: Yeah, and then sometimes we’ll just set arbitrary deadlines, just to kind of get out of a cycle of trying to perfect it, which I fall deeply into. Yeah so that’s kind of a little bit of my goal right now - stop holding on to all of my projects that are not as done as I want them to be, and will never be done. With the first iteration of this Archer thing I literally spent three days trying to get this faceted bar chart thing to sort in multiple ways and was super frustrated and then I tweeted something about it and immediately David Robinson responded with precisely what I needed and would have never figured out. So I’m working on doing that more. And also because it’s so helpful to me when other people do that.

KO: How did you get hooked up with Julia and David, just through Twitter?

MA: Yeah! So Julia I’d met at Open Vis Conf, David I’d read his blog about a million lines of bad code - it was open on my iPad for like years because I loved it so much, and still do. And yeah so again as this super random twitter-human that I feel like I am, I do end up meeting and doing things with cool people who are super smart and do really cool things.

KO: It’s impressive how much you post and not just that, but it’s really evident that you care. People can tell that this isn’t just someone who reposts a million things day.

MA: I mean it’s totally selfish, don’t get me wrong. But I’m super glad that it’s helpful to other people too. It gives me so much anxiety to think that people might think I know how to do all the things that I post, which I don’t, that’s why I had to read them - but even when I read them, sometimes I don’t know. The R community is pretty awesome, at least the parts of it that I know; which is not universally true of any community of any group of scientists. R Twitter is super-super helpful. And that was evident really quickly, at least to me.

My plea to everyone who has a blog is to put their Twitter handle somewhere on it.

KO: What are some of your favorite things on the internet? Blogs, Twitter Accounts, Podcasts…

MA: I have never skipped one of Julia Silge’s blog posts. Her posts are always something that I know I should learn how to do. Both she and D-Rob [David Robinson] know their stuff and they write really well. So those are two blogs and follows that I love. Bob Rudis - almost daily, I can’t believe how quickly he churns stuff out. R-Bloggers is a great way to discover new stuff. Dr. Simon J [Simon Jackson] - I literally think of people by their twitter handles [@drsimonj], and there are so many others.

PUT A BIRD ON IT!

Every day I’m amazed by all the stuff I didn’t know existed. And also there’s stuff that people wrote three or four years ago. A lot of the data vis stuff I end up finding from weird angles. So those are some of my favorites - I’m sure there are more. Oh! Thomas Lin Pedersen, Data Imaginist is his blog. There are so many good blogs. My plea to everyone who has a blog is to put their twitter handle somewhere on it. I actually try really hard to find attribution stuff. Every now and then I get it really wrong and it’ll be someone who has nothing to do with it but who has the same name. There’s a bikini model who has the same name as someone who I said wrote a thing - which I vetted it too! I was like, well she’s multi-faceted, good for her! And then somebody was like, I don’t think that’s the right one. Oops! I have to say that that’s the one thing that Medium nailed - when you click share it gives you their twitter handle. If you have a blog, put your twitter handle there so I don’t end up attributing it to a bikini model.

2017 rOpenSci ozunconf :: Reflections and the realtime Package

$
0
0

This year’s rOpenSci ozunconf was held in Melbourne, bringing together over 45 R enthusiasts from around the country and beyond. As is customary, ideas for projects were discussed in GitHub Issues (41 of them by the time the unconf rolled around!) and there was no shortage of enthusiasm, interesting concepts, and varied experience.

I’ve been to a few unconfs now and I treasure the time I get to spend with new people, new ideas, new backgrounds, new approaches, and new insights. That’s not to take away from the time I get to spend with people I met at previous unconfs; I’ve gained great friendships and started collaborations on side projects with these wonderful people.

When the call for nominations came around this year it was an easy decision. I don’t have employer support to attend these things so I take time off work and pay my own way. This is my networking time, my development time, and my skill-building time. I wasn’t sure what sort of project I’d be interested in but I had no doubts something would come up that sounded interesting.

As it happened, I had been playing around with a bit of code, purely out of interest and hoping to learn how htmlwidgets work. The idea I had was to make a classic graphic equaliser visualisation like this

using R.

This presents several challenges; how can I get live audio into R, and how fast can I plot the signal? I had doubts about both parts, partly because of the way that R calls tie up the session (for now…) and partly because constructing a ggplot2 object is somewhat slow (in terms of raw audio speeds). I’d heard about htmlwidgets and thought there must be a way to leverage that towards my goal.

I searched for a graphic equaliser javascript library to work with and didn’t find much that aligned with what I had in my head. Eventually I stumbled on p5.js and its examples page which has an audio-input plot with a live demo. It’s a frequency spectrum, but I figured that’s just a bit of binning away from what I need. Running the example there looks like

This seemed to be worth a go. I managed to follow enough of this tutorial to have the library called from R. I modified the javascript canvas code to look a little more familiar, and the first iteration of geom_realtime() was born

This seemed like enough of an idea that I proposed it in the GitHub Issues for the unconf. It got a bit of attention, which was worrying, because I had no idea what to do with this next. Peter Hickey pointed out that Sean Kross had already wrapped some of the p5.js calls into R calls with his p5 package, so this seemed like a great place to start. It’s quite a clever way of doing it too; it involves re-writing the javascript which htmlwidgets calls on each time you want to do something.

Fast forward to the unconf and a decent number of people gathered around a little slip of paper with geom_realtime() written on it. I had to admit to everyone that the ggplot2 aspect of my demo was a sham (it’s surprisingly easy to draw a canvas in just the right shade of grey with white gridlines), but people stayed, and we got to work seeing what else we could do with the idea. We came up with some suggestions for input sources, some different plot types we might like to support, and set about trying to understand what Sean’s package actually did.

As it tends to work out, we had a great mix of people with different experience levels in different aspects of the project; some who knew how to make a package, some who knew how to work with javascript, some who knew how to work with websockets, some who knew about realtime data sources, and some who knew about nearly none of these things (✋ that would be me). If everyone knew every aspect about how to go about an unconf project I suspect the endeavor would be a bit boring. I love these events because I get to learn so much about so many different topics.

I shared my demo script and we deconstructed the pieces. We dug into the inner workings of the p5 package and started determining which parts we could siphon off to meet our own needs. One of the aspects that we wanted to figure out was how to simulate realtime data. This could be useful both for testing, and also in the situation where one might want to ’re-cast’ some time-coded data. We were thankful that Jackson Kwok had gone deep-dive into websockets and pretty soon (surprisingly soon, perhaps; within the first day) we had examples of (albeit, constructed) real-time (every 100ms) data streaming from a server and being plotted at-speed

Best of all, running the plot code didn’t tie up the session; it uses a listener written into the javascript so it just waits for input on a particular port.

With the core goal well underway, people started branching out into aspects they found most interesting. We had some people work on finding and connecting actual data sources, such as the bitcoin exchange rate

and a live-stream of binary-encoded data from the Australian National University (ANU) Quantum Random Numbers Server

Others formalised the code so that it can be piped into different ‘themes’, and retain the p5 structure for adding more components

These were still toy examples of course, but they highlight what’s possible. They were each constructed using an offshoot of the p5 package whereby the javascript is re-written to include various features each time the plot is generated.

Another route we took is to use the direct javascript binding API with factory functions. This had less flexibility in terms of adding modular components, but meant that the javascript could be modified without worrying about how it needed to interact with p5 so much. This resulted in some outstanding features such as side-scrolling and date-time stamps. We also managed to pipe the data off to another thread for additional processing (in R) before being sent to the plot.

The example we ended up with reads the live-feed of Twitter posts under a given hashtag, computes a sentiment analysis on the words with R, and live-plots the result:

Overall I was amazed at the progress we made over just two days. Starting from a silly idea/demo, we built a package which can plot realtime data, and can even serve up some data to be plotted. I have no expectations that this will be the way of the future, but it’s been a fantastic learning experience for me (and hopefully others too). It’s highlighted that there’s ways to achieve realtime plots, even if we’ve used a library built for drawing rather than one built for plotting per se.

It’s even inspired offshoots in the form of some R packages; tRainspotting which shows realtime data on New South Wales public transport using leaflet as the canvas

and jsReact which explores the interaction between R and Javascript

The possibilities are truly astounding. My list of ‘things to learn’ has grown significantly since the unconf, and projects are still starting up/continuing to develop. The ggeasy package isn’t related, but it was spawned from another unconf Github Issue idea. Again; ideas and collaborations starting and developing.

I had a great time at the unconf, and I can’t wait until the next one. My hand will be going up to help out, attend, and help start something new.

My thanks and congratulations go out to each of the realtime developers: Richard Beare, Jonathan Carroll, Kim Fitter, Charles Gray, Jeffrey O Hanson, Yan Holtz, Jackson Kwok, Miles McBain and the entire cohort of 2017 rOpenSci ozunconf attendees. In particular, my thanks go to the organisers of such a wonderful event; Nick Tierney, Rob Hyndman, Di Cook, and Miles McBain.


Six tips for running a successful unconference

$
0
0

Attendees at the May 2017 rOpenSci unconference. Photo credit: Nistara RandhawaAttendees at the May 2017 rOpenSci unconference. Photo credit: Nistara Randhawa

In May 2017, I helped run a wildly successful “unconference” that had a huge positive impact on the community I serve. rOpenSci is a non-profit initiative enabling open and reproducible research by creating technical infrastructure in the form of staff- and community-contributed software tools in the R programming language that lower barriers to working with scientific data sources on the web, and creating social infrastructure through a welcoming and diverse community of software users and developers. Our 4th annual unconference brought together 70 people to hack on projects they dreamed up and to give them opportunities to meet and work together in person. One third of the participants had attended before, and two thirds were first-timers, selected from an open call for applications. We paid all costs up front for anyone who requested this in order to lower barriers to participation.

It’s called an “unconference” because there is no schedule set before the event – participants discuss project ideas online in advance and projects are selected by participant-voting at the start. I’m sharing some tips here on how to do this well for this particular flavour of unconference.

1. Have a code of conduct

Having a code of conduct that the organizers promote in the welcome goes a long way to creating a welcoming and safe environment and preventing violations in the first place.

2. Host online discussion of project ideas before the unconference

Our unconference centered on teams working on programming projects, rather than discussions, so prior to the unconference, we asked all participants to suggest project ideas using an open online system, called GitHub, that allows everyone to see and comment on ideas or just share enthusiastic emoji to show support.

3. Have a pre-unconference video-chat with first-time participants

Our AAAS CEFP training emphasizes the importance of extending a personalized welcome to community members, so I was inspired to make the bold move of talking with more than 40 first-time participants prior to the unconference. I asked each person to complete a short questionnaire to get them to consider the roles they anticipated playing prior to our chat. Frequently, within an hour of a video-chat, without prompting, I would see the person post a new project idea or share their thoughts about someone else’s. Other times, our conversation gave the person an opportunity to say “this idea maybe isn’t relevant but…” and I would help them talk through it, inevitably leading to “oh my gosh this is such a cool idea”. I got a huge return on my investment. People’s questions like “how do you plan to have 20 different projects present their work?” led to better planning on my part. Specific ideas for improvements came from me responding with “well…how would YOU do it?”

Between the emails, slack channel, issues on GitHub, and personal video chats, I felt completely at ease going into the unconf (where I knew next to no one!).

-rOpenSci unconf17 participant

4. Run an effective ice breaker

I adapted the “Human Barometer” ice breaker to enable 70 people to share opinions across all perceived levels of who a participant is and introduce themselves to the entire group within a 1 hour period. Success depended on creating questions that were relevant to the unconference crowd, and on visually keeping track of who had spoken up in order to call on those who had not. Ice breakers and the rOpenSci version of the Human Barometer will be the subject of a future CEFP blog post.

5. Have a plan to capture content

So much work and money go into running a great unconference, you can’t afford to do it without a plan to capture and disseminate stories about the people and the products. Harness the brain work that went into the ideas! I used the concept of content pillars to come up with a plan. Every project group was given a public repository on GitHub to house their code and documentation. In a 2-day unconference with 70 people in 20 projects, how do people present their results?! We told everyone that they had just three minutes to present, and that the only presentation material they could use was their project README (the page of documentation in their code repository). No slides allowed! This kept their focus on great documentation for the project rather than an ephemeral set of pretty slides. Practically speaking, this meant that all presentations were accessible from a single laptop connected to the projector and that to access their presentation, a speaker had only to click on the link to their repo. Where did the essence of this great idea come from? From a pre-unconference chat of course!

In the week following the unconference, we used the projects’ README documentation to create a series of five posts released Monday through Friday, noting every one of the 20 projects with links to their code repositories. To get more in-depth stories about people and projects, I let participants know we were keen to host community-contributed blog posts and that accepted posts would be tweeted to rOpenSci’s >13,000 followers. Immediately after the unconference, we invited selected projects to contribute longer form narrative blog posts and posted these once a week. The series ended with Unconf 2017: The Roads Not Taken, about all the great ideas that were not yet pursued and inviting people to contribute to these.

All of this content is tied together in one blog post to summarize the unconference and link to all staff- and community-contributed posts in the unconference projects series as well as to posts of warm personal and career-focussed reflections by some participants.

6. Care about other people’s success

Community managers do a lot of “emotional work”. In all of this, my #1 key to running a successful unconference is to genuinely and deeply care that participants arrive feeling ready to hit the ground running and leave feeling like they got what they wanted out of the experience. Ultimately, the success of our unconference is more about the human legacy – building people’s capacity as software users and developers, and the in-person relationships they develop within our online community – than it is about the projects.

“I’ve steered clear of ‘hackathons’ because they feel intimidating and the ‘bad’ kind of competitive. But, this….this was something totally different.”

– rOpenSci unconf17 participant

Additional Resources

ochRe - Australia themed colour palettes

$
0
0

The second rOpenSci OzUnConf was held in Melbourne Australia a few weeks ago. A diverse range of scientists, developers and general good-eggs came together to make some R-magic happen and also learn a lot along the way. Before the conference began, a huge stack of projects were suggested on the unconf GitHub repo. For six data-visualisation enthusiasts, one issue in particular caught their eye, and the ochRe package was born.

The ochRe package contains colour palettes influenced by the Australian landscape, iconic Australian artists and images. OchRe is originally the brain-child of Di Cook, who was inspired by Karthik Ram’s wesanderson package.

Why “ochre”?

Naming our package was the “most important” task facing us after all jumping on board the project. Fuelled by a selection of pastries we opened the discussions, fully expecting this to take some time. Fortunately, we all agreed on the name in less than 5 minutes, which meant plenty of pastries were left for the serious business of package building.

Ochre is naturally occuring, brownish-yellow pigment found in many parts of Australia, so frequently in fact, that Australia is sometimes referred to as the “land of ochre soil”. Additionally, ochre pigment has been used for thousands of years by Aboriginal people in Australia, with many culturally important uses from artwork to the preservation of animal skins.

Building ochRe

We started our package building journey by each picking an iconic Australian artwork (this took longer than you might think). Once we had selected our images, we used the online Image Color Extract PHP demo tool to extract the hex code for the main colours within each image. Some images required a more selective approach, so where needed the colour code extraction was done using the eyedropper tool (in macOS) or the Google Chrome colourPick extension.

Once we were happy with the colours, codes and order for each palette we loaded this information into ochRe as lists of hex codes associated with the palette name. We adopted scales to improve the fucntionality of the packages when using ggplot, in particular to allow manipulation of colour ramping and transparency. The package also contains a few simple functions for displaying the different palettes.

Below are some examples of original art work and their associated palettes:

namatjira_qual and namatjira_div are both inspired by the watercolour painting “Twin Ghosts”, by Aboriginal artist Albert Namatjira. namatjira_div is ordered for plotting divergent datasets.

The nolan_ned palette is inspired by the famous paintings of the outlaw Ned Kelly by Sidney Nolan.

olsen_seq has been designed for plotting sequential data, such as a heat map or landscape layers. The colours come from the abstract piece “Sydney Sun”, 1965, by John Olsen.

There was a high proportion of ecologists at the #ozunconf, which inspired the somewhat pessimistic healthy_reef and dead_reef palettes, with the colours taken from recent underwater photgraphs of the Great Barrier Reef.

Introducing: ochRe

Our package currently contains 16 colour palettes, each one inspired by either an Australian landscape, an artwork or image by an Australian artist, or an Australian animal. Some of the palettes are more suited to displaying continuous data (such as in the Australian elevation maps above). Other palettes will perform best plotting discrete data (as in the parliament example below).

ochRe can be currently be installed from GitHub:

# You need to install the 'devtools' package first
devtools::install_github("ropenscilabs/ochRe")

You can visualise all 16 palettes using the following code snippet:

pal_names <- names(ochre_palettes)

par(mfrow=c(length(ochre_palettes)/2, 2), lheight = 2, mar=rep(1, 4), adj = 0)
for (i in 1:length(ochre_palettes)){
    viz_palette(ochre_palettes[[i]], pal_names[i])
}

Here are some worked examples, showing how to use the palettes for different types of data visualisation, including both ggplot and base plotting in R.

An example using base R and the winmar palette, this is based on an iconic photograph by Wayne Ludbey, “Nicky Winmar St Kilda Footballer”, 1993. In the photo, Aboriginal AFL player Nicky Winmar is baring his skin in response to racial abuse during an AFL game.

## basic example code
pal <- colorRampPalette(ochre_palettes[["winmar"]])
image(volcano, col = pal(20))

Paired scatter plot using the emu_Woman_paired palette, inspired by “Emu Woman”, 1988-89, by Emily Kame Kngwarreye.

library(tidyverse)
library(ochRe)
library(naniar)

# Exploring missing values benefits from a paired palette, like the emu women
# Here missing status on air temperature is shown in a plot of the two wind variables
data(oceanbuoys)
oceanbuoys <- oceanbuoys %>% add_shadow(humidity, air_temp_c)
ggplot(oceanbuoys, aes(x=wind_ew, y=wind_ns, colour=air_temp_c_NA)) +
    geom_point(alpha=0.8) +
    scale_colour_ochre(palette="emu_woman_paired") +
    theme_bw() + theme(aspect.ratio=1)

# Slightly more complicated, forcing the pairs
clrs <- ochre_palettes$emu_woman_paired[11:12]
ggplot(oceanbuoys, aes(x=wind_ew, y=wind_ns, colour=air_temp_c_NA)) +
    geom_point(alpha=0.8) +
    scale_colour_manual(values=clrs) +
    theme_bw() + theme(aspect.ratio=1)

A map of the Australian electoral boundaries, using the galah palette. Galahs are a common species of cockatoo found throughout mainland Australia.

# Map of the 2016 Australian electoral boundaries
# with the galah palette
library(eechidna)
library(ggthemes)
data(nat_map_2016)
data(nat_data_2016)
ggplot(aes(map_id=id), data=nat_data_2016) +
    geom_map(aes(fill=Area_SqKm), map=nat_map_2016) +
    expand_limits(x=nat_map$long, y=nat_map$lat) +
    scale_fill_ochre(palette="galah", discrete=FALSE) +
    theme_map()

Results of the 2016 Australian election to the senate, coloured by political party using the parliament palette. The colours for this palette were taken from the tapestry by Arthur Boyd found in the Great Hall of Parliament House.

# Election results
senate <- read_csv("http://results.aec.gov.au/20499/Website/Downloads/SenateSenatorsElectedDownload-20499.csv",
                   skip = 1)
coalition <- c("Country Liberals (NT)", "Liberal", "Liberal National Party of Queensland","The Nationals")
labor <- c("Australian Labor Party", "Australian Labor Party (Northern Territory) Branch","Labor")
greens <- c("The Greens", "Australian Greens", "The Greens (WA)")

senate <- senate %>% mutate(PartyNm = ifelse(as.character(PartyNm) %in% coalition,"Liberal National Coalition", PartyNm))

senate <- senate %>% mutate(PartyNm = ifelse(as.character(PartyNm) %in% labor,"Australian Labor Party", PartyNm))

senate <- senate %>% mutate(PartyNm = ifelse(as.character(PartyNm) %in% greens,"Australian Greens", PartyNm))

senate$PartyNm <- factor(senate$PartyNm,
                         levels = names(sort(table(senate$PartyNm),
                            decreasing = T)))

ggplot(data = senate, aes(x = PartyNm, fill = PartyNm)) +
    geom_bar() + xlab("") +
    ylab("") + scale_fill_ochre(palette="parliament") + coord_flip() +
    theme_bw() + theme(legend.position = "None")

For more information about the individual palettes available in ochRe visit our vignette

All of the ochRe team had a great time at #ozunconf, Thank you to the organisers for a brilliant event. Special Thanks to Michael Sumner for providing code to access the Australian elevation map you see at the start of this post.

changes: easy Git-based version control from R

$
0
0

Are you new to version control and always running into trouble with Git? Or are you a seasoned user, haunted by the traumas of learning Git and reliving them whilst trying to teach it to others? Yeah, us too.

Git is a version control tool designed for software development, and it is extraordinarily powerful. It didn’t actually dawn on me quite how amazing Git is until I spent a weekend in Melbourne with a group of Git whizzes using Git to write a package targeted toward Git beginners. Whew, talk about total Git immersion! I was taking part in the 2017 rOpenSci ozunconf, in which forty-odd developers, scientists, researchers, nerds, teachers, starving students, cat ladies, and R users of all descriptions form teams to create new R packages fulfilling some new and useful function. Many of the groups used Git for their collaborative workflows all weekend.

Unfortunately, just like many a programming framework, Git can often be a teensy bit (read: extremely, prohibitively) intimidating, especially for beginners who don’t need all of Git’s numerous and baffling features. It’s one of those platforms that makes your life a million times better once you know how to use it, but if you’re trying to teach yourself the basics using the internet, or—heaven forbid—trying to untangle yourself from some Git-branch tangle that you’ve unwittingly become snarled in… (definitely done that one…) well, let’s just say using your knuckles to break a brick wall can sometimes seem preferable. Just ask the Git whizzes. They laugh, because they’ve been there, done that.

The funny thing is, doing basic version control in Git only requires a few commands. After browsing through the available project ideas and settling into teams, a group of eight of us made a list of the commands that we use on a daily basis, and the list came to about a dozen. We looked up our Git histories and compiled a Git vocabulary, which came out to less than 50 commands, including combination commands.

As Nick Golding so shrewdly recognized in the lead up to this year’s unconference, the real obstacle for new Git users is not the syntax, it’s actually (a) the scary, scary terminal window and (b) the fact that Git terminology was apparently chosen by randomly opening a verb dictionary and blindly pointing to a spot on the page. (Ok, I’m exaggerating, but the point is that the terminology is pretty confusing). We decided to address these two problems by making a package that uses the R console and reimagining the version control vocabulary and workflow for people who are new to version control and only need some of its many features.

Somewhat ironically, nine people worked for two days on a dozen branches, using Git and GitHub to seamlessly merge our workflows. It was wonderful to see how so many people’s various talents can be combined to make something that no group members could have done all on their own.

Enter, changes ( repo, website– made using pkgdown), our new R package to do version control with a few simple commands. It uses Git and Git2r under the hood, but new users don’t need to know any Git to begin using version control with changes. Best of all, it works seamlessly with regular Git. So if a user thinks they’re ready to expand their horizons they can start using git commands via the Githug package, RStudio’s git interface, or on the command line.

Here is an overview of some of the ways we’ve made simple version control easy with changes:

Simple terminology

It uses simple and deliberately un-git-like terminology:

  • You start a new version control project with create_repo(), which is like git init but it can set up a nice project directory structure for you, automatically ignoring things like output folders.
  • All of the steps involved in commiting edits have been compressed into one function: record(). All files that aren’t ignored will be committed, so users don’t need to know the difference between tracking, staging and committing files.
  • It’s easy to set which files to omit from version control with ignore(), and to change your mind with unignore().
  • changes() lets you know which files have changed since the last record, like a hybrid of git status and git diff.
  • You can look back in history with timeline() (a simplified version of git log), go_to() a previous record (like git checkout), and scrub() any unwanted changes since the last record (like git reset --hard).

It’s linear

After a long discussion, we decided that changes won’t provide an interface to Git branches (at least not yet), as the merge conflicts it leads to are one of the scariest things about version control for beginners. With linear version control, users can can easily go_to() a past record with a version number, rather than unfamiliar SHA’s. These numbers appear in the a lovely visual representation of their timeline():

      (1) initial commit
       |  2017-11-18 02:55
       |
      (2) set up project structure
       |  2017-11-18 02:55
       |
      (3) added stuff to readme
          2017-11-18 02:55

If you want to roll your project back to a previous record, you can retrieve() it, and changes will simply append that record at the top of your timeline (storing all the later records, just in case).

Readable messages and automatic reminders

Some of Git’s messages and helpfiles are totally cryptic to all but the most hardened computer scientists. Having been confronted with our fair share of detached HEADs and offers to update remote refs along with associated objects, we were keen to make sure all the error messages and helpfiles in changes are as intuitive and understandable as possible.

It can also be hard to get into the swing of recording edits, so changes will give you reminders to encourage you to use record() regularly. You can change the time interval for reminders, or switch them off, using remind_me().

Coming soon

We made a lot of progress in two days, but there’s plenty more we’re planning to add soon:

  1. Simplified access to GitHub with a sync() command to automagically handle most uses of git fetch, git merge, and git push.
  2. A Git training-wheels mode, so that people who want to move use Git can view the Git commands changes is using under the hood.
  3. Added flexibility – we are working on adding functionality to handle simple deviations from the defaults, such as recording changes only to named files, or to all except some excluded files.

We’d be really keen to hear your suggestions too, so please let us know your ideas via the changes issue tracker!

I have only recently started using Git and GitHub, and this year’s rOpenSci ozunconf was a big eye-opener for me, in several ways. Beyond finally understanding to power of proper version control, I met a group of wonderful people dedicated to participating in the R community. Now as it turns out, R users take the word “community” very seriously. Each and every person I met during the event was open and friendly. Each person had ideas for attracting new users to R, making it easier to learn, making methods and data more readily available, and creating innovative new functionality. Even before the workshop began, dozens of ideas for advancement circulated on GitHub Issues. Throughout the conference, it was a pleasure to be a part of the ongoing conversation and dialogue about growing and improving the R community. That’s right, you can delete any lingering ‘introverted computer geek’ stereotypes you might still be harbouring in a cobwebbed attic of your mind. In today’s day and age, programming is as much about helping each other, communicating, learning, and networking as it is about solving problems. And building the community is a group effort.

R users come from all sorts of backgrounds, but I was gratified to see scientists and researchers well-represented at the unconference. Gone are the days when I need to feel like the ugly duckling for being the only R user in my biology lab! If you still find yourself isolated, join the blooming online R users community, or any one of a number of meetups and clubs that are popping up everywhere. I have dipped my toe in those waters, and boy am I glad I did!

Announcing a New rOpenSci Software Review Collaboration

$
0
0

rOpenSci is pleased to announce a new collaboration with the Methods and Ecology and Evolution (MEE), a journal of the British Ecological Society, published by Wiley press 1. Publications destined for MEE that include the development of a scientific R package will now have the option of a joint review process whereby the R package is reviewed by rOpenSci, followed by fast-tracked review of the manuscript by MEE. Authors opting for this process will be recognized via a mark on both web and print versions of their paper.

We are very excited for this partnership to improve the rigor of both scientific software and software publications and to provide greater recognition to developers in the fields of ecology and evolution. It is a natural outgrowth of our interest in supporting scientists in developing and maintaining software, and of MEE’s mission of vetting and disseminating tools and methods for the research community. The collaboration formalizes and eases a path already pursued by researchers: The rotl, RNexML, and treebase packages were all developed or reviewed by rOpenSci and subsequently had associated manuscripts published in MEE.

About rOpenSci software review

rOpenSci is a diverse community of researchers from academia, non-profit, government, and industry who collaborate to develop and maintain tools and practices around open data and reproducible research. The rOpenSci suite of tools is made of core infrastructure software developed and maintained by the project staff. The suite also contains numerous packages that are contributed by members of the broader R community. The volume of community submissions has grown considerably over the years necessitating a formal system of review quite analogous to that of a peer reviewed academic journal.

rOpenSci welcomes full software submissions that fit within our aims and scope, with the option of a pre-submission inquiry in cases when the scope of a submission is not immediately obvious. This software peer review framework, known as the rOpenSci Onboarding process, operates with three editors and one editor in chief who carefully vet all incoming submissions. After an editorial review, editors solicit detailed, public and signed reviews from two reviewers, and the path to acceptance from then on is similar to a standard journal review process. Details about the system are described in variousblogposts by the editorial team.

Collaboration with journals

This is our second collaboration with a journal. Since late 2015, rOpenSci has partnered with the Journal of Open Source software (JOSS), an open access journal that publishes brief articles on research software. Packages accepted to rOpenSci can be submitted for fast-track publication at JOSS, in which JOSS editors may evaluate based on rOpenSci’s reviews alone. As rOpenSci’s review criteria is significantly more stringent and designed to be compatible with JOSS, these packages are generally accepted without additional review. We have had great success with this partnership providing rOpenSci authors with an additional venue to publicize and archive their work. Given this success, we are keen on expanding to other journals and fields where there is potential for software reviewed and created by rOpenSci to play a significant role in supporting scientific findings.

The details

Our new partnership with MEE broadly resembles that with JOSS, with the major difference that MEE, rather than rOpenSci, leads review of the manuscript component. Authors with R packages and associated manuscripts that fit the Aims and Scope for both rOpenSci and MEE are encouraged to first submit to rOpenSci. The rotl, RNexML, and treebase packages are all great examples of such packages. MEE editors may also refer authors to this option if authors submit an appropriate manuscript to MEE first.

On submission to rOpenSci, authors can use our updated submission template to choose MEE as a publication venue. Following acceptance by rOpenSci, the associated manuscript will be reviewed by an expedited process at MEE, with reviewers and editors having the knowledge that the software has already been reviewed and the public reviews available to them.

Should the manuscript be accepted, a footnote will appear in the web version and the first page of the print version of the MEE article indicating that the software as well as the manuscript has been peer-reviewed, with a link to the rOpenSci open reviews.

As with any collaboration, there may be a few hiccups early on and we welcome ideas to make the process more streamlined and efficient. We look forward to the community’s submissions and to your participation in this process.

Many thanks to MEE’s Assistant Editor Chris Grieves and Senior Editor Bob O’Hara for working with us on this collaboration.

The Value of Welcome, part 2: How to prepare 40 new community members for an unconference

$
0
0

I’ve raved about the value of extending a personalized welcome to new community members and I recently shared six tips for running a successful hackathon-flavoured unconference. Building on these, I’d like to share the specific approach and (free!) tools I used to help prepare new rOpenSci community members to be productive at our unconference. My approach was inspired directly by my AAAS Community Engagement Fellowship Program (AAAS-CEFP) training. Specifically, 1) one mentor said that the most successful conference they ever ran involved having one-to-one meetings with all participants prior to the event, and 2) prior to our in-person AAAS-CEFP training, we completed an intake questionnaire that forced us to consider things like “what do you hope to get out of this” and “what do you hope to contribute”.

A challenge of this year’s unconference was the fact that we were inviting 70 people to participate. As a rule, one third of the crowd will have participated in one of our previous unconferences and two-thirds would be first-time participants. With only two days together, these people need to quickly self-sort into project groups and get working.

So I sent this email to 45 first-time participants:

pre-unconf-email

Arranging meetings is one of my least favorite activities, but the free Calendly tool made this process relatively painless. When a person clicks on the calendar link in the email above, it reveals only times that I am available in my Google Calendar, the time slot they choose shows up in my calendar, and I receive a confirmation email indicating who booked a meeting with me. In my busiest week, I had 19 meetings, but that meant the bulk of them were done!

To make the meeting time most effective I followed AAAS-CEFP program director Lou Woodley’s model for onboarding our AAAS CEFP cohort by sending a set of questions to be answered in advance. I took the model to the next level by creating a free Google Form questionnaire so that all answers were automatically collected and could be viewed per individual or collectively, and automatically exported to a spreadsheet.

pre-unconf-google-form

Questions included:

  1. List three things you hope to get from the unconference

    • Examples: connect with people working in a similar domain, learn about best practices in data science with R, or develop a new package that does X
  2. List three things you hope to contribute to the unconference

    • Examples: expertise or experience in X, mentoring skills, write all the docs!
  3. Have you had any previous interactions with the rOpenSci community?

    • Examples: I read the rOpenSci blog, or I submitted software for rOpenSci open peer review
  4. Do you have any concerns about your readiness to participate?

    • Examples: I’ve never developed an R package or How do I decide what project to work on?
  5. Would you be interested in writing a blog post about your unconference project or your unconference experience?

  6. Do you have a preferred working style or anything you would like to let us know about how you work best?

    • Examples: I’m an introvert who likes to take my lunch alone sometimes to recover from group activities – it doesn’t mean I’m not having fun. or I know I have a tendency to dominate in group discussions – it’s totally fine to ask me to step back and let others contribute. I won’t be offended.

These questions encouraged participants to reflect in advance. The example answer snippets we provided gave them ideas from which to seed their answers and in some cases gave them permission to show some vulnerability. Individual’s answers gave me cues for things to address in our chat and freed both of us to spend our time talking about the most important issues.

The answers to the question “List three things you hope to get from the unconf” were so heartening:

unconf-three-things

Beautiful, but in a different way, were answers to the question, “Do you have any concerns about your readiness to participate?”. People expressed real concerns about impostor syndrome, their perceived ability to contribute “as much or as well” as others, and feeling “outclassed by all the geniuses present”. These responses prompted me to reassure people that they were 100% qualified to participate, and opened an opportunity to listen to and address specific concerns.

To conduct the pre-unconference chats, I used video conferencing via appear.in, a free, browser-based application that does not require plugins or user accounts. Rather than being exhausted from these calls, I felt energized and optimistic and I experienced many direct positive outcomes. These conversations enabled me to prime people to connect on day-one of the unconference with others with similar interests or from related work sectors. Frequently, I noticed that immediately after our conversation, first-time participants would join the online discussion of existing project ideas, or they themselves proposed new ideas. My conversations with two first-time participants led directly to their proposing community-focussed projects - a group discussion and a new blog series of interviews!

An unexpected benefit was that questions people asked me during the video chats led to actions I could take to improve the unconference. For example, when someone wanted to know what previous participants wished they knew beforehand, I asked for and shared example resources. One wise person asked me what my plan was for having project teams report out at the end of the unconference and this led directly to a streamlined plan (See Six tips for running a successful unconference).

Big thanks to AAAS CEFP training for giving me the confidence to try this community experiment! Arranging and carrying out these pre-unconference questionnaires and video chats took a big investment of my time and energy and yet, I consider this effort to be one of the biggest contributors to participants’ satisfaction with the unconference. Will 100% do this again!

Viewing all 668 articles
Browse latest View live