Generative AI for R programmers

Melissa Van Bussel

Outline

  • Part 1: Introduction to ChatGPT

  • Part 2: Using ChatGPT as an R programmer

Outline

  • Part 1: Introduction to ChatGPT

  • Part 2: Using ChatGPT as an R programmer

  • Part 3: GitHub CoPilot

Part 1: Introduction to ChatGPT

What is ChatGPT?

  • The easiest way to answer this question is to just ask ChatGPT!
  • Note: Each time you ask this question, you’ll receive a different answer.

I’m giving a presentation right now about ChatGPT. The audience is primarily composed of R programmers, many of whom have a university-level background in Statistics. Can you explain what ChatGPT is and how it works, in a way that’s appropriate for this audience?

What is ChatGPT?

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

  • Generative: Can generate new text from scratch

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

  • Generative: Can generate new text from scratch
  • Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

  • Generative: Can generate new text from scratch
  • Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages
  • Transformer: A type of neural network

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

  • Generative: Can generate new text from scratch
  • Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages
  • Transformer: A type of neural network
  • Chat: ChatGPT is a specific version of the GPT models created by OpenAI, but tailored for chatting (trained on both text data and conversational data)

What is ChatGPT good at?

  • Creative writing (emails, documentation, cover letters)

Important

You should only use ChatGPT to create a first draft of a piece of writing.

What is ChatGPT good at?

In this workshop, we will explore the world of generative AI with a focus on two cutting-edge tools: ChatGPT and GitHub CoPilot. Designed for R programmers, we’ll dive into the mechanics of these tools, discuss their respective strengths and weaknesses, and delve into the ethical considerations that arise from their use. The workshop will also include a live demo of using ChatGPT and GitHub CoPilot in R, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

In this workshop, we-will explore the world of generative AI with a focus on two cutting-edgeinnovative tools: ChatGPT and GitHub CoPilot. Designed for R programmers, wWe’ll dive intodiscuss the mechanics of these tools, discussdive into their respective strengths and weaknesses, and delve intoexplore the ethical considerations that arise from their use. The workshop will also include a liveWe’ll then demo ofhow to useing ChatGPT and GitHub CoPilot inas an R programmer, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

In this workshop, we’ll explore the world of generative AI with a focus on two innovative tools: ChatGPT and GitHub CoPilot. We’ll discuss the mechanics of these tools, dive into their respective strengths and weaknesses, and explore the ethical considerations that arise from their use. We’ll then demo how to use ChatGPT and GitHub CoPilot as an R programmer, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

  • Writing code

What is ChatGPT good at?

But again, you should only use it to create a first draft.

What is ChatGPT good at?

The errors are usually fairly easy to fix:

ggplot(penguins, aes_string(
  x = input$x_axis,
  y = input$y_axis,
  color = species
)) +
  geom_point(size = 3) +
  labs(
    x = input$x_axis, 
    y = input$y_axis, 
    title = "Palmer Penguins"
  )

What is ChatGPT good at?

The errors are usually fairly easy to fix:

ggplot(penguins, aes_string(
  x = input$x_axis,
  y = input$y_axis
)) +
  geom_point(aes(color = species), size = 3) +
  labs(
    x = input$x_axis, 
    y = input$y_axis, 
    title = "Palmer Penguins"
  )

What is ChatGPT good at?


What is ChatGPT good at?

  • Teaching you new concepts
  • Helping you become a better programmer

What is ChatGPT bad at?

Basically, anything that requires critical thinking or analytical reasoning skills.

ChatGPT is often incorrect, and confident in its incorrect answers.

What word is missing from this sequence?

  • inch
  • chapel
  • elongate
  • [??????]
  • amaze
  • zebra
  • radius
  • user

What word is missing from this sequence?

  • inch
  • chapel
  • elongate
  • [??????]
  • amaze
  • zebra
  • radius
  • user

What word is missing from this sequence?

  • inch
  • chapel
  • elongate
  • team
  • amaze
  • zebra
  • radius
  • user

An example of ChatGPT being incorrectly confident

An example of ChatGPT being incorrectly confident

The biggest problem here is not that ChatGPT is incorrect, but the level to which it is confident in its incorrect answer.



“Although most people talk about machine learning’s ability to predict the future, what it really does is predict the past.

- Ben Green

Asking ChatGPT unethical questions

  • Baked in “safety” – ChatGPT will tell you if a prompt is unethical




But there are ways to “jailbreak” ChatGPT, or bypass its filtering system…

Trusting OpenAI

  • Even if ChatGPT was able to recognize unethical prompts 100% of the time, we still have to trust OpenAI

Trusting OpenAI

  • Remember that ChatGPT is Pretrained, so the model can’t update itself based on input the user provides
  • That being said, OpenAI will continue to create new and improved models
  • It’s stated directly in their Terms of Service that OpenAI may use your provided data to “provide and maintain the Services”.

The Unethical way that ChatGPT was trained to be Ethical

  • Because the GPT-3 model had been trained using publicly available data, it was prone to saying some pretty toxic things
  • Needed to feed the model labelled examples of hate speech, violence, abuse, etc.
  • Paid workers between $1.32 and $2.00 per hour
  • All 4 employees interviewed by TIME said they were “mentally scarred by the work”

Generation of false information


Generation of false information

  • If you ask ChatGPT to provide references or citations for information that it provides to you, it will gladly do so
  • The only problem is that these citations will be made up
  • Remember: ChatGPT is Pretrained, it can’t provide you modern news sources!

This is especially concerning given that we know how confident ChatGPT is in its incorrect information.

Stealing content from creators

  • There have been class-action lawsuits filed against OpenAI and companies like it for essentially stealing the work of others without the “3 C’s”:
  1. Consent (for their work to be included in the training data)
  1. Compensation (for their work being used to train the model)
  1. Credit (for when the model outputs results based off the creator’s original content)

Environmental impact of ChatGPT

It’s difficult to determine the environmental cost of ChatGPT, though people have tried to estimate this:

  • To have a 20-50 question/answer conversation with ChatGPT, this consumes a 500ml bottle of water



“Any tool can be used for good or bad. It’s really the ethics of the artist using it.”

- John Knoll

Best practices as an R programmer

  • Always use as a first draft, whether it’s generated text or generated code
  • Check your work
  • Details are your friend
  • Think about word limits
  • Remember that ChatGPT remembers what you say!

Part 2: Using ChatGPT as an R programmer

Three main approaches

  1. Browser-based version
  1. Direct interfacing with OpenAI API
  1. Using GPTStudio

No. 2 and No. 3 require an OpenAI API key…

Understanding usage and pricing

The first step to understanding pricing for the OpenAI API is understanding “tokens”:

Multiple models, each with different capabilities and price points. Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens.

Understanding usage and pricing

  • When you sign up, you’ll get $5.00 worth of credit (expires after 3 months)
  • Go to Billing > Usage limits to set “hard” and “soft” spending caps per month – you won’t accidentally spend too much!
  • Recommendation: Use the $5.00 credit for a project and then monitor your usage throughout the project

Browser-based version

To use the browser-based version of ChatGPT, go to https://chat.openai.com/ and login or create an account.


Pros:

Doesn’t cost you any money (exception: GPT-4)

Is easy to use

Provides an aesthetically pleasing experience

Browser-based version

Cons:

Lots of manual copy-pasting, and…

Using the OpenAI API directly

You can use R to access ChatGPT through the OpenAI API. This is great when…


You get the “ChatGPT is at capacity right now” message and you don’t mind paying money (or using your free trial) in order to avoid this

You want more control over the answers that ChatGPT provides (tuning parameters, or different models)

You want to incorporate ChatGPT into a Shiny app

Creating your OpenAI API key

  • Go to Personal > View API keys
  • Create new secret key (make sure you save it)

Defining your OPENAI_API_KEY environment variable

To define in current R session:

Sys.setenv(OPENAI_API_KEY = "PASTE KEY HERE")

To define for an R project:

usethis::edit_r_environ()

To define everywhere:

Edit the system environment variables > Environment Variables > New

The openai package

  • There’s a package called openai that contains a wrapper for interacting with OpenAI’s models in R.

  • Before you can use it, make sure you’ve set your OPENAI_API_KEY environment variable

library(openai)

Chatting

create_chat_completion(
  model = "gpt-3.5-turbo",
  messages = list(
    list(
      "role" = "user",
      "content" = "What is the meaning of life?"
    )
  )
)

Creating images

response <- create_image(
  prompt = "A white siamese cat",
  n = 1,
  size = "1024x1024"
)
response$data$url

Creating images

Creating transcriptions

# Create transcription
my_transcription <- create_transcription(
  file = "my_video.mp4",
  model = "whisper-1"
)

# Extract results
my_transcription$text

Creating transcriptions

Creating translations

# Create translation
my_translation <- create_translation(
  file = "my_video.mp4",
  model = "whisper-1"
)

# Extract results
my_translation$text

Creating your own custom models

  • You can create (for example) a custom classification model using your OWN data (use the create_fine_tune() function)

I must say that I am fairly disappointed by this “horror” movie. I did not get scared even once while watching it. It also is not very suspenseful either…. I was able to guess the ending half way through the movie… So.. what’s left?

“The Ring” is a trully scary movie… I wish other movies would stop copying from it (e.g. the trade-mark: long hair). Please give me some originality.

Will not recommend this movie.

Creating your own custom models

gptstudio

  • gptstudio is an R package that contains add-ins that let you work with the openai package through a point-and-click system in RStudio.

  • Before installing gptstudio, make sure you have the most up-to-date version of RStudio installed.

install.packages("gptstudio")
  • Once installed, close and restart RStudio, and then you should be able to use gptstudio by going to Addins > GPTSTUDIO in RStudio.

gptstudio: Spelling and Grammar

  • With the gptstudio package, you can easily check your spelling and grammar in RStudio. Simply highlight the text you want to check, and then go to Addins > GPTStudio > Spelling and Grammar.

  • Spelling and grammar will automatically be corrected in-line, and the results will also be printed to the console.

  • Additionally, gptstudio will tell you how many tokens you used to perform this action.

gptstudio: Turning comments into code

With the gptstudio package, you can quickly turn your comments into code. You can use this feature in the same way that you used the “Spelling and grammar” feature.

Note

There used to be a “Write code from prompt” option, but this is no longer available. Instead, use “Comment your code”.

Example

# Create a scatterplot of the palmerpenguins dataset with bill length along the x-axis and bill depth along the y-axis, using ggplot

becomes…

# load the palmerpenguins package
library(palmerpenguins)

# load the ggplot2 package
library(ggplot2)

# create a scatterplot of the palmerpenguins dataset with bill length along the x-axis 
# and bill depth along the y-axis, using ggplot 
ggplot(data = penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point()

gptstudio: Using ChatGPT in the RStudio Viewer

With the gptstudio package, you can also use ChatGPT as a Shiny app.

Tip

Before using this feature, make sure all your packages are updated.

To use ChatGPT, go to Addins > GPTStudio > ChatGPT.

GPT-4

  • More creative, accepts visual input, can create longer outputs

  • Performs better on standardized tests

  • Less “jailbreaking”

  • 2 options: ChatGPT Plus ($20USD/month) or use the API directly ($$$)

GPT-4

Still not perfect, though:

“What word is missing from this sequence?”

The missing word is “tar”. The words in the sequence are in alphabetical order, but not based on the entire word. Instead, the order is based on the second, third and fourth letters in each word. So the correct sequence would be: inch, chapel, elongate, tar, amaze, zebra, radius, user.

How much was spent today?

To get a better sense of how expensive it was to create this presentation, we can check my API usage for the month of September.

Note

Remember, this only includes:

  • Using the ChatGPT Shiny app that I had created
  • Using gptstudio from within RStudio

Anything done in the browser-based version was free.

Part 3: GitHub CoPilot

What is CoPilot?

GitHub CoPilot is a collaboration between GitHub and OpenAI. It generates code suggestions in real-time, directly within your IDE. It can do things like…

  • Convert comments to code
  • Automatically suggest what code should come next
  • Show you code alternatives

CoPilot has a flat rate fee of $10/month for individuals.

Why should you care?

  • CoPilot is specifically designed to assist with writing code, whereas ChatGPT is a general-purpose language model
  • CoPilot’s training data is comprised of open source code from GitHub, so the model has a better understanding of code conventions
  • The integration with IDEs solves the “copy paste” problem with ChatGPT

R is to Python as CoPilot is to ChatGPT

IDE integration

Unfortunately, there’s no CoPilot integration in RStudio. You’ll need to use one of the following:

  • Visual Studio Code (VS Code)
  • Visual Studio
  • A compatible JetBrains IDE
  • Neovim

Keyboard Shortcuts for GitHub CoPilot (on Windows)


Action Shortcut
Accept an inline suggestion Tab
Dismiss an inline suggestion Esc
Show next inline suggestion Alt+]
Show previous inline suggestion Alt+[
Trigger inline suggestion Alt+\

GitHub CoPilot Demo

References

Thank You!

@ggnot2

melissavanbussel.com

@melvanbussel

@melissavanbussel

github.com/melissavanbussel