Generative AI for R programmers

Melissa Van Bussel

Outline

Part 1: Introduction to ChatGPT
Part 2: Using ChatGPT as an R programmer

Outline

Part 1: Introduction to ChatGPT
Part 2: Using ChatGPT as an R programmer
Part 3: GitHub CoPilot

Part 1: Introduction to ChatGPT

What is ChatGPT?

The easiest way to answer this question is to just ask ChatGPT!
Note: Each time you ask this question, you’ll receive a different answer.

I’m giving a presentation right now about ChatGPT. The audience is primarily composed of R programmers, many of whom have a university-level background in Statistics. Can you explain what ChatGPT is and how it works, in a way that’s appropriate for this audience?

What is ChatGPT?

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

Generative: Can generate new text from scratch

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

Generative: Can generate new text from scratch
Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

Generative: Can generate new text from scratch
Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages
Transformer: A type of neural network

How does ChatGPT work?

The easiest way to understand is to break down the name “ChatGPT”.

Generative: Can generate new text from scratch
Pretrained: A Large Language Model (LLM) that’s been trained on 45 terabytes of data, including Wikipedia, books, and webpages
Transformer: A type of neural network
Chat: ChatGPT is a specific version of the GPT models created by OpenAI, but tailored for chatting (trained on both text data and conversational data)

What is ChatGPT good at?

Creative writing (emails, documentation, cover letters)

Important

You should only use ChatGPT to create a first draft of a piece of writing.

What is ChatGPT good at?

In this workshop, we will explore the world of generative AI with a focus on two cutting-edge tools: ChatGPT and GitHub CoPilot. Designed for R programmers, we’ll dive into the mechanics of these tools, discuss their respective strengths and weaknesses, and delve into the ethical considerations that arise from their use. The workshop will also include a live demo of using ChatGPT and GitHub CoPilot in R, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

In this workshop, we-wi’ll explore the world of generative AI with a focus on two cutting-edgeinnovative tools: ChatGPT and GitHub CoPilot. Designed for R programmers, wWe’ll dive intodiscuss the mechanics of these tools, discussdive into their respective strengths and weaknesses, and delve intoexplore the ethical considerations that arise from their use. The workshop will also include a liveWe’ll then demo ofhow to useing ChatGPT and GitHub CoPilot inas an R programmer, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

In this workshop, we’ll explore the world of generative AI with a focus on two innovative tools: ChatGPT and GitHub CoPilot. We’ll discuss the mechanics of these tools, dive into their respective strengths and weaknesses, and explore the ethical considerations that arise from their use. We’ll then demo how to use ChatGPT and GitHub CoPilot as an R programmer, covering several different approaches and highlighting best practices.

What is ChatGPT good at?

Writing code

What is ChatGPT good at?

But again, you should only use it to create a first draft.

What is ChatGPT good at?

The errors are usually fairly easy to fix:

ggplot(penguins, aes_string(
  x = input$x_axis,
  y = input$y_axis,
  color = species
)) +
  geom_point(size = 3) +
  labs(
    x = input$x_axis, 
    y = input$y_axis, 
    title = "Palmer Penguins"
  )

What is ChatGPT good at?

The errors are usually fairly easy to fix:

ggplot(penguins, aes_string(
  x = input$x_axis,
  y = input$y_axis
)) +
  geom_point(aes(color = species), size = 3) +
  labs(
    x = input$x_axis, 
    y = input$y_axis, 
    title = "Palmer Penguins"
  )

What is ChatGPT good at?

Teaching you new concepts

Helping you become a better programmer

What is ChatGPT bad at?

Basically, anything that requires critical thinking or analytical reasoning skills.

ChatGPT is often incorrect, and confident in its incorrect answers.

What word is missing from this sequence?

inch
chapel
elongate
[??????]
amaze
zebra
radius
user

What word is missing from this sequence?

inch
chapel
elongate
[??????]
amaze
zebra
radius
user

What word is missing from this sequence?

inch
chapel
elongate
team
amaze
zebra
radius
user

An example of ChatGPT being incorrectly confident

The biggest problem here is not that ChatGPT is incorrect, but the level to which it is confident in its incorrect answer.

“Although most people talk about machine learning’s ability to predict the future, what it really does is predict the past.”

- Ben Green

Asking ChatGPT unethical questions

Baked in “safety” – ChatGPT will tell you if a prompt is unethical

But there are ways to “jailbreak” ChatGPT, or bypass its filtering system…

Trusting OpenAI

Even if ChatGPT was able to recognize unethical prompts 100% of the time, we still have to trust OpenAI

Trusting OpenAI

Remember that ChatGPT is Pretrained, so the model can’t update itself based on input the user provides

That being said, OpenAI will continue to create new and improved models

It’s stated directly in their Terms of Service that OpenAI may use your provided data to “provide and maintain the Services”.

The Unethical way that ChatGPT was trained to be Ethical

Because the GPT-3 model had been trained using publicly available data, it was prone to saying some pretty toxic things
Needed to feed the model labelled examples of hate speech, violence, abuse, etc.
Paid workers between $1.32 and $2.00 per hour
All 4 employees interviewed by TIME said they were “mentally scarred by the work”

Generation of false information

If you ask ChatGPT to provide references or citations for information that it provides to you, it will gladly do so

The only problem is that these citations will be made up

Remember: ChatGPT is Pretrained, it can’t provide you modern news sources!

This is especially concerning given that we know how confident ChatGPT is in its incorrect information.

Stealing content from creators

There have been class-action lawsuits filed against OpenAI and companies like it for essentially stealing the work of others without the “3 C’s”:

Consent (for their work to be included in the training data)

Compensation (for their work being used to train the model)

Credit (for when the model outputs results based off the creator’s original content)

Environmental impact of ChatGPT

It’s difficult to determine the environmental cost of ChatGPT, though people have tried to estimate this:

To train GPT-3, the water footprint is estimated at around 3.5 million litres of freshwater (an Olympic swimming pool is about 2.5 million)

To have a 20-50 question/answer conversation with ChatGPT, this consumes a 500ml bottle of water

Daily carbon footprint of 23.04 kg CO$^2$ emissions (average Canadian = 18.72kg CO$^2$ emissions per year)

“Any tool can be used for good or bad. It’s really the ethics of the artist using it.”

- John Knoll

Best practices as an R programmer

Always use as a first draft, whether it’s generated text or generated code

Check your work

Details are your friend

Think about word limits

Remember that ChatGPT remembers what you say!

Now that we’re done talking about ethical considerations, it’s time for the fun part of the workshop where we talk about using these tools as an R programmer specifically.

Details: Be as specific as possible, also tell it the intended audience or the persona that it should take on. Tell it the desired length and format (e.g., social media post, table). Tell it what information it should include and exclude.

Word limits: The length of your input is limited based on the model you choose, and the price increases accordingly. gpt-3.5: 4,096 and 0.002/1k, gpt-4, can go up to 32k but 0.06k (30x the price). The length of your output is unlimited though so ask for it to be as long as you want

Remember: chain multiple prompts together to get the best result, tell it when it’s wrong. But if it’s so broken and wrong, memory/context might be so broken that it’s better to start fresh. You can also train it on your own writing style so it sounds more like you.

Part 2: Using ChatGPT as an R programmer

Three main approaches

Browser-based version

Direct interfacing with OpenAI API

Using GPTStudio

No. 2 and No. 3 require an OpenAI API key…

Understanding usage and pricing

The first step to understanding pricing for the OpenAI API is understanding “tokens”:

Multiple models, each with different capabilities and price points. Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens.

Understanding usage and pricing

When you sign up, you’ll get $5.00 worth of credit (expires after 3 months)

Go to Billing > Usage limits to set “hard” and “soft” spending caps per month – you won’t accidentally spend too much!

Recommendation: Use the $5.00 credit for a project and then monitor your usage throughout the project

Browser-based version

To use the browser-based version of ChatGPT, go to https://chat.openai.com/ and login or create an account.

Pros:

Doesn’t cost you any money (exception: GPT-4)

Is easy to use

Provides an aesthetically pleasing experience

Browser-based version

Cons:

Lots of manual copy-pasting, and…

Using the OpenAI API directly

You can use R to access ChatGPT through the OpenAI API. This is great when…

You get the “ChatGPT is at capacity right now” message and you don’t mind paying money (or using your free trial) in order to avoid this

You want more control over the answers that ChatGPT provides (tuning parameters, or different models)

You want to incorporate ChatGPT into a Shiny app

Creating your OpenAI API key

Go to platform.openai.com and create an account

Go to Personal > View API keys

Create new secret key (make sure you save it)

Defining your `OPENAI_API_KEY` environment variable

To define in current R session:

Sys.setenv(OPENAI_API_KEY = "PASTE KEY HERE")

To define for an R project:

usethis::edit_r_environ()

To define everywhere:

Edit the system environment variables > Environment Variables > New

The `openai` package

There’s a package called openai that contains a wrapper for interacting with OpenAI’s models in R.
Before you can use it, make sure you’ve set your OPENAI_API_KEY environment variable

library(openai)

Chatting

create_chat_completion(
  model = "gpt-3.5-turbo",
  messages = list(
    list(
      "role" = "user",
      "content" = "What is the meaning of life?"
    )
  )
)

Creating images

response <- create_image(
  prompt = "A white siamese cat",
  n = 1,
  size = "1024x1024"
)
response$data$url

Creating images

Creating transcriptions

# Create transcription
my_transcription <- create_transcription(
  file = "my_video.mp4",
  model = "whisper-1"
)

# Extract results
my_transcription$text

Creating transcriptions

Creating translations

# Create translation
my_translation <- create_translation(
  file = "my_video.mp4",
  model = "whisper-1"
)

# Extract results
my_translation$text

Creating your own custom models

You can create (for example) a custom classification model using your OWN data (use the create_fine_tune() function)

I must say that I am fairly disappointed by this “horror” movie. I did not get scared even once while watching it. It also is not very suspenseful either…. I was able to guess the ending half way through the movie… So.. what’s left?

“The Ring” is a trully scary movie… I wish other movies would stop copying from it (e.g. the trade-mark: long hair). Please give me some originality.

Will not recommend this movie.

Creating your own custom models

`gptstudio`

gptstudio is an R package that contains add-ins that let you work with the openai package through a point-and-click system in RStudio.
Before installing gptstudio, make sure you have the most up-to-date version of RStudio installed.

install.packages("gptstudio")

Once installed, close and restart RStudio, and then you should be able to use gptstudio by going to Addins > GPTSTUDIO in RStudio.

`gptstudio`: Spelling and Grammar

With the gptstudio package, you can easily check your spelling and grammar in RStudio. Simply highlight the text you want to check, and then go to Addins > GPTStudio > Spelling and Grammar.
Spelling and grammar will automatically be corrected in-line, and the results will also be printed to the console.
Additionally, gptstudio will tell you how many tokens you used to perform this action.

`gptstudio`: Turning comments into code

With the gptstudio package, you can quickly turn your comments into code. You can use this feature in the same way that you used the “Spelling and grammar” feature.

Note

There used to be a “Write code from prompt” option, but this is no longer available. Instead, use “Comment your code”.

Example

# Create a scatterplot of the palmerpenguins dataset with bill length along the x-axis and bill depth along the y-axis, using ggplot

becomes…

# load the palmerpenguins package
library(palmerpenguins)

# load the ggplot2 package
library(ggplot2)

# create a scatterplot of the palmerpenguins dataset with bill length along the x-axis 
# and bill depth along the y-axis, using ggplot 
ggplot(data = penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point()

`gptstudio`: Using ChatGPT in the RStudio Viewer

With the gptstudio package, you can also use ChatGPT as a Shiny app.

Tip

Before using this feature, make sure all your packages are updated.

To use ChatGPT, go to Addins > GPTStudio > ChatGPT.

GPT-4

More creative, accepts visual input, can create longer outputs
Performs better on standardized tests
Less “jailbreaking”
2 options: ChatGPT Plus ($20USD/month) or use the API directly ($$$)

GPT-4

Still not perfect, though:

“What word is missing from this sequence?”

The missing word is “tar”. The words in the sequence are in alphabetical order, but not based on the entire word. Instead, the order is based on the second, third and fourth letters in each word. So the correct sequence would be: inch, chapel, elongate, tar, amaze, zebra, radius, user.

How much was spent today?

To get a better sense of how expensive it was to create this presentation, we can check my API usage for the month of September.

Note

Remember, this only includes:

Using the ChatGPT Shiny app that I had created
Using gptstudio from within RStudio

Anything done in the browser-based version was free.

Part 3: GitHub CoPilot

What is CoPilot?

GitHub CoPilot is a collaboration between GitHub and OpenAI. It generates code suggestions in real-time, directly within your IDE. It can do things like…

Convert comments to code

Automatically suggest what code should come next

Show you code alternatives

CoPilot has a flat rate fee of $10/month for individuals.

Why should you care?

CoPilot is specifically designed to assist with writing code, whereas ChatGPT is a general-purpose language model

CoPilot’s training data is comprised of open source code from GitHub, so the model has a better understanding of code conventions

The integration with IDEs solves the “copy paste” problem with ChatGPT

R is to Python as CoPilot is to ChatGPT

IDE integration

Unfortunately, there’s no CoPilot integration in RStudio. You’ll need to use one of the following:

Visual Studio Code (VS Code)
Visual Studio
A compatible JetBrains IDE
Neovim

Keyboard Shortcuts for GitHub CoPilot (on Windows)

Action	Shortcut
Accept an inline suggestion	`Tab`
Dismiss an inline suggestion	`Esc`
Show next inline suggestion	`Alt+]`
Show previous inline suggestion	`Alt+[`
Trigger inline suggestion	`Alt+\`

GitHub CoPilot Demo

References

Thank You!

github.com/melissavanbussel

Generative AI for R programmers

Outline

Outline

Part 1: Introduction to ChatGPT

What is ChatGPT?

What is ChatGPT?

How does ChatGPT work?

How does ChatGPT work?

How does ChatGPT work?

How does ChatGPT work?

How does ChatGPT work?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT good at?

What is ChatGPT bad at?

ChatGPT is often incorrect, and confident in its incorrect answers.

What word is missing from this sequence?

What word is missing from this sequence?

What word is missing from this sequence?

An example of ChatGPT being incorrectly confident

An example of ChatGPT being incorrectly confident

The biggest problem here is not that ChatGPT is incorrect, but the level to which it is confident in its incorrect answer.

“Although most people talk about machine learning’s ability to predict the future, what it really does is predict the past.” - Ben Green

Asking ChatGPT unethical questions

But there are ways to “jailbreak” ChatGPT, or bypass its filtering system…

Trusting OpenAI

Trusting OpenAI

The Unethical way that ChatGPT was trained to be Ethical

Generation of false information

Generation of false information

Stealing content from creators

Environmental impact of ChatGPT

“Any tool can be used for good or bad. It’s really the ethics of the artist using it.” - John Knoll

Best practices as an R programmer

Part 2: Using ChatGPT as an R programmer

Three main approaches

Understanding usage and pricing

Understanding usage and pricing

Browser-based version

Browser-based version

Using the OpenAI API directly

Creating your OpenAI API key

Defining your OPENAI_API_KEY environment variable

The openai package

Chatting

Creating images

Creating images

Creating transcriptions

Creating transcriptions

Creating translations

Creating your own custom models

Creating your own custom models

gptstudio

gptstudio: Spelling and Grammar

gptstudio: Turning comments into code

Example

gptstudio: Using ChatGPT in the RStudio Viewer

GPT-4

GPT-4

How much was spent today?

Part 3: GitHub CoPilot

What is CoPilot?

Why should you care?

R is to Python as CoPilot is to ChatGPT

IDE integration

Keyboard Shortcuts for GitHub CoPilot (on Windows)

GitHub CoPilot Demo

References

Thank You!

“Although most people talk about machine learning’s ability to predict the future, what it really does is predict the past.”

- Ben Green

“Any tool can be used for good or bad. It’s really the ethics of the artist using it.”

- John Knoll

Defining your `OPENAI_API_KEY` environment variable

The `openai` package

`gptstudio`

`gptstudio`: Spelling and Grammar

`gptstudio`: Turning comments into code

`gptstudio`: Using ChatGPT in the RStudio Viewer