Anthropics API Fundamentals

Takeaways from Anthropics Course 1 - Anthropics API Fundamentals

Getting Started

Install the anthropic library


pip install anthropic

After you get an API Key from https://console.anthropic.com , use it to create the Anthropic Client object that is the main entry point for interacting with the API


from anthropic import Anthropic

client = Anthropic(
    api_key=my_api_key
)

If you set the environment variable ANTHROPIC_API_KEY , you can just say client = Anthropic()

Example of request


our_first_message = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=1000,
    messages=[
        {"role": "user", "content": "Hi there! Please write me a haiku about a pet chicken"}
    ]
)

print(our_first_message.content[0].text)

Messages Format

We can use client.messages.create() to send a message to Claude and get a response. There is a certain format to the messages parameter

The messages parameter expects a list of message dictionaries, where each dictionary represents a single message in the conversation. Each message dictionary should have the following keys:

role: A string indicating the role of the message sender. It can be either "user" (for messages sent by the user) or "assistant" (for messages sent by Claude).

content: A string or list of content dictionaries representing the actual content of the message. If a string is provided, it will be treated as a single text content block. If a list of content dictionaries is provided, each dictionary should have a "type" (e.g., "text" or "image") and the corresponding content. For now, we'll leave content as a single string.

Here's an example of a messages list with a single user message:


messages = [
    {"role": "user", "content": "Hello Claude! How are you today?"}
]

And here's an example with multiple messages representing a conversation:


messages = [
    {"role": "user", "content": "Hello Claude! How are you today?"},
    {"role": "assistant", "content": "Hello! I'm doing well, thank you. How can I assist you today?"},
    {"role": "user", "content": "Can you tell me a fun fact about ferrets?"},
    {"role": "assistant", "content": "Sure! Did you know that excited ferrets make a clucking vocalization known as 'dooking'?"},
]

Remember that messages always alternate between user and assistant messages and always state with the user message.

The message object we receive contains a handful of properties::

In order to print the actual text content we use .content[0].text

In addition to content, the Message object contains some other pieces of information:

id - a unique object identifier

type - The object type, which will always be "message"

role - The conversational role of the generated message. This will always be "assistant".

model - The model that handled the request and generated the response

stop_reason- The reason why the model stopped generating its response.

stop_sequence- The sequence that caused the model to stop generating, if any.

usage - information on billing and rate-limit usage. Contains information on:

input_tokens - The number of input tokens that were used.
output_tokens - The number of output tokens that were used.

It's important to know that we have access to these pieces of information, but if you only remember one thing, make it this: content contains the actual model-generated content

Few-shot prompting

One of the most useful prompting strategies is called "few-shot prompting" which involves providing a model with a small number of **examples**. These examples help guide Claude's generated output. The messages conversation history is an easy way to provide examples to Claude.

Using these principles we can create a multi-turn command line chatbot. Here is my solution to the exercise


chat_log = [{"role": "user", "content": "Perform the role of a chatbot and respond to the user"},
            {"role": "assistant", "content": "Hello! How can I help you today?"}]

chat_input = ""

print("Welcome to the chatbot! Type 'exit' to end the conversation.")

while chat_input.lower() != "exit":
    chat_input = input("You: ")
    chat_log.append({"role": "user", "content": chat_input})
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        messages=chat_log
    )
    print("Bot: " + response.content[0].text)
    chat_log.append({"role": "assistant", "content": response.content[0].text})

print("Goodbye! Have a great day!")

Models

The Claude Python SDK supports multiple models, each with different capabilities and performance characteristics. This visualization compares cost vs. speed across Claude 3 and 3.5 models, showcasing the range in tradeoffs between cost and intelligence:

Claude recommends their model Claude 3.5 Sonnet generally, for use cases involving:

Coding: Claude 3.5 Sonnet writes, edits, and runs code autonomously, streamlining code translations for faster, more accurate updates and migrations.

Customer support: Claude 3.5 Sonnet understands user context and orchestrates multi-step workflows, enabling 24/7 support, faster responses, and improved customer satisfaction.

Data science & analysis: Claude 3.5 Sonnet navigates unstructured data, generates insights, and produces visualizations and predictions to enhance data science expertise.

Visual processing: Claude 3.5 Sonnet excels at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone.

Writing: Claude 3.5 Sonnet represents a significant improvement in understanding nuance and humor, producing high-quality, authentic, and relatable content.

Picking a model

Copied from the anthropic tutorial

The next logical question is: which model should you use? It's a difficult question to answer without knowing the specific tasks and demands of a given application. The choice of model can significantly impact the performance, user experience, and cost-effectiveness of your application:

Capabilities

The first and foremost consideration is whether the model possesses the necessary capabilities to handle the tasks and use cases specific to your application. Different models have varying levels of performance across different domains, such as general language understanding, task-specific knowledge, reasoning abilities, and generation quality. It's essential to align the model's strengths with the demands of your application to ensure optimal results.

Speed

The speed at which a model can process and generate responses is another critical factor, particularly for applications that require real-time or near-real-time interactions. Faster models can provide a more responsive and seamless user experience, reducing latency and improving overall usability. However, it's important to strike a balance between speed and model capabilities, as the fastest model may not always be the most suitable for your specific needs.

Cost

The cost associated with using a particular model is a practical consideration that can impact the viability and scalability of your application. Models with higher capabilities often come with a higher price tag, both in terms of API usage costs and computational resources required. It's crucial to assess the cost implications of different models and determine the most cost-effective option that still meets your application's requirements.

One approach: start with Haiku

When experimenting, we often recommend starting with the Haiku model. Haiku is a lightweight and fast model that can serve as an excellent starting point for many applications. Its speed and cost-effectiveness make it an attractive option for initial experimentation and prototyping. In many use cases, Haiku proves to be perfectly capable of generating high-quality responses that meet the needs of the application. By starting with Haiku, you can quickly iterate on your application, test different prompts and configurations, and gauge the model's performance without incurring significant costs or latency. If you are unhappy with the responses, it's easy to "upgrade" to a model like Claude 3.5 Sonnet.

Evaluating and upgrading

As you develop and refine your application, it's essential to set up a comprehensive suite of evaluations specific to your use case and prompts. These evaluations will serve as a benchmark to measure the performance of your chosen model and help you make informed decisions about potential upgrades.

If you find that Haiku's responses do not meet your application's requirements or if you desire higher levels of sophistication and accuracy, you can easily transition to more capable models like Sonnet or Opus. These models offer enhanced capabilities and can handle more complex tasks and nuanced language understanding.

By establishing a rigorous evaluation framework, you can objectively compare the performance of different models across your specific use case. This empirical evidence will guide your decision-making process and ensure that you select the model that best aligns with your application's needs.

Model Parameters

`max_tokens`

What is a token? Tokens are the small building blocks of a text sequence that a LLM processes, understands, and generates texts with. For Claude, a token approximately represents 3.5 English characters, though the exact number can vary depending on the language used.

max_tokens controls the maximum number of tokens that Claude should generate in its response.

To find out why a model stopped generating use response.stop_reason where response is the model response parameter

It's important to note that the models don't "know" about max_tokens when generating content. Changing max_tokens won't alter how Claude generates the output, it just gives the model room to keep generating (with a high max_tokens value) or truncates the output (with a low max_tokens value).

To find the number of output tokens in a response, use response.usage.output_tokens

Why alter max tokens?

API limits & Cost: The number of tokens in your input text and the generated response count towards the API usage limits. Each API request has a maximum limit on the number of tokens it can process. Being aware of tokens helps you stay within the API limits and manage your usage efficiently.

Performance: The number of tokens Claude generates directly impacts the processing time and memory usage of the API. Longer input texts and higher max_tokens values require more computational resources. Understanding tokens helps you optimize your API requests for better performance.

Response quality: Setting an appropriate max_tokens value ensures that the generated response is of sufficient length and contains the necessary information. If the max_tokens value is too low, the response may be truncated or incomplete. Experimenting with different max_tokens values can help you find the optimal balance for your specific use case.

Stop sequences

stop_sequence allows us to provide the model with a set of strings that, when encountered in the generated response, cause the generation to stop. They are essentially a way of telling Claude, "if you generate this sequence, stop generating anything else!”. For example:


response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=500,
    messages=[{"role": "user", "content": "Generate a JSON object representing a person with a name, email, and phone number ."}],
    stop_sequences=["}"]
)
print(response.content[0].text)

The response does not include the stop_sequence parameter

Temperature

The `temperature` parameter is used to control the "randomness" and "creativity" of the generated responses. It ranges from 0 to 1, with higher values resulting in more diverse and unpredictable responses with variations in phrasing. Lower temperatures can result in more deterministic outputs that stick to the most probable phrasing and answers.Temperature has a default value of 1.

When generating text, Claude predicts the probability distribution of the next token (word or subword). The temperature parameter is used to manipulate this probability distribution before sampling the next token. If the temperature is low (close to 0.0), the probability distribution becomes more peaked, with high probabilities assigned to the most likely tokens. This makes the model more deterministic and focused on the most probable or "safe" choices. If the temperature is high (closer to 1.0), the probability distribution becomes more flattened, with the probabilities of less likely tokens increasing. This makes the model more random and exploratory, allowing for more diverse and creative outputs.

We can see how the probability of the other tokens apart from possible token 3 has increased with increasing temperature

So for analytical tasks use temperatures close to 0, and closer to 1.0 for creative and generative tasks

With a temperature of 0, Claude is likely to generate very similar responses each time we run the API with the same message. While this setting doesn't guarantee completely identical results, it significantly increases the consistency of Claude's responses.

System Prompt

The system_prompt is an optional parameter that you can include in sending messages. It provides high-level instructions, like role definition or background information that should be taken into account for its responses.

Key points about the system_prompt:

It's optional but can be useful for setting the tone and context of the conversation.

It's applied at the conversation level, affecting all of Claude's responses in that exchange.

It can help steer Claude's behavior without needing to include instructions in every user message.

Note - This is not a replacement for prompts like detailed instructions, external input content and examples. That should go inside the first user message for better results. For the most part, only tone, context, and role content should go inside the system prompt

For example:


def generate_questions(topic, num_questions=3):
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        system=f"You are an expert on {topic}. Generate thought-provoking questions about this topic.",
        messages=[
            {"role": "user", "content": f"Generate {num_questions} questions about {topic} as a numbered list."}
        ],
        stop_sequences=[f"{num_questions+1}."]
    )
    print(response.content[0].text)

Streaming

Our approach so far waits for the entire content to be generated, which isn't ideal for user experience. Streaming allows us to display content as the model generates it, similar to how ChatGPT and claude.ai work.

How to use streaming?

Just set the parameter stream = True to client.messages.create .


stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model="claude-3-haiku-20240307",
    max_tokens=100,
    temperature=0,
    stream=True,
)

This returns a Stream Object like this <anthropic.Stream at 0x114e51210>

On iterating through the object, you see something like this-

MessageStartEvent - A message with empty content

Series of content blocks - Each of which contains:

A ContentBlockStartEvent
One or more ContentBlockDeltaEvents
A ContentBlockStopEvent

One or more MessageDeltaEvents which indicate top-level changes to the final message

A final MessageStopEvent

We can see that the token contains parts of words and blanks in each token, using print by default is not possible cos every print is in a new line. So:

end="": By default, the print() function adds a newline character (\n) at the end of the printed text. However, by setting end="", we specify that the printed text should not be followed by a newline character. This means that the next print() statement will continue printing on the same line.

flush=True: The flush parameter is set to True to force the output to be immediately written to the console or standard output, without waiting for a newline character or the buffer to be filled. This ensures that the text is displayed in real-time as it is received from the streaming response.

The MessageStartEvent and MessageDeltaEvents contains the input tokens and output tokens respectively

Other streaming events

Ping events - streams may also include any number of ping events.

Error events - you may occasionally see error events in the event stream. For example, during periods of high usage, you may receive an overloaded_error, which would normally correspond to an HTTP 529 in a non-streaming context.

Here's an example error event:


event: error
data: {"type": "error", "error": {"type": "overloaded_error", "message": "Overloaded"}}

TTFT - Time to first token

The main reason to use streaming is to improve the time for the user to receive the first bit of model generated content.