Fly.io

Supported OS Linux Windows Mac OS

Integration version2.0.1
This integration is in public beta. Use caution if enabling it on production workloads.

Overview

This check monitors Fly.io metrics through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a Fly application.

Installation

The Fly.io check is included in the Datadog Agent package. We recommend running the Fly.io check on the Datadog Agent in a Fly.io application. The Agent collects Prometheus metrics as well as some additional data from the Machines API. Additionally, you can configure the Agent to receive traces and custom metrics from all of your Fly.io applications inside the organization.

Deploying the Agent as a Fly.io application

  1. Create a new application in Fly.io with the image set as the Datadog Agent when launching, or provide the image in the fly.toml file:

    [build]
        image = 'gcr.io/datadoghq/agent:7'
    
  2. Set a secret for your Datadog API key called DD_API_KEY, and optionally your site as DD_SITE.

  3. In your app’s directory, create a conf.yaml file for the Fly.io integration, configure the integration, and mount it in the Agent’s conf.d/fly_io.d/ directory as conf.yaml:

    instances:
    - empty_default_hostname: true
      headers:
          Authorization: Bearer <YOUR_FLY_TOKEN>
      machines_api_endpoint: http://_api.internal:4280
      org_slug: <YOUR_ORG_SLUG>
    
  4. Deploy your app.

Note: To collect traces and custom metrics from your applications, see Application traces.

Configuration

  1. Edit the fly_io.d/conf.yaml file, located in the conf.d/ folder at the root of your Agent’s configuration directory, to start collecting your Fly.io performance data. See the sample fly_io.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for fly_io under the Checks section.

Data Collected

Metrics

fly_io.app.concurrency
(gauge)
fly_io.app.connect_time.bucket
(count)

Shown as second
fly_io.app.connect_time.count
(count)
fly_io.app.connect_time.sum
(count)

Shown as second
fly_io.app.count
(gauge)
Count of apps
fly_io.app.http_response_time.bucket
(count)

Shown as second
fly_io.app.http_response_time.count
(count)
fly_io.app.http_response_time.sum
(count)

Shown as second
fly_io.app.http_responses.count
(gauge)

Shown as response
fly_io.app.tcp_connects.count
(gauge)
fly_io.app.tcp_disconnects.count
(gauge)
fly_io.edge.data_in
(gauge)

Shown as byte
fly_io.edge.data_out
(gauge)

Shown as byte
fly_io.edge.http_response_time.bucket
(count)

Shown as second
fly_io.edge.http_response_time.count
(count)
fly_io.edge.http_response_time.sum
(count)

Shown as second
fly_io.edge.http_responses.count
(gauge)

Shown as response
fly_io.edge.tcp_connects.count
(gauge)
fly_io.edge.tcp_disconnects.count
(gauge)
fly_io.edge.tls_handshake_errors
(gauge)

Shown as error
fly_io.edge.tls_handshake_time.bucket
(count)

Shown as second
fly_io.edge.tls_handshake_time.count
(count)
fly_io.edge.tls_handshake_time.sum
(count)

Shown as second
fly_io.instance.cpu.count
(count)
The amount of time each CPU (cpu_id) has spent performing different kinds of work (mode) in centiseconds
fly_io.instance.disk.io_in_progress
(gauge)
Incremented as requests are given to appropriate struct request_queue and decremented as they finish.
fly_io.instance.disk.reads_completed.count
(count)
This is the total number of reads completed successfully.
fly_io.instance.disk.reads_merged.count
(count)
Reads and writes which are adjacent to each other may be merged for efficiency. This field lets you know how often this was done.
fly_io.instance.disk.sectors_read.count
(count)
This is the total number of sectors read successfully.
fly_io.instance.disk.sectors_written.count
(count)
This is the total number of sectors written successfully.
fly_io.instance.disk.time_io.count
(count)
Counts jiffies when at least one request was started or completed. If request runs more than 2 jiffies then some I/O time might be not accounted in case of concurrent requests.
Shown as millisecond
fly_io.instance.disk.time_io_weighted.count
(count)
Incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field.
Shown as millisecond
fly_io.instance.disk.time_reading.count
(count)
This is the total number of milliseconds spent by all reads.
Shown as millisecond
fly_io.instance.disk.time_writing.count
(count)
This is the total number of milliseconds spent by all writes
Shown as millisecond
fly_io.instance.disk.writes_completed.count
(count)
This is the total number of writes completed successfully.
fly_io.instance.disk.writes_merged.count
(count)
Reads and writes which are adjacent to each other may be merged for efficiency. This field lets you know how often this was done.
fly_io.instance.filefd.allocated
(gauge)
Number of allocated file descriptors
fly_io.instance.filefd.max
(gauge)
Number of maximum file descriptors
fly_io.instance.filesystem.block_size
(gauge)
File system block size.
fly_io.instance.filesystem.blocks
(gauge)
Total number of blocks on file system
fly_io.instance.filesystem.blocks_avail
(gauge)
Total number of available blocks.
fly_io.instance.filesystem.blocks_free
(gauge)
Total number of free blocks.
fly_io.instance.load.avg
(gauge)
System load average measuring the number of processes in the system run queue, with samples representing averages over 1, 5, and 15 minutes.
Shown as process
fly_io.instance.memory.active
(gauge)
Memory that has been used more recently and usually not reclaimed unless absolutely necessary.
Shown as byte
fly_io.instance.memory.buffers
(gauge)
Relatively temporary storage for raw disk blocks
Shown as byte
fly_io.instance.memory.cached
(gauge)
In-memory cache for files read from the disk (the pagecache) as well as tmpfs & shmem. Doesn't include SwapCached.
Shown as byte
fly_io.instance.memory.dirty
(gauge)
Memory which is waiting to get written back to the disk
Shown as byte
fly_io.instance.memory.inactive
(gauge)
Memory which has been less recently used. It is more eligible to be reclaimed for other purposes
Shown as byte
fly_io.instance.memory.mem_available
(gauge)
An estimate of how much memory is available for starting new applications, without swapping.
Shown as byte
fly_io.instance.memory.mem_free
(gauge)
Total free RAM.
Shown as byte
fly_io.instance.memory.mem_total
(gauge)
Total usable RAM (i.e. physical RAM minus a few reserved bits and the kernel binary code)
Shown as byte
fly_io.instance.memory.pressure_full
(gauge)
Memory pressure for all processes
fly_io.instance.memory.pressure_some
(gauge)
Memory pressure for at least one process
fly_io.instance.memory.shmem
(gauge)
Total memory used by shared memory (shmem) and tmpfs
Shown as byte
fly_io.instance.memory.slab
(gauge)
in-kernel data structures cache
Shown as byte
fly_io.instance.memory.swap_cached
(gauge)
Memory that once was swapped out, is swapped back in but still also is in the swapfile
Shown as byte
fly_io.instance.memory.swap_free
(gauge)
Memory which has been evicted from RAM, and is temporarily on the disk
Shown as byte
fly_io.instance.memory.swap_total
(gauge)
total amount of swap space available
Shown as byte
fly_io.instance.memory.vmalloc_chunk
(gauge)
largest contiguous block of vmalloc area which is free
Shown as byte
fly_io.instance.memory.vmalloc_total
(gauge)
total size of vmalloc virtual address space
Shown as byte
fly_io.instance.memory.vmalloc_used
(gauge)
amount of vmalloc area which is used
Shown as byte
fly_io.instance.memory.writeback
(gauge)
Memory which is actively being written back to the disk
Shown as byte
fly_io.instance.net.recv_bytes.count
(count)
Number of good bytes received by the interface.
Shown as byte
fly_io.instance.net.recv_compressed.count
(count)
Number of correctly received compressed packets.
fly_io.instance.net.recv_drop.count
(count)
Number of packets received but not processed, e.g. due to lack of resources or unsupported protocol.
Shown as packet
fly_io.instance.net.recv_errs.count
(count)
Total number of bad packets received on this network device.
Shown as packet
fly_io.instance.net.recv_fifo.count
(count)
Receiver FIFO overflow event counter.
fly_io.instance.net.recv_frame.count
(count)
Receiver frame alignment errors.
fly_io.instance.net.recv_multicast.count
(count)
Multicast packets received.
Shown as packet
fly_io.instance.net.recv_packets.count
(count)
Number of good packets received by the interface.
Shown as packet
fly_io.instance.net.sent_bytes.count
(count)
Number of good transmitted bytes.
Shown as byte
fly_io.instance.net.sent_carrier.count
(count)
Number of frame transmission errors due to loss of carrier during transmission.
fly_io.instance.net.sent_colls.count
(count)
Number of collisions during packet transmissions.
fly_io.instance.net.sent_compressed.count
(count)
Number of transmitted compressed packets.
fly_io.instance.net.sent_drop.count
(count)
Number of packets dropped on their way to transmission, e.g. due to lack of resources.
Shown as packet
fly_io.instance.net.sent_errs.count
(count)
Total number of transmit problems.
fly_io.instance.net.sent_fifo.count
(count)
Sent FIFO overflow event counter.
fly_io.instance.net.sent_packets.count
(count)
Number of packets successfully transmitted.
Shown as packet
fly_io.instance.up
(gauge)
Reports 1 if the VM is reporting correctly
fly_io.instance.volume.size
(gauge)
Volume size in bytes.
Shown as byte
fly_io.instance.volume.used
(gauge)
Percentage of volume used.
Shown as byte
fly_io.machine.count
(gauge)
Count of running machines
fly_io.machine.cpus.count
(gauge)
Number of cpus
fly_io.machine.gpus.count
(gauge)
Number of gpus
fly_io.machine.memory
(gauge)
Memory of a machine
Shown as megabyte
fly_io.machine.swap_size
(gauge)
Swap space to reserve for the Fly Machine
Shown as megabyte
fly_io.machines_api.up
(gauge)
Whether the check can access the machines API or not
fly_io.pg.database.size
(gauge)
Database size
Shown as byte
fly_io.pg.replication.lag
(gauge)
Replication lag
fly_io.pg_stat.activity.count
(gauge)
number of connections in this state
fly_io.pg_stat.activity.max_tx_duration
(gauge)
max duration in seconds any active transaction has been running
Shown as second
fly_io.pg_stat.archiver.archived_count
(gauge)
Number of WAL files that have been successfully archived
fly_io.pg_stat.archiver.failed_count
(gauge)
Number of failed attempts for archiving WAL files
fly_io.pg_stat.bgwriter.buffers_alloc
(gauge)
Number of buffers allocated
fly_io.pg_stat.bgwriter.buffers_backend
(gauge)
Number of buffers written directly by a backend
fly_io.pg_stat.bgwriter.buffers_backend_fsync
(gauge)
Number of times a backend had to execute its own fsync call (normally the background writer handles those even when the backend does its own write)
fly_io.pg_stat.bgwriter.buffers_checkpoint
(gauge)
Number of buffers written during checkpoints
fly_io.pg_stat.bgwriter.buffers_clean
(gauge)
Number of buffers written by the background writer
fly_io.pg_stat.bgwriter.checkpoint_sync_time
(gauge)
Total amount of time that has been spent in the portion of checkpoint processing where files are synchronized to disk, in milliseconds
Shown as millisecond
fly_io.pg_stat.bgwriter.checkpoint_write_time
(gauge)
Total amount of time that has been spent in the portion of checkpoint processing where files are written to disk, in milliseconds
Shown as millisecond
fly_io.pg_stat.bgwriter.checkpoints_req
(gauge)
Number of requested checkpoints that have been performed
fly_io.pg_stat.bgwriter.checkpoints_timed
(gauge)
Number of scheduled checkpoints that have been performed
fly_io.pg_stat.bgwriter.maxwritten_clean
(gauge)
Number of times the background writer stopped a cleaning scan because it had written too many buffers
fly_io.pg_stat.bgwriter.stats_reset
(gauge)
Time at which these statistics were last reset
fly_io.pg_stat.database.blk_read_time
(gauge)
Time spent reading data file blocks by backends in this database, in milliseconds
Shown as millisecond
fly_io.pg_stat.database.blk_write_time
(gauge)
Time spent writing data file blocks by backends in this database, in milliseconds
Shown as millisecond
fly_io.pg_stat.database.blks_hit
(gauge)
Number of times disk blocks were found already in the buffer cache, so that a read was not necessary (this only includes hits in the PostgreSQL buffer cache, not the operating system's file system cache)
fly_io.pg_stat.database.blks_read
(gauge)
Number of disk blocks read in this database
fly_io.pg_stat.database.conflicts
(gauge)
Number of queries canceled due to conflicts with recovery in this database. Conflicts occur only on standby servers
fly_io.pg_stat.database.conflicts_confl_bufferpin
(gauge)
Number of queries in this database that have been canceled due to pinned buffers
fly_io.pg_stat.database.conflicts_confl_deadlock
(gauge)
Number of queries in this database that have been canceled due to deadlocks
fly_io.pg_stat.database.conflicts_confl_lock
(gauge)
Number of queries in this database that have been canceled due to lock timeouts
fly_io.pg_stat.database.conflicts_confl_snapshot
(gauge)
Number of queries in this database that have been canceled due to old snapshots
fly_io.pg_stat.database.conflicts_confl_tablespace
(gauge)
Number of queries in this database that have been canceled due to dropped tablespaces
fly_io.pg_stat.database.deadlocks
(gauge)
Number of deadlocks detected in this database
fly_io.pg_stat.database.numbackends
(gauge)
Number of backends currently connected to this database. This is the only column in this view that returns a value reflecting current state; all other columns return the accumulated values since the last reset.
fly_io.pg_stat.database.stats_reset
(gauge)
Time at which these statistics were last reset
fly_io.pg_stat.database.tup_deleted
(gauge)
Number of rows deleted by queries in this database
fly_io.pg_stat.database.tup_fetched
(gauge)
Number of rows fetched by queries in this database
fly_io.pg_stat.database.tup_inserted
(gauge)
Number of rows inserted by queries in this database
fly_io.pg_stat.database.tup_returned
(gauge)
Number of rows returned by queries in this database
fly_io.pg_stat.database.tup_updated
(gauge)
Number of rows updated by queries in this database
fly_io.pg_stat.database.xact_commit
(gauge)
Number of transactions in this database that have been committed
fly_io.pg_stat.database.xact_rollback
(gauge)
Number of transactions in this database that have been rolled back
fly_io.pg_stat.replication.pg_current_wal_lsn_bytes
(gauge)
WAL position in bytes
Shown as byte
fly_io.pg_stat.replication.pg_wal_lsn_diff
(gauge)
Lag in bytes between master and slave
Shown as byte
fly_io.pg_stat.replication.reply_time
(gauge)
Send time of last reply message received from standby server
fly_io.volume.block_size
(gauge)
The size of each memory block in bytes
Shown as byte
fly_io.volume.blocks.count
(gauge)
The total number of blocks in the volume
fly_io.volume.blocks_avail
(gauge)
The number of blocks available for data in the volume
fly_io.volume.blocks_free
(gauge)
The total number of blocks free for data and root user ops
fly_io.volume.created
(gauge)
Whether the volume has been created or not
fly_io.volume.encrypted
(gauge)
Whether the volume is encrypted or not
fly_io.volume.size
(gauge)
The size of the volume in GB
Shown as gigabyte

Events

The Fly.io integration does not include any events.

Service Checks

The Fly.io integration does not include any service checks.

Application traces

Follow these steps to collect traces for an application in your Fly.io environment.

  1. Instrument your application.

  2. Deploy the Datadog Agent as a Fly.io application.

  3. Set the required environment variables in the fly.toml or Dockerfile of your application and deploy the app.

    Set the following as an environment variable to submit metrics to the Datadog Agent application:

    [env]
        DD_AGENT_HOST="<YOUR_AGENT_APP_NAME>.internal"
    

    Set the following environment variable to ensure you report the same host for logs and metrics:

    DD_TRACE_REPORT_HOSTNAME="true"
    

    To utilize unified service tagging, set these environment variables:

    DD_SERVICE="APP_NAME"
    DD_ENV="ENV_NAME"
    DD_VERSION="VERSION"
    

    To correlate logs and traces, follow these steps and set this environment variable:

    DD_LOGS_INJECTION="true"
    
  4. Set the following environment variables in your Datadog Agent application’s fly.toml and deploy the app:

    [env]
        DD_APM_ENABLED = "true"
        DD_APM_NON_LOCAL_TRAFFIC = "true"
        DD_DOGSTATSD_NON_LOCAL_TRAFFIC = "true"
        DD_BIND_HOST = "fly-global-services"
    

Note: Ensure that the settings on your Fly.io instances do not publicly expose the ports for APM and DogStatsD, if enabled.

Log collection

Use the fly_logs_shipper to collect logs from your Fly.io applications.

  1. Clone the logs shipper project.

  2. Modify the vector-configs/vector.toml file to set the logs source as fly_io:

    [transforms.log_json]
    type = "remap"
    inputs = ["nats"]
    source  = '''
    . = parse_json!(.message)
    .ddsource = 'fly-io'
    .host = .fly.app.instance
    .env = <YOUR_ENV_NAME>
    '''
    

This configuration will parse basic fly-specific log attributes. To fully parse all log attributes, set ddsource to a known logs integration on a per-app basis using vector transforms.

  1. Set secrets for NATS: ORG and ACCESS_TOKEN

  2. Set secrets for Datadog: DATADOG_API_KEY and DATADOG_SITE

  3. Deploy the logs shipper app.

Troubleshooting

Need help? Contact Datadog support.

PREVIEWING: rtrieu/product-analytics-ui-changes