Supported OS Linux Windows Mac OS

Versión de la integración2.1.0
Esta integración está en beta pública. Ten cuidado si la activas en cargas de trabajo de producción.

Información general

Este check monitoriza Fly.io a través del Datadog Agent.

Configuración

Sigue las instrucciones a continuación para instalar y configurar este check para un Agent que se ejecuta en una aplicación Fly.

Instalación

El check Fly.io está incluido en el paquete del Datadog Agent. Recomendamos desplegar una aplicación Fly.io exclusiva para ejecutar el Datadog Agent. Este Agent puede ejecutar el check Fly.io que recopila métricas de Prometheus, así como algunos datos adicionales de la API de máquinas. Además, puedes configurar el Agent para recibir [trazas (traces)(#Application-Traces) y métricas personalizadas de todas tus aplicaciones Fly.io dentro de la organización.

Desplegar el Agent como una aplicación Fly.io

  1. Crea una nueva aplicación en Fly.io con la imagen definida como el Datadog Agent al iniciar o proporciona la imagen en el archivo fly.toml:

    [build]
        image = 'gcr.io/datadoghq/agent:7'
    
  2. Configura un secreto para tu clave de API Datadog llamada DD_API_KEY y opcionalmente tu sitio como DD_SITE.

  3. En el directorio de tu aplicación, crea un archivo conf.yaml para la integración Fly.io, configura la integración y móntala en el directorio conf.d/fly_io.d/ del Agent como conf.yaml:

    instances:
    - empty_default_hostname: true
      headers:
          Authorization: Bearer <YOUR_FLY_TOKEN>
      machines_api_endpoint: http://_api.internal:4280
      org_slug: <YOUR_ORG_SLUG>
    
  4. Despliega tu aplicación.

Nota: Para recopilar trazas y métricas personalizadas de tus aplicaciones, consulta Trazas de aplicación.

Configuración

  1. Edita el archivo fly_io.d/conf.yaml, que se encuentra en la carpeta conf.d/ en la raíz del directorio de configuración de tu Agent, para empezar a recopilar tus datos de rendimiento de Fly.io. Para conocer todas las opciones de configuración disponibles, consulta el fly_io.d/conf.yaml de ejemplo.

  2. Reinicia el Agent.

Validación

Ejecuta el subcomando de estado del Agent y busca fly_io en la sección Checks.

Datos recopilados

Métricas

fly_io.app.concurrency
(gauge)
fly_io.app.connect_time.bucket
(count)

Shown as second
fly_io.app.connect_time.count
(count)
fly_io.app.connect_time.sum
(count)

Shown as second
fly_io.app.count
(gauge)
Count of apps
fly_io.app.http_response_time.bucket
(count)

Shown as second
fly_io.app.http_response_time.count
(count)
fly_io.app.http_response_time.sum
(count)

Shown as second
fly_io.app.http_responses.count
(gauge)

Shown as response
fly_io.app.tcp_connects.count
(gauge)
fly_io.app.tcp_disconnects.count
(gauge)
fly_io.edge.data_in
(gauge)

Shown as byte
fly_io.edge.data_out
(gauge)

Shown as byte
fly_io.edge.http_response_time.bucket
(count)

Shown as second
fly_io.edge.http_response_time.count
(count)
fly_io.edge.http_response_time.sum
(count)

Shown as second
fly_io.edge.http_responses.count
(gauge)

Shown as response
fly_io.edge.tcp_connects.count
(gauge)
fly_io.edge.tcp_disconnects.count
(gauge)
fly_io.edge.tls_handshake_errors
(gauge)

Shown as error
fly_io.edge.tls_handshake_time.bucket
(count)

Shown as second
fly_io.edge.tls_handshake_time.count
(count)
fly_io.edge.tls_handshake_time.sum
(count)

Shown as second
fly_io.instance.cpu.count
(count)
The amount of time each CPU (cpu_id) has spent performing different kinds of work (mode) in centiseconds
fly_io.instance.disk.io_in_progress
(gauge)
Incremented as requests are given to appropriate struct request_queue and decremented as they finish.
fly_io.instance.disk.reads_completed.count
(count)
This is the total number of reads completed successfully.
fly_io.instance.disk.reads_merged.count
(count)
Reads and writes which are adjacent to each other may be merged for efficiency. This field lets you know how often this was done.
fly_io.instance.disk.sectors_read.count
(count)
This is the total number of sectors read successfully.
fly_io.instance.disk.sectors_written.count
(count)
This is the total number of sectors written successfully.
fly_io.instance.disk.time_io.count
(count)
Counts jiffies when at least one request was started or completed. If request runs more than 2 jiffies then some I/O time might be not accounted in case of concurrent requests.
Shown as millisecond
fly_io.instance.disk.time_io_weighted.count
(count)
Incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field.
Shown as millisecond
fly_io.instance.disk.time_reading.count
(count)
This is the total number of milliseconds spent by all reads.
Shown as millisecond
fly_io.instance.disk.time_writing.count
(count)
This is the total number of milliseconds spent by all writes
Shown as millisecond
fly_io.instance.disk.writes_completed.count
(count)
This is the total number of writes completed successfully.
fly_io.instance.disk.writes_merged.count
(count)
Reads and writes which are adjacent to each other may be merged for efficiency. This field lets you know how often this was done.
fly_io.instance.filefd.allocated
(gauge)
Number of allocated file descriptors
fly_io.instance.filefd.max
(gauge)
Number of maximum file descriptors
fly_io.instance.filesystem.block_size
(gauge)
File system block size.
fly_io.instance.filesystem.blocks
(gauge)
Total number of blocks on file system
fly_io.instance.filesystem.blocks_avail
(gauge)
Total number of available blocks.
fly_io.instance.filesystem.blocks_free
(gauge)
Total number of free blocks.
fly_io.instance.load.avg
(gauge)
System load average measuring the number of processes in the system run queue, with samples representing averages over 1, 5, and 15 minutes.
Shown as process
fly_io.instance.memory.active
(gauge)
Memory that has been used more recently and usually not reclaimed unless absolutely necessary.
Shown as byte
fly_io.instance.memory.buffers
(gauge)
Relatively temporary storage for raw disk blocks
Shown as byte
fly_io.instance.memory.cached
(gauge)
In-memory cache for files read from the disk (the pagecache) as well as tmpfs & shmem. Doesn't include SwapCached.
Shown as byte
fly_io.instance.memory.dirty
(gauge)
Memory which is waiting to get written back to the disk
Shown as byte
fly_io.instance.memory.inactive
(gauge)
Memory which has been less recently used. It is more eligible to be reclaimed for other purposes
Shown as byte
fly_io.instance.memory.mem_available
(gauge)
An estimate of how much memory is available for starting new applications, without swapping.
Shown as byte
fly_io.instance.memory.mem_free
(gauge)
Total free RAM.
Shown as byte
fly_io.instance.memory.mem_total
(gauge)
Total usable RAM (i.e. physical RAM minus a few reserved bits and the kernel binary code)
Shown as byte
fly_io.instance.memory.pressure_full
(gauge)
Memory pressure for all processes
fly_io.instance.memory.pressure_some
(gauge)
Memory pressure for at least one process
fly_io.instance.memory.shmem
(gauge)
Total memory used by shared memory (shmem) and tmpfs
Shown as byte
fly_io.instance.memory.slab
(gauge)
in-kernel data structures cache
Shown as byte
fly_io.instance.memory.swap_cached
(gauge)
Memory that once was swapped out, is swapped back in but still also is in the swapfile
Shown as byte
fly_io.instance.memory.swap_free
(gauge)
Memory which has been evicted from RAM, and is temporarily on the disk
Shown as byte
fly_io.instance.memory.swap_total
(gauge)
total amount of swap space available
Shown as byte
fly_io.instance.memory.vmalloc_chunk
(gauge)
largest contiguous block of vmalloc area which is free
Shown as byte
fly_io.instance.memory.vmalloc_total
(gauge)
total size of vmalloc virtual address space
Shown as byte
fly_io.instance.memory.vmalloc_used
(gauge)
amount of vmalloc area which is used
Shown as byte
fly_io.instance.memory.writeback
(gauge)
Memory which is actively being written back to the disk
Shown as byte
fly_io.instance.net.recv_bytes.count
(count)
Number of good bytes received by the interface.
Shown as byte
fly_io.instance.net.recv_compressed.count
(count)
Number of correctly received compressed packets.
fly_io.instance.net.recv_drop.count
(count)
Number of packets received but not processed, e.g. due to lack of resources or unsupported protocol.
Shown as packet
fly_io.instance.net.recv_errs.count
(count)
Total number of bad packets received on this network device.
Shown as packet
fly_io.instance.net.recv_fifo.count
(count)
Receiver FIFO overflow event counter.
fly_io.instance.net.recv_frame.count
(count)
Receiver frame alignment errors.
fly_io.instance.net.recv_multicast.count
(count)
Multicast packets received.
Shown as packet
fly_io.instance.net.recv_packets.count
(count)
Number of good packets received by the interface.
Shown as packet
fly_io.instance.net.sent_bytes.count
(count)
Number of good transmitted bytes.
Shown as byte
fly_io.instance.net.sent_carrier.count
(count)
Number of frame transmission errors due to loss of carrier during transmission.
fly_io.instance.net.sent_colls.count
(count)
Number of collisions during packet transmissions.
fly_io.instance.net.sent_compressed.count
(count)
Number of transmitted compressed packets.
fly_io.instance.net.sent_drop.count
(count)
Number of packets dropped on their way to transmission, e.g. due to lack of resources.
Shown as packet
fly_io.instance.net.sent_errs.count
(count)
Total number of transmit problems.
fly_io.instance.net.sent_fifo.count
(count)
Sent FIFO overflow event counter.
fly_io.instance.net.sent_packets.count
(count)
Number of packets successfully transmitted.
Shown as packet
fly_io.instance.up
(gauge)
Reports 1 if the VM is reporting correctly
fly_io.instance.volume.size
(gauge)
Volume size in bytes.
Shown as byte
fly_io.instance.volume.used
(gauge)
Percentage of volume used.
Shown as byte
fly_io.machine.count
(gauge)
Count of running machines
fly_io.machine.cpus.count
(gauge)
Number of cpus
fly_io.machine.gpus.count
(gauge)
Number of gpus
fly_io.machine.memory
(gauge)
Memory of a machine
Shown as megabyte
fly_io.machine.swap_size
(gauge)
Swap space to reserve for the Fly Machine
Shown as megabyte
fly_io.machines_api.up
(gauge)
Whether the check can access the machines API or not
fly_io.pg.database.size
(gauge)
Database size
Shown as byte
fly_io.pg.replication.lag
(gauge)
Replication lag
fly_io.pg_stat.activity.count
(gauge)
number of connections in this state
fly_io.pg_stat.activity.max_tx_duration
(gauge)
max duration in seconds any active transaction has been running
Shown as second
fly_io.pg_stat.archiver.archived_count
(gauge)
Number of WAL files that have been successfully archived
fly_io.pg_stat.archiver.failed_count
(gauge)
Number of failed attempts for archiving WAL files
fly_io.pg_stat.bgwriter.buffers_alloc
(gauge)
Number of buffers allocated
fly_io.pg_stat.bgwriter.buffers_backend
(gauge)
Number of buffers written directly by a backend
fly_io.pg_stat.bgwriter.buffers_backend_fsync
(gauge)
Number of times a backend had to execute its own fsync call (normally the background writer handles those even when the backend does its own write)
fly_io.pg_stat.bgwriter.buffers_checkpoint
(gauge)
Number of buffers written during checkpoints
fly_io.pg_stat.bgwriter.buffers_clean
(gauge)
Number of buffers written by the background writer
fly_io.pg_stat.bgwriter.checkpoint_sync_time
(gauge)
Total amount of time that has been spent in the portion of checkpoint processing where files are synchronized to disk, in milliseconds
Shown as millisecond
fly_io.pg_stat.bgwriter.checkpoint_write_time
(gauge)
Total amount of time that has been spent in the portion of checkpoint processing where files are written to disk, in milliseconds
Shown as millisecond
fly_io.pg_stat.bgwriter.checkpoints_req
(gauge)
Number of requested checkpoints that have been performed
fly_io.pg_stat.bgwriter.checkpoints_timed
(gauge)
Number of scheduled checkpoints that have been performed
fly_io.pg_stat.bgwriter.maxwritten_clean
(gauge)
Number of times the background writer stopped a cleaning scan because it had written too many buffers
fly_io.pg_stat.bgwriter.stats_reset
(gauge)
Time at which these statistics were last reset
fly_io.pg_stat.database.blk_read_time
(gauge)
Time spent reading data file blocks by backends in this database, in milliseconds
Shown as millisecond
fly_io.pg_stat.database.blk_write_time
(gauge)
Time spent writing data file blocks by backends in this database, in milliseconds
Shown as millisecond
fly_io.pg_stat.database.blks_hit
(gauge)
Number of times disk blocks were found already in the buffer cache, so that a read was not necessary (this only includes hits in the PostgreSQL buffer cache, not the operating system's file system cache)
fly_io.pg_stat.database.blks_read
(gauge)
Number of disk blocks read in this database
fly_io.pg_stat.database.conflicts
(gauge)
Number of queries canceled due to conflicts with recovery in this database. Conflicts occur only on standby servers
fly_io.pg_stat.database.conflicts_confl_bufferpin
(gauge)
Number of queries in this database that have been canceled due to pinned buffers
fly_io.pg_stat.database.conflicts_confl_deadlock
(gauge)
Number of queries in this database that have been canceled due to deadlocks
fly_io.pg_stat.database.conflicts_confl_lock
(gauge)
Number of queries in this database that have been canceled due to lock timeouts
fly_io.pg_stat.database.conflicts_confl_snapshot
(gauge)
Number of queries in this database that have been canceled due to old snapshots
fly_io.pg_stat.database.conflicts_confl_tablespace
(gauge)
Number of queries in this database that have been canceled due to dropped tablespaces
fly_io.pg_stat.database.deadlocks
(gauge)
Number of deadlocks detected in this database
fly_io.pg_stat.database.numbackends
(gauge)
Number of backends currently connected to this database. This is the only column in this view that returns a value reflecting current state; all other columns return the accumulated values since the last reset.
fly_io.pg_stat.database.stats_reset
(gauge)
Time at which these statistics were last reset
fly_io.pg_stat.database.tup_deleted
(gauge)
Number of rows deleted by queries in this database
fly_io.pg_stat.database.tup_fetched
(gauge)
Number of rows fetched by queries in this database
fly_io.pg_stat.database.tup_inserted
(gauge)
Number of rows inserted by queries in this database
fly_io.pg_stat.database.tup_returned
(gauge)
Number of rows returned by queries in this database
fly_io.pg_stat.database.tup_updated
(gauge)
Number of rows updated by queries in this database
fly_io.pg_stat.database.xact_commit
(gauge)
Number of transactions in this database that have been committed
fly_io.pg_stat.database.xact_rollback
(gauge)
Number of transactions in this database that have been rolled back
fly_io.pg_stat.replication.pg_current_wal_lsn_bytes
(gauge)
WAL position in bytes
Shown as byte
fly_io.pg_stat.replication.pg_wal_lsn_diff
(gauge)
Lag in bytes between master and slave
Shown as byte
fly_io.pg_stat.replication.reply_time
(gauge)
Send time of last reply message received from standby server
fly_io.volume.block_size
(gauge)
The size of each memory block in bytes
Shown as byte
fly_io.volume.blocks.count
(gauge)
The total number of blocks in the volume
fly_io.volume.blocks_avail
(gauge)
The number of blocks available for data in the volume
fly_io.volume.blocks_free
(gauge)
The total number of blocks free for data and root user ops
fly_io.volume.created
(gauge)
Whether the volume has been created or not
fly_io.volume.encrypted
(gauge)
Whether the volume is encrypted or not
fly_io.volume.size
(gauge)
The size of the volume in GB
Shown as gigabyte

Eventos

La integración Fly.io no incluye eventos.

Checks de servicio

La integración Fly.io no incluye checks de servicios.

Trazas de aplicación

Sigue estos pasos para recopilar trazas de una aplicación en tu entorno Fly.io.

  1. Instrumenta tu aplicación.

  2. Despliega el Datadog Agent como una aplicación Fly.io.

  3. Configura las variables de entorno necesarias en el fly.toml o el Dockerfile de tu aplicación y despliega la aplicación.

    Configura lo siguiente como variable de entorno para enviar métricas a la aplicación del Datadog Agent:

    [env]
        DD_AGENT_HOST="<YOUR_AGENT_APP_NAME>.internal"
    

    Configura la siguiente variable de entorno para asegurarte de que informas logs y métricas del mismo host:

    DD_TRACE_REPORT_HOSTNAME="true"
    

    Para utilizar el etiquetado unificado de servicios, configura estas variables de entorno:

    DD_SERVICE="APP_NAME"
    DD_ENV="ENV_NAME"
    DD_VERSION="VERSION"
    

    Para correlacionar logs y trazas, sigue estos pasos y configura esta variable de entorno:

    DD_LOGS_INJECTION="true"
    
  4. Configura las siguientes variables de entorno en tu fly.toml de la aplicación del Datadog Agent y despliega la aplicación:

    [env]
        DD_APM_ENABLED = "true"
        DD_APM_NON_LOCAL_TRAFFIC = "true"
        DD_DOGSTATSD_NON_LOCAL_TRAFFIC = "true"
        DD_BIND_HOST = "fly-global-services"
    

Nota: Asegúrate de que la configuración de tus instancias Fly.io no expone públicamente los puertos para APM y DogStatsD, si están habilitados.

Recopilación de logs

Utiliza el shipper de logs fly para recopilar logs de tus aplicaciones Fly.io.

  1. Clona el proyecto del shipper de logs.

  2. Modifica el archivo vector-configs/vector.toml para configurar la fuente de logs como fly_io:

    [transforms.log_json]
    type = "remap"
    inputs = ["nats"]
    source  = '''
    . = parse_json!(.message)
    .ddsource = 'fly-io'
    .host = .fly.app.instance
    .env = <YOUR_ENV_NAME>
    '''
    

Esta configuración analiza atributos de los logs específicos de fly. Para analizar completamente los atributos de todos los logs, define ddsource como una integración de logs conocida por cada aplicación utilizando transformaciones vectoriales.

  1. Configura secretos para NATS: ORG y ACCESS_TOKEN

  2. Configura secretos para Datadog: DATADOG_API_KEY y DATADOG_SITE.

  3. Despliega la aplicación del shipper de logs.

Solucionar problemas

¿Necesitas ayuda? Ponte en contacto con el servicio de asistencia de Datadog.

PREVIEWING: guacbot/translation-pipeline