- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
View Kafka broker metrics collected for a 360-view of the health and performance of your Kafka clusters in real time. With this integration, you can collect metrics and logs from your Kafka deployment to visualize telemetry and alert on the performance of your Kafka stack.
If you would benefit from visualizing the topology of your streaming data pipelines and identifying the root cause of bottlenecks, learn more about Data Streams Monitoring.
Note:
The Agent’s Kafka check is included in the Datadog Agent package, so you don’t need to install anything else on your Kafka nodes.
The check collects metrics from JMX with JMXFetch. A JVM is needed on each kafka node so the Agent can run JMXFetch. The same JVM that Kafka uses can be used for this.
Note: The Kafka check cannot be used with Managed Streaming for Apache Kafka (Amazon MSK). Use the Amazon MSK integration instead.
To configure this check for an Agent running on a host:
Edit the kafka.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory. Kafka bean names depend on the exact Kafka version you’re running. Use the example configuration file that comes packaged with the Agent as a base since it is the most up-to-date configuration. Note: the Agent version in the example may be for a newer version of the Agent than what you have installed.
Available for Agent versions >6.0
Kafka uses the log4j
logger by default. To activate logging to a file and customize the format edit the log4j.properties
file:
# Set root logger level to INFO and its only appender to R
log4j.rootLogger=INFO, R
log4j.appender.R.File=/var/log/kafka/server.log
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
By default, the Datadog integration pipeline supports the following conversion patterns:
%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
%d [%t] %-5p %c - %m%n
%r [%t] %p %c %x - %m%n
[%d] %p %m (%c)%n
Clone and edit the integration pipeline if you have a different format.
Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml
file:
logs_enabled: true
Add the following configuration block to your kafka.d/conf.yaml
file. Change the path
and service
parameter values based on your environment. See the sample kafka.d/conf.yaml for all available configuration options.
logs:
- type: file
path: /var/log/kafka/server.log
source: kafka
service: myapp
#To handle multi line that starts with yyyy-mm-dd use the following pattern
#log_processing_rules:
# - type: multi_line
# name: log_start_with_date
# pattern: \d{4}\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])
For containerized environments, see the Autodiscovery with JMX guide.
Available for Agent versions >6.0
Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.
Parameter | Value |
---|---|
<LOG_CONFIG> | {"source": "kafka", "service": "<SERVICE_NAME>"} |
Run the Agent’s status subcommand and look for kafka
under the JMXFetch section:
========
JMXFetch
========
Initialized checks
==================
kafka
instance_name : kafka-localhost-9999
message :
metric_count : 46
service_check_count : 0
status : OK
kafka.consumer.bytes_consumed (gauge) | The average number of bytes consumed per second for a specific topic. Shown as byte |
kafka.consumer.bytes_in (gauge) | Consumer bytes in rate. Shown as byte |
kafka.consumer.delayed_requests (gauge) | Number of delayed consumer requests. Shown as request |
kafka.consumer.expires_per_second (gauge) | Rate of delayed consumer request expiration. Shown as eviction |
kafka.consumer.fetch_rate (gauge) | The minimum rate at which the consumer sends fetch requests to a broker. Shown as request |
kafka.consumer.fetch_size_avg (gauge) | The average number of bytes fetched per request for a specific topic. Shown as byte |
kafka.consumer.fetch_size_max (gauge) | The maximum number of bytes fetched per request for a specific topic. Shown as byte |
kafka.consumer.kafka_commits (gauge) | Rate of offset commits to Kafka. Shown as write |
kafka.consumer.max_lag (gauge) | Maximum consumer lag. Shown as offset |
kafka.consumer.messages_in (gauge) | Rate of consumer message consumption. Shown as message |
kafka.consumer.records_consumed (gauge) | The average number of records consumed per second for a specific topic. Shown as record |
kafka.consumer.records_per_request_avg (gauge) | The average number of records in each request for a specific topic. Shown as record |
kafka.consumer.zookeeper_commits (gauge) | Rate of offset commits to ZooKeeper. Shown as write |
kafka.expires_sec (gauge) | Rate of delayed producer request expiration. Shown as eviction |
kafka.follower.expires_per_second (gauge) | Rate of request expiration on followers. Shown as eviction |
kafka.log.flush_rate.rate (gauge) | Log flush rate. Shown as flush |
kafka.messages_in.rate (gauge) | Incoming message rate. Shown as message |
kafka.net.bytes_in.rate (gauge) | Incoming byte rate. Shown as byte |
kafka.net.bytes_out (gauge) | Outgoing byte total. Shown as byte |
kafka.net.bytes_out.rate (gauge) | Outgoing byte rate. Shown as byte |
kafka.net.bytes_rejected.rate (gauge) | Rejected byte rate. Shown as byte |
kafka.net.processor.avg.idle.pct.rate (gauge) | Average fraction of time the network processor threads are idle. Shown as fraction |
kafka.producer.available_buffer_bytes (gauge) | The total amount of buffer memory that is not being used (either unallocated or in the free list) Shown as byte |
kafka.producer.batch_size_avg (gauge) | The average number of bytes sent per partition per-request. Shown as byte |
kafka.producer.batch_size_max (gauge) | The max number of bytes sent per partition per-request. Shown as byte |
kafka.producer.buffer_bytes_total (gauge) | The maximum amount of buffer memory the client can use (whether or not it is currently used). Shown as byte |
kafka.producer.bufferpool_wait_ratio (gauge) | The fraction of time an appender waits for space allocation. |
kafka.producer.bufferpool_wait_time (gauge) | The fraction of time an appender waits for space allocation. |
kafka.producer.bufferpool_wait_time_ns_total (gauge) | The total time in nanoseconds an appender waits for space allocation. Shown as nanosecond |
kafka.producer.bytes_out (gauge) | Producer bytes out rate. Shown as byte |
kafka.producer.compression_rate (gauge) | The average compression rate of record batches for a topic Shown as fraction |
kafka.producer.compression_rate_avg (rate) | The average compression rate of record batches. Shown as fraction |
kafka.producer.delayed_requests (gauge) | Number of producer requests delayed. Shown as request |
kafka.producer.expires_per_seconds (gauge) | Rate of producer request expiration. Shown as eviction |
kafka.producer.io_wait (gauge) | Producer I/O wait time. Shown as nanosecond |
kafka.producer.message_rate (gauge) | Producer message rate. Shown as message |
kafka.producer.metadata_age (gauge) | The age in seconds of the current producer metadata being used. Shown as second |
kafka.producer.record_error_rate (gauge) | The average per-second number of errored record sends for a topic Shown as error |
kafka.producer.record_queue_time_avg (gauge) | The average time in ms record batches spent in the record accumulator. Shown as millisecond |
kafka.producer.record_queue_time_max (gauge) | The maximum time in ms record batches spent in the record accumulator. Shown as millisecond |
kafka.producer.record_retry_rate (gauge) | The average per-second number of retried record sends for a topic Shown as record |
kafka.producer.record_send_rate (gauge) | The average number of records sent per second for a topic Shown as record |
kafka.producer.record_size_avg (gauge) | The average record size. Shown as byte |
kafka.producer.record_size_max (gauge) | The maximum record size. Shown as byte |
kafka.producer.records_per_request (gauge) | The average number of records sent per second. Shown as record |
kafka.producer.request_latency_avg (gauge) | Producer average request latency. Shown as millisecond |
kafka.producer.request_latency_max (gauge) | The maximum request latency in ms. Shown as millisecond |
kafka.producer.request_rate (gauge) | Number of producer requests per second. Shown as request |
kafka.producer.requests_in_flight (gauge) | The current number of in-flight requests awaiting a response. Shown as request |
kafka.producer.response_rate (gauge) | Number of producer responses per second. Shown as response |
kafka.producer.throttle_time_avg (gauge) | The average time in ms a request was throttled by a broker. Shown as millisecond |
kafka.producer.throttle_time_max (gauge) | The maximum time in ms a request was throttled by a broker. Shown as millisecond |
kafka.producer.waiting_threads (gauge) | The number of user threads blocked waiting for buffer memory to enqueue their records. Shown as thread |
kafka.replication.active_controller_count (gauge) | Number of active controllers in the cluster. Shown as node |
kafka.replication.isr_expands.rate (gauge) | Rate of replicas joining the ISR pool. Shown as node |
kafka.replication.isr_shrinks.rate (gauge) | Rate of replicas leaving the ISR pool. Shown as node |
kafka.replication.leader_count (gauge) | Number of leaders on this broker. Shown as node |
kafka.replication.leader_elections.rate (gauge) | Leader election rate. Shown as event |
kafka.replication.max_lag (gauge) | Maximum lag in messages between the follower and leader replicas. Shown as offset |
kafka.replication.offline_partitions_count (gauge) | Number of partitions that don't have an active leader. |
kafka.replication.partition_count (gauge) | Number of partitions across all topics in the cluster. |
kafka.replication.unclean_leader_elections.rate (gauge) | Unclean leader election rate. Shown as event |
kafka.replication.under_min_isr_partition_count (gauge) | Number of under min ISR partitions. |
kafka.replication.under_replicated_partitions (gauge) | Number of under replicated partitions. |
kafka.request.channel.queue.size (gauge) | Number of queued requests. Shown as request |
kafka.request.fetch.failed.rate (gauge) | Client fetch request failures rate. Shown as request |
kafka.request.fetch_consumer.rate (gauge) | Fetch consumer requests rate. Shown as request |
kafka.request.fetch_consumer.time.99percentile (gauge) | Total time in ms to serve the specified request. Shown as millisecond |
kafka.request.fetch_consumer.time.avg (gauge) | Total time in ms to serve the specified request. Shown as millisecond |
kafka.request.fetch_follower.rate (gauge) | Fetch follower requests rate. Shown as request |
kafka.request.fetch_follower.time.99percentile (gauge) | Total time in ms to serve the specified request. Shown as millisecond |
kafka.request.fetch_follower.time.avg (gauge) | Total time in ms to serve the specified request. Shown as millisecond |
kafka.request.fetch_request_purgatory.size (gauge) | Number of requests waiting in the producer purgatory. Shown as request |
kafka.request.handler.avg.idle.pct.rate (gauge) | Average fraction of time the request handler threads are idle. Shown as fraction |
kafka.request.metadata.time.99percentile (gauge) | Time for metadata requests for 99th percentile. Shown as millisecond |
kafka.request.metadata.time.avg (gauge) | Average time for metadata request. Shown as millisecond |
kafka.request.offsets.time.99percentile (gauge) | Time for offset requests for 99th percentile. Shown as millisecond |
kafka.request.offsets.time.avg (gauge) | Average time for an offset request. Shown as millisecond |
kafka.request.produce.failed.rate (gauge) | Failed produce requests rate. Shown as request |
kafka.request.produce.rate (gauge) | Produce requests rate. Shown as request |
kafka.request.produce.time.99percentile (gauge) | Time for produce requests for 99th percentile. Shown as millisecond |
kafka.request.produce.time.avg (gauge) | Average time for a produce request. Shown as millisecond |
kafka.request.producer_request_purgatory.size (gauge) | Number of requests waiting in the producer purgatory Shown as request |
kafka.request.update_metadata.time.99percentile (gauge) | Time for update metadata requests for 99th percentile. Shown as millisecond |
kafka.request.update_metadata.time.avg (gauge) | Average time for a request to update metadata. Shown as millisecond |
kafka.server.socket.connection_count (gauge) | Number of currently open connections to the broker. Shown as connection |
kafka.session.fetch.count (gauge) | Number of fetch sessions. |
kafka.session.fetch.eviction (gauge) | Eviction rate of fetch session. Shown as event |
kafka.session.zookeeper.disconnect.rate (gauge) | Zookeeper client disconnect rate. Shown as event |
kafka.session.zookeeper.expire.rate (gauge) | Zookeeper client session expiration rate. Shown as event |
kafka.session.zookeeper.readonly.rate (gauge) | Zookeeper client readonly rate. Shown as event |
kafka.session.zookeeper.sync.rate (gauge) | Zookeeper client sync rate. Shown as event |
kafka.topic.messages_in.rate (gauge) | Incoming message rate by topic Shown as message |
kafka.topic.net.bytes_in.rate (gauge) | Incoming byte rate by topic. Shown as byte |
kafka.topic.net.bytes_out.rate (gauge) | Outgoing byte rate by topic. Shown as byte |
kafka.topic.net.bytes_rejected.rate (gauge) | Rejected byte rate by topic. Shown as byte |
The Kafka check does not include any events.
kafka.can_connect
Returns CRITICAL
if the Agent is unable to connect to and collect metrics from the monitored Kafka instance, WARNING
if no metrics are collected, and OK
otherwise.
Statuses: ok, critical, warning
This Agent integration collects message offset metrics from your Kafka consumers. This check fetches the highwater offsets from the Kafka brokers, consumer offsets that are stored in Kafka (or Zookeeper for old-style consumers), and then calculates consumer lag (which is the difference between the broker offset and the consumer offset).
If you would benefit from visualizing the topology of your streaming data pipelines and identifying the root cause of bottlenecks, learn more about Data Streams Monitoring.
Note:
The Agent’s Kafka consumer check is included in the Datadog Agent package. No additional installation is needed on your Kafka nodes.
To configure this check for an Agent running on a host running your Kafka consumers:
Edit the kafka_consumer.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory. See the sample kafka_consumer.d/conf.yaml for all available configuration options.
This check does not collect additional logs. To collect logs from Kafka brokers, see log collection instructions for Kafka.
For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.
Parameter | Value |
---|---|
<INTEGRATION_NAME> | kafka_consumer |
<INIT_CONFIG> | blank or {} |
<INSTANCE_CONFIG> | {"kafka_connect_str": <KAFKA_CONNECT_STR>} For example, {"kafka_connect_str": "server:9092"} |
This check does not collect additional logs. To collect logs from Kafka brokers, see log collection instructions for Kafka.
Run the Agent’s status subcommand and look for kafka_consumer
under the Checks section.
kafka.broker_offset (gauge) | Current message offset on broker. Shown as offset |
kafka.consumer_lag (gauge) | Lag in messages between consumer and broker. Shown as offset |
kafka.consumer_offset (gauge) | Current message offset on consumer. Shown as offset |
kafka.estimated_consumer_lag (gauge) | Lag in seconds between consumer and broker. This metric is provided through Data Streams Monitoring. Additional charges may apply. Shown as second |
consumer_lag:
The Datadog Agent emits an event when the value of the consumer_lag
metric goes below 0, tagging it with topic
, partition
and consumer_group
.
The Kafka-consumer check does not include any service checks.
Kerberos GSSAPI Authentication
Depending on your Kafka cluster’s Kerberos setup, you may need to configure the following:
tls_cert
and/or tls_ca_cert
options.tls_private_key
option. If applicable, the private key password should be configured with the tls_private_key_password
.KRB5_CLIENT_KTNAME
environment variable pointing to the Kafka client’s Kerberos keytab location if it differs from the default path (for example, KRB5_CLIENT_KTNAME=/etc/krb5.keytab
)KRB5CCNAME
environment variable pointing to the Kafka client’s Kerberos credentials ticket cache if it differs from the default path (for example, KRB5CCNAME=/tmp/krb5cc_xxx
)systemd
environment:Linux Systemd Example
/path/to/environment/file
KRB5_CLIENT_KTNAME=/etc/krb5.keytab
KRB5CCNAME=/tmp/krb5cc_xxx
Create a Datadog Agent service configuration override file: sudo systemctl edit datadog-agent.service
Configure the following in the override file:
[Service]
EnvironmentFile=/path/to/environment/file
sudo systemctl daemon-reload
sudo systemctl restart datadog-agent.service
sudo service datadog-agent restart