Overview
Connect TiDB cluster to Datadog in order to:
- Collect key TiDB metrics of your cluster.
- Collect logs of your cluster, such as TiDB/TiKV/TiFlash logs and slow query logs.
- Visualize cluster performance on the provided dashboard.
Note:
Setup
Installation
First, download and launch the Datadog Agent.
Then, manually install the TiDB check. Instructions vary depending on the environment.
Run datadog-agent integration install -t datadog-tidb==<INTEGRATION_VERSION>
.
Configuration
Metric collection
- Edit the
tidb.d/conf.yaml
file in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your TiDB performance data. See the sample tidb.d/conf.yaml for all available configuration options.
The sample tidb.d/conf.yaml only configures the PD instance. You need to manually configure the other instances in the TiDB cluster. Like this:
init_config:
instances:
- pd_metric_url: http://localhost:2379/metrics
send_distribution_buckets: true
tags:
- cluster_name:cluster01
- tidb_metric_url: http://localhost:10080/metrics
send_distribution_buckets: true
tags:
- cluster_name:cluster01
- tikv_metric_url: http://localhost:20180/metrics
send_distribution_buckets: true
tags:
- cluster_name:cluster01
- tiflash_metric_url: http://localhost:8234/metrics
send_distribution_buckets: true
tags:
- cluster_name:cluster01
- tiflash_proxy_metric_url: http://localhost:20292/metrics
send_distribution_buckets: true
tags:
- cluster_name:cluster01
- Restart the Agent.
Log collection
Available for Agent versions >6.0
Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml
file:
Add this configuration block to your tidb.d/conf.yaml
file to start collecting your TiDB logs:
logs:
# pd log
- type: file
path: "/tidb-deploy/pd-2379/log/pd*.log"
service: "tidb-cluster"
source: "pd"
# tikv log
- type: file
path: "/tidb-deploy/tikv-20160/log/tikv*.log"
service: "tidb-cluster"
source: "tikv"
# tidb log
- type: file
path: "/tidb-deploy/tidb-4000/log/tidb*.log"
service: "tidb-cluster"
source: "tidb"
exclude_paths:
- /tidb-deploy/tidb-4000/log/tidb_slow_query.log
- type: file
path: "/tidb-deploy/tidb-4000/log/tidb_slow_query*.log"
service: "tidb-cluster"
source: "tidb"
log_processing_rules:
- type: multi_line
name: new_log_start_with_datetime
pattern: '#\sTime:'
tags:
- "custom_format:tidb_slow_query"
# tiflash log
- type: file
path: "/tidb-deploy/tiflash-9000/log/tiflash*.log"
service: "tidb-cluster"
source: "tiflash"
Change the path
and service
according to your cluster’s configuration.
Use these commands to show all log path:
# show deploying directories
tiup cluster display <YOUR_CLUSTER_NAME>
# find specific logging file path by command arguments
ps -fwwp <TIDB_PROCESS_PID/PD_PROCESS_PID/etc.>
Restart the Agent.
Validation
Run the Agent’s status subcommand and look for tidb
under the Checks section.
Data Collected
Metrics
tidb_cluster.tidb_executor_statement_total (count) | The total number of statements executed Shown as execution |
tidb_cluster.tidb_server_execute_error_total (count) | The total number of execution errors Shown as error |
tidb_cluster.tidb_server_connections (gauge) | Current number of connections in TiDB server Shown as connection |
tidb_cluster.tidb_server_handle_query_duration_seconds.count (count) | The total number of handled queries in server Shown as query |
tidb_cluster.tidb_server_handle_query_duration_seconds.sum (count) | The sum of handled query duration in server Shown as second |
tidb_cluster.tikv_engine_size_bytes (gauge) | The disk usage bytes of TiKV instances Shown as byte |
tidb_cluster.tikv_store_size_bytes (gauge) | The disk capacity bytes of TiKV instances Shown as byte |
tidb_cluster.tikv_io_bytes (count) | The io read/write bytes of TiKV instances Shown as byte |
tidb_cluster.tiflash_store_size_used_bytes (gauge) | The disk usage bytes of TiFlash instances Shown as byte |
tidb_cluster.tiflash_store_size_capacity_bytes (gauge) | The disk capacity bytes of TiFlash instances Shown as byte |
tidb_cluster.process_cpu_seconds_total (count) | The cpu usage seconds of TiDB/TiKV/TiFlash instances Shown as second |
tidb_cluster.process_resident_memory_bytes (gauge) | The resident memory bytes of TiDB/TiKV/TiFlash instances Shown as byte |
It is possible to use the metrics
configuration option to collect additional metrics from a TiDB cluster.
Events
TiDB check does not include any events.
Service Checks
tidb_cluster.prometheus.health
Returns CRITICAL
if the Agent cannot fetch Prometheus metrics, otherwise returns OK
.
Statuses: ok, critical
Troubleshooting
Missing CPU and Memory metrics for TiKV and TiFlash instances on macOS
CPU and Memory metrics are not provided for TiKV and TiFlash instances in the following cases:
Too many metrics
The TiDB check enables Datadog’s distribution
metric type by default. This part of data is quite large and may consume lots of resources. You can modify this behavior in tidb.yml
file:
send_distribution_buckets: false
Since there are many important metrics in a TiDB cluster, the TiDB check sets max_returned_metrics
to 10000
by default. You can decrease max_returned_metrics
in tidb.yml
file if necessary:
max_returned_metrics: 1000
Need help? Contact Datadog support.