2016-10-30 05:28,redis集群出现一次CLUSTERDOWN问题,看起来是因为网络抖动引起的,记录现场信息备用。
Redis 3.0.7 64bit cluster mode
三台服务器(285/286/287),每台服务两个实例(30001/30002)

285上3001实例当时的日志9d21b96013bbee9319a2387a243271c255b411dd
[code]
31672:S 30 Oct 05:28:11.809 # Cluster state changed: fail
31672:S 30 Oct 05:28:30.551 * FAIL message received from d4a1b5802d51faa245e1f7e2723f05521faa0c2c about 94bd2201144028727f5560b3e088b9224f08d5b3
31672:S 30 Oct 05:28:30.551 * FAIL message received from d4a1b5802d51faa245e1f7e2723f05521faa0c2c about 4a0d258e31dd4220fbe6d08b06ff2bb63e4cb3ed
31672:S 30 Oct 05:28:30.939 # Cluster state changed: ok
31672:S 30 Oct 05:28:31.941 * Clear FAIL state for node 94bd2201144028727f5560b3e088b9224f08d5b3: slave is reachable again.
31672:S 30 Oct 05:28:31.941 * Clear FAIL state for node 4a0d258e31dd4220fbe6d08b06ff2bb63e4cb3ed: slave is reachable again.
[/code]

285上30002实例日志d4a1b5802d51faa245e1f7e2723f05521faa0c2c
[code]
31722:M 30 Oct 05:28:13.388 # Cluster state changed: fail
31722:M 30 Oct 05:28:30.531 * Marking node 94bd2201144028727f5560b3e088b9224f08d5b3 as failing (quorum reached).
31722:M 30 Oct 05:28:30.531 * Marking node 4a0d258e31dd4220fbe6d08b06ff2bb63e4cb3ed as failing (quorum reached).
31722:M 30 Oct 05:28:30.551 * Clear FAIL state for node 4a0d258e31dd4220fbe6d08b06ff2bb63e4cb3ed: slave is reachable again.
31722:M 30 Oct 05:28:31.553 * Clear FAIL state for node 94bd2201144028727f5560b3e088b9224f08d5b3: slave is reachable again.
31722:M 30 Oct 05:28:35.459 # Cluster state changed: ok
[/code]

- 阅读剩余部分 -

有一段时间饱受syn flooding的困惑
[code]
kernel: possible SYN flooding on port 80. Sending cookies.
[/code]

偶然见到tengine的reuse_port,便决定尝试一下。貌似已经解决了这个问题。
后记:经过一次晚高峰的洗礼,已经确认此问题解决。

[code]
events {
use epoll;
reuse_port on;
worker_connections 655350;
}
[/code]

reuse_port on打开之前:
[code]
# ss -lnt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 65535 *:80 *:*
[/code]

reuse_port on打开之后(80端口的listen数量跟worker数量一致):
[code]
# ss -lnt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
LISTEN 0 65535 *:80 *:*
[/code]

- 阅读剩余部分 -

分析SYN flooding问题时,需要知道当前有多少SYN_RECV状态的连接,总结了几种方法,供以后参考使用。

1、ss -s 这个命令最快,几乎是立即得到结果,但synrecv一直显示为0,所以没法用。除此之外,其它信息是完整的。
[code]
# ss -s
Total: 30234 (kernel 30462)
TCP: 115175 (estab 30148, closed 77237, orphaned 7771, synrecv 0, timewait 77237/0), ports 1139

Transport Total IP IPv6
* 30462 - -
RAW 0 0 0
UDP 1 1 0
TCP 37938 37938 0
INET 37939 37939 0
FRAG 0 0 0
[/code]

- 阅读剩余部分 -

nginx服务器的/var/log/message里出现这个问题
[code]
kernel: possible SYN flooding on port 80. Sending cookies.
[/code]

sys + cookies 去查ip-sysctl文档(https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt)
找到这个东西
[code]
tcp_syncookies - BOOLEAN
Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
Send out syncookies when the syn backlog queue of a socket
overflows. This is to prevent against the common 'SYN flood attack'
Default: 1

Note, that syncookies is fallback facility.
It MUST NOT be used to help highly loaded servers to stand
against legal connection rate. If you see SYN flood warnings
in your logs, but investigation shows that they occur
because of overload with legal connections, you should tune
another parameters until this warning disappear.
See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

syncookies seriously violate TCP protocol, do not allow
to use TCP extensions, can result in serious degradation
of some services (f.e. SMTP relaying), visible not by you,
but your clients and relays, contacting you. While you see
SYN flood warnings in logs not being really flooded, your server
is seriously misconfigured.

If you want to test which effects syncookies have to your
network connections you can set this knob to 2 to enable
unconditionally generation of syncookies.
[/code]
文中提到了另外三个配置:tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow

- 阅读剩余部分 -

今天有QA童鞋反馈,mac下的chrome访问https服务时提示证书乱码,无法打开。
chrome的版本是:54.0.2840.59 (64-bit)

让QA童鞋访问https的另一个环境,却是正常的。

对比了一下两个环境的证书,另一个环境是tenging,不支持8192位的证书,所以用的是4096的。
无法访问的是openresty,证书是8192,怀疑是此问题。

把openresty下的证书换成4096位的以后,这个童鞋可以正常访问了。

目前的结果来看,8192位的证书没法用:tenging不支持,mac下的chrome不支持。还是先用4096的吧。

备忘之。