Goroutine Leak和解决之道

兰玉磊

5 年前

概述

在Go中，goroutine很轻量级，随便创建成千上万个goroutine不是问题，但要注意，要是这么多的goroutine一致递增，而不退出，不释放资源，可就麻烦了。本文介绍goroutine泄露的实际场景，并讨论如何解决该问题。

产生原因分析

产生goroutine leak（协程泄露）的原因可能有以下几种：

goroutine由于channel的读/写端退出而一直阻塞，导致goroutine一直占用资源，而无法退出
goroutine进入死循环中，导致资源一直无法释放

goroutine终止的场景

一个goroutine终止有以下几种情况：

当一个goroutine完成它的工作
由于发生了没有处理的错误
有其他的协程告诉它终止

实际的goroutine leak

生产者消费者场景

代码

func main() {
	newRandStream := func() <-chan int {
		randStream := make(chan int)

		go func() {
			defer fmt.Println("newRandStream closure exited.")
			defer close(randStream)
			// 死循环：不断向channel中放数据，直到阻塞
			for {
				randStream <- rand.Int()
			}
		}()

		return randStream
	}

	randStream := newRandStream()
	fmt.Println("3 random ints:")

	// 只消耗3个数据，然后去做其他的事情，此时生产者阻塞，
	// 若主goroutine不处理生产者goroutine，则就产生了泄露
	for i := 1; i <= 3; i++ {
		fmt.Printf("%d: %d\n", i, <-randStream)
	}

	fmt.Fprintf(os.Stderr, "%d\n", runtime.NumGoroutine())
	time.Sleep(10e9)
	fmt.Fprintf(os.Stderr, "%d\n", runtime.NumGoroutine())
}

生产协程进入死循环，不断产生数据，消费协程，也就是主协程只消费其中的3个值，然后主协程就再也不消费channel中的数据，去做其他的事情了。此时生产协程放了一个数据到channel中，但已经不会有协程消费该数据了，所以，生产协程阻塞。此时，若没有人再消费channel中的数据，生成协程是被泄露的协程。

解决办法

总的来说，要解决channel引起的goroutine leak问题，主要是看在channel阻塞goroutine时，该goroutine的阻塞是正常的，还是可能导致协程永远没有机会执行。若可能导致协程永远没有机会执行，则可能会导致协程泄露。所以，在创建协程时就要考虑到它该如何终止。

解决一般问题的办法就是，当主线程结束时，告知生产线程，生产线程得到通知后，进行清理工作：或退出，或做一些清理环境的工作。

func main() {
	newRandStream := func(done <-chan interface{}) <-chan int {
		randStream := make(chan int)

		go func() {
			defer fmt.Println("newRandStream closure exited.")
			defer close(randStream)

			for {
				select {
				case randStream <- rand.Int():
				case <-done:  // 得到通知，结束自己
					return
				}
			}
		}()

		return randStream
	}


	done := make(chan interface{})
	randStream := newRandStream(done)
	fmt.Println("3 random ints:")

	for i := 1; i <= 3; i++ {
		fmt.Printf("%d: %d\n", i, <-randStream)
	}

    // 通知子协程结束自己
    // done <- struct{}{}
	close(done)
	// Simulate ongoing work
	time.Sleep(1 * time.Second)
}

上面的代码中，协程通过一个channel来得到结束的通知，这样它就可以清理现场。防止协程泄露。通知协程结束的方式，可以是发送一个空的struct，更加简单的方式是直接close channel。如上图所示。

master work场景

在该场景下，我们一般是把工作划分成多个子工作，把每个子工作交给每个goroutine来完成。此时若处理不当，也是有可能发生goroutine泄漏的。我们来看一下实际的例子：

代码

// function to add an array of numbers.
func worker_adder(s []int, c chan int) {
	sum := 0
	for _, v := range s {
		sum += v
	}
	// writes the sum to the go routines.
	c <- sum // send sum to c
	fmt.Println("end")
}

func main() {
	s := []int{7, 2, 8, -9, 4, 0}

	c1 := make(chan int)
	c2 := make(chan int)

	// spin up a goroutine.
	go worker_adder(s[:len(s)/2], c1)
	// spin up a goroutine.
	go worker_adder(s[len(s)/2:], c2)

	//x, y := <-c1, <-c2 // receive from c1 aND C2
	x, _:= <-c1
	// 输出从channel获取到的值
	fmt.Println(x)

	fmt.Println(runtime.NumGoroutine())
	time.Sleep(10e9)
	fmt.Println(runtime.NumGoroutine())
}

以上代码在主协程中，把一个数组分成两个部分，分别交给两个worker协程来计算其值，这两个协程通过channel把结果传回给主协程。但，在以上代码中，我们只接收了一个channel的数据，导致另一个协程在写channel时阻塞，再也没有执行的机会。要是我们把这段代码放入一个常驻服务中，看的更加明显：

http server 场景

代码

// 把数组s中的数字加起来
func sumInt(s []int, c chan int) {
	sum := 0
	for _, v := range s {
		sum += v
	}
	c <- sum
}

// HTTP handler for /sum
func sumConcurrent2(w http.ResponseWriter, r *http.Request) {
	s := []int{7, 2, 8, -9, 4, 0}

	c1 := make(chan int)
	c2 := make(chan int)

	go sumInt(s[:len(s)/2], c1)
	go sumInt(s[len(s)/2:], c2)

	// 这里故意不在c2中读取数据，导致向c2写数据的协程阻塞。
	x := <-c1

	// write the response.
	fmt.Fprintf(w, strconv.Itoa(x))
}

func main() {
	StasticGroutine := func() {
		for {
			time.Sleep(1e9)
			total := runtime.NumGoroutine()
			fmt.Println(total)
		}
	}

	go StasticGroutine()

	http.HandleFunc("/sum", sumConcurrent2)
	err := http.ListenAndServe(":8001", nil)
	if err != nil {
		log.Fatal("ListenAndServe: ", err)
	}
}

如果运行以上程序，并在浏览器中输入：

http://127.0.0.1:8001/sum

并不断刷新浏览器，来不断发送请求，可以看到以下输出：

这个输出是我们的http server的协程数量，可以看到：每请求一次，协程数就增加一个，而且不会减少。说明已经发生了协程泄露(goroutine leak)。

解决办法

解决的办法就是不管在任何情况下，都必须要有协程能够读写channel，让协程不会阻塞。代码修改如下：

...
	x,y := <-c1,<-c2

	// write the response.
	fmt.Fprintf(w, strconv.Itoa(x+y))
...

如何调试和发现goroutine leak

runtime

可以通过runtime.NumGoroutine()函数来获取后台服务的协程数量。通过查看每次的协程数量的变化和增减，我们可以判断是否有goroutine泄露发生。

...
	fmt.Fprintf(os.Stderr, "%d\n", runtime.NumGoroutine())
	time.Sleep(10e9) //等一会，查看协程数量的变化
	fmt.Fprintf(os.Stderr, "%d\n", runtime.NumGoroutine())
...

pprof来确认泄露的地方

一旦我们发现了goroutein leak，我们就需要确认泄露的出处。

import (
  "runtime/debug"
  "runtime/pprof"
)

func getStackTraceHandler(w http.ResponseWriter, r *http.Request) {
    stack := debug.Stack()
    w.Write(stack)
    pprof.Lookup("goroutine").WriteTo(w, 2)
}
func main() {
    http.HandleFunc("/_stack", getStackTraceHandler)
}

总结

goroutine leak往往是由于协程在channel上发生阻塞，或协程进入死循环，特别是在一些后台的常驻服务中。在使用channel和goroutine时要注意：

创建goroutine时就要想好，该goroutine该如何结束
使用channel时，要考虑到channel阻塞时，协程可能的行为
要注意平时一些常见的goroutine leak的场景，包括：master-worker模式，producer-consumer模式等等。

概述

产生原因分析

goroutine终止的场景

实际的goroutine leak

生产者消费者场景

代码

解决办法

master work场景

代码

http server 场景

代码

解决办法

如何调试和发现goroutine leak

runtime

pprof来确认泄露的地方

总结

参考url