您好, 访客   登录/注册

基于策略梯度算法的工作量证明中挖矿困境研究

来源:用户上传      作者:

  摘 要:针对区块链中工作量证明(PoW)共识机制下区块截留攻击导致的挖矿困境问题,将矿池间的博弈行为视作迭代的囚徒困境(IPD)模型, 采用深度强化学习的策略梯度算法研究IPD的策略选择。利用该算法将每个矿池视为独立的智能体(Agent), 将矿工的潜入率量化为强化学习中的行为分布,通过策略梯度算法中的策略网络对Agent的行为进行预测和优化,最大化矿工的人均收益,并通过模拟实验验证了策略梯度算法的有效性。实验发现,前期矿池处于相互攻击状态,平均收益小于1,出现了纳什均衡的问题;经过policy gradient算法的自我调整后,矿池由相互攻击转变为相互合作,每个矿池的潜入率趋于0,人均收益趋于1。实验结果表明,policy gradient算法可以解决挖矿困境的纳什均衡问题,最大化矿池人均收益。
  关键词:区块链;工作量证明机制;博弈论;深度强化学习;策略梯度算法
  中图分类号:TP183
  文献标志码:A
  Abstract: In view of the mining dilemma problem caused by block withholding attack under Proof of Work (PoW) consensus mechanism in the blockchain, the game behavior between mining pools was regarded as an Iterative Prisoner’s Dilemma (IPD) model and the policy gradient algorithm of deep reinforcement learning was used to study IPD’s strategy choices. Each mining pool was considered as an independent Agent and the miners infiltration rate was quantified as a behavior distribution in reinforcement learning. The policy network in the policy gradient was used to predict and optimize the Agent’s behavior in order to maximize miners’ average revenues. And the effectiveness of the policy gradient algorithm was validated through simulation experiments. Experimental results show that the mining pools attack each other at the beginning with miners’ average revenue less than 1, which causes Nash equilibrium problem. After selfadjustment by the policy gradient algorithm, the relationship between the mining pools transforms from mutual attack to mutual cooperation with infiltration rate of each mining pool tending to zero and miners’ average revenue tending to 1. The results show that the policy gradient algorithm can solve the Nash equilibrium problem of mining dilemma and maximize the miners’ average revenue.
  英文關键词Key words: blockchain; Proof of Work (PoW); game; deep reinforcement learning; policy gradient algorithm
  0 引言
  区块链是比特币[1]等加密货币的底层实现技术,比特币作为区块链最为成功的应用场景,是在工作量证明(Proof of Work,PoW)的共识机制下完成交易内容的。在比特币系统中,每个节点都会参与到区块的生产中,并提供一定的PoW,首先生产出区块的节点,可以获得一定的比特币奖励。这一过程就是“挖矿”,参与挖矿的节点称为“矿工”。按照比特币系统的设定,区块大约10min产生一个,意味着大多数矿工挖不到区块,为获得相对稳定的收入,矿工会选择性地加入矿池进行合作挖矿。矿池由矿池管理员和若干矿工组成,矿工会不断地向管理员发送部分工作量证明或完整的工作量证明,矿池管理员会按照各个成员的工作量贡献比分发收益。
  然而有些矿工只向管理员发送部分工作量证明,若获取到完整的工作量证明,会选择丢弃,即只获得矿池的部分收益而不贡献有效算力,这种行为被称为区块截留攻击(block withholding attack)[2]。矿池可以利用自己的矿工潜入其他矿池,对其进行区块截留攻击以增加自己的收益,但是当所有矿池都相互攻击时,它们的收益将低于互不攻击的情形,此即PoW共识漏洞产生的挖矿困境,可视为博弈论中的囚徒困境模型。其存在一个纳什均衡点:没有一方可以通过改变自己的行为策略来提高整体收益[3]。本文的核心内容是,如何在PoW共识机制下优化矿池行为选择来增加其人均收益,以解决区块截留攻击导致的矿难问题。   1 相关工作
  针对基于PoW共识机制的比特币系统中存在的挖矿问题,众多学者提出了不同的博弈模型。
  2014年,Eyal等[4]提出的比特币系统其实是脆弱的,在矿工挖矿过程中存在一种称为Selfish mining策略,即不断地开采私有区块而不发布,当其长度大于公共链时发布出来,使得公共链失去意义,从而导致“诚实”矿工的算力资源损失, 这就是常见的区块链“分叉”问题。针对这一漏洞,Kiayias等[5]将比特币挖矿系统简化为完备信息的随机博弈模型,通过控制挖掘到的区块的发布时间来控制区块链主链的长度,并提出Frontier策略(矿工挖掘到区块就立即发布并加入最长主链),分析了不同实验设定下矿工算力为多少时,采取Frontier策略最优; Lewenberg等[6]从合作博弈的角度对矿工加入矿池的选择进行了分析, 将同一矿池成员视为一个联盟,矿工通过改变加入的矿池来增加收益; Liu等[7]则提出了演化博弈模型,预先计算出矿工加入不同矿池的收益后再决定选择加入哪个矿池。
  以上研究从不同角度对比特币挖矿过程建立了博弈模型,但是没有考虑矿池间相互攻击的情形,即PoW共识机制下产生的挖矿困境问题。2015年,Eyal[3]对区块截留攻击产生的矿难问题进行了研究,从双池和多池间相互攻击这两类情形出发,对矿池间的博弈进行了定性分析,将其视为迭代的囚徒困境(Iterated Prison’s Dilemma, IPD)模型,并通过纳什均衡理论证明了各矿池收入会因彼此攻击而减少,从而促使矿池趋于封闭稳定的状态。唐长兵等[8]在此基础上,对博弈困境中纯策略及混合策略均衡的问题做了进一步研究,并利用零行列式(Zero Determinant, ZD)策略对区块截留攻击博弈进行了优化。
  本文在文献[3]的基础上,建立了矿池间的博弈模型,并利用深度强化学习的策略梯度(policy gradient)算法[9-10]对矿池间的博弈行为进行了优化,提高了矿工的人均收益。
  参考文献 (References)
  [1]     NAKAMOTO S. Bitcoin: a peertopeer electronic cash system [EB/OL]. [2017-10-10]. https://bitcoin.org/bitcoin.pdf.
  [2]     COURTOIS N T, BAHACK L. On subversive miner strategies and block withholding attack in bitcoin digital currency[J/OL]. arXiv Preprint, 2014, 2014: arXiv:1402.1718 (2014-01-28) [2014-12-02]. https://arxiv.org/abs/1402.1718.
  [3]     EYAL I. The miner’s dilemma[C]// Proceedings of the 2015 IEEE Symposium on Security and Privacy. Piscataway, NJ: IEEE, 2015:89-103.
  [4]     EYAL I, SIRER E G. Majority is not enough: bitcoin mining is vulnerable[C]// FC 2014: International Conference on Financial Cryptography and Data Security. Berlin: Springer, 2014: 436-454.
  [5]     KIAYIAS A, KOUTSOUPIAS E, KYROPOULOU M, et al. Blockchain mining games[C]// Proceedings of the 2016 ACM Conference on Economics and Computation. New York: ACM, 2016: 365-382.
  [6]     LEWENBERG Y, BACHRACH Y, SOMPOLINSKY Y, et al. Bitcoin mining pools: a cooperative game theoretic analysis[C]// Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2015: 919-927.
  [7]     LIU X, WANG W, NIYATO D, et al. Evolutionary game for mining pool selection in blockchain networks[J]. IEEE Wireless Communications Letters, 2017, 7(5): 760-763.
  [8]     唐長兵, 杨珍, 郑忠龙,等. PoW共识算法中的博弈困境分析与优化[J]. 自动化学报, 2017, 43(9):1520-1531.(TANG C B, YANG Z, ZHENG Z L, et al. Game dilemma analysis and optimization of PoW consensus algorithm[J]. Acta Automatica Sinica, 2017, 43(9):1520-1531.)   [9]     SUTTON R S, McALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation[C]// NIPS 2000: Neural Information Processing Systems. Boston: MIT Press, 2000:1057-1063.
  [10]    WILLIAMS R J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning[J].Machine Learning, 1992,8(3/4):229-256.
  [11]    MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human level control through deep reinforcement learning[J].Nature, 2015,518(7540):529-533.
  [12]    TAMPUU A, MATIISEN T, KODELJA D, et al. Multiagent cooperation and competition with deep reinforcement learning[J].PLoS One, 2017, 12(4):e0172395.
  [13]    LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[J/OL]. arXiv Preprint, 2015, 2015: arXiv:1509.02971 [2015-09-09]. https://arxiv.org/abs/1509.02971.
  [14]    王兵團, 张作泉, 赵平福. 数值分析简明教程(大学数学系列丛书)[M]. 北京:清华大学出版社, 2012:50-60. (WANG B T, ZHANG Z Q, ZHAO P F. Numerical Analysis Concise Tutorial(University Mathematics Series)[M]. Beijing: Tsinghua University Press,2012:50-60.)
转载注明来源:https://www.xzbu.com/8/view-14941592.htm