The application of openstack virtual cloud desktop in Ctrip call center
openstack is the most mainstream and popular cloud platform at present. Ctrip openstack environment is not only applied in Ctrip station, but also widely used in the desktop cloud system of Ctrip call center. As one of the industry's leading call centers, Ctrip service contact center provides 365X24 hours of global services for tens of thousands of employees, leaving relatives who say they can leave without worries
desktop cloud greatly improves the efficiency of it operation and maintenance, significantly reduces the user failure rate, and is a major development trend of it in the future. So how does Ctrip deploy these two effectively in Ctrip call center
this article will mainly share the desktop cloud system widely used by Ctrip call center, introduce the openstack based cloud desktop system architecture and some openstack related problems encountered in the development process, and share the cloud desktop system operation and maintenance, monitoring, automated testing, etc
first, why use virtual cloud desktop
1. Background
Ctrip call center, that is, service contact center, is one of the core departments of Ctrip, with tens of thousands of employees. They provide services for Ctrip users around the world 7x24 hours a year. In the past, the desktop PC was used in the call center desktop. With the expansion of business scale, the amount of PC maintenance doubled, and a lot of human, material and financial resources need to be invested to report the stable operation of the system. To this end, Ctrip officially introduced virtual cloud desktop
what is virtual cloud desktop? As shown in the figure, the user's desktop PC is replaced by a thinclient (TC). All CPUs, memory and hard disks are in the cloud. The cloud is full of virtual machines, and the user desktop is connected to the virtual machine through a thin client to use windows. Among them, the virtual machine is realized by QEMU and KVM, the cloud environment is managed by openstack, and the remote desktop protocol is a highly customized and modified spice protocol of a third party
2. Advantages of cloud desktop
first, operation and maintenance costs. PC deployment and system software installation take a long time. A virtual machine that can be automatically delivered to users in 5 minutes in the background of cloud desktop; The PC expansion and deployment investment is huge. Cloud desktop only needs to purchase a small number of servers to access the cloud system, so as to rapidly expand the deployment
second, fault handling efficiency. If there is a problem with the PC, technicians may need to go to the user's site for unpacking inspection. Troubleshooting takes a long time. If there is a serious hardware problem that requires replacement of accessories, the waiting period is longer. The cloud desktop fault standard is to complete the processing in 5 minutes. For problems that cannot be solved in 5 minutes, just replace the virtual machine in the background
third, operation and maintenance management. PCs are scattered on the user's desktop, and the operation and maintenance needs the user's cooperation (such as keeping the machine on). Cloud desktop provides an operation and maintenance system. Just set the time and installation task parameters, and the system will automatically install and maintain. At the same time, the thin client is lightweight and has no user data, which also brings great convenience to users. Typically, if the user's location is migrated, the cloud desktop does not need to be moved, only the user needs to log in to the new location
finally, cloud desktop is low-carbon and environmentally friendly as a whole. The power of thin client is similar to that of ordinary energy-saving lamps, which is one order of magnitude lower than that of PC
3. Current situation of portable cloud desktop
portable cloud desktop has been deployed to six call centers in Shanghai, Nantong, Rugao, Hefei, Xinyang and Muling. Hundreds of computing nodes, nearly 10000 seats, and the scale is still expanding. New call centers are also planned
at the same time, the failure rate of cloud desktop platform and thin client is far lower than that of PC. The following figure is the statistical chart of the failure rate of Ctrip operation and maintenance department
II. How to realize virtual cloud desktop
1. The original architecture of cloud desktop
the background cloud platform of cloud desktop has been iterated many times in practice, and the original architecture is shown in the above figure. The feature of this architecture is that it directly carries out customized development in openstack nova, adds an interface for allocating virtual machines, and enables thin clients to directly access openstack to obtain virtual machine information
under this architecture, the cloud desktop platform can directly access all virtual machine information and directly operate all virtual machines. The data is also centrally stored in the openstack database, which is convenient for deployment. User permissions are directly controlled by openstack keystone. The management interface uses openstack horizon and adds a cloud desktop management page
in the typical use case of allocating virtual machines, thin clients authenticate through openstack keystone, obtain tokens, and then access Nova to request virtual machines. As shown in the above figure, the thin client will be authenticated by keystone. Keystone will verify the password to the domain LDAP after confirming that the user exists, and then return the token after confirming that the user is legal; The thin client then applies for a virtual machine from Nova through a token
nova first finds out whether the virtual machine has been allocated to this seat according to the seat information set by the thin client. If any, directly return to the corresponding virtual machine. If none, allocate from the background idle virtual machine, update the database allocation, and return the remote desktop protocol connection information
2. Limitations of the original architecture
with the growth of business, the original architecture has some limitations. First, the business has a strong binding relationship with openstack, resulting in that openstack upgrade involves business rewriting; Modifying business logic requires regression testing of the entire cloud platform
secondly, the user must be a keystone user, and the user management must use the keystone model. This leads to regular synchronization between keystone and LDAP, and sometimes special users need to be synchronized manually
at the management level, because horizon is oriented to cloud resource management, but its business is mainly oriented to operation and maintenance. This part of the difference leads us to develop a new portal to make up for it. Managers need to carry out operation and maintenance through two systems
in the overall scheme, the cloud desktop remote desktop protocol is provided by a third party. If the third party scheme does not support openstack, it cannot be used in Ctrip cloud desktop system
finally, user departments have various needs. It is difficult to develop directly in openstack, and the online time is long. It is difficult for developers to achieve technology to lead business development
3. New architecture
after architecture adjustment, the new architecture realizes the decoupling between openstack and our business, while adapting to the business development direction of the user department, which facilitates the rapid iterative launch of functions
it can be seen from the figure that the cloud desktop electromechanical control and loading installation system is independent of the business logic under the server from openstack, becoming vmpool and allocator; The management independently developed a portal system for it operation and maintenance to replace horizon; The cloud platform can directly use the native openstack
among them, the vmpool negative configuration corresponding accessories are responsible for the tension, tensile strength and shear test of the insulation strip, and is responsible for maintaining the available number of virtual machines of a certain specification, so as to avoid the fact that there are no virtual machines available when needed and let users wait. Allocator meets the qualified user request, returns the virtual machine corresponding to the user, or allocates the virtual machine from the vmpool to allocate the user
for the typical use case of user assigned virtual machine, it is greatly changed from the original architecture. First, the business layer thin client will directly access the API of the business layer. The API layer will directly authenticate users through LDAP and obtain user ou, group and other information
then, the business layer will match user rules. Each allocator matches rules through user groups, ou, tags, etc. to determine whether the user is served by himself. If the rules defined by allocator are not met, the next allocator will be selected for matching according to the priority of allocator until it is matched or the default rules are met
after matching, if there is an allocation rule with binding relationship, such as user binding, agent binding, TC binding, the allocator will directly return the existing binding from the database; If there is no binding relationship, the allocator will allocate a virtual feed from the corresponding vmpool, which will be used more widely in the industry
finally, for user departments, users belong to a group, which corresponds to a specific virtual machine. Only by adjusting user attributes, users can allocate specific virtual machines to fully meet their various needs
III. various obstacles encountered in large-scale deployment
1. Software version selection
before building openstack, it is necessary to conduct demand analysis to determine the required requirements. Then select the versions of openstack and related components that meet the conditions according to the requirements to avoid various system and virtual machine problems in the later stage
according to the business needs of Ctrip call center, we have selected several versions of KVM, QEMU, and openvswitch. After selecting several available kernel and libvirt versions that can adapt to them, we have eliminated the unstable version or the version with known problems. We have formed a reasonable combination of these components, and conducted 7x24 hours of user simulation automatic test to find the most stable, suitable and meeting the needs for production and online use
2. Resource super score
super score is strongly related to application scenarios. We must first determine whether the requirements are CPU intensive, memory intensive, IO intensive or storage intensive. After doing sufficient user surveys, we prepared a large number of user simulation automation scripts for automated testing to select the most reasonable super score
from our test results, the bottleneck is mainly memory. Excessive memory allocation will lead to direct oom (out of memory) downtime of the host. Windows and windows applications eat memory seriously, especially programs such as chrome, which occupy memory first. Although we use KSM (kernel samepage merging), which saves some memory, we can only achieve a super score of 1:1.2 in the end
"according to Peng Xianyu, a senior engineer and partner of the alliance group,
for IO, it is obvious in the Windows startup phase. When a large number of windows are started at the same time, it will cause a startup storm. In our extreme condition test, it takes 40 minutes to start windows, 100% hard disk IOS are used, and each read-write request responds in an average of 0.2 seconds. Therefore, in large-scale deployment, there must be a certain limit on the number of concurrent boot of virtual machines. At the same time, the hard disk must be multi raid to provide higher IO throughput
finally, CPU. Excessive CPU will seriously affect the user experience. But generally, it will not cause downtime of the host machine. Under our test conditions, the user experience began to decline when the super score reached 1:2, so the actual online super score was not much
in the end, our current production environment is based on memory as the standard, and the hard disk and CPU are controlled within an acceptable range
3. Network details
multi dnsmasq instance problem
the IP address of our virtual machine is obtained through DHCP. The dnsmasq we use on the DHCP server is relatively old. It simply implements multi instance operation, but does not really bind to the virtual interface
in the production environment, we observed that VMS can obtain IP, but they fail a lot when renewing IP. Through packet capturing analysis, when the virtual machine requests IP for the first time, because it has no IP address, it uses the broadcast method to request DHCP; When renewing the lease, IP point-to-point unicast request is adopted because it has IP address and DHCP server address
on the server side, when multiple dnsmasq instances are running, if it is a broadcast packet, all dnsmasq receive messages, and all broadcast requests can be replied correctly. In unicast, only the dnsmasq that was last started can receive the request, which eventually leads to the virtual machine not getting the correct DHCP renewal response. Finally we passed
LINK
Copyright © 2011 JIN SHI