The hottest openstack virtual cloud desktop is in

  • Detail

The application of openstack virtual cloud desktop in Ctrip call center

openstack is the most mainstream and popular cloud platform at present. Ctrip openstack environment is not only applied in Ctrip station, but also widely used in the desktop cloud system of Ctrip call center. As one of the industry's leading call centers, Ctrip service contact center provides 365X24 hours of global services for tens of thousands of employees, leaving relatives who say they can leave without worries

desktop cloud greatly improves the efficiency of it operation and maintenance, significantly reduces the user failure rate, and is a major development trend of it in the future. So how does Ctrip deploy these two effectively in Ctrip call center

this article will mainly share the desktop cloud system widely used by Ctrip call center, introduce the openstack based cloud desktop system architecture and some openstack related problems encountered in the development process, and share the cloud desktop system operation and maintenance, monitoring, automated testing, etc

first, why use virtual cloud desktop

1. Background

Ctrip call center, that is, service contact center, is one of the core departments of Ctrip, with tens of thousands of employees. They provide services for Ctrip users around the world 7x24 hours a year. In the past, the desktop PC was used in the call center desktop. With the expansion of business scale, the amount of PC maintenance doubled, and a lot of human, material and financial resources need to be invested to report the stable operation of the system. To this end, Ctrip officially introduced virtual cloud desktop

what is virtual cloud desktop? As shown in the figure, the user's desktop PC is replaced by a thinclient (TC). All CPUs, memory and hard disks are in the cloud. The cloud is full of virtual machines, and the user desktop is connected to the virtual machine through a thin client to use windows. Among them, the virtual machine is realized by QEMU and KVM, the cloud environment is managed by openstack, and the remote desktop protocol is a highly customized and modified spice protocol of a third party

2. Advantages of cloud desktop

first, operation and maintenance costs. PC deployment and system software installation take a long time. A virtual machine that can be automatically delivered to users in 5 minutes in the background of cloud desktop; The PC expansion and deployment investment is huge. Cloud desktop only needs to purchase a small number of servers to access the cloud system, so as to rapidly expand the deployment

second, fault handling efficiency. If there is a problem with the PC, technicians may need to go to the user's site for unpacking inspection. Troubleshooting takes a long time. If there is a serious hardware problem that requires replacement of accessories, the waiting period is longer. The cloud desktop fault standard is to complete the processing in 5 minutes. For problems that cannot be solved in 5 minutes, just replace the virtual machine in the background

third, operation and maintenance management. PCs are scattered on the user's desktop, and the operation and maintenance needs the user's cooperation (such as keeping the machine on). Cloud desktop provides an operation and maintenance system. Just set the time and installation task parameters, and the system will automatically install and maintain. At the same time, the thin client is lightweight and has no user data, which also brings great convenience to users. Typically, if the user's location is migrated, the cloud desktop does not need to be moved, only the user needs to log in to the new location

finally, cloud desktop is low-carbon and environmentally friendly as a whole. The power of thin client is similar to that of ordinary energy-saving lamps, which is one order of magnitude lower than that of PC

3. Current situation of portable cloud desktop

portable cloud desktop has been deployed to six call centers in Shanghai, Nantong, Rugao, Hefei, Xinyang and Muling. Hundreds of computing nodes, nearly 10000 seats, and the scale is still expanding. New call centers are also planned

at the same time, the failure rate of cloud desktop platform and thin client is far lower than that of PC. The following figure is the statistical chart of the failure rate of Ctrip operation and maintenance department

II. How to realize virtual cloud desktop

1. The original architecture of cloud desktop

the background cloud platform of cloud desktop has been iterated many times in practice, and the original architecture is shown in the above figure. The feature of this architecture is that it directly carries out customized development in openstack nova, adds an interface for allocating virtual machines, and enables thin clients to directly access openstack to obtain virtual machine information

under this architecture, the cloud desktop platform can directly access all virtual machine information and directly operate all virtual machines. The data is also centrally stored in the openstack database, which is convenient for deployment. User permissions are directly controlled by openstack keystone. The management interface uses openstack horizon and adds a cloud desktop management page

in the typical use case of allocating virtual machines, thin clients authenticate through openstack keystone, obtain tokens, and then access Nova to request virtual machines. As shown in the above figure, the thin client will be authenticated by keystone. Keystone will verify the password to the domain LDAP after confirming that the user exists, and then return the token after confirming that the user is legal; The thin client then applies for a virtual machine from Nova through a token

nova first finds out whether the virtual machine has been allocated to this seat according to the seat information set by the thin client. If any, directly return to the corresponding virtual machine. If none, allocate from the background idle virtual machine, update the database allocation, and return the remote desktop protocol connection information

2. Limitations of the original architecture

with the growth of business, the original architecture has some limitations. First, the business has a strong binding relationship with openstack, resulting in that openstack upgrade involves business rewriting; Modifying business logic requires regression testing of the entire cloud platform

secondly, the user must be a keystone user, and the user management must use the keystone model. This leads to regular synchronization between keystone and LDAP, and sometimes special users need to be synchronized manually

at the management level, because horizon is oriented to cloud resource management, but its business is mainly oriented to operation and maintenance. This part of the difference leads us to develop a new portal to make up for it. Managers need to carry out operation and maintenance through two systems

in the overall scheme, the cloud desktop remote desktop protocol is provided by a third party. If the third party scheme does not support openstack, it cannot be used in Ctrip cloud desktop system

finally, user departments have various needs. It is difficult to develop directly in openstack, and the online time is long. It is difficult for developers to achieve technology to lead business development

3. New architecture

after architecture adjustment, the new architecture realizes the decoupling between openstack and our business, while adapting to the business development direction of the user department, which facilitates the rapid iterative launch of functions

it can be seen from the figure that the cloud desktop electromechanical control and loading installation system is independent of the business logic under the server from openstack, becoming vmpool and allocator; The management independently developed a portal system for it operation and maintenance to replace horizon; The cloud platform can directly use the native openstack

among them, the vmpool negative configuration corresponding accessories are responsible for the tension, tensile strength and shear test of the insulation strip, and is responsible for maintaining the available number of virtual machines of a certain specification, so as to avoid the fact that there are no virtual machines available when needed and let users wait. Allocator meets the qualified user request, returns the virtual machine corresponding to the user, or allocates the virtual machine from the vmpool to allocate the user

for the typical use case of user assigned virtual machine, it is greatly changed from the original architecture. First, the business layer thin client will directly access the API of the business layer. The API layer will directly authenticate users through LDAP and obtain user ou, group and other information

then, the business layer will match user rules. Each allocator matches rules through user groups, ou, tags, etc. to determine whether the user is served by himself. If the rules defined by allocator are not met, the next allocator will be selected for matching according to the priority of allocator until it is matched or the default rules are met

after matching, if there is an allocation rule with binding relationship, such as user binding, agent binding, TC binding, the allocator will directly return the existing binding from the database; If there is no binding relationship, the allocator will allocate a virtual feed from the corresponding vmpool, which will be used more widely in the industry

finally, for user departments, users belong to a group, which corresponds to a specific virtual machine. Only by adjusting user attributes, users can allocate specific virtual machines to fully meet their various needs

III. various obstacles encountered in large-scale deployment

1. Software version selection

before building openstack, it is necessary to conduct demand analysis to determine the required requirements. Then select the versions of openstack and related components that meet the conditions according to the requirements to avoid various system and virtual machine problems in the later stage

according to the business needs of Ctrip call center, we have selected several versions of KVM, QEMU, and openvswitch. After selecting several available kernel and libvirt versions that can adapt to them, we have eliminated the unstable version or the version with known problems. We have formed a reasonable combination of these components, and conducted 7x24 hours of user simulation automatic test to find the most stable, suitable and meeting the needs for production and online use

2. Resource super score

super score is strongly related to application scenarios. We must first determine whether the requirements are CPU intensive, memory intensive, IO intensive or storage intensive. After doing sufficient user surveys, we prepared a large number of user simulation automation scripts for automated testing to select the most reasonable super score

from our test results, the bottleneck is mainly memory. Excessive memory allocation will lead to direct oom (out of memory) downtime of the host. Windows and windows applications eat memory seriously, especially programs such as chrome, which occupy memory first. Although we use KSM (kernel samepage merging), which saves some memory, we can only achieve a super score of 1:1.2 in the end

"according to Peng Xianyu, a senior engineer and partner of the alliance group,

for IO, it is obvious in the Windows startup phase. When a large number of windows are started at the same time, it will cause a startup storm. In our extreme condition test, it takes 40 minutes to start windows, 100% hard disk IOS are used, and each read-write request responds in an average of 0.2 seconds. Therefore, in large-scale deployment, there must be a certain limit on the number of concurrent boot of virtual machines. At the same time, the hard disk must be multi raid to provide higher IO throughput

finally, CPU. Excessive CPU will seriously affect the user experience. But generally, it will not cause downtime of the host machine. Under our test conditions, the user experience began to decline when the super score reached 1:2, so the actual online super score was not much

in the end, our current production environment is based on memory as the standard, and the hard disk and CPU are controlled within an acceptable range

3. Network details

multi dnsmasq instance problem

the IP address of our virtual machine is obtained through DHCP. The dnsmasq we use on the DHCP server is relatively old. It simply implements multi instance operation, but does not really bind to the virtual interface

in the production environment, we observed that VMS can obtain IP, but they fail a lot when renewing IP. Through packet capturing analysis, when the virtual machine requests IP for the first time, because it has no IP address, it uses the broadcast method to request DHCP; When renewing the lease, IP point-to-point unicast request is adopted because it has IP address and DHCP server address

on the server side, when multiple dnsmasq instances are running, if it is a broadcast packet, all dnsmasq receive messages, and all broadcast requests can be replied correctly. In unicast, only the dnsmasq that was last started can receive the request, which eventually leads to the virtual machine not getting the correct DHCP renewal response. Finally we passed

Copyright © 2011 JIN SHI