Service Discovery - Max Fang's Notes

# Service Discovery ## Candidates ### [[Consul]] by [[HashiCorp]] *Consul uses service identities and traditional networking practices to help organizations securely connect applications running in any environment.* - [consul.io](https://www.consul.io/) - GitHub: [hashicorp/consul](https://github.com/hashicorp/consul) - [[Rust]] client libraries are dead - [consul-rust](https://github.com/pierresouchay/consul-rust) last updated in 2021 - [async-consul](https://github.com/nuclearfurnace/async-consul) last updated in 2021 - Has nice [docs](https://developer.hashicorp.com/consul/docs) - MPL 2.0 License ### [[etcd]] *A distributed, reliable key-value store for the most critical data of a distributed system* - [etcd.io](https://etcd.io/) - Written in [[Go]] - GitHub: [etcd-io/etcd](https://github.com/etcd-io/etcd) - Interactive [play](http://play.etcd.io/play) website - Rust client libraries: - Well-maintained `etcd-client` [crate](https://crates.io/crates/etcd-client) ([docs](https://docs.rs/etcd-client/0.12.1/etcd_client/), GitHub: [etcd-client/etcd-client](https://github.com/etcdv3/etcd-client)) - *An etcd v3 API client for Rust. It provides asynchronous client backed by tokio and tonic.* - Maintained MadSim `etcd-client` simulator [crate](https://crates.io/crates/madsim-etcd-client) > etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node. ### [[TiKV]] *TiKV is a highly scalable, low latency, and easy to use key-value database.* - [tikv.org](https://tikv.org/) - GitHub: [tikv/tikv](https://github.com/tikv/tikv) - Written in [[Rust]] - Raft implementation `raft-rs` ([crate](https://crates.io/crates/raft-rs), no docs, [repo](https://github.com/tikv/raft-rs)) ## Comparisons ### (GitHub gist) yurishkuro: [Etcd vs Consul vs Zookeeper](https://gist.github.com/yurishkuro/10cb2dc42f42a007a8ce0e055ed0d171) ### (Hacker News) [comment](https://news.ycombinator.com/item?id=15487084) > Zookeeper and Consul have very different scopes and therefore architectures. Consul servers are very much like Zookeeper: an odd number of nodes running a consensus protocol to provide a consistent API to a key/value store. > > However, Consul builds tons more on top of the servers. Consul has a first class notion of services and health checks. It has agents that run on every node to register services, perform health checks, and provide service discovery to local apps via HTTP or DNS. > > For every Consul feature there is probably a similar library or tool to do the same thing with Zookeeper. Consul just chose to focus on the service discovery problem and address it as a first class feature of the project. > > Full disclosure: I work for HashiCorp, but not on Consul. ### (Blog post) [Service Discovery – Consul vs ZooKeeper vs etcd](https://www.bizety.com/2019/01/17/service-discovery-consul-vs-zookeeper-vs-etcd/) All three are similar in architecture with server nodes, which need a quorum of nodes to operate, typically via a simple majority. They are highly-consistent and expose primitives that can be deployed via client libraries within applications to build complex distributed systems. All have roughly the same semantics in relation to offering key/value storage. The differences between them are more apparent when they are used for advanced cases. Zookeeper, for instance, only has a primitive K/V store and application developers have to build their own systems to provide service discovery. This is in comparison to Consul, which offers an opinionated framework for service discovery; this cuts out any guess work and the need for development. Zookeeper has been around the longest. It originated in Hadoop for use in Hadoop clusters. [Developers commend](https://stackshare.io/stackups/consul-vs-etcd-vs-zookeeper) its high performance and the support it offers for Kafka and Spring Boot. etcd is the newest option and is the simplest and easiest to use. Developers who have tried it [say](https://medium.com/@andybons/service-discovery-with-etcd-dc697e65acd9) it is “one of the best-designed tools precisely because of this simplicity”. It is bundled with Coreo, and has fault tolerance as a key value store. Consul offers more features than the other two, including a key value store, built-in framework for service discovery, health checking, gossip clustering and high availability, in addition to Docker integration. Developers [often cite](https://www.consul.io/intro/vs/zookeeper.html) its first-class support for service discovery, health checking, K/V storage and multiple data centers as reasons for its use. #### Consul [Consul](https://www.consul.io/) is distributed, highly scalable and highly available. It is a decentralized fault-tolerant service developed by HashiCorp Company (behind Vagrant, TerraForm, Atlas, Otto, and others). It is a tool explicitly for service discovery and configuration. Each Consul agent is installed to each host and is a first-class cluster participants. This means that servers don’t have to know the discovery address within a network, so all discovery requests can be processed to a local address. Consul uses algorithms for information distribution, which are based on an eventual consistency model. Its agents use gossip protocol for distribution information, and for leader election, servers use the Raft algorithm. Consul can also be used in Cluster, which is a network of related nodes with running services that are registered in discovery. Consul ensures that information on clusters will be distributed between all cluster participants and be available when required. Not only is there peer support, but additionally multi-zoned cluster. This means [it is possible to both work with data centers and to perform an action on any others](https://medium.com/@LogPacker/consul-service-discovery-part-1-167e4b711d8). Agents from one data center can information from another data center to help in building an effective solution for distributed systems. Service can be registered in two different ways in Consul: (i) Through the use of HTTP API or an agent configuration file if the service independently communicates with Consul (ii) Through registering the service as a third party component in the instance it can’t communicate with Consul. [Reasons developers have cited](https://stackshare.io/stackups/consul-vs-etcd-vs-zookeeper) for choosing Consul include: - Thorough health checking - Superior service discovery infrastructure - Distributed key-value store - Insightful monitoring - High-availability - Web-UI - Gossip clustering - Token-based acls - DNS server - Docker integration #### Zookeeper [ZooKeeper](https://zookeeper.apache.org/) is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and offering group services. All these services are used by distributed applications in one way or another. Every time they are implemented, a large amount of manual work is put into fixing the bugs and racing inevitable conditions. Applications tend to skimp on them initially because of the size of workload, which can make them brittle in the face of change and hard to manage. Different implementations of these services can cause problems and lead to management complexity even when deployed correctly. Zookeeper is an attempt to solve these challenges via the enabling of highly reliable distributed coordination. To achieve high availability with Zookeeper, multiple Zookeeper services must be used instead of just one. This is called an ensemble. In this instance, all Zookeeper servers storage copies of data. It is replicated across the hosts in the ensemble to guarantee the data’s high availability. Each server maintains an in-memory image of the stage, in addition to a transaction log in a persistent store in order to know about the other servers in the ensemble. As long as most of the servers are available, the Zookeeper service will be available. A leader for the ensemble is chosen via [leader election recipe](https://medium.com/@gavindya/what-is-zookeeper-db8dfc30fc9b) The leader’s job is to maintain consensus. Leader election also happens in the case of failure of an existing leader. All update requests go through the leader to guarantee the data’s availability. Zookeeper maintains a hierarchical structure of nodes, which are known as znodes. Each znode has data associated with it, and may have children connected to it as well. Node structure is similar to a standard file structure. There are two types of znode: persistent znodes and ephemeral znodes. [Reasons developers have cited](https://stackshare.io/stackups/consul-vs-etcd-vs-zookeeper) for choosing Zookeeper include: - It is high performance - Straightforward generation of node specific config - Offers support of Kafka - Java enabled and embeddable in Java Service - Spring Boot Support - Supports DC/OS - Enables extensive distributed IPC - Used in Hadoop Apache Zookeeper is a volunteer-led open source project managed by the Apache Software Foundation. #### etcd [etcd](https://coreos.com/etcd/) is a distributed key value store that offers a reliable way to store data over a cluster of machines through offering shared configuration and service discovery for Container Linux clusters. It is available on GitHub as an open source project. etcd is written in Go and uses the Raft protocol, which specializes in assisting multiple nodes in the maintenance of identical logs of state changing commands. Any node in a raft node can be treated as the master. It will then work in collaboration with the others to decide on the order state changes happen in. etcd handles leader elections during network partitions and is able to tolerate machine failure, including the leader. Application containers running on clusters can read and write data into etcd; use cases include storing database connection details, configuring cache settings. or feature flags in the form of key value pairs. The values can be watched, enabling your app to reconfigure itself when or if they change. Advanced uses leverage the consistency guarantees to put into practice database leader elections or carry out distributed locking across a cluster of workers. [Kubernetes is built on top of etcd](https://thenewstack.io/about-etcd-the-distributed-key-value-store-used-for-kubernetes-googles-cluster-container-manager/). It leverages the etcd distributed K/V store, as does Cloud Foundry. etcd also handles the storage and replication of data used by Kubernetes over the entire cluster. etcd is able to recover from hardware failure and network partitions because of the Raft consensus algorithm. It was designed to be the backbone of any distributed system, hence why projects like Kubernetes, Cloud Foundry and Fleet depend on etcd. Developers cite choosing etcd for a range of reasons, including: - Service discovery - Bundled with CoreOS - Runs on a range of operating systems, including Linux, OS X and BSD - Fault tolerant key value store - Simple interface, which reads and writes values with curl, in addition to other HTTP libraries - Easy to manage cluster coordination and state management - Optional SSL client cert authentication - Optional TTLs for keys expiration - Properly distributed through Raft protocol - Benchmarked at 1000s of write/s per instance