Let’s study distributed systems — 1. Introduction

Hidetatsu YAGINUMA
4 min readNov 4, 2019

Imagine, what will happen when you are running your company by yourself? You can proceed your work by yourself, without any friction with other people. But you have to do everything by yourself. If you hire employees, you can pass some of your tasks to them. However, you have to tell them how to work on it, what you are expecting. Sometimes, you will have to wait for their work. There is no guarantee that they finish their work until deadline you told.

The difference between centralized systems and distributed systems is very similar to above example. We try to create a system with multiple computers to make them work more efficiently. We call such systems, which are connected via network mutually and consists of multiple computers, distributed systems.

Why do we create distributed systems?

In most cases, there are 2 main purposes to build distributed systems;

  • To achieve high performance which single computer cannot reach
  • To keep working when some computers become unavailable

To see why we choose distributed systems, let’s see an example; Apache Hadoop.

Hadoop — distributed data processing

The Apache Hadoop is a framework that enables us to process large data sets across clusters of computers.

Nowadays, we can have huge data because of disk capacity and computers performance advancement. Many companies want to use these huge data to contribution to their business and customers. Now we want a solution to process and analyze these data very quickly.

Using Hadoop, we can process huge data on multiple distributed computers. Hadoop tries to distribute 2 things into multiple computers; Data and Task.

HDFS

HDFS (Hadoop Distributed File Systems) is a distributed file systems which is expected to be used for Hadoop. To use many computers more effectively, we want to distribute data into multiple computers equally as much as possible. And, we need to copy the same data to multiple nodes because we want to avoid data lost when a node crashes. In Hadoop, multiple Datanodes have a part of data.

HDFS Architecture guide
(From HDFS Architecture guide)

HDFS consists of single Namenode and multiple Datanodes. Namenode manages huge data and distributes it into Datanodes. Namenode knows how many Datanodes are working and which part of data is managed by each Datanode internally.

Hadoop MapReduce

As the same reason, Hadoop supports distributing tasks (processing each data). In Hadoop, how data processing works consists of 2 parts; MapReduce processing platform and MapReduce application. In this article I omit description about MapReduce application because it’s not very related distributed systems itself. If you are interested, you can refer to the official tutorial.

Like HDFS, MapReduce has also 2 parts; JobTracker and TaskTracker. Single JobTracker manages multiple TaskTrackers. JobTracker distributes tasks to TaskTrackers. TaskTrackers try to fetch data from HDFS and execute pre-defined processing.

The benefit of Distributed systems

Because Hadoop works as distributed systems, we will be getting some benefits.

  • If dataset is huge, processing time can be reduced
  • If one of Datanodes or TaskTrackers crashes, processing can be proceeded

By choosing distributed systems architecture, we might be able to get these benefits.

So what’s the downside?

However, you will also be facing some difficulties of distributed systems.

  • If Namenode or JobTracker crashes, what will happen?
  • How to ensure that multiple copies of data across Datanodes are the same?

Typically, distributed systems has a tendency to become complicated. Having multiple nodes bring us the possibilities of connection failure and server crash. We need take care of consideration about each failure and recovery solution. It’s hard to have solutions for every corner cases.

Will my system become really faster?

Also, we need to think about if we really can get benefits of distributed systems.

(From: Latency Numbers Every Programmer Should Know)

Let’s see L2 cache reference and Round trip within same datacenter. If we choose running something in 1 process, the data can live on L2 cache. However, if we choose replicating our data in multiple datacenters, it requires roundtrip time across them.

When architecting systems, we definitely should do micro benchmarking and estimate that if there are enough reasons to build distributed systems.

Conclusion and What’s next?

In this article, I tried to explain what is distributed systems and why we want to choose it. I’m really happy if this article can be a chance for you to dive into distributed systems.

In this Let’s study distributed systems, I want to explain tech stacks around distributed systems as much as I can. Looking into distributed systems makes us feel like very long, endless journey, so I want to explain it 1 by 1, very simply as much as possible.

I am planning to write an article about clock. We are very familiar with clock, but if we want to use it correctly in distributed systems, it’s a bit difficult. Clock is a very basic mechanism to define events order in distributed systems. If you have interests, please check it out my next article.

--

--

Hidetatsu YAGINUMA

Hidetatsu loves infrastructures, database, concurrent programming, transactions, distributed systems… https://github.com/hidetatz https://hidetatz.io