In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could definably be extended to other similar versions.
These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!
- Don’t use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
- You have to have a 1 GB of memory for around 1 million files in the namenode.
- Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
- The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.
- Information: the merging of fsimage files happens on the secondary namenode.
- Hadoop can cache blocks to improve performance. By default it caches 0.
- You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first or second data block has been copied to the datanodes. That might make writing data faster.
- Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.
- Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.
- Log file stores by default 7 days.
- part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers (starts from 0).
- You can change the log.level of mapper and reducers tasks yo get more information.
- mapreduce.reduce.log.level=DEBUG
- yarn server checks what spark did. localhost:4040 also shows what has been done.
- It is important to check where to put the namenode fsimage file. You might want to replicate this file.
- You have to save a lot of disk space (25%) to dfs.datanode.du.reserve, for the shuffle phase.
- This phase is going to be written in disk, so there needs to be space!
- When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.
- You can build a lamdba architecture with flume. You can also specify if you want to put data in memory or disk flume.
- Regarding hardware, worker nodes need more cores for more processing. The master nodes don’t process that much.
- For the namenode you want more quality disks and better hardware (like raid – and raid makes no sense on worker nodes).
- The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.
- Hadoop applications are typically not cpu bound.
- Virtualization might give you some benefits (easier to manage), but it hits performance. Usually it brings between 5% and 30% of overhead.
- Hadoop does not support ipv6. You can disable ipv6. You can also disable selinux inside the cluster. Both give overhead.
- A good size for a starting cluster is around 6 nodes.
- Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.
That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!