Do not stop threads
I dedicate this article to László Fekete, my former boss and director at T-Mobile Hungary. He plays a significant role in this story as he was the one who made the decision to cancel our contract. I must acknowledge that he made the right call, and it was the correct course of action.
However, I also remember some instances where he seemed less concerned about his health, disregarding his blood pressure and cholesterol levels, despite my concerns, which we discussed a few times. Sadly, László passed away in 2017 at the young age of 57 due to a heart attack. It’s a stark reminder of the importance of taking care of our well-being and not neglecting warning signs.
Now, as I find myself at the same age László was when he left us, it serves as a poignant reminder of the fragility of life and the need to prioritize our health and well-being.
1. Introduction, Topic
I am 57, and I recently made some bad moves, and my back aches. I cannot sit for a long time, and I suddenly had ample time at my hand watching YouTube videos.
During my exploration, I stumbled upon an impressive channel called https://www.youtube.com/@ThePrimeTimeagen. The creator of this channel is a remarkable young individual who possesses wisdom beyond his years. His videos exhibit a profound understanding of technology, which captivates me.
I appreciate how he simply sits and discusses other videos or articles without feeling the need to over-explain things. It’s a "take it or leave it" approach. Those who comprehend his content gain valuable insights, and those who don’t: sorry.
I very much enjoy when I understand what he says and feel that probably not many do. It is a snug but somewhat arrogant feeling that one should be careful.
Also, I could hardly find any of his statements I would strongly disagree. Sometimes I feel we could have some discussion, but generally I can agree to, or accept his points. Go and watch him!
Recently I saw a video where he was commenting an article that was about a story how someone almost accidentally corrupted PayPal in the early days. I will not talk about that, it is here https://www.youtube.com/watch?v=MzescXc5SW0. It is a story with lots of technical details you can learn from.
Being 57 does not only mean backache. It also means that I have seen and done a few things that sometimes I tell younger people in the office. Why not write articles about these? So I decided I will write a few articles about things that I have seen and done and that I think are worth sharing.
And here we go.
2. Disclaimer
Most of the story is true and based on real events.
3. Stopping threads
As I said, I have time to watch YouTube videos. I came across the video https://www.youtube.com/shorts/f4fajEBqY0g. It is a short video about how to stop a thread, which you should not. This is one minute, and it does say you should not, and a sentence why, but one minute is too limited to explain the reasons.
I know why you should not stop a thread and not only what the documentation says. It cost me 20,000$ in lost revenue in 2006 when the GDP per capita per year in my country was less than that.
4. Background
I started programming in 1980. My father was a professor at TU Budapest in Hungary and could access a TI-30 calculator. It was a programmable calculator. I remember I tried to write a program to crack an RSA encoded text published to be cracked. Although the prime numbers were only 10 digits long, and the calculator had 1024-step program memory, registers were perhaps 16bit integers, and I had to implement multi precision arithmetic in my code.
I never succeeded with this one, but the exposure to programming "infected" me. I was 14. Later I programmed the Swedish ABC80, the Hungarian C64 clone, and the Hungarian VT-1080z that resembled the Enterprise computer, ZX Sinclair Spectrum, and many others. That time we programmed whatever we could get our hands on. My Unix exposure was minimal because the chair I was volunteering had VAX VMS machines.
I finished TU Budapest Electric Engineer and started to work as a sales rep for Digital Equipment Corporation in Hungary in 1991. Does not fit a programming carrier, does it. That the time paid programming in Hungary was mainly crafting bookkeeping applications in BDase, and it did not pay well. I was already married and had a child with the twins on the way, so I needed a respectable wage. You can afford to live your hobby as a profession if you can afford it. My priorities were different.
I kept programming in C and Perl that time as a hobby. I even wrote a small book in Hungarian about perl, which was the first such, and many learned Perl programming that time from my book. So much that when Larry Wall visited the Budapest Perl conference in the late 90s, I was invited as a keynote speaker. The title of my talk was "Forbid Perl", and I was talking about how Perl makes you so productive that using Perl eliminates the need of too many other programmers, and therefore it has to be forbidden to be used for real applications. I was saying that in front of the father of Perl sitting in the first row. I intended that as humor, but after a few decades I see that I was right. At the time, I did not see the benefit of professional software development overhead versus hacking something together in Perl. It is not the trait of the language per se, but Perl usually was used to script things in a hacky way.
I left DEC in 1999 and joined index.hu as CIO. It was a small startup, the first only online news site founded by a few university friends of mine. We wanted to make history and get rich. We achieved the first one.
I also programmed the advertisement engine of the site, which is a story on its own.
When the dotcom bubble burst, we had to lay off people, restructure the operation from investment oriented growing to sustainable operation. There were a lot of things I learned there, but those were management lessons, not programming. The last step was to give in my own notice, and I left the company in 2001.
Then I started to work for T-Mobile, but they did not hire me as a programmer. I had no prior professional experience and "hobby programming" did not count. I was hired as a project manager.
I was working in that position, I even ignited the development of a reformed project management methodology, but this was not my piece of cake. Five years later, my brother told me to create our own company. He was one sixth owner of a small company that was doing software development, and the other five developers moved towards SQL and stored-procedure direction. My brother thought that Java development is more interesting and more prospective, so he wanted to start a new company.
Why we decided to go to the Java direction and not Microsoft is again another topic that deserves an article on its own. It was more political/philosophical than a technical decision. I will write an article about that later, as well as about why we chose to trade in our old Linux and Windows machines to MacBooks with MacOS. These are interesting topics because people approach such decisions based on belief, and it can lead to heated discussions. Not now.
We started the company in 2006. One of our first clients was T-Mobile. We knew the people there, they knew me, and they needed an advertisement engine. I wrote the one for index.hu, and it was still in production six years later, delivering millions of HTTP responses per day. Not only it was the far largest traffic web server in the country, but it was also the most reliable one.
Much later at a conference, a speaker said that back in the days they checked their Internet connection by pinging the adserver of index.hu. Other sites can be down, but if the adserver is not reachable then it is more likely they have a connection problem. He did not know I was sitting there in the audience. It was a great feeling hearing that. That ad server ran for nine years uninterrupted and without any code modification.
5. Thread Stopping AdServer
So we got the contract to develop an ad server for T-Mobile. The contract size was around 30,000$. I did not know any Java that time. I had limited OOP experience. I was mainly programming in C and Perl and not commercial. But I was a good programmer, at least I though so.
We created the application in Java while I was learning it. The users were authenticated, and we had a backing database with user data. The ad engine had to select the ads based on the mobile subscription, number of used minutes, phone type, and other parameters.
We used PostgreSQL as the database in the dev environment and Hibernate on a Tomcat. An advertisement had to be displayed in two seconds. If the selection process was running longer, then a default ad was displayed. To achieve this, we executed the selection logic in a separate thread using the ExecutorService and waiting on a Future object. We also used the database connection pool available from the Hibernate library.
We manually tested the application, and it worked fine. We ran some load test and it worked fine. But I wanted to deliver perfect software, so I decided to play a bit with the case when the selection times out. In that case, the request serving thread sends a response, but the selection thread is still running putting a useless load on an already overloaded system. We can call 'stop' on the thread.
We tested this scenario, and it worked fine. The connection pool realized that the thread was stopped and closed the connection and created a new one in these cases. I knew that the production will use ORACLE database and the connection pool will also be the one provided by ORACLE. We did not have a test environment with these components, therefore, I decided not to use this performance-saving trick in the production system. But I was proud of my code, and I did not want to delete the line stopping the thread. Instead, I put it into an if statement that was never true, with a comment something like
// this 'if' is always false but I keep it here to show that I know how to stop a thread
if( true ){
thread.stop();
}
Now, you already get a clue, especially if you skip over the line reading it not realizing that the ACTUAL value is 'true'. The code went into production and worked fine. It worked fine for a while, except when the load went up.
When the load went up, the application started to deliver the default ad. The weird thing was that after the load went down, the application still delivered the default ad. Operation had to restart the application to work again. We did not have a clue what was going on, and we responded suggesting to increase the hardware capacity. It was clearly needed to handle the peak load, but there was another problem eventually. We tried to ignore it. Being a small company, we were already occupied with the next project. Putting new hardware under a service in a large corporation does not happen from one day to the other. The service needed to restart a few times every day. It went on between us and the project manager till he escalated the issue, and we could not ignore it anymore.
We had the log files, and we started to investigate. The log clearly showed that the application allocated connection from the pool when a selection started. The log also showed that the connection was returned to the pool when the selection finished even when the selection timed out. I strongly believed that this could not be the problem, especially because we did not stop the threads in the case of timeout.
At least that was what I thought.
We added more logging to the code, deployed it to production which essentially made it a bit slower, making the client even less happy, but it was needed. There were log items for each request and response, we knew when a request timed out, the connection id, thread id and so on. The log was huge, and I wrote Perl scripts to analyze it. It took a week and a lot of diagrams until I realized that whenever a thread timed out, that connection ID never appeared later in the log. The connection never returned to the pool, even though the library falsely reported that it was. But why? We did not stop the threads, and the log showed that these threads always stopped a few milliseconds after the selection timed out.
This was the first clue. It seemed fishy. When the selection using a few SQL selects timed out, why was it always only a little bit late? The fact that we first tried to increase the timeout from two seconds to two and a half seconds shows how clueless we were. It made the time outing threads to finish in two and a half second plus a few milliseconds. Always the timeout time plus a few milliseconds.
"Didn’t you leave the code in that stops the thread?" asked my brother.
"Sure, I didn’t, see, it is in an if statement that is never true."
"No. That is what the comment says." — he replied. — "But the code is there, and it stops the thread."
I was looking at that code hundreds of times blindly during those past two weeks. I read the comment and skipped the code. I read what I wanted to be there and not what really was there.
This time I deleted the line and the comment, and we deployed the code. It worked fine, unlike our relationship with the client. They canceled our contract for the further development of the ad server. We have lost a 20,000$ contract, and we were told that we will never get any contract from them again. I could not blame them.
This "never" lasted three years when partnering with another company, we delivered a system they used to electronically sign four million invoices every month. Do you remember what my very first program was on that TI-30 calculator? That delivery I am not ashamed of. I learned a lot during those three years.
6. Conclusion
There are many things to learn from this story.
6.1. Don’t stop threads
Even though you technically can stop threads, you MUST not.
If you MUST not, then why experiment with it?
You can tell the thread that it can stop if it feels so.
You can use some shared state for the thread to check periodically and stop when it can do safely.
Calling interrupt()
on a thread is a good way to tell the thread that it can stop.
Documentations list a lot of things that may happen when you call stop()
on a thread, but reading it is one thing and when it happens to you is another.
Everybody has to burn the hands a few times. The cleverer you are, the less you need to burn your hands. There are some Mucius Scaevolas out there, not learning from their mistakes. Do not be one.
6.2. Logs are only logs
Logs contain the messages that the application writes about what it does and not what really happens. Programmers make bugs, including misleading logs. Even when you use a high reputation library, you can still face bugs.
6.3. Comments can be dangerous
Comments can be dangerous. Comments are in English and no matter how nerd you are, your eyes will read the human text first. In this case, non-native English speakers may have a slight advantage. If the comment is outdated, misleading or plain wrong, it may lead the maintainers' eyes away from the code.
A good comment does not explain what the code does. The code precisely describes that. You should explain why it does what it does and how other parts of the code should use, and interface the code.
In this case not having any comment before the if
statement, or just
// we can switch experimental thread stopping on and off here
if( true ){
thread.stop();
}
would have been better. My today wisdom says to delete the line and the comment. If you want to keep the line as a legacy, do it in a separate branch or tag in the version control system.
6.4. You do not know when you are stupid
At that point, writing my first commercial application, I was at the peak of my Dunner-Kruger curve. You do not know when you are there. If you feel you are an expert, you know everything, you are the best: be very careful. You are probably at that dangerous peek. Don’t stay there, climb off on the right side and start to climb up on the peek-less long slope to the right, always with a healthy level of self-doubt.
6.5. Customer is always right
When the customer says that you are wrong, you are wrong. They complained that the application does not come back from the overloaded state and our first response was to ask for more hardware. Technically, we were right. If the system does not ever get into the overloaded state, then there is no problem not getting back to normal from it. However, you see how arrogant this standpoint was. Probably this was the number one reason we lost the contract.
We learned from this mistake. We learned many more mistakes after that, and this is a process that I have not finished yet. Learning from mistakes may be the most perpetual thing in my life, and I think it is important for everyone. I have many similar stories, and if you liked this one then leave a comment, give some feedback that will make me know that I should write more.
Comments
Please leave your comments using Disqus, or just press one of the happy faces. If for any reason you do not want to leave a comment here, you can still create a Github ticket.