Thursday, March 17, 2011

Javacore and jextract

Introducing javacore
Javacore can be a powerful tool to diagnose why an application powered by IBM JVM is not very responsive. With javacore, we can see which thread is currently running at the time of dumping.

As an example, I analyzed a hanging syncNode command. The syncNode refreshes a dependent profile information from the 'master' deployment manager profile.
Typically, syncNode command succeeds within a minute.
In my case, 20 minutes had passed and it was still running.
To analyze, I produced 3 javacores every two minutes by sending a SIGQUIT (by running kill -3) signal to the syncNode process.
These javacores were then used as inputs to IBM Thread and Monitor Dump Analyzer for Java bundled with IBM Support Assistant.
Threads comparison of 3 javacores in IBM Thread and Monitor Dump Analyzer



From the picture above, you can see that the main thread is getting the localhost information with InetAddress.getLocaHost. The getLocalHost() method, in turn, resolves the hostname from the IP address.

Why does the getLocalHost() span over 6 minutes snapshot time?
If we take a closer look, we can see that the look up exploits dual-IP stack implementation of IBM JVM. In this case, the syncNode command tries to lookup its own hostname. However, it starts with IPv6 look up and the DNS does not respond. So this command hangs. This is the DNS problem but I worked around by configuring resolver to only use IPv4 lookup i.e. set netsvc.conf to hosts=local4,bind4 instead of hosts=local,bind.


Backup plan: native core file + jextract to the rescue
Now we see the value of the javacore.
Unfortunately, there are cases when a java process hangs but kill -3 doesn't produce a javacore.
(Examples of these cases include a native memory issue where we simply do not have sufficient memory to satisfy just dumping a javacore).
With IBM, you are not running out of tools. IBM JVM equips you with a jextract tool. The jextract command analyzes native (core) dump of the JVM process and generate the output file together with the dump itself.
As the analysis is JVM build-specific; ideally, we should run jextract on the same problematic machine which generates the core dump. Of course, the dump must be complete for jextract to be useful. On AIX, you should enable full core dump.

Please allow me to show the value of jextract in the real-world. It comes to help my customer whose application intermittently hangs from time to time for no apparent reason.
We ran ps -ef command to confirm the process is alive.
We checked SystemOut.
> Nothing is printed.
We tried to kill -3 but no javacore was produced.
We followed the mustgather when kill -3 produces no javacore.
We have to wait for the problem to surface again to continue with this mustgather. Then, we had jextract output ready to proceed.

We use "IBM Monitoring and Diagnostic Tools - Dump Analyzer" (DTFJ) to analyze jextract output. We are able to see from all jextract snapshots which thread was hung. In our case, the current thread is always 2162735.


            Java Thread: SoapConnectorThreadPool : 1 (com.ibm.ws.util.ThreadPool$Worker@0x070000003960EA20)
                System thread ID: 2162735   JNI: 0x000000011788DC00  

                State: 0x00000005 (ALIVE,RUNNABLE)   Priority: 5  

                WAS thread-ID: 0x00000024
                Currently waiting on: (none)
                Currently holding monitors: NLS hash table ID=0x0000000110014A30 ((Raw monitor - no Java object))
                Currently holding monitors: (un-named monitor @0x700000005d65500 for object @0x700000005d65500) ID=0x0700000005D65500 (java.net.InetAddress$Lock@0x0700000005D65500)
                Java Stack Area: 0x0000000117B7D208-0x0000000117B81A37 (18 KB) "JavaStackSection for JavaThread: SoapConnectorThreadPool : 1"
                Java Stack Frames: 
                    ip=0x00000006/sp=0x0000000117B80A68: java.net.Inet6AddressImpl.lookupAllHostAddr()
                    ip=0x000000011173FEE0/sp=0x0000000115BF28B0: java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
                    ip=0x000000011166CED8/sp=0x0000000115BF2940: java.net.InetAddress.getAddressFromNameService(InetAddress.java:1297)
                    ip=0x0000000111765074/sp=0x0000000117B812C8: java.net.InetAddress.getLocalHost(InetAddress.java:1475)
                    ip=0x00000001117B2418/sp=0x0000000117B993D0: org.apache.soap.util.mime.MimeUtils.getUniqueValue(MimeUtils.java:85)
                    ip=0x00000001117B2238/sp=0x0000000117B99448: org.apache.soap.rpc.SOAPContext.setRootPart(SOAPContext.java:383)
                    ip=0x00000001117B2098/sp=0x0000000117B99548: org.apache.soap.rpc.SOAPContext.setRootPart(SOAPContext.java:426)
                    ip=0x0000000111719014/sp=0x00000000: org.apache.soap.rpc.SOAPContext.setRootPart(SOAPContext.java:401)
                    ip=0x0000000111719014/sp=0x0000000117B99650: com.ibm.ws.management.connector.soap.SOAPResponse.writeSOAPResponse(SOAPResponse.java:74)
                    ip=0x000000011171328C/sp=0x0000000117B99838: com.ibm.ws.management.connector.soap.SOAPConnector.service(SOAPConnector.java:280)
                    ip=0x07000000423F0BE9/sp=0x0000000117B99858: com.ibm.ws.management.connector.soap.SOAPConnection.handleRequest(SOAPConnection.java:64)
                    ip=0x07000000423EE3C1/sp=0x0000000117B99978: com.ibm.ws.http.HttpConnection.readAndHandleRequest(HttpConnection.java:742)
                    ip=0x000000011170A628/sp=0x0000000117B99A60: com.ibm.ws.http.HttpConnection.run(HttpConnection.java:528)
                    ip=0x0700000040EDE902/sp=0x0000000113CFB490: com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1497)
                Native Thread: 2162735
                    Associated Java Thread: SoapConnectorThreadPool : 1 (com.ibm.ws.util.ThreadPool$Worker@0x070000003960EA20)
                    Properties: 
                        processor on which I'm bound: 0
                        wait channel: 0
                        current/last signal taken: 0
                        scheduling policy: 52
                        current effective priority: 0
                        suspend count: 1
                        thread state: 5
                        type of thread wait: 1
                        thread flags: c10404
                        processor usage: ffff0002
                    Native Stack Area: 0x0000000110000000-0x000000011CE6C55F (211377 KB) "stack section at 110000000"
                    Native Stack Frames: 
                        ip=0x09000000004972E4/sp=0x0000000118543260: /usr/lib/libpthreads.a(shr_xpg5_64.o)::._validate_local_lock+0x3c
                        ip=0x0900000000052C90/sp=0x00000001185433A0: /usr/lib/libc.a(shr_64.o)::_$STATIC+0x48
                        ip=0x090000000004FFA4/sp=0x0000000118543480: /usr/lib/libc.a(shr_64.o)::+0x4c
                        ip=0x09000000028E1774/sp=0x00000001185434F0: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::_$STATIC+0x14c
                        ip=0x09000000028E1088/sp=0x0000000118543560: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.j9mem_startup+0x38
                        ip=0x09000000028E1348/sp=0x0000000118543600: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.checkTagSumCheck+0x28
                        ip=0x09000000028EEBDC/sp=0x00000001185436A0: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.parse_catalog+0x47c
                        ip=0x09000000028EF9B4/sp=0x0000000118543C00: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.j9nls_set_catalog+0x24
                        ip=0x09000000028EFED0/sp=0x0000000118543CB0: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.j9nls_get_variant+0x10
                        ip=0x09000000028EFA44/sp=0x0000000118543D50: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.j9nls_set_catalog+0xb4
                        ip=0x09000000029736F0/sp=0x0000000118543DC0: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9dmp23.so::.rasDumpEnableHooks_34_29+0xd8
                        ip=0x090000000288C29C/sp=0x0000000118544100: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9vm23.so::.generateSystemDump+0xac
                        ip=0x09000000028EC0B8/sp=0x00000001185451E0: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.masterSynchSignalHandler+0xd0
                        ip=0x090000000288C594/sp=0x0000000118545560: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9vm23.so::.writeVMInfo_18_2+0x13c
                        ip=0x09000000028EC32C/sp=0x0000000118545630: /usr/IBM/WebSphere/AppServer/java/jre/bin/libj9prt23.so::.initializeSignalTools+0x138
                        ip=0x000044F4/sp=0x0000000118547340: <unknown location>
                        ip=0xE9E1FFD8/sp=0x09000000028852CC: <unknown location>


From the above stack trace, it is crystal clear that the hanging code is not application code. It's time to get help from IBM.

Using the above information, IBM support further identifies the incident is from AIX defects below:
IZ66709 : CORE DUMP AFTER REFERENCING THE ADDR AFTER GETADDRINFO 
IZ68793 : GETGRENT_R CAUSES SIGSEGV WHEN INVOKED BY MULTIPLE THREADS

After we upgraded AIX, the problem is gone.

No comments:

Post a Comment