Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Andoird] ANR if app closed and opened too quickly #7567

Closed
AntTheAlchemist opened this issue Apr 2, 2023 · 54 comments
Closed

[Andoird] ANR if app closed and opened too quickly #7567

AntTheAlchemist opened this issue Apr 2, 2023 · 54 comments

Comments

@AntTheAlchemist
Copy link
Contributor

AntTheAlchemist commented Apr 2, 2023

**Update: skip to the last message to get to the conclusion and avoid the waffle **

If the Java side of the app hasn't finished shutting down properly when re-opened, it all goes a bit FUBAR when you re-open the app. You have to be pretty quick with your fingers to trigger this, but still shouldn't happen? Might be obvious / simple to fix?

Looking at Logcat, I'm seeing these messages come up, which might be a clue:

 E  eglSwapBuffersWithDamageKHRImpl:1313 error 300d (EGL_BAD_SURFACE)
 E  eglSwapBuffersWithDamageKHRImpl:1313 error 300d (EGL_BAD_SURFACE)
 E  eglSwapBuffersWithDamageKHRImpl:1313 error 300d (EGL_BAD_SURFACE) (repeated 100s of times a second)
 W  Client was already closed and can't be reused. Please create another instance.  (this can come up before or after the bad_surface errors)
@1bsyl
Copy link
Contributor

1bsyl commented Apr 4, 2023

Can you put more logcat, before there should be relevant sdl message ...
Are you sure when you say "app hasn't finished shutting down" ... is it onDestroy() // or could this be background/foreground

@AntTheAlchemist
Copy link
Contributor Author

The problem happens with SDL_PumpEvents(), it completely locks up after that. The renderer seems to work before then no problem though. It's difficult to debug because it only happens if you re-start the app quckly after closing. There are no relevant SDL messages on logcat. By "app hasn't finished shutting down", it could be at any process of the shut down, not easy to say because you can't debug a shutdown and restart in Android Studio. Not sure if it's relevant to background? It only happens if you re-start the app too quickly, therefore, it has something to do with the previous execution, because it never happens if you give the previous execution enough time to shut down.

@AntTheAlchemist
Copy link
Contributor Author

By shut down, I mean SDL_Quit() and exiting main()

@1bsyl
Copy link
Contributor

1bsyl commented Apr 4, 2023

By shut down, I mean SDL_Quit() and exiting main()

Exiting main(), should be return so that you should see some message, onDestroy() etc.

that's where the main() ends, and tells too "finish()" the activity:
https://github.com/libsdl-org/SDL/blob/main/android-project/app/src/main/java/org/libsdl/app/SDLActivity.java#L1905
the activity onDestroy() message:
https://github.com/libsdl-org/SDL/blob/main/android-project/app/src/main/java/org/libsdl/app/SDLActivity.java#L602

The problem happens with SDL_PumpEvents(),
So in the re-start ?
not sure it this helps, but in your xml, you should have some "singleInstance" parameter to tell you have only 1 instance
https://github.com/libsdl-org/SDL/blob/main/android-project/app/src/main/AndroidManifest.xml#L76

otherwise, maybe, this is just an issue with phone gles2 driver...

@AntTheAlchemist
Copy link
Contributor Author

I just had the idea to put a goto at the end of main() to force it to restart without exiting main(). My app on Windows 10 restarts ok, so the init & cleanup code seems ok, but on Android I get "unable to create an EGL window surface (call to eglCreateWindowSurface failed, reporting an error of EGL_BAD_ALLOC)".

I will do some more digging to find out if it's my code, or something in SDL and come back with a repo, or bug fix.

@AntTheAlchemist
Copy link
Contributor Author

AntTheAlchemist commented Apr 4, 2023

I can confirm it's an SDL(3) issue. If you SDL_Quit cleanly and then try to create a window without dropping out of main(), it will fall over. Maybe someone with better knowledge of the inner-workings of SDL will instinctively know why this is happening, otherwise I can dive in the deep end and risk drowning 😉 . This test code reproduces:

#include <SDL3/SDL_main.h>
#include <SDL3/SDL.h>

int main(int argc, char** argv) {
	again:
	SDL_SetHint(SDL_HINT_ANDROID_TRAP_BACK_BUTTON, "1");
	if(SDL_Init(SDL_INIT_VIDEO)) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_Init", SDL_GetError(), 0);
		return 1;
	}
	SDL_Window* w = SDL_CreateWindowWithPosition( "", SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED, 640, 480, SDL_WINDOW_RESIZABLE);
	if(!w) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateWindow", SDL_GetError(), 0);
		return 1;
	}
	SDL_Renderer *r = SDL_CreateRenderer(w, 0, SDL_RENDERER_PRESENTVSYNC);
	if (!r) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateRenderer", SDL_GetError(), 0);
		return 1;
	}
	SDL_bool running = SDL_TRUE;
	while (running) {
		SDL_SetRenderDrawColor(r, 0, 0, 255, SDL_ALPHA_OPAQUE);
		SDL_RenderClear(r);
		static SDL_FRect box = { 0,0,50,50 };
		SDL_SetRenderDrawColor(r, 0, 255, 0, SDL_ALPHA_OPAQUE);
		SDL_RenderFillRect(r, &box);
		SDL_RenderPresent(r);
		SDL_PumpEvents();
		SDL_Event event;
		while (SDL_PeepEvents(&event, 1, SDL_GETEVENT, SDL_EVENT_FIRST, SDL_EVENT_LAST) == 1) {
			if(event.type == SDL_EVENT_KEY_UP || event.type == SDL_EVENT_FINGER_UP)
				running = SDL_FALSE;
		}
		if(box.x != 300)
			box.x++;
		else
			box.x = 0;
	}
	SDL_DestroyRenderer(r);
	SDL_DestroyWindow(w);
	SDL_Quit();
	/* SDL_Delay(8000); /* It still crashes even with a delay to let the java-side catch up */
	goto again;
}

@AntTheAlchemist
Copy link
Contributor Author

Oh yes, and this is what comes through logcat:

19:12:39.482  V  setOrientation() requestedOrientation=10 width=640 height=480 resizable=true hint=
19:12:39.485  E  [SurfaceView[org.libsdl.app/org.libsdl.app.SDLActivity]#1(BLAST Consumer)1](id:4a6c00000001,api:1,p:19052,c:19052) connect: already connected (cur=1 req=1)
19:12:39.486  E  eglCreateWindowSurface: native_window_api_connect (win=0x704449f810) failed (0xffffffea) (already connected to another API?)
19:12:39.486  E  eglCreateWindowSurfaceTmpl:676 error 3003 (EGL_BAD_ALLOC)
19:12:39.486  I  pixel format wanted SDL_PIXELFORMAT_RGBA8888 (1), got SDL_PIXELFORMAT_RGBA8888 (1)
19:12:39.529  D  Compat change id reported: 171228096; UID 10437; state: ENABLED
19:12:39.605  V  onWindowFocusChanged(): false

@1bsyl
Copy link
Contributor

1bsyl commented Apr 4, 2023

ok. but you change your bug, before was exit from main(), to some re-jump to start main.

Indeed this doesn't work, like in other platforms, because the android java surface view is still there and is connected to the egl.
you can probably make this working by adding:

  + SDL_MinimizeWindow(w);
  + SDL_Delay(1000);
    SDL_DestroyWindow(w);

so that you see a "surfaceDestroyed()" message in you log (nothing remains connected).

if you think the initial bug is the same as your 2nd, you can add a global variable counter to check that. You display and increment the variable at each start of the main() and see the value. (but that shouldn't happen! unless you have different activity configuration).

@AntTheAlchemist
Copy link
Contributor Author

I didn't change the bug, it still happens if you exit main(),. This repo is a better way of catching it, that's all. When it happens by exiting main() and re-starting the app, the bug is worse because CreateWindow succeeds, but the event process locks-up, and there is no way to debug it.

Other platforms, this works, so it's a bug specific to the Android platform.

I'm not adding code to work around the bug, especially not a delay because that will create ANRs, which is bad behaviour in Google Play.

I'm pretty sure this can be easily fixed in the SDL java code, where the source of the problem is.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 5, 2023

You're also saying the error doesn't occur at the same time.

Can you add a global variable counter that you display and increment at the start of you main().
then do your fast close-restart as you describe in you first message.
to see what's the value of this counter, when the bug occurs

@AntTheAlchemist
Copy link
Contributor Author

The code I pasted reproduces the same error, just with a slightly different result - the error's coming back from logcat at the same. The counter will be at 2, because it happens after the first exit and re-start if you open the app quick enough.

@AntTheAlchemist
Copy link
Contributor Author

To clarify, SDL is getting stuck, and produces an ANR if restarted too quickly. It gets stuck in nativePollOnce in MessageQueue. I am getting ANR reporets from Google Play about this.

There's an interesting comment in this method's source code which might be relevant. It might be related to the bug.

        // Return here if the message loop has already quit and been disposed.
        // This can happen if the application tries to restart a looper after quit
        // which is not supported.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 5, 2023

To clarify, SDL is getting stuck, and produces an ANR if restarted too quickly. It gets stuck in nativePollOnce in MessageQueue. I am getting ANR reporets from Google Play about this.

There's an interesting comment in this method's source code which might be relevant. It might be related to the bug.

not sure if this is relevant, we don't use the looper directly, but the activity.... and what is the "quit" in this situation, onDestroy() ? if so nothing is done after this. This is an android comment for android internals.

The code I pasted reproduces the same error, just with a slightly different result - the error's coming back from logcat at the same.

strange. because I tried it, several times, and i could restart the main() with this. Maybe the timing is different (try to increase). for this test, the relevant information is to check the logcat and search for "onSurfaceDestroyed" and see it if arrive before you reach the "again:" label. SDL_MinimizeWindow should trigger indirectly "onSurfaceDestroyed".

The` counter will be at 2, because it happens after the first exit and re-start if you open the app quick enough.

Sorry to ask, but here this is really not clear to me. "counter will be at 2". did you check that ? or you assume it ? the goal is to tell if the main() is re-started in the same process, or in a new one. (this is has to be tested without the "again:" label + goto).

@AntTheAlchemist
Copy link
Contributor Author

The bug is clear and easy to reproduce with the latest source using the android-project template. There's not really anything more I can add.

@AntTheAlchemist
Copy link
Contributor Author

AntTheAlchemist commented Apr 5, 2023

Just to clarify further facepalm: I didn't assume the counter will be 2, I know it will. Global / static data does not get destroyed on Android when main() exits. You can see this by the position of the green box that slides to the right. It will persist, even when you return to the app an hour later, unless you force-close the app, or the Android system decides to garbage collect.

I've just tested this on an old Nexus 7 and the game loop continues, but the event loop locks up. I did have to add if(event.type == SDL_EVENT_KEY_DOWN || event.type == SDL_EVENT_FINGER_UP) because the back button doesn't produce a key event. The point is, it's still essentially the same bug.

@madebr
Copy link
Contributor

madebr commented Apr 5, 2023

This is an apk for aarch64 I created, using the code from #7567 (comment) and my android cmake pr (+ printing of a counter value at the start of main).
issue_7567.zip

I don't see a ANR on a OnePlus 5, after more then 500 restarts.

issue_7567.c
#include <SDL3/SDL_main.h>
#include <SDL3/SDL.h>

static int counter = 0;

int main(int argc, char** argv) {
again:
    SDL_Log("At start of main. counter=%d", counter);
    counter++;
	SDL_SetHint(SDL_HINT_ANDROID_TRAP_BACK_BUTTON, "1");
	if(SDL_Init(SDL_INIT_VIDEO)) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_Init", SDL_GetError(), 0);
		return 1;
	}
	SDL_Window* w = SDL_CreateWindowWithPosition( "", SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED, 640, 480, SDL_WINDOW_RESIZABLE);
	if(!w) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateWindow", SDL_GetError(), 0);
		return 1;
	}
	SDL_Renderer *r = SDL_CreateRenderer(w, 0, SDL_RENDERER_PRESENTVSYNC);
	if (!r) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateRenderer", SDL_GetError(), 0);
		return 1;
	}
	SDL_bool running = SDL_TRUE;
	while (running) {
		SDL_SetRenderDrawColor(r, 0, 0, 255, SDL_ALPHA_OPAQUE);
		SDL_RenderClear(r);
		static SDL_FRect box = { 0,0,50,50 };
		SDL_SetRenderDrawColor(r, 0, 255, 0, SDL_ALPHA_OPAQUE);
		SDL_RenderFillRect(r, &box);
		SDL_RenderPresent(r);
		SDL_PumpEvents();
		SDL_Event event;
		while (SDL_PeepEvents(&event, 1, SDL_GETEVENT, SDL_EVENT_FIRST, SDL_EVENT_LAST) == 1) {
			if(event.type == SDL_EVENT_KEY_UP || event.type == SDL_EVENT_FINGER_UP)
				running = SDL_FALSE;
		}
		if(box.x != 300)
			box.x++;
		else
			box.x = 0;
	}
	SDL_DestroyRenderer(r);
	SDL_DestroyWindow(w);
	SDL_Quit();

	/* SDL_Delay(8000); /* It still crashes even with a delay to let the java-side catch up */
	goto again;
}

@1bsyl
Copy link
Contributor

1bsyl commented Apr 5, 2023

@madebr thanks for the cooperation !
I have also tried, actually, and I do see there is an issue :( trying more thing at the moment
there are actually two bugs:
1/ fast close-restart re-use the same .so. (counter increment) so that's really bad. I don't think we can support it.
Even if sdl could, user app could also have static var. so I think we should just warn and quit.

2/ the label "again:" bug. eg: Init() Quit() ... Init() Quit(). maybe that can work .. but that's an uncommon scenario

@1bsyl
Copy link
Contributor

1bsyl commented Apr 5, 2023

@AntTheAlchemist
can you try this https://github.com/libsdl-org/SDL/pull/7585/files
(not with the "again:" label testcase).
it shouldn't allow to re-oppen the app with previous global static data. you can have an un-noticeable aborted start when it would have previously failed, but no crash/anr.

@madebr
Copy link
Contributor

madebr commented Apr 5, 2023

I can reliably reproduce the error when removing the again label.
New issue_7567.zip.

And start/stop it using:

adb shell "while true; do am start-activity org.libsdl.sdl.test.issue_7567/.SDLTestActivity; input tap 400 400; done"

For me, it always fails in the 3rd iteration.

When stuck, you can unstuck it with:

adb shell am start -S org.libsdl.sdl.test.issue_7567/.SDLTestActivity

This forces a restart of the program.

source.c
#include <SDL3/SDL_main.h>
#include <SDL3/SDL.h>

int main(int argc, char** argv) {
	SDL_SetHint(SDL_HINT_ANDROID_TRAP_BACK_BUTTON, "1");
	if(SDL_Init(SDL_INIT_VIDEO)) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_Init", SDL_GetError(), 0);
		return 1;
	}
	SDL_Window* w = SDL_CreateWindowWithPosition( "", SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED, 640, 480, SDL_WINDOW_RESIZABLE);
	if(!w) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateWindow", SDL_GetError(), 0);
		return 1;
	}
	SDL_Renderer *r = SDL_CreateRenderer(w, 0, SDL_RENDERER_PRESENTVSYNC);
	if (!r) {
		SDL_ShowSimpleMessageBox(SDL_MESSAGEBOX_ERROR, "SDL_CreateRenderer", SDL_GetError(), 0);
		return 1;
	}
	SDL_bool running = SDL_TRUE;
	while (running) {
		SDL_SetRenderDrawColor(r, 0, 0, 255, SDL_ALPHA_OPAQUE);
		SDL_RenderClear(r);
		static SDL_FRect box = { 0,0,50,50 };
		SDL_SetRenderDrawColor(r, 0, 255, 0, SDL_ALPHA_OPAQUE);
		SDL_RenderFillRect(r, &box);
		SDL_RenderPresent(r);
		SDL_PumpEvents();
		SDL_Event event;
		while (SDL_PeepEvents(&event, 1, SDL_GETEVENT, SDL_EVENT_FIRST, SDL_EVENT_LAST) == 1) {
			if(event.type == SDL_EVENT_KEY_UP || event.type == SDL_EVENT_FINGER_UP)
				running = SDL_FALSE;
		}
		if(box.x != 300)
			box.x++;
		else
			box.x = 0;
	}
	SDL_DestroyRenderer(r);
	SDL_DestroyWindow(w);
	SDL_Quit();

	return 0;
}

@1bsyl
Copy link
Contributor

1bsyl commented Apr 6, 2023

@madebr are you reproducing the error with the patch from the PR ?

@madebr
Copy link
Contributor

madebr commented Apr 6, 2023

I modified the adb command to repeatedly start/stop the app to:

adb shell "while true; do am start-activity org.libsdl.sdl.test.issue_7567/.SDLTestActivity; while true; do dumpsys activity | grep org.libsdl.sdl.test.issue_7567/.SDLTestActivity >/dev/null; res=\$?; if test x\$res = x0; then break; else echo "not finished yet"; fi; done; input tap 400 400; done"

Then, it does not block anymore.

The app keeps the global static state though: the green block does not move left after a close/start cycle.

When testing manually, the behavior is not good: as slouken wrote in #7585, you need to re-start the app 5 times before the activity shows up. And then too, the global state remains.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 6, 2023

I have updated the PR so that it indicate the count the main is called (eg: 3c703f2 )
do you see the message in the log ? do you have a logcat to see what's happening ?

@slouken
Copy link
Collaborator

slouken commented Apr 6, 2023

FYI, the real fix is to add a hint to call exit() after main() returns, and default that to true. It should be a hint, because some applications properly handle the Android life cycle and expect the activity to be reentered. This may need special handling to survive SDL_Quit(), possibly implemented as a callback that sets a global variable.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 6, 2023

@madebr I am not sure how you get the global state to remain.
it means:

  • you click, it enters onCreate(), then call returnrs / finish(), but doesn't destroy() the static data ?
  • so you do it 5x times, and it enters and recall main() will old datas.
    (probably it call onDestroy() because the app is now not blocked, but the static data remains).

I am curious, about the method, and I also would like to know it calls onDestroy() in the logcat !

@slouken
I've tested with exit() and:
1/ if it's called after the main(), so in the side thread, then it logs a bad crash. so not sure that we should do that.
this is something that would go in the crash reports probably.
but
2/ if we add it in the onCreate() (instead of finish()), then:

  • it produces the same kind of exit than what appears after onDestroy()
: System.exit called, status: 0
I AndroidRuntime: VM exiting with result code 0, cleanup skipped.

if i do the quick sequence close-restart, then:

  • before it was like a bad click: app didn't start again (because the 2nd call was to finish the first one)
  • now, it exits the activity with the message, but recreate a new one just after: it clean up the previous activity and start a new one. so the behavior looks good to me

@1bsyl
Copy link
Contributor

1bsyl commented Apr 7, 2023

I've updated #7585
@AntTheAlchemist please test the PR when you have some time

@madebr
Copy link
Contributor

madebr commented Apr 7, 2023

I locally merged #7581 into #7585, and recorded a video of me opening the app, closing, and reopening it by clicking 5 times.
I recorded it with scrcpy, and watched a logcat log in parallel.
In it, I saw onCreate, followed by onDestroy.
And also

10-18 00:19:09.754 15202  4793 V SDL     : Running main function SDL_main from library /data/app/org.libsdl.sdl.test.issue_7567-5t57_6_bHEagcW1y8pOVgA==/lib/arm64/libissue_7567.so
10-18 00:19:09.754 15202  4793 V SDL     : nativeRunMain() 19 time
repro.mp4

@1bsyl
Copy link
Contributor

1bsyl commented Apr 8, 2023

so onDestroy() is called, but the static native data remains ?
nativeRunMain() 19 time
it means, we should add an exit() in onDestroy().
how to you reproduce this ?

@madebr
Copy link
Contributor

madebr commented Apr 8, 2023

I just keep tapping on my phone screen (or clicking in the scrcpy window).
And yes, onDestroy is visible in adb logcat.

Here's a screen cast:
https://user-images.githubusercontent.com/4138939/230724734-a5651f1e-8835-4da8-85d2-3f0c4bd3453e.mp4

@1bsyl
Copy link
Contributor

1bsyl commented Apr 11, 2023

I've looked at the video, the process id, remains the same across all the video (20391) (except at the very start).
even after onDestroy() is called.
If the pid is the same, the native library remain also the same with the static data. (not possibility to unload it).

Now the patch use system.exit(0). so it should really exits. (before, only finish() that call onDestroy()).
so each tap, should be a new processid.

also:
why does the pid remains the same ? do you have another activity that never gets terminated ?
if your app: SDLActivity + SDLEntryTest ? Process would terminate on exit only if all the activity are destroyed?

@madebr
Copy link
Contributor

madebr commented Apr 11, 2023

also: why does the pid remains the same ? do you have another activity that never gets terminated ? if your app: SDLActivity + SDLEntryTest ? Process would terminate on exit only if all the activity are destroyed?

SDLEntryActivity is not started in the video: it is only used when started through a shortcut.
I have no clue why the pid remained the same. I was only tapping on the screen.

Repeating the above test procedure makes logcat return a new process id, and resets the test executable.

Video + logcat of running:

adb shell "while true; do am start-activity org.libsdl.sdl.test.issue_7567/.SDLTestActivity; usleep 900000; input tap 200 200; done"
out.mp4

@AntTheAlchemist
Copy link
Contributor Author

I've updated #7585
@AntTheAlchemist please test the PR when you have some time

@1bsyl I was certain I had reported back on this before now, maybe elsewhere. Apologies if I haven't already & for the delay.

I did test, and it does seem to have fixed the issue from the user's perspective.

I have been re-entering apps without issue for a long time. I was careful not to leave global / static data behind that would disrupt a re-start of the app, simply by not assuming global data defaults when it's restarted on Android. It was kind of useful to maintain the state of the app sometimes. I could see it tripping up a lot of unaware developers though.

Are we sure there won't be any side-effects as a result of this fix?

@madebr
Copy link
Contributor

madebr commented Apr 19, 2023

I think the fix assumes the SDLActivity is the only activity of your app, or at least the last one.
If it is part of a stacl of activities, System.exit(0) will terminate the complete app and other activities on the stack won't get re-activated.

I haven't tested this theory. So this might not be true.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 19, 2023

@AntTheAlchemist thanks for the feedback ! yes, you gave some feedback, but it's better when it's re-actualized

the side effect is almost as @madebr say.

Now by default, it only lets you run the SDLActivity once, with the same PID.
You can have multiple activities (eg also SDLActivity + AdActivity).
But when SDLActivity Finished (eg killed from tasks, or the C main() returned), the next time the SDLActivity is re-created, if it's the same process - it will exit(). Re-creation in same process can happen because of: multiple activities in the app, fast user click, of OS setting (?) (see madebr video).
Nicely, the behavior I see on my phone, is that the exit() is immediately followed by an automatic re-opened. so this is a good unexpected behavior.

Just to make clear:

  • going to background / foreground, isn't affected. this is a pause/resume (not a create/destroy). So unrelated. And that works normally.
  • usually, when the C main() finishes, onDestroy() is called, and the process exit(). And next time, it's a new process.
    So here, we fixes a corner case of the apps being re-opened with the same process / (eg same PID). It should rarely happen with a normal usage (and for an app that doesn't explicitly rely on it).
  • It doesn't exit() in onDestroy(). it exits if it re-opens with same PID.

When is this an issue ?

  • if you have multiple activities, and you want to create/destroy SDL Activity several times. Because you have several activities running, Activity will keep the same PID.

    • I don't think you want to do that for an SDL game. Usually games starts, ends, and you do a fresh re-start a later.
      Maybe it makes sense if you create 1 activity that can embed and start various SDL games...
      But, since SDL_Init/Quit/Init/Quit isn't working, I don't think people do that. (see Android: allow to re-do the sequence SDL_init / SDL_quit #7591)

    • Maybe if you only use SDL for the Joystick subsystem for instance, this can make sense: I mean, your game here is differently designed: SDL is just a third party component. And you don't want you app to exit() when SDLActivity is re-recreated.
      So here, you should use the Hint SDL_HINT_ANDROID_ALLOW_RECREATE_ACTIVITY

Are we sure there won't be any side-effects as a result of this fix?

  • I think this is ok, as long as you make sure to set the hint if you need it (SDL as third party). but in that case, it's kind of easy to detect that your app exits with your multiple activities.

  • without this PR (Android: make sure the activity gets really finished #7585), I can re-produce the bug almost 100% of the time, so clearly, this is something to fix.

@AntTheAlchemist
Copy link
Contributor Author

All very comprehensive, thank you.

The only remaining thing I can think of is will any of the waiting for things to end / restart spark another ANR? I have an SDL3 app live on Play Store that's being punished for exceeding the bad behaviour for ANRs (I'll post a separate bug for this), especially in onDestroy. Seems to only be happening on TV devices. Most of the other ANRs are either caused by AdMob, or joystick operations (SDL_hid_open_path, SDL_LockJoysticks), which I'll update an existing issue report about.

@1bsyl
Copy link
Contributor

1bsyl commented Apr 19, 2023

Here, with the PR, there is no "wait" for things to end / restart. So I don´t believe this can add more ANR.
(if the activity is recreated, it will exit directly)

In fact, I hope this could reduce the ANR rate on android. Because, if you re-create the activity with a previous -random- state, then this is an undefined behavior, leading to ANR (event sent / lock) or Crash

My experience about android anr/crashes:

  • there is a very low crash rate. like totally un-understandable.
  • there are ANR (waiting queue / android.os.MessageQueue.nativePollOnce ) but this seems related to admod

both are under the threshold. crash rate is far below, but the anr rate is quite close to the threshold.
Admob has added: OPTIMIZE_INITIALIZATION and OPTIMIZE_AD_LOADING but that doesn't help much. maybe that defers some anr to crashes ... (look for "MobileAds.Initialize() taking 8 seconds")

On TV devices ... it cannot run AdMob ?

I have disable HID, so I cannot tell. maybe there are some ANR there, that add to the Admob ones, so that it exceeds threshold.

( btw, after updating your app, I think you need to wait some time, eg 2 weeks to have so relevant new rates.)

@1bsyl
Copy link
Contributor

1bsyl commented Apr 20, 2023

@AntTheAlchemist.
The worse case, I can think of:
App starts, it somehow locks the mutex SDL event queue.
The SDL Activity get destroyed (from user swiping the app out? or from OS ? )
SDL Activity get re-created with same PID. so the mutex is already locked. it tries to re-lock it, and dead-lock -> ANR.

Not sure, if this is feasible. but for instance, the video of madebr, prove the same pid is always use there. SDLActivity is always reopened with same PID. maybe because it use other activity to instrument the SDLActivty (you were using scrcpy ? )

@AntTheAlchemist
Copy link
Contributor Author

I've applied dfd80f3 and everything seems to be ok. The '.MainActivity.onDestroy' ANR has disappeared from Google Play Console reports - which was responsible for sending the ANR rate many times over the bad behaviour limit. And of course there's no glitching when restarting the app too quickly. Thanks everyone!

@1bsyl
Copy link
Contributor

1bsyl commented May 17, 2023

@AntTheAlchemist Thanks for the feedback, I was recently wondering how it goes !! thanks also for reporting this !
how long was the new version alive ?
what's your ANR rate right now ? (select in the googleplay console, for the new revision only)
do you have the "ANR delay " reports ?

can we leave it open, so that you also refresh the status, in a couple again ?

@1bsyl 1bsyl reopened this May 17, 2023
@AntTheAlchemist
Copy link
Contributor Author

@1bsyl this update has only been live for 5 days, but the ANRs would appear after 1 day, so I'm confident we won't see any onDestroy related ANRs. 600 new downloads a day gives a good sample. I was averaging 8 ANRs a day, now it's dropped to 3-4 and still dropping daily. I'm still over the bad behaviour threshold (0.2%, needs to be under 0.04%?), but that's caused by joystick mutex locking (e.g: SDL_hid_open_path, SDL_LockJoystick), which will affect this app considerably, since users will be connecting and disconnecting joysticks a lot - I've already opened another issue about this; I'm hoping Sam can look into it at some point when I post more info on the other issue.

It's also worth noting that because people will be testing game controllers on their game-console TV set-up, most of my installs (over 50%) are on Android TV devices, as opposed to phone. My ANRs seem to be something specific to how Android TV works. I've updated about 20 different versions trying to tackle the ANRs and strangely they're just going up and up until your activity destroying patch.

I can revisit here later, but I'm confident this works. All I will say is that there may be another hidden issue that caused the activity to not perfectly be re-used without destruction, so using the hint to keep the activity alive may still be a problem for those that need it.

@slouken
Copy link
Collaborator

slouken commented May 17, 2023

Great! Thanks for your feedback! :)

I haven't forgotten your joystick ANRs, but I don't know what the underlying cause might be. Can you ping the other issue with an updated stack trace and I'll come back and look when I get a chance?

@slouken slouken closed this as completed May 17, 2023
@AntTheAlchemist
Copy link
Contributor Author

I haven't forgotten your joystick ANRs, but I don't know what the underlying cause might be. Can you ping the other issue with an updated stack trace and I'll come back and look when I get a chance?

I'll put together some info that should help point in the right direction, leave it with me. It should be a lot easier now activity is being renewed. I may need to acquire an Android TV box and see if I can trigger the ANR with the profiler running - might be when controllers are physically plugged in and out of a USB port on the box. Bare with :-)

@1bsyl
Copy link
Contributor

1bsyl commented Jun 6, 2023

Hey @AntTheAlchemist !
can you give more feedback about the updat ?
especially, crash rate / anr rate with later vs previous ?

I am currently testing this recently ...

btw, another, question:
in you ANR stack traces, do you have this case where:
there is the list of all threads. you have the SDLThread at let say position "Thread 22".
then you have another "SDLThread" at "Thread 16123" ?

@AntTheAlchemist
Copy link
Contributor Author

AntTheAlchemist commented Jun 6, 2023

Hi @1bsyl !

Crashes are very rare. There are some null pointer crashes in the Java code - will list them in another post. They should be easy to fix with a null check.

The destroy-activity-on-close update has greatly reduced my ANRs, but I'm still way over bad behaviour. I only had 1 rare one 19 days ago in the update. Looking at the other SDL activities, there is one doing a Android_JNI_RequestPermission, following a PLATFORM_hid_init.

I see there has been a recent hid / libusb update to SDL3. I'd like to see if it makes any difference.

Yes, I am seeing multiple SDLTread in the stack trace, with the thread numbers going quite high, eg: 463.

The main culprit is SDL_LockJoysticks.

I gave Sam admin access to my Google Play Console so he could view the stack traces. I can do that, or paste a bunch of collapsible plain-text stack traces here.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 6, 2023

ok !
Please, can you copy paste one one with multiple SDLTread 463 just to see ! because that's strange to me ! thanks !

@1bsyl
Copy link
Contributor

1bsyl commented Jun 9, 2023

@AntTheAlchemist
maybe this helps to reduce more ANR / Crash.
#7792: fix EventFilter with DID_ENTER_FOREGROUND
#7794: robustness, for native code.

Also, if you plane to do a new release, could you add some smart trace in SDLActivity:

replace in SDLActivity.java:
mSDLThread = new Thread(new SDLMain(), "SDLThread");

by:

long seconds = System.currentTimeMillis() / 1000l;
mSDLThread = new Thread(new SDLMain(), "SDLThread_" + seconds);

this can give some insight, like:
is a new SDLThread really re-created? or the stack trace is just broken and re-using the SDLThread name ...

(the longest thread name for me in gp console, so far is 35 chars)

@1bsyl
Copy link
Contributor

1bsyl commented Jun 12, 2023

@AntTheAlchemist
Maybe this fix some ANR in onDestroy #7806

SDLActivity waiting, in onDestroy() : thread.join()

and SDLThread thread is created and seems to be running, without executing the quit event.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 16, 2023

Also, maybe an ANR with nativeKeyDown/Up: #7832

@1bsyl
Copy link
Contributor

1bsyl commented Jun 22, 2023

After pushing, some update to gather some more information about the duplicated SDLThread...
it seems to me ANR logs are erroneous.
I got for instance:
SDLThread_123456789 and SDLThread_123456
So the name was truncated (java bug), but without show the timestamp/

I've inversed the string and got:
123456789_SDLThread
123456789_S
(still truncated), but same timestamp (in miliseconds)

clearly android bug. timestamp couldn't be identical nor truncated.
at least that part of the ANR are not relevants...

@AntTheAlchemist
Copy link
Contributor Author

I suspect some of the ANRs (and crashes) are caused by the Android system, and the app gets the blame. I've seen reports showing SQL_Lite code. I'm not even using SQL.

My ANRs are decreasing slightly in my latest live production. I wasn't in time to include #378e33b though. If you give me a couple more weeks, I'll have some nice fresh stack logs (which I'll post in the other issue) that'll help identify the culprit.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 22, 2023

what's #378e33b ?

@AntTheAlchemist
Copy link
Contributor Author

what's #378e33b ?

378e33b

Sorry, I'm not very good with this Github lark.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 22, 2023

ok ! (copy paste is fine !). I am currently testing it.

@1bsyl
Copy link
Contributor

1bsyl commented Jun 22, 2023

#7859: Android: also protect Hat/Joy/PadDown,Up so they are not sent without window

@1bsyl
Copy link
Contributor

1bsyl commented Jun 22, 2023

@AntTheAlchemist : I meant, you don't need for your next update to change the SDLThread java name. (Unless you've idea about what to include to gather more information)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants