Reviving wrote...
GuyBlade wrote...
Forgive me for a moment while I wax about the technical challenges of this little toy.
....
I've been doing a little bit of n7hq page scraping myself recently, at one point I was thinking of contacting you to ask for some pointers as I struggled a bit with DOM tools (I tried using html agility pack via C#). In the end I just used some simple regular expressions my purpose was to pull in the number of UR ranks and the silver challenges that have not been completed including progress made on these challenges.
I was going to make a small automated tool and see if anyone else on the boards was interested but I came across the same problem you mention the "age-gate", do you hit this because the tool trying to download the page doesn't have the necessary cookie? Anyway as I was just playing around for my own amusement I just manually downloaded the html I was interested in and then ran a couple of scripts to extract the data.
Back to your actual problem: does it take 60 seconds to download the six html pages that make up the n7hq? That's just the html text and none of the other files? How long to you cache the data for, or rather after what amount of time do you re-request data from n7hq. I'm sure it wouldn't matter too much to make it even as long as 24hrs as this seems to be thing causing the logjam.
Are you using an insane amount of bandwidth to provide this service? Do you think that n7hq has throttled you becuase of high usage and hence the long times you have retrieving the pages?
That's a lot of questions and just me thinking off the top of my head. The first thing that struck me when I read your initial message was that you could just add names to a queue when you receive a request for a badge and return whatever you have cached straight away (even if it's well out of date) then a separate process can work on the queue saving the data and badges for the next time they are accessed. To be honest it's easy to talk about systems like this in a "hand-wavy" way when I don't know all the technical challenges you've faced and the way the project has changed over time.
Anyway well done on getting this running again!
Both the age gate and language selection show up if you don't have appropriate cookies. My solution to both of these was to use a tool that could understand and maintain cookies across requests and to emulate going through them as a user. That is, I pretend to select English and then pretend to select my age before I do any of the actual polling. I also discard cookies after each user is scraped. That is partially out of laziness (the polling script is stand alone) and partially out of not wanting to manage the cookie renewal and the like.
I cache data for 30 minutes, currently. My thinking was that was long enough to go play a game and come back if you wanted to see it updated.
The amount of bandwidth is usually pretty small. The images being output are on the order of a kilobyte or so, which makes serving them pretty easy. The main user of bandwidth is fetching pages. Each one is in the dozens or hundres of kilobytes. That doesn't sound like much, but when you're getting a lot of them and they are all blocking long enough to time out and need to be redone, it starts to clog things up. When I got home from work yesterday, I had over a hundred processes in some form of waiting on data state. With that many contending for the limited bandwidth of my internet connection, none of them were making progress, so the problem just kept getting worse as more requests rolled in.
I don't think that I've been throttled. I have two internet connections at my home. One is where the servers are and there is a separate one that I use for day-to-day things. My non-server internet connection has to wait quite a while to get data from N7HQ and it is never used for polling, so I think the servers are just slow. My guess is that Bioware has dialed back the number of machines serving out the data as the number of players has decreased over time.
Ultimately, the "queue and serve what's available" method is what I did to resolve the most recent batch of problems. I had previously used other bandaids, but they were all relatively vulnerable to race conditions or sudden large request bursts. The current solution uses an (internal-only) RPC server to manage update requests. The webserver will block for up to two minutes on a request while waiting for the RPC server to update data, then sever whatever it has. I didn't want to go this way initially because it means another piece of code that I'll have to daemonize and ensure is running everytime the machine boots. I prefer to do everything on the web-server side, when possible, as it tends to be a bit more resilient.