Aller au contenu

Photo

The Vault Preservation Project


  • Veuillez vous connecter pour répondre
161 réponses à ce sujet

#51
ehye_khandee

ehye_khandee
  • Members
  • 855 messages
If there's any way admin access can be had, we'd be happy to wrangle the essential file transfer for you. It would be as simple as copying the databases, and the files themselves. I'm not sure what the original is written in but now might be a good time to revamp the thing such as what was done / is being done on neverwinterconnections.com .

Be well. Game on.
GM_ODA

#52
Just a ghost

Just a ghost
  • Members
  • 146 messages
Quite sure that database is huge and not something you easily export.

#53
ehye_khandee

ehye_khandee
  • Members
  • 855 messages
It can be done, and when going from server to server you cut out one step vs having individuals download & store, then re-upload-into-forms. The amount of man hours required to do it the latter way seems prohibitive. We did this with neverwinterconnections.com, copying all data (with admin access to both servers, they connect directly using higher speeds than the bottleneck of your local ISP would allow.

Be well. Game on.
GM_ODA

#54
acomputerdood

acomputerdood
  • Members
  • 219 messages
yeah, if you have shell access to the server, you can just dump the whole db. this sort of problem has come up before and people realized they needed solutions to these problems. :)

#55
meaglyn

meaglyn
  • Members
  • 807 messages
I second, or rather third, that copying the DB and files directly would be the best way to do
this. There may be political/legal hurdles. nwvault may not be a separate DB but be part of
a single massive IGN database. Maybe the right tables could be exported selectively...

#56
Rolo Kipp

Rolo Kipp
  • Members
  • 2 791 messages
 <dancing...>

Maximus says
...
Thanks for the heads up and all your efforts. I fully endorse this...
Getting any resources to help with the current site is very difficult, and any effort you're making to keep the community going is much appreciated...

That is Soooooo cool! :-)
Unfortunately, the "difficult" comment includes getting access to the existing Db :-P
Which means we continue with the "EULA-friendly" "Many Hands" approach :-)

Edit: Trying to write with half a brain just doesn't work :-P I'm just happy Maximus and IGN are okay with this project :-)

<...like no one is watching>

Modifié par Rolo Kipp, 01 octobre 2012 - 09:38 .


#57
Tarot Redhand

Tarot Redhand
  • Members
  • 2 674 messages
So I says to myself how hard can it be to do a section with just 6 pages (as opposed to 50+ for modules) for this project? Little did I know just how absurd some people can be. Just on the first page I came upon 2 submissions -

Submission 1 - Submission 2

which between them have 98 (77, 21) seperate small files to download. This begs the question, don't they realise the inherant laziness of some people (me included) who would much rather download a single file and sort out for themselves. If it was done so that credit to the original authors would be explicit, I have to ask have they never heard of a readme file. In exasperation I really have to ask - Are they nuts?

I have to say that Carcerian nearly made it into this rant but he had the decency to also put download all fonts type files in his 2 font submissions.

TR <still banging head against wall, while gibbering in corner>

Modifié par Tarot Redhand, 01 octobre 2012 - 11:54 .


#58
Bannor Bloodfist

Bannor Bloodfist
  • Members
  • 924 messages
Well, it appears that special care was given to give credit where credit is due, which is what all the links are for in the description. However, I think just compiling the "final" haks would have been better still.

I am surprised that CEP allowed any of their content to be removed/split from their compilations as they have always been very vocal about their "ownership" of other folks works in the past.

As to Maximus; well, in my experiences with him over the past 9 years or so, he has ALWAYS been a very helpful person, in more ways than the coummity at large has ever officially recognized. He may even be willing to give a port into the db, if asked correctly. IE, just a table(s) name(s) and access to export those tables, OR he may be willing to port them directly (less likely as the bandwith required is going to be HUGE, and would require him to export to himself then send some sort of link to that data OR to directly send the data)

I know I have half a dozen cd's, yes, cd's not dvd's of data that I grabbed from the vault years ago. I would not be able to transmitt that much data across my internet connection without it hogging the bandwidth for at least a week.

Has anyone considered a time limit or data limit on how far back you are willing to "mine" the data? I mean, there are haks up there from the very beginning of NWN, and I would suspect that most of those haks have had no traffic for years or have been superceded by much more recent uploads. Of course, having said that, I also know that there are some gems burried back then that are still worth saving that may still require updating/fixing.

As you all know, I have always been tileset specific in my searches for data, and have saved mainly tilesets. I have many of the haks, with whatever documentation was included in them, but have no backups of the original postings on the vault. Gawd, I wish I could have kept the CTP plugging away, we had a huge amount of content that never got finished and released. A large section of the "extra" work we had done has been lost, but I still have the "original" files stored away on cd's with much of the "original work" that was performed by the early CTP team. (I lost the interim work that was performed by the "middle" team during CTP's life cycle, but still have most of the "end" stages etc.

Anyway, back to THIS project. We are talking in the range of 150-200 gig, possibly more, of data to mine. I don't care how fast your internet connection is, that is a HUGE amount of bandwidth, and it WILL set off alarm bells for any ISP out there. Downloads to a personal pc/location is one thing, but when you start "sending" that much data to a centralized location, your ISP may, and likely will, cut or slow down your internet connection on a monthly basis.

Have we figured out how/where, exactly, the data is going to end up? I know, I know, you have this Drupal site, but from my recent experience in posting there, Drupal is going to take a huge amount of re-editing of posts to get formatting to work. Much less xmitting all the actual hak files.

Please excuse me if I missed some notes on this over all the posts for this project, but I wish for this project to succeed and am just making sure we are ALL considering the amount of real data that we are attempting to save along with the "dangers" involved.

quick recap:
1) Age of files to save?
2) Size of files to be saved (along with posts etc.)
3) Formatting: Which to my knowledge has not really been addressed yet?
4) Possible direct DB access for direct export and to where exactly?
5) Editing - Reformatting of all that data?

#59
virusman

virusman
  • Members
  • 282 messages

Bannor Bloodfist wrote...

Anyway, back to THIS project. We are talking in the range of 150-200 gig, possibly more, of data to mine. I don't care how fast your internet connection is, that is a HUGE amount of bandwidth, and it WILL set off alarm bells for any ISP out there. Downloads to a personal pc/location is one thing, but when you start "sending" that much data to a centralized location, your ISP may, and likely will, cut or slow down your internet connection on a monthly basis.

Not everywhere.. I have a 60 Mbit connection and uploaded 1.7 TB last month, with no problems from ISP.

I think the current goal now is to save as much data as we can and store it somewhere, and set up a site only if Vault ever goes down.

#60
Bannor Bloodfist

Bannor Bloodfist
  • Members
  • 924 messages
wow.. my max speed for download is 2mb (with limitations) and max upload is only 256k... Not that I really need anything faster than that anymore. Unless I am downloading a new version of Skyrim or something (which happens frequently enough to be annoying.}

So, you have roughly 30 times the speed that I can get... damn isp. I can upgrade to a max of 10mb, but that costs about $80 per month just for the internet, on top of the $80 that I pay for basic cable. I live in the boonies, and there is absolutely NO competition for internet connections out here. It is a one shop area, and they claim "We are a small company, so we are excluded from allowing others to share our cable lines", where in a normal city environment you can have multiple carriers on the same line, here we only have one. Not enough customers for the bigger internt folks to even bother attempting to run new cables out here. Besides, I live in the USA, and we are what, 47th in internet speeds worldwide? That number changes back and forth, but the USA is inherently slower than anyone in Europe or the Far East.

#61
Mecheon

Mecheon
  • Members
  • 439 messages
Just going to say Bannor, I once was a moderator on a Warcraft website that exceeded its allocation by about 5 TB

I think we only had 500 megs. And we'd filled nearly every server this place had. ISPs can miss a lot

#62
Pstemarie

Pstemarie
  • Members
  • 2 745 messages
I feel your pain Bannor - I pay $75 per month for cable through charter for a 30mb line with unlimited bandwidth. On top of that I have DirecTV (since their cable tv package goes up every 3 months and keeps dropping channels).

#63
acomputerdood

acomputerdood
  • Members
  • 219 messages
my next attempt:


#!/usr/bin/perl


$OUTPUT = "projects";
$project = "";


for($id=1; $id < 5; $id++){
#       $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
#       $url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
        $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;

        $page = `curl -s $url`;

        @lines = split /\\n/, $page;

        $files = 0;
        $comments = 0;
        foreach $l (@lines){
                if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
                        $project = $1;
                        $project =~ s/\\// /g;
                        $project =~ s/&/and/g;
                        $project =~ s/--/-/g;
                        $project =~ s/\\(//g;
                        $project =~ s/\\)//g;
                        $project =~ s/ /_/g;
                        print "\\nprocessing $project -> $OUTPUT/$project\\n";
                        `mkdir $OUTPUT/$project`;
                }
                if($l =~ /<a name="Files"><\\/a>Files/){
                        $files = 1;
                }
                if($l =~ /<\\/TABLE>/){
                        $files = 0;
                }


                if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
                        print "downloading: $2\\n";
                       `wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
                }
                if($comments == 0){
                        if($l =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
                                $comments = 1;
                                &get_next_page($url, 2);
                        }
                }
                next if !$files;

        }

        open(FILE, ">$OUTPUT/$project/index.html");
        print FILE $page;
        close FILE;
}


sub get_next_page{
        $u = shift;
        $num = shift;
        print "fetching comments page: $num\\n";

        $u2 = $u . "\\\\&comment_page=$num";

        $p = `curl -s $u2`;
        open(FILE, ">$OUTPUT/$project/index$num.html");
        print FILE $p;
        close FILE;


        @lines2 = split /\\n/, $p;
        foreach $l2 (@lines2){
                if($l2 =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
                        &get_next_page($u, $num + 1);
                }
        }
}

it will iterate through all of the entries in the for loop, creating a new project directory for each page, downloading each file, and grabbing each comment page. it doesn't do screenshots - do we care about that?

also, i've not been able to find a page with an external linked source to test against, but it *should* try to download it from the vault and fail.

i've tested it for entries 1-10 and it works great. i'll try to capture all of the scripts directory next, but i don't know how much space i'll be using. anybody want to volunteer testing it?


PS Tarot Redhand:
i changed my url to process the link you posted for the textures page. it seems the vault pages are standardized enough that it works beautifully against it.

i did notice, however, that i'm not trying to grab anything linked in from the "description" section. i think that's fine because those links are either to external files or to other vault pages.

#64
acomputerdood

acomputerdood
  • Members
  • 219 messages
just an update - this script will grab the screenshots and thumbs:

#!/usr/bin/perl


$OUTPUT = "projects";
$project = "";


for($id=147; $id < 148; $id++){
#	$url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
#	$url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
#	$url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;
	$url = "http://nwvault.ign.com/View.php?view=Textures.Detail\\\\&id=" . $id;

	$page = `curl -s $url`;

	@lines = split /\\n/, $page;

	$files = 0;
	$comments = 0;
	$images = 0;
	foreach $l (@lines){
		if($images == 1){
			if($l =~ /<a href/){
				&grab_screenshots($l);
			}
		}
		
		if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
			$project = $1;
			$project =~ s/\\// /g;
			$project =~ s/&/and/g;
			$project =~ s/--/-/g;
			$project =~ s/\\(//g;
			$project =~ s/\\)//g;
			$project =~ s/ /_/g;
			print "\\nprocessing $project -> $OUTPUT/$project\\n";
			`mkdir -p $OUTPUT/$project`;
		}
		if($l =~ /<a name="Files"><\\/a>Files/){
			$files = 1;
		}
		if($l =~ /<\\/TABLE>/){
			$files = 0;
		}


		if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
			print "downloading: $2\\n";
			`wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
		}
		if($comments == 0){
			if($l =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
				$comments = 1;
				&get_next_page($url, 2);
			}
		}
		if($l =~ /-START OF IMAGE CODE-/){
			$images = 1;
		}
	}

	open(FILE, ">$OUTPUT/$project/index.html");
	print FILE $page;
	close FILE;
}


sub get_next_page{
	$u = shift;
	$num = shift;
	print "fetching comments page: $num\\n";

	$u2 = $u . "\\\\&comment_page=$num";

	$p = `curl -s $u2`;
	open(FILE, ">$OUTPUT/$project/index$num.html");
	print FILE $p;
	close FILE;


	@lines2 = split /\\n/, $p;
	foreach $l2 (@lines2){
		if($l2 =~ /<A href="\\/View.php.*" >Next&gt;<\\/A>/){
			&get_next_page($u, $num + 1);
		}
	}
}

sub grab_screenshots{
	$images = 0;
	$imgline = shift;

	@imgchunks = split /<p>/, $imgline;

	foreach $ic (@imgchunks){
		if($ic =~ /<a href="(fms\\/Image.php\\?id=(.*?))"/){
			`wget -O $OUTPUT/$project/$2.jpg http://nwvault.ign.com/$1`;
		}
		if($ic =~ /src="(http:\\/\\/vnmedia.ign.com\\/nwvault.ign.com\\/fms\\/images\\/.*?\\/.*?\\/(.*?))"/){
			`wget -O $OUTPUT/$project/$2 $1 `;
		}
	}
}


#65
werelynx

werelynx
  • Members
  • 628 messages
@acomputerdood: "not trying to grab anything linked in from the "description" section"
Sometimes in the modules section, the hakpaks are linked in the description... sometimes those hakpaks are module-specific. Of course it's good as long as you are going to grab all the hakpaks as well, but still there is a need to "link them" those haks.

Good luck with this project.
I hope I'll be able to join soon, but probably I won't have time untill February.

#66
acomputerdood

acomputerdood
  • Members
  • 219 messages
well, the way i see it, the things liked in the description section fall into one of three categories:

1) they're already included in the downloads section below which i'm grabbing
2) they're linked to another vault page that somebody else in charge of archiving
3) they're hosted on an external site, which we're not to download, only preserve the link

i believe my script handles all 3 cases.

#67
werelynx

werelynx
  • Members
  • 628 messages
What I meant for 2) is that old link will direct you to the nwvault(..) address while it should direct you to its nwn-ccc(..) equivalent. In case vault goes down it would be a dead link.

What you could do is make your code write with big red letters that there is link that needs to be changed(visible when checking the uploaded page), so it could be changed manually after all content is already on nwn-ccc.

#68
acomputerdood

acomputerdood
  • Members
  • 219 messages
that's easy enough to fix in post-processing. once everything is grabbed, i assume somebody will take up the effort of reformatting everything into the new pages and layouts. until that happens, there's no reason to try and correct links now.

it will be just as easy to identify incorrect links at that time as it is now.


oh, and incidentally, my first run just finished.  it took 348 minutes to download the 3865 entries in the "scripts" section.

rolo, can you run a perl script on the server you're archiving on?

Modifié par acomputerdood, 02 octobre 2012 - 05:11 .


#69
Rolo Kipp

Rolo Kipp
  • Members
  • 2 791 messages
<doing the whole...>

@ Bannor: I agree, now that we have some legitimacy from Maximus. Concentrating on the Last In (Newest) projects first makes a lot of sense to me. (Note: the VPP has an upload limit of 100mb. Larger files will need to be somehow sent to me (dropbox, sky-drive, google drive, yousendit... etc.) for FTP. Or dropped into the Vault's "over 25mb" FTP site. I can access them from there directly (just not after they've been moved to the Vaults permanent storage :-P ).

@ Tarot: My take is to bundle up the Realms stuff into one 7z or rar :-P Personally, when I do the haks (soon, real soon) I'll be re-archiving anything that isn't either rar, zip (:-P ) or 7z. I'm only including zip because it's ubiquitous ;-/

@ Virusman: Save everything! Newest stuff first, though... But eventually, I hope to update (like neverwinter connections did) the site and incorporate the full Vault functionality.

@ ACD: I can run perl (though I'm currently perl ignorant). Email me ( rolo@amethysttapestry.com )? Let's talk. Or call me after 7pm PST if you have a cell phone and are in the US (pm me for number?)

My current thought for automating stuff is to get a CSV or Db of the metadata, create all the projects (flagged "pending upload") using drupal's migration tools and then present a filterable list of projects still needing files/screenies/comments... Make any sense?

In that context, preserving the data is still the first priority, while I structure the site.

@ Oda: I would love to take this opportunity for improving the Vault. But then, I really only know how to do tricky stuff in Drupal (and 3DS Max, but that's a different thread ;-). I am negotiating <pleading> for Db/files access.

@ Werelynx: The Original Page link is specifically there to point to the NwVault page the project was salvaged from. My reasoning is two-fold; first to be sure proper credit is given to the original author and secondly to provide a quick compare link to find/fix mistakes. If the NwVault does go down, that field may be disabled. I.e. it's for construction purposes, at the moment.

<...one-armed paper-hanger bit>

Modifié par Rolo Kipp, 02 octobre 2012 - 07:01 .


#70
Rolo Kipp

Rolo Kipp
  • Members
  • 2 791 messages
<going...>

Just an update in the middle of other things, Maximus has fixed my admin tools and I've approved most of the backlog on the Vault.

<...approval-happy>

#71
Bard Simpson

Bard Simpson
  • Members
  • 162 messages
You're the man, Rolo! Oh, wait; should that be you're the wizard? Eh, either way, thank you for starting this project and thank you for all your hard work on the Vault itself.

#72
Lovelamb

Lovelamb
  • Members
  • 38 messages
Sir, did you say the Vault is now read-only? I've devoted over a year of my recent life to working on an evil module that I doubt the Nexus, with their strict rules, would accept... (Should I kill myself for being late? :()

I would like to help with backuping the Vault, though I might need an explanation as to how to upload the content to your site. I can save all the web pages and related files for now. You can sign me up for the first 10 pages (or 250 modules) on the module list. I'm not sure how the modules are ordered, hope everyone sees the same list. I have the disk space, but my upload speed isn't very high.

Modifié par Lovelamb, 02 octobre 2012 - 10:26 .


#73
Vibrant Penumbra

Vibrant Penumbra
  • Members
  • 162 messages
 Hmmm, like the look, lambchops =]

Yeah, the Kipper said it was read-only... for a while :P

Maxy-dear fixed the old man's wagon and now it's working again. For another little while. :unsure:

Ack! Sunshine!

Toodles!

#74
meaglyn

meaglyn
  • Members
  • 807 messages
ACD - drat you beat me to it :)

I've just about completed a set of scripts which do almost the same thing yours does. The major difference being the creation of a key value metadata file along with the downloads. The idea
there was to make it easy to get that data into a new DB. But that could be done with tools on
the saved raw html too once it's all downloaded.

Cheers,
Meaglyn

#75
Rolo Kipp

Rolo Kipp
  • Members
  • 2 791 messages
<reaching out...>

@ meaglyn: But that is what I want! :-P The key value metadata, that is... preferably in CSV or Excel format.

Would you be willing to share with ACD and incorporate that? He's sent me one updated version, why not another =)

Getting the metadata into an easily imported format would make things vastly easier. I'd then use that CSV file to generate the projects. Then all I need to do is link up the files/screenies and comments.

Actually, comments could be collected in a keyed file, also. Drupal gives each comment its own node and links the nodes to the project. So I'd just need a field for each comment with the unique identifier for the project... I think... :-P

<...with both hands>