The Vault Preservation Project
#51
Posté 30 septembre 2012 - 10:28
Be well. Game on.
GM_ODA
#52
Posté 01 octobre 2012 - 06:35
#53
Posté 01 octobre 2012 - 08:29
Be well. Game on.
GM_ODA
#54
Posté 01 octobre 2012 - 12:17
#55
Posté 01 octobre 2012 - 02:04
this. There may be political/legal hurdles. nwvault may not be a separate DB but be part of
a single massive IGN database. Maybe the right tables could be exported selectively...
#56
Posté 01 octobre 2012 - 08:33
That is Soooooo cool! :-)Maximus says
...
Thanks for the heads up and all your efforts. I fully endorse this...
Getting any resources to help with the current site is very difficult, and any effort you're making to keep the community going is much appreciated...
Unfortunately, the "difficult" comment includes getting access to the existing Db :-P
Which means we continue with the "EULA-friendly" "Many Hands" approach :-)
Edit: Trying to write with half a brain just doesn't work :-P I'm just happy Maximus and IGN are okay with this project :-)
<...like no one is watching>
Modifié par Rolo Kipp, 01 octobre 2012 - 09:38 .
#57
Posté 01 octobre 2012 - 11:52
Submission 1 - Submission 2
which between them have 98 (77, 21) seperate small files to download. This begs the question, don't they realise the inherant laziness of some people (me included) who would much rather download a single file and sort out for themselves. If it was done so that credit to the original authors would be explicit, I have to ask have they never heard of a readme file. In exasperation I really have to ask - Are they nuts?
I have to say that Carcerian nearly made it into this rant but he had the decency to also put download all fonts type files in his 2 font submissions.
TR <still banging head against wall, while gibbering in corner>
Modifié par Tarot Redhand, 01 octobre 2012 - 11:54 .
#58
Posté 02 octobre 2012 - 06:27
I am surprised that CEP allowed any of their content to be removed/split from their compilations as they have always been very vocal about their "ownership" of other folks works in the past.
As to Maximus; well, in my experiences with him over the past 9 years or so, he has ALWAYS been a very helpful person, in more ways than the coummity at large has ever officially recognized. He may even be willing to give a port into the db, if asked correctly. IE, just a table(s) name(s) and access to export those tables, OR he may be willing to port them directly (less likely as the bandwith required is going to be HUGE, and would require him to export to himself then send some sort of link to that data OR to directly send the data)
I know I have half a dozen cd's, yes, cd's not dvd's of data that I grabbed from the vault years ago. I would not be able to transmitt that much data across my internet connection without it hogging the bandwidth for at least a week.
Has anyone considered a time limit or data limit on how far back you are willing to "mine" the data? I mean, there are haks up there from the very beginning of NWN, and I would suspect that most of those haks have had no traffic for years or have been superceded by much more recent uploads. Of course, having said that, I also know that there are some gems burried back then that are still worth saving that may still require updating/fixing.
As you all know, I have always been tileset specific in my searches for data, and have saved mainly tilesets. I have many of the haks, with whatever documentation was included in them, but have no backups of the original postings on the vault. Gawd, I wish I could have kept the CTP plugging away, we had a huge amount of content that never got finished and released. A large section of the "extra" work we had done has been lost, but I still have the "original" files stored away on cd's with much of the "original work" that was performed by the early CTP team. (I lost the interim work that was performed by the "middle" team during CTP's life cycle, but still have most of the "end" stages etc.
Anyway, back to THIS project. We are talking in the range of 150-200 gig, possibly more, of data to mine. I don't care how fast your internet connection is, that is a HUGE amount of bandwidth, and it WILL set off alarm bells for any ISP out there. Downloads to a personal pc/location is one thing, but when you start "sending" that much data to a centralized location, your ISP may, and likely will, cut or slow down your internet connection on a monthly basis.
Have we figured out how/where, exactly, the data is going to end up? I know, I know, you have this Drupal site, but from my recent experience in posting there, Drupal is going to take a huge amount of re-editing of posts to get formatting to work. Much less xmitting all the actual hak files.
Please excuse me if I missed some notes on this over all the posts for this project, but I wish for this project to succeed and am just making sure we are ALL considering the amount of real data that we are attempting to save along with the "dangers" involved.
quick recap:
1) Age of files to save?
2) Size of files to be saved (along with posts etc.)
3) Formatting: Which to my knowledge has not really been addressed yet?
4) Possible direct DB access for direct export and to where exactly?
5) Editing - Reformatting of all that data?
#59
Posté 02 octobre 2012 - 06:49
Not everywhere.. I have a 60 Mbit connection and uploaded 1.7 TB last month, with no problems from ISP.Bannor Bloodfist wrote...
Anyway, back to THIS project. We are talking in the range of 150-200 gig, possibly more, of data to mine. I don't care how fast your internet connection is, that is a HUGE amount of bandwidth, and it WILL set off alarm bells for any ISP out there. Downloads to a personal pc/location is one thing, but when you start "sending" that much data to a centralized location, your ISP may, and likely will, cut or slow down your internet connection on a monthly basis.
I think the current goal now is to save as much data as we can and store it somewhere, and set up a site only if Vault ever goes down.
#60
Posté 02 octobre 2012 - 07:49
So, you have roughly 30 times the speed that I can get... damn isp. I can upgrade to a max of 10mb, but that costs about $80 per month just for the internet, on top of the $80 that I pay for basic cable. I live in the boonies, and there is absolutely NO competition for internet connections out here. It is a one shop area, and they claim "We are a small company, so we are excluded from allowing others to share our cable lines", where in a normal city environment you can have multiple carriers on the same line, here we only have one. Not enough customers for the bigger internt folks to even bother attempting to run new cables out here. Besides, I live in the USA, and we are what, 47th in internet speeds worldwide? That number changes back and forth, but the USA is inherently slower than anyone in Europe or the Far East.
#61
Posté 02 octobre 2012 - 08:20
I think we only had 500 megs. And we'd filled nearly every server this place had. ISPs can miss a lot
#62
Posté 02 octobre 2012 - 10:38
#63
Posté 02 octobre 2012 - 10:40
#!/usr/bin/perl
$OUTPUT = "projects";
$project = "";
for($id=1; $id < 5; $id++){
# $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
# $url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
$url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;
$page = `curl -s $url`;
@lines = split /\\n/, $page;
$files = 0;
$comments = 0;
foreach $l (@lines){
if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
$project = $1;
$project =~ s/\\// /g;
$project =~ s/&/and/g;
$project =~ s/--/-/g;
$project =~ s/\\(//g;
$project =~ s/\\)//g;
$project =~ s/ /_/g;
print "\\nprocessing $project -> $OUTPUT/$project\\n";
`mkdir $OUTPUT/$project`;
}
if($l =~ /<a name="Files"><\\/a>Files/){
$files = 1;
}
if($l =~ /<\\/TABLE>/){
$files = 0;
}
if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
print "downloading: $2\\n";
`wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
}
if($comments == 0){
if($l =~ /<A href="\\/View.php.*" >Next><\\/A>/){
$comments = 1;
&get_next_page($url, 2);
}
}
next if !$files;
}
open(FILE, ">$OUTPUT/$project/index.html");
print FILE $page;
close FILE;
}
sub get_next_page{
$u = shift;
$num = shift;
print "fetching comments page: $num\\n";
$u2 = $u . "\\\\&comment_page=$num";
$p = `curl -s $u2`;
open(FILE, ">$OUTPUT/$project/index$num.html");
print FILE $p;
close FILE;
@lines2 = split /\\n/, $p;
foreach $l2 (@lines2){
if($l2 =~ /<A href="\\/View.php.*" >Next><\\/A>/){
&get_next_page($u, $num + 1);
}
}
}
it will iterate through all of the entries in the for loop, creating a new project directory for each page, downloading each file, and grabbing each comment page. it doesn't do screenshots - do we care about that?
also, i've not been able to find a page with an external linked source to test against, but it *should* try to download it from the vault and fail.
i've tested it for entries 1-10 and it works great. i'll try to capture all of the scripts directory next, but i don't know how much space i'll be using. anybody want to volunteer testing it?
PS Tarot Redhand:
i changed my url to process the link you posted for the textures page. it seems the vault pages are standardized enough that it works beautifully against it.
i did notice, however, that i'm not trying to grab anything linked in from the "description" section. i think that's fine because those links are either to external files or to other vault pages.
#64
Posté 02 octobre 2012 - 11:05
#!/usr/bin/perl
$OUTPUT = "projects";
$project = "";
for($id=147; $id < 148; $id++){
# $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=3800";
# $url = "http://nwvault.ign.com/View.php?view=hakpaks.detail\\\\&id=7849";
# $url = "http://nwvault.ign.com/View.php?view=Scripts.Detail\\\\&id=" . $id;
$url = "http://nwvault.ign.com/View.php?view=Textures.Detail\\\\&id=" . $id;
$page = `curl -s $url`;
@lines = split /\\n/, $page;
$files = 0;
$comments = 0;
$images = 0;
foreach $l (@lines){
if($images == 1){
if($l =~ /<a href/){
&grab_screenshots($l);
}
}
if($l =~ /<a href="\\#Files" title="Downloads Below">(.*?)<\\/a>/){
$project = $1;
$project =~ s/\\// /g;
$project =~ s/&/and/g;
$project =~ s/--/-/g;
$project =~ s/\\(//g;
$project =~ s/\\)//g;
$project =~ s/ /_/g;
print "\\nprocessing $project -> $OUTPUT/$project\\n";
`mkdir -p $OUTPUT/$project`;
}
if($l =~ /<a name="Files"><\\/a>Files/){
$files = 1;
}
if($l =~ /<\\/TABLE>/){
$files = 0;
}
if($l =~ /<a href="(fms\\/Download\\.php.*?)".*?>(.*?)<span>/){
print "downloading: $2\\n";
`wget -O $OUTPUT/$project/$2 http://nwvault.ign.com/$1`;
}
if($comments == 0){
if($l =~ /<A href="\\/View.php.*" >Next><\\/A>/){
$comments = 1;
&get_next_page($url, 2);
}
}
if($l =~ /-START OF IMAGE CODE-/){
$images = 1;
}
}
open(FILE, ">$OUTPUT/$project/index.html");
print FILE $page;
close FILE;
}
sub get_next_page{
$u = shift;
$num = shift;
print "fetching comments page: $num\\n";
$u2 = $u . "\\\\&comment_page=$num";
$p = `curl -s $u2`;
open(FILE, ">$OUTPUT/$project/index$num.html");
print FILE $p;
close FILE;
@lines2 = split /\\n/, $p;
foreach $l2 (@lines2){
if($l2 =~ /<A href="\\/View.php.*" >Next><\\/A>/){
&get_next_page($u, $num + 1);
}
}
}
sub grab_screenshots{
$images = 0;
$imgline = shift;
@imgchunks = split /<p>/, $imgline;
foreach $ic (@imgchunks){
if($ic =~ /<a href="(fms\\/Image.php\\?id=(.*?))"/){
`wget -O $OUTPUT/$project/$2.jpg http://nwvault.ign.com/$1`;
}
if($ic =~ /src="(http:\\/\\/vnmedia.ign.com\\/nwvault.ign.com\\/fms\\/images\\/.*?\\/.*?\\/(.*?))"/){
`wget -O $OUTPUT/$project/$2 $1 `;
}
}
}
#65
Posté 02 octobre 2012 - 02:26
Sometimes in the modules section, the hakpaks are linked in the description... sometimes those hakpaks are module-specific. Of course it's good as long as you are going to grab all the hakpaks as well, but still there is a need to "link them" those haks.
Good luck with this project.
I hope I'll be able to join soon, but probably I won't have time untill February.
#66
Posté 02 octobre 2012 - 03:37
1) they're already included in the downloads section below which i'm grabbing
2) they're linked to another vault page that somebody else in charge of archiving
3) they're hosted on an external site, which we're not to download, only preserve the link
i believe my script handles all 3 cases.
#67
Posté 02 octobre 2012 - 04:49
What you could do is make your code write with big red letters that there is link that needs to be changed(visible when checking the uploaded page), so it could be changed manually after all content is already on nwn-ccc.
#68
Posté 02 octobre 2012 - 04:56
it will be just as easy to identify incorrect links at that time as it is now.
oh, and incidentally, my first run just finished. it took 348 minutes to download the 3865 entries in the "scripts" section.
rolo, can you run a perl script on the server you're archiving on?
Modifié par acomputerdood, 02 octobre 2012 - 05:11 .
#69
Posté 02 octobre 2012 - 06:56
@ Bannor: I agree, now that we have some legitimacy from Maximus. Concentrating on the Last In (Newest) projects first makes a lot of sense to me. (Note: the VPP has an upload limit of 100mb. Larger files will need to be somehow sent to me (dropbox, sky-drive, google drive, yousendit... etc.) for FTP. Or dropped into the Vault's "over 25mb" FTP site. I can access them from there directly (just not after they've been moved to the Vaults permanent storage :-P ).
@ Tarot: My take is to bundle up the Realms stuff into one 7z or rar :-P Personally, when I do the haks (soon, real soon) I'll be re-archiving anything that isn't either rar, zip (:-P ) or 7z. I'm only including zip because it's ubiquitous ;-/
@ Virusman: Save everything! Newest stuff first, though... But eventually, I hope to update (like neverwinter connections did) the site and incorporate the full Vault functionality.
@ ACD: I can run perl (though I'm currently perl ignorant). Email me ( rolo@amethysttapestry.com )? Let's talk. Or call me after 7pm PST if you have a cell phone and are in the US (pm me for number?)
My current thought for automating stuff is to get a CSV or Db of the metadata, create all the projects (flagged "pending upload") using drupal's migration tools and then present a filterable list of projects still needing files/screenies/comments... Make any sense?
In that context, preserving the data is still the first priority, while I structure the site.
@ Oda: I would love to take this opportunity for improving the Vault. But then, I really only know how to do tricky stuff in Drupal (and 3DS Max, but that's a different thread ;-). I am negotiating <pleading> for Db/files access.
@ Werelynx: The Original Page link is specifically there to point to the NwVault page the project was salvaged from. My reasoning is two-fold; first to be sure proper credit is given to the original author and secondly to provide a quick compare link to find/fix mistakes. If the NwVault does go down, that field may be disabled. I.e. it's for construction purposes, at the moment.
<...one-armed paper-hanger bit>
Modifié par Rolo Kipp, 02 octobre 2012 - 07:01 .
#70
Posté 02 octobre 2012 - 09:58
Just an update in the middle of other things, Maximus has fixed my admin tools and I've approved most of the backlog on the Vault.
<...approval-happy>
#71
Posté 02 octobre 2012 - 10:10
#72
Posté 02 octobre 2012 - 10:15
I would like to help with backuping the Vault, though I might need an explanation as to how to upload the content to your site. I can save all the web pages and related files for now. You can sign me up for the first 10 pages (or 250 modules) on the module list. I'm not sure how the modules are ordered, hope everyone sees the same list. I have the disk space, but my upload speed isn't very high.
Modifié par Lovelamb, 02 octobre 2012 - 10:26 .
#73
Posté 02 octobre 2012 - 10:26
Yeah, the Kipper said it was read-only... for a while
Maxy-dear fixed the old man's wagon and now it's working again. For another little while.
Ack! Sunshine!
Toodles!
#74
Posté 03 octobre 2012 - 02:16
I've just about completed a set of scripts which do almost the same thing yours does. The major difference being the creation of a key value metadata file along with the downloads. The idea
there was to make it easy to get that data into a new DB. But that could be done with tools on
the saved raw html too once it's all downloaded.
Cheers,
Meaglyn
#75
Posté 03 octobre 2012 - 04:56
@ meaglyn: But that is what I want! :-P The key value metadata, that is... preferably in CSV or Excel format.
Would you be willing to share with ACD and incorporate that? He's sent me one updated version, why not another =)
Getting the metadata into an easily imported format would make things vastly easier. I'd then use that CSV file to generate the projects. Then all I need to do is link up the files/screenies and comments.
Actually, comments could be collected in a keyed file, also. Drupal gives each comment its own node and links the nodes to the project. So I'd just need a field for each comment with the unique identifier for the project... I think... :-P
<...with both hands>





Retour en haut






