bryguy ([info]bryguypgh) wrote,
@ 2008-04-29 11:52:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
very large dataset
I've got a problem for work and I thought perhaps one of my readers out there might have some insight.

I need to come up with a very large dataset of publicly available data that is related to some topic high schoolers might find intriguing (as much as that's possible :) ). My first thought was the freedb project which is a free implementation of the cddb database that Gracenote stole from the users a few years back, but I was disappointed to find that the whole dataset clocks in at under 600 megabytes. I need something that's at least a terabyte and ideally much bigger. It has to be free (not just as in cost, but as in freedom i.e. no copyright/trademark restrictions), it can't raise any privacy concerns, and it has to be pg-13.

If you have any suggestions and or pointers I will be in your debt.



(Post a new comment)


[info]sui66iy
2008-04-29 04:10 pm UTC (link)
Check out this article:

http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php

Incidentally, if you actually need to work with the data, fetching terabytes off someone's servers is never a lot of fun... why the size requirement?

(Reply to this) (Thread)


[info]bryguypgh
2008-04-29 05:56 pm UTC (link)
That looks like a great resource, thanks.

The long and short of it is that I'm trying to create an exercise where map-reduce is better suited to the data than a traditional rdbms, and generally that means a whole lot of data. I could generate fake data or mine some of our logs, but it's meant to be interesting to teenagers so I want to make it captivating.

(Reply to this) (Parent)(Thread)


[info]sui66iy
2008-04-29 06:20 pm UTC (link)
What about image processing? It's (relatively) easy to create very large rasters that you could break into overlapping tiles or something. For instance, you could harvest photos, make a giant mosaic, and then do an image recognition task or something. Also, it's not very RDBMSy ;-)

Another dataset not mentioned in the article (and I have no idea how big it is) is the Netflix Challenge dataset. Also, I think Google has an n-gram dataset --- not sure how large it is (and probably a bit technical).

(Reply to this) (Parent)

Free?
(Anonymous)
2008-04-29 07:09 pm UTC (link)
If you really need a "free" database with no copyright restrictions, then you shouldn't be using freedb anyway. It is copyrighted, if you hadn't noticed. It is released under the GNU license (even though the GNU license cannot be applied to databases, but that doesn't seem to bother the freedb guys). The GNU license does have restrictions for use. But maybe you didn't really mean "no copyright restrictions"?

If you want a large free database, try the 2000 US census data at:

http://www2.census.gov/census_2000/datasets/

There are a lot of files, but uncompressed it may all approach a terabyte, or at the least many gigabytes.

As for Gracenote, how did they steal the database from users? As far as I've seen, no user has ever paid to access the Gracenote database. Only developers have to pay for access, and then only commercial ones. But hey, bashing is fun, eh?

(Reply to this) (Thread)

Re: Free?
[info]bryguypgh
2008-04-29 07:48 pm UTC (link)
Thanks for the census pointer, anonymous poster.

The cddb was created with community input, including my labor (I submitted track information for a few local compilation CDs). Gracenote took the list proprietary and free music-playing software distributed with linux was no longer able to use their list. To top it off, Gracenote attempted to sue users of other other projects that did the same thing for patent infringement. Whether or not they had a case (they didn't), they were bad corporate citizens. Your scorn is not appreciated, believe me I wish there was no behavior to bash.

There's some discussion of the history here:
http://www.news.com/2100-1023-257529.html
http://slashdot.org/article.pl?sid=06/11/13/1312214&tid=141

About the fees you had to pay to distribute applications that used cddb circa 2001 when they first walled off their database:
http://linuxplanet.com/linuxplanet/opinions/3127/1/

If you really want to quibble over whether GNU licenses are free, we can do it elsewhere but they are certainly free in the sense that I mean. Students can reuse the data and share it with others without paying someone a license fee for the privilege. They could even charge for the data if they like, the only thing they can't do is what Gracenote did, i.e., steal someone else's work, call it their own, and not let others use it.

(Reply to this) (Parent)(Thread)

Re: Free?
(Anonymous)
2008-04-29 09:15 pm UTC (link)
Actually, I was trying to make a point, because you wanted data that you could use without restriction. If you're going to bash Gracenote for not being free and praise freedb for being free, at least understand what you're talking about - freedb data, by their copyright, does have restrictions. However, my point is academic. You can actually use freedb however you like, because you can't actually copyright databases of public information. Both Gracenote and freedb, etc., have no claim over the data.

As for the rest, I certainly don't look to slashdot for facts. The actual facts are that Gracenote only ever tried to sue (and did) two entities, Roxio and Musicmatch. The former case was settled and Roxio licensed Gracenote's service. In the latter, the court ruled that Musicmatch was in breach of contract with Gracenote. Other aspects of the case are murky, but ended without resolution when Yahoo, a Gracenote licensee, bought Musicmatch.

Free music players can and do use Gracenote's service. There is a linux/Unix SDK for accessing Gracenote, and it's free to use for noncommercial (free) players. Xmcd, the unix player that started it all is one of the ones that uses it.

In case you are wondering, yes, I was a Gracenote employee, so I am admittedly biased. But I do have actual facts, as opposed to those who formulated the popular mythology about the evils of Gracenote. There were a couple of jerks using strongarm business tactics (they're history), and that's basically what set off the freebie crowd's anger - about a year after Gracenote (then CDDB) went commercial, and more than a year after the database was removed for full public download (well before CDDB went "commercial"). Nobody complained then... That anger fueled a lot of hate, and I never cease to be amazed at some of the antics these haters have perpetrated because Gracenote is so evil, heedless of the fact that their acts were far worse than anything Gracenote was accused of doing. Spreading untruths about Gracenote (mixed in with truths to add credibility, of course) is the least of these acts.

In any case, the data you personally entered is and has always been available for download. Gracenote started with a database seeded from the original public database, but so did freedb (and dozens of other "free" and "commercial" databases). If you didn't want anyone using the data you personally entered commercially, you shouldn't have submitted it. It's getting used endlessly in countless commercial ways, and always has been. For example, Nero, the most popular media player/ripper out there is commercial, and it uses freedb. It's making money off of the data you typed in personally. Yadb (jriver) also. And the list goes on. The quid pro quo has always been that you get to download data typed in by others, and occasionally maybe enter data yourself when it's not found. I would hazard to guess that you've downloaded much more information for free from Gracenote and freedb than you have actually entered yourself. That quid pro quo has never precluded commercial use of whatever data you might enter yourself.

But a I said, it's popular to bag on Gracenote because of the "ungraceful" (no pun intended) way in which they transitioned to a commercial organization. Far from perfect, to say the least. But if you look at where it is now and compare it to others, it's not that different. So much of the garbage out there is really just that - garbage, fueled by misinformation spread through anger and accepted as truth by those eager to believe it all (though not all of that anger is necessarily unjustified).

(Reply to this) (Parent)(Thread)

Re: Free?
[info]bryguypgh
2008-04-29 09:29 pm UTC (link)
I'll respond fully later as I don't have time at the moment, but I am dying to know- how did you stumble on my little corner of the web here?

Actually let me try to boil down my reply. Am I just as free to take information from cddb and use it to publish my own database as Gracenote was to take the information I submitted and publish it in their database?

(Reply to this) (Parent)(Thread)

Re: Free?
(Anonymous)
2008-04-29 11:04 pm UTC (link)
Disclaimer: IANAL... My layman's understanding of copyright law is that for a work to be copyrightable it has to contain meaningful amounts of creative content. A collection of names and titles from CD covers, much like the phone book, *probably* does not constitute a copyrightable work. However, editorial enhancement of such content is likely to be copyrightable, if it can be claimed to be original work above and beyond what's on the CD cover itself. Gracenote data, at this point, is heavily edited in many areas and contains a lot of things that are not part of the original work (such as extremely deep genre assignments, which require educated editorial decisions). The titles and such are also often edited in ways that differ from the original CD. This probably means that much of Gracenote's data is well-protected by copyright at this point. Sorting out the parts that are and aren't would be a daunting task.

Freedb and like services tend to be totally automated in their collection of data. That makes any claim of copyrightability very weak, because the process of amassing the data does not involve originality or creativity. Laying the GNU license on it, or any other license for that matter, is fairly meaningless, as it's largely public data.

However, data and service are different things. The freedb folks, while not really at liberty to claim any copyright on their data, could limit access to their service if they felt so inclined. They have no obligation to allow any particular person access to the database, nor do they have an obligation to allow anyone unlimited access. More than likely they do have some sort of daily access limit to keep load reasonable, at the very least, like the old CDDB always did (it's largely the same code base).

Gracenote as well is not obligated to allow unlimited access to their service, nor does it do so. Users are allowed a generous daily access limit to allow for reasonable use of the service. They could at need block anyone completely if they felt so inclined, though that rarely if ever happens to my knowledge. So nobody is entitled to the data of either party, and the Gracenote EULA may preclude certain uses of data downloaded from their service (not sure, but would presumably be legal for them to do so). But violating that EULA in itself would not necessarily constitute copyright violation (as far as I know).

PS: Check out google alerts. You can set one up for any number of topics of interest to you.

(Reply to this) (Parent)(Thread)

Re: Free?
[info]bryguypgh
2008-04-30 11:04 am UTC (link)
So the answer is "no" and I stand by my original statement.

I know you think it's "garbage", and that we "haters" perpetrated "far worse" acts than Gracenote did, but Gracenote walled off a piece of the commons and tried to call it theirs under threat of lawsuit. I don't think you've succeeded at all in trying to clear their name; I actually feel worse about Gracenote now.

I appreciate the fact that you eventually came clean about your connection to the company, I was wondering after that first post. I understand that you don't want the name Gracenote to be the first example everyone uses to describe the process of making public data proprietary through abuse of the legal system, but your PR spin is too little too late in my opinion.

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…