buildhigh.com

The basics...
About buildhigh.com
Archives
About me
Crap I like
Java Tips
Projects...
JonnyChat
Of me...
Name : Jon
Email : click here
Profession : Programmer

September 02, 2003 - Data Mining. Because someone has to.


Today on Slashdot I read a post which got me thinking. It was a reply to a comment in the "Office 2003 to incorporate DRM". A poster bemoaned the fact that DRM could really screw with his abilities to carry out data mining on spreadsheets, which is part of his job. A wanker replied that data belongs in databases.

The wanker is right. Data belongs in databases. My guess is el wankero is not working in anything close to the real world, or doesn't have a lot of programming experience, because on of the basic realities of working with databases is the data has to get into the database. It doesn't just grow there.

In my past life as a Bloomberg drone I did a lot of data mining. If I was *lucky* I got a spreadsheet to work with. Usually I wasn't lucky and ended up with a really jumbled text file. Usually the same application would rely on 15 different data sources, each coming in a subtly different format.

Back then I we didn't use perl. Why I can't begin to imagine. Using perl for something like text parsing is like using a tire on a car. It just fits. No, we used VB. Not even a modern version of VB, one that supported the InStr function. That'd be too easy. We had to use the one where you had to look at each individual char in a string to do your work. How quaint.

Data mining is not a fun job. When you think about writing programs and scripts to pull data out of mangled documents grabbed from the Ontario Ministry of Transportation and from Tesoro, fun doesn't jump to mind. But it seemed that it was a valuable job at the 'berg, and a lot of governmental agencies hire people to do just that.

I don't think the DRM in office will alter data mining. If people want it available, they're not going to protect it. And even if they do, it's not like Microsoft stuff is secure, so I'm sure there will be some sort of hack/crack.


When is a filesystem not a filesystem? When it's WinFS.

WinFS, if you haven't heard, is the all-singing, all-dancing filesystem that's going to be included in the next version of windows. It will allow you to search files based on content, comments, etc. You know, stuff that OS/2 had a decade ago.

But I digress. We know what happened to OS/2.

So it would seem that the advanced searching capabilities will finally make it into a version of windows. Or will it. According to a lot of articles, WinFS is not a filesystem at all. It's just a really souped version of that "Fast Find/Indexing" crap that's existed since Windows 9x. The same stuff that makes your computer run like shit while it's doing it's magic in the background. Originally it was going to be a database based filesystem, but apparently that proved too challenging. Or maybe it wasn't. There's more disinformation than information concerning Microsoft projects most of the time.

I'm not going to guess at how extensive the whole search is going to be. I'm not sure how extensively it will be integrated into the old APIs (will API disc writing calls be updated to support indexing in WinFS?). And who knows what the virus writers will do...




Go Home...