OscarsX

OscarsX: An XML database for the Academy Awards®

This site serves several purposes: The site has several sections:

Caveat Emptor

The contents of the site are my own. I make no claim that the information here is correct or authoritative. If you want authoritative, you should refer to the Academy's own website, and in particular its online database ( see below).

This site and its Oscars backend is still a work in progress. I continue to uncover the odd typo in the database and am still in the process of reworking the structure and content of several of the award categories. It's very close though (although I'd hate to be asked just how "close" is "close").

You might want to note the version number on the top-level <oscars> element. It's currently 3.68 I'll be updating the version number as I fix mistakes and typos. Check back from time to time if you want to stay current.

If you find bugs or other problems, you can let me know at howardk @AT@ fatdog.com.

Browsing the Database

Let's look around the database a bit. Queries in the following section explore the structure of the Oscars DB and its contents. Let's looks at some of the results of the 79th Annual Awards from 2006.


Return all the award winners for 2006.


Return all the nominees for 2006 (ie winners and loser both).


Who won the major acting awards in 2006?


Who won the supporting awards in 2006?


Show all the nominees for Best Picture.


Show who won.


What awards was "Pan's Labyrinth" nominated for?


Which ones did it win?


Which award(s) had the most nominees in 2006, and how many nominees were there?


Which awards were won by "Marie Antoinette" in 2006?


How many remakes of "Marie Antoinette" have there been?


So much for 2006. Let's range a bit further afield ...

Return everything in the database. This particular query uses a call to XQuery's doc() function. As shown above, the Mark Logic Content Server also let's us use /oscars and even // to root a query. I use all three syntactic forms somewhat freely.



How many XML nodes in the database?


More information on the structure of the DB. What are the award categories by name?


More information on the structure and contents of the DB. What are the award categories by count?


Another way of formatting the same information.


Yet more information on the structure and contents of the DB.

(This one shows a few mistakes in my structure that make me laugh. If you look hard, you can spot them too. I'll leave them in for the moment as a salutary example of why it's important, if your database is in flux, to periodically run queries like this to check for surprises. You'll surmise of course that I don't have a schema to validate against, or if I did, haven't bothered to do so. This is the next best thing.)



What kinds of Oscar "things" are found in the DB?


Show all the sound-related nominations since 2000.


What is the current version of the Oscars document?

(The latest is 3.67.)



As a doublecheck that the XML encoding is set correctly ("UTF-8" in this document), show any awards with a nominee named "Rivera".

You should see <person>José Rivera</person>, with an accent acute over the "e" in "Jose".


Oscars Trivia

One of the purposes of this project is to show how to use XQuery to pose and answer Oscar-type trivia questions, the kind that pop up in the press every year a few months before the Academy Awards ceremony. The Academy of Motion Picture Arts and Sciences has its own list ( see below), but here are just a few quick examples I've come up with on my own. There's no hard and fast rule about what constitutes a trivia question; sometimes it's just a question of degree.

If you have interesting trivia questions of your own and the XQuery code to solve them (or not), send them along. If they're interesting, novel, or illustrate a particularly interesting facet of XQuery, I'll publish them here. If you have suggestions for improving any solutions already shown, those are welcome too.


Which 2005 winners were first-time nominees?


Which 2005 winners were first-time winners?

The previous query states that the winners have never been nominated before. This one states they've been nominated but never won.



Songs With the Same Title as Their Movie

Which nominated songs had the same title as the movie they were nominated for?



Songs With the Same Title as Their Movie -- How Many?

How many were there?



Songs With the Same Title as Their Movie -- That Won

Which winning songs had the same title as the movie they were nominated for?



Songs With the Same Title as Their Movie, Where Both Won

Which winning songs had the same title as the movie they were nominated for, where the movie also won for Best Picture? Results show both the winning song and the winning picture for any matches.



John Williams Win-to-Loss Ratio

How many awards has John Williams won vs. how many has he lost?


Academy Trivia

The Academy of Motions Picture Arts and Sciences website ( AMPAS) has an excellent list of trivia questions, and what must be considered authoritative solutions. You can find them by visiting the Academy's Awards Database website; follow the link titled "Browse Statistics" in the lefthand column. The database is an invaluable tool for checking on anything that's Oscar-related. The following are just some of the trivia-type questions that are posted on the site, and my attempts at providing a suitable solution in XQuery. I'll be providing more solutions as time goes on.


Films Receiving 10 or More Nominations

Here's a first cut at attempting to answer this question. It doesn't provide quite the correct answer tho. The Academy database says that "All about Eve" and "Titanic" tied for first place with 14 nominations each, while this XQuery says (Don't believe me? Try it for yourself!) that "A Star is Born" received 17 nominations, "Titanic" received 16, and "All about Eve" is in fifth place with 14 nominations. What's going on?



Films Receiving 10 or More Nominations (version 2)

The prior query is over-counting because of the remakes problem. The earlier query on "King Kong" provided an earlier look at this. "A Star is Born" for example has been remade on three separate occasions,and the preceding XQuery doesn't discriminate between them, so the nomination counts shown are high.

One way to correct for this is to pose the same query again, and then to look at each of the movies that survive this first step, further partitioning each remake by year and again eliminating any with less than the desired number of nominations.

( This query may take up to a minute or so to execute btw. It should be executing somewhat faster than that, at least as fast as my desktop machine, but its configuration is a bit awry at the moment and needs some TLC.)



Films Receiving 5 or More Competitive Awards

This is almost identical to the prior query, except that the number is different, and while that one looked at nominated films; this one looks at winners. Both queries are great candidates for parameterization (see below).



Persons Receiving 5 or More Acting Nominations

This query highlights a current deficiency in my database; it produces a slightly different list than the one produced by AMPAS. Mine shows Bette Davis receiving 11 nominations for example, while AMPAS says 10.

The Academy's number is correct (which is not surprising, since they're nothing if not sticklers for accuracy). Mine is high because I'm including a nomination for Best Actress in 1934 ("Of Human Bondage"). The Academy notes that "THIS IS NOT AN OFFICIAL NOMINATION", while my DB doesn't take this into account (and should). There are a number of such notes in the AMPAS database (all nominations from 1929 for example are officially marked as "UNOFFICIAL").

A few solutions present themselves:
  1. add an attribute @status="UNOFFICIAL" to the 1934 Best Actress citation as well as the others, and then test for the absence of this marker in any query that might inadvertently pick up this entry, as in ... and not( ./@status="UNOFFICIAL ) ...
  2. simply remove all such entries from the database
I'm undecided for the moment. And then there's the issue of Special or Honorary awards, some of which the Academy counts and I don't. But I'll leave that for the moment. In any event, here's the query:



Persons Receiving N or More Acting Nominations (the parameterized version)

Let's take a bit of a technical digression for those interested in XQuery. We can parameterize the preceding query by implementing it as an XQuery function. In this case, we'll build a function called nominations() that takes a single integer argument to indicate how many nominations we're interested in looking at.

NOTE: Mark Logic's function declaration syntax differs slightly from that of the current XQuery specification. See Function syntax.



Persons Nominated in Two Acting Categories in the Same Year


Best Actor/Actress Winners from the Same Film

Title says it all, I think.



Best Picture and Directing Winners NOT from Same Film

This one was fun. Note there were two awards for Directing in 1928 (formally "1927/1928" in Academy parlance): one for Comedy Picture and one for Dramatic Picture.



Films Nominated for Best Picture Not Nominated for Directing

Poor directors!


Transforming Database Structure

This section looks at just a few of the XSLT-like transformations you can do on the structure of the "oscars.xml" file, in case you don't like the way I've structured it here or you just require a different format for more efficient querying. I may add more examples as I have time.


This transformation regroups award nominations by year. In the processs I've added a count attribute to the nominations element and removed the year attribute from the individual nominations, since it's been hoisted upwards into nominations .

Finally, I've amalgamated the two won and lost elements into a single won attribute on the award name, that takes a boolean value of "true" or "false".



Here's a small snippet of code, part of a larger transformation, that transforms all musicSong elements into song elements and musicScore elements into score . The original "oscars.xml" document went through a number of such transformations while I was working out the final names of the award categories I wanted.

In this example, the default clause on typeswitch returns a null for any award other than the above two. This is handy if you just want to view the results of a small, localized transformation such as this one. If you're doing a full and final award-name transformation, then it's useful to have default return something like <ILLEGAL-AWARD-NAME/> or such to pick up errors during development.



A way of "chunkifying" the database, in this case partitioning the awards into groups of n nominations each (in this particular example 10), without otherwise changing the general structure of the data. This capability can sometimes be useful, and I wanted to figure out how to do it. You can of course set the group size to whatever you want.

RDF/semweb

For semweb-heads, a script that emits the Oscars database in RDF/XML.


Under constuction.

Tech Notes

Timing queries
Some of the queries illustrated here are followed by a comma and the expression xdmp:query-meters()//*:elapsed-time/text(). This is a Mark Logic built-in function that returns the elapsed time required to return a query result. If you like exploring query performance, you can append this to the end of any query you're interested in timing. Clock starts ticking when the query begins.

Limiting xdmp: functionality
xdmp:query-meters() and xdmp:estimate() (a replacement for fn:count() that can improve on that function's performance in certain situations) are the only two Mark Logic functions I'm exposing through the query interface, since a number of the others can let naughty people do nasty things to the database, and that's something I'd prefer not to have happen.

Mark Logic function syntax
The Mark Logic syntax for declaring functions in the query header differs slightly from that of the current XQuery specification. The Mark Logic implementation is based on an older, May 2003 Working Draft.

  • Mark Logic's define function is now called declare function
  • Function declarations in the current specification must be followed by a semi-colon (";") (not required in Mark Logic), and
  • Both the function declaration and invocation for user-defined functions must be preceded, in the current language specification, by the default prefix for user-defined functions, local: (or a prefix that's been equated to that namespace in your query.) Mark Logic doesn't require this.

Throttled output
I've implemented a so-called "throttler" on the output of the query processor that limits the number of XML nodes returned by a query. That limit is currently set just high enough for you to grab a full copy of the entire database. If your query exceeds that number, nodes will be removed from the query result to keep it within the allowable limit, and the last emitted top-level node from which subnodes have been removed will be so notated. I'm not sure how well this will work in practise, but I've been curious to try it out.

In the following, for example, the contents of the lost element have been removed (since in this hypothetical example they drove the node count over the allowed limit), and the containing cinematography element has been decorated with a "limit exceeded" warning:
   <cinematography xmlns:oscar-results="http://www.fatdog.com/oscar-results"
                   oscar-results:warning="LIMIT [125000 item(s)] exceeded, 7 node(s) removed"
                   year="1939"
                   subcat="black and white">
       <lost/>
   </cinematography>
A counting bug
There's a minor bug in my throttling algorithm that produces a slightly incorrect node count which is eluding me for the moment. You'd think a grown, professional programmer would be able to count nodes in an XML document, but this seems to be a challenge for me. I'll take another stab at fixing it when I'm less frustrated.

Pretty printing
I use my own query engine, XQEngine, which handles the throttling, to pretty print as part of a post-processing phase once Mark Logic has determined the result set. As a by-product, it pretty-prints the output. XQEngine is rather dated and not very well maintained, I'm afraid, but at least it does this quite well (better than I can count at any rate).



Howard Katz, Fatdog Software Inc.
Tue Jul 08 19:13:02 PDT 2008
Last updated 29March2007