LARGE DISCLAIMER: This is just a short summary of what
I heard during the talk. I took a few notes but did not make a
full verbatim transcript of everything that was said today. Just
because I wrote something, there is no guarantee that this
has to be correct. I am very grateful to any suggestions,
additions and corrections. Feel free to point out any spelling
error to me (email@example.com).
Two days before I wrote about the “Volltextsuche Online” (VTO) project at Google Blogoscoped, I had registered to a workshop on that subject on February 20, organized by the MVB company, which is the commercial branch of the German Publishers’ and Book Shops’ Association (Börsenverein des Deutschen Buchhandels). In the aftermath of the posting and the media attention, Theodor Brüggemann, head of the VTO team, “was released from work” by the MVB (according to them) or “left the job” (according to the MVB owned news magazine Börsenblatt). [UPDATE 2007/03/18: Mr. Brüggemann’s lawyers and the Börsenverein and other newspapers have reached an agreement on how to properly describe the events. The current version, blessed by the parties involved, mentions that Mr. Brüggemann cancelled his temporary contract by himself before the debate about security and proper installation unfolded. Everyone wishes him the best for his new adventure.] A few days later, I got a call from a MVB employee who said that the February 20 event was canceled due to a lack of someone running the show. I was kindly invited to join the workshop on February 28. Last week, she called again and said that she had to disinvite me because I wasn’t a book publisher. In short, I refused to get disinvited and she accepted. I also got two invitations in the meantime for a chat with people involved in the VTO project which happened to be a great pleasure. The last call was fun, too, a publishing house in Berlin asked me to have a look at their full text search project and ways of “accessing the content”. In a strange set of coincidences, the same publisher was sitting right next to me on the workshop today without knowing (until we introduced ourselves).
The event went from 11:00 a.m. to 02:00 p.m. with a short break and lovely biscuits and cake.
The workshop/presentation was lead by Reiner Klink, head of the department “Information Services” at the MVB. He apparently joined the book publishing industry in 1974. Several members of his team were in that room, too; along with a representative from the AKEP (working group electronic publishing) and two people from hgv, a Holtzbrinck company that got the contract to build up VTO.
I counted representatives from a few publishing houses with very different areas of work (from STM to cooking).
Most of the time was spent to go through the slides of Mr. Klink’s presentation which probably won’t appear online, even if it lacked any confidential information, IIRC. Mr. Klink mentioned the debate about my previous encounter with VTO. Several statements mentioned the rather unspecified schedule. They (MVB and hgv) are working on translating the user interface of the different parts of VTO which is scheduled to be completed at the Leipzig Book Fair on March 22 (this year, I guess). Mr. Fischbach (hgv) said that no new features are planned until then.
When they took down the beta version of the VTO end user front end, they also deleted all of the existing books inside this installation. During the presentation today, there was only one book present (Thomas Hettche : Woraus wir gemacht sind) and Klink announced this book to be deleted again for security reasons. Security has now become a number one priority, he said.
Publishers participating in this project should receive their user data soon, if they haven’t already. MVB won’t be sending out login information via email but make phone calls to deliver the passwords. Email will just be used for sending out the docs. MVB said that it takes up to two weeks after signing the contract until the account data is given, one publisher said he hasn’t received anything yet so far and that he had signed the contract a while ago.
The first non-overview slide was about uploading the data. The slide read something like
“Sicherung der Rechte (Texte und Bilder)” (securing the rights (text and images))
and Mr. Klink started to talk about security related issues. He was soon stopped by his team and realized that this was about “Sicherung” as in copyright, not security. In the past, many publishers only got the rights for a specific country or for a specific purpose such as “print”. These rights will render it impossible for those publishers right now to participate in VTO or any other project. The problems with images seem to be far worse. To make it even more annoying, German copyright law (”Urheberrecht” -> right of the author) does not permit authors to grant rights for applications that are not invented yet. As far as I understand it, publishers who made a contract with an author about the “full rights of usage” in, say, 1986 without specifically mentioning the publishing in electronic, remote, TCPish ways do not have these rights right now, unless they made another deal afterwards. The German government is currently attempting to change this as part of the 2. Korb” (second basket) of the new copyright legislation.
Every book will be stored in a .zip file containing several parts and meta information of the book. These files will be uploaded via SFTP (220.127.116.11 port 22) until they are processed.
Now should be a good time to mention that all the specific information may be changed in the future with or without warning or backwards-compatibility. Several specs from version 0.9 to 1.5 changed, albeit they were minor ones (the file used to be $ISBN.zip, its now $USERID-$ISBN.zip).
There are several kinds of files to be included into this zip archive:
The front cover: $USERID-$ISBN_COVER.jpg (or pdf)
The full text: $USERID-$ISBN.pdf
the back cover
the “inventory” file.
The inventory file is a rather funny one. It is currently a Microsoft Excel file that contains all the file names of the other files in that zip archive. As far as I can see, this is totally redundant in several ways and one out of many possible sources of confusion and mistakes. The system seems to be unable to determine the order of the files, even if they are named as required from the specs.
$USERID refers to a number that identifies the publisher to VTO. The number is central in many ways, it is the user login at the SFTP server and to the file names. Without any further security SNAFU, knowing that 5106488 refers to the Börsenverein itself as a publisher won’t help you much, I guess. I am not sure if that number has any connection to other IDs used at MVB.
After uploading the zip file, the publisher has to go to the “Content Tracking System” (currently at http://cts.volltextsuche-online.de).
His tasks there are to:
- Create and Upload the (another) meta-files
- Check the meta-data-files and the zip files
- Give feedback after the “conversion” process
- Check the converted PDF files (”QS-Qualitätssicherung” / quality assuarance)
- Release the book into the wilderness live VTO system.
Even if the Börsenverein, the Task Force VTO, the MVB and everybody and his dog have already mentioned a feature to differentiate the different access levels to different audiences (book shops, end user, libraries,), there is currently only one global setting per book. It was the standard phrase of the workshop to declare anything above the bare minimum to be a “Das ist Zukunftsmusik” (”this is the music of the future”). Hence the title of my blog posting.
I would like to repeat it once more: As of February 2007, VTO does not offer different levels of access to different audiences (as long as you ignore the “hackers” ).
The conversion process (i.e. converting the “web-ready” PDF file into a web-ready TIFF file and some meta data) takes currently between 3 days and 4-6 weeks. Mr. Fischbach asked to add that it is currently more 4-6 weeks rather than 3 days. 6 weeks as in 3.6 million seconds.
As I mentioned before, publishers have to pay 17 Euro plus tax per book per year to participate. There is currently a special offer: Uploading books before March 31 will get you one year for free. Klink said that there are discussions about a new deadline. Anyway, the dis count is granted when uploaded as such.
Even though Jens Redmer or any other Google Book Search person was not physically in the room, there was some kind of metaphysical presence there. MVB is currently (read: since 2005) in what they call negotiations with Google and Yahoo. MVB wants information at VTO to be included in Google and other search engines. One of the more important aspects is the question about how much content third parties will get and in what form. This is what seems to be the current understanding at MVB about Google’s demand:
100% of the content of a book, not necessarily in full text but stripped of punctuation, formatting and so on.
Mr. Klink had a rather interesting explanation for pushing publishers to give up 100% of the content to third parties: “You won’t get found unless there are 100% in the index”.
The VTO internal search will index 100% of the content and will show very small snippets around search keywords, even if the pages are restricted.
There is another meta file which is Micsosoft Excel spreadsheet that collects all the books from one publisher.
One really scary “feature” (maybe also as in “future”) is to provide the email addresses of the people who registered to see pages of a book to the specific publishers. Mr. Fischbach mentioned concerns about privacy but as far as I understood it, a EULA clause will let end users of VTO grant VTO the right to give away email addresses to third parties.
Passwords do not appear to be changeable by the publishers at the moment. It would be my theory that the different places to make use of the password do not have a unified authentication system right now but I might be very wrong here.
During the presentation of the CTS, there was a very funny situation when a person saw the timestamp of a demo meta file that was uploaded at 05:44 a.m. at VTO, and stated to be very proud of this – until one technician mentioned that this is Delhi time zone, not CET. Still impressive, IMHO .
There was a large malfunction at the “AdminTool” demonstration. Speaking of malfunction, this tool is completely written in Macromedia Flash/Shockwave. Nonetheless, it did not go beyond the login-screen. They tried it again half an hour later, still with no luck. On the bright side, even the technicians in Munich at hgv were unable to log in.
At the discussion round after the presentation, I was not able to get a feeling of overwhelming optimism from anyone in that round. A publisher mentioned that he considers this system to be not that much intuitive, especially the MS Excel part. He also said that this might also create some confusion and security SNAFU when put into practice. Mr. Klink said in his answer that they intend to facilitate this.
One publisher had two legal questions, so they brought Mr. Adil-Dominik Al-Jubouri into the room, who does legal stuff related to VTO. The question was that the legal contract between the publishers and the MVB does not seem to mention any responsibility on MVBs side to implement measures of security. Mr. Al-Jubouri refered to a rather vague statement in the first paragraph. He insisted that downloading/saving/printing pages from VTO was impossible. IANAL. I would love to get a legal opinion from a non-affiliated person that agrees or disagrees with my impression that right now, the MVB is under no obligation to protect the content of the publishers and that there is no compensation for publishers whose content gets leaked via VTO.
Of course, there were other issues that have no technical or political background. I would like to refer to them as “psychological issues”, such as the lack of hyperlinks to the publisher’s web site. The only place right now is a company logo close to the upper left corner. Clicking on that logo did not work today. Above the publisher logo, there is an even larger VTO logo which does work .
A rather surprisingly honest statement came from Mr. Klink at the end of the workshop. He said (after putting this into two “heretic-tags”) that no user will ever come to the content via volltextsuche-online.de itself. They fully rely on the company that starts with G and ends with oogle.
The VTO internal search engines currently works thanks to Lucene. I asked about the ranking mechanism on a site-wide search. The ranking is meant to be in “relevancy as determined by Lucene” order. This feature seemed to be offline today, the default mode seems to be alphabetical order right now (As a publisher, you have to be lucky to have lots of Aaron Aaftergoods as writers or at least lots of books about the city of Aachen – hey, that’s almost a neverending source of jokes that involve 18th century SEO techniques for book publishers. (Offtopic: You’ve got to see this movie:http://www.youtube.com/watch?v=xFAWR6hzZek)
So. This is just my set of impressions and I would love to hear more. Feel free to go to volltextsuche-online.de in case you are interested in this topic. Those of you who went to that workshop, I would like to hear your opinion on this, too. Please think of that large disclaimer at the top and don’t rely on this text when it comes to business decisions.
These are the questions that are in my mind right now:
- Does the MVB have access to the source code of the software?
- Do they have an infinite license to run the software?
- How long does hgv have to do support for VTO related issues?
- If that process of conversion of content from PDF into TIFF+X is that hard, can’t publishers do this on their own?
- Will publishers get full raw access to the converted content from MVB/hgv/MPS?
- Will VTO become part of a “2. Korb” deal that involves mandatory licenses of VTO for libraries?
- And if so, will libraries get 100% content even if the publishers state otherwise in their meta files?
- And if so, how would the MVB be able to explain this to publishers?
- And if so, will libraries get 100% content even if the publishers state otherwise in their meta files?