wiki:Dev/FileIO

Introduction

This page is about file I/O in general (formats, streaming, network, caching), and how we can improve it in OpenSG.

Current ongoing efforts

  • Christoph Fuenfzig requested a SceneFileHandler for remote files. Working on ContentProvider for file, http, ftp. Local Mirror function.

Loading more formats

A key feature of any scene graph is the loaders that it supports. If model format X is not supported, then people may just move on. So, we need more formats supported.

If you wish a format added that's not supported by OpenSG at the moment, add it to the wish list. If it is in development and you need it urgently, ask on the mail-list.

Note that we can (and perhaps should) port loaders from other scene graphs as well.

Loader status:

  • GIF - Animated formats: only some animation modes are supported, some GIFs are flawed (no EOL) which OpenSG chokes on, but web browsers happily accept. - #61

See OSGComparison for a list of all currently supported formats here. We ought to try to add status on each loader as well -ML

Custom content providers, network-streaming

Some history:

  • Pre 1.8 only allowed users to add local paths for loading (via the PathHandler).
  • In 1.8, all loaders use iostreams, and loading can be customized by users.
  • However, a file name is still sent for compability; 3rd-party loaders might not support iostreams (rather they use FILE or just path-names).

There has been requests and some work on network loaders (http/ftp) or archives (zip/rar/etc). In theory, they should be addable as is, now, but there are a few (larger or smaller) problems:

  • File that references other files:
  • For our loaders not a problem.
  • Non-iostream loaders are worse.

CF: Add FILE* function to ContentProvider in addition to istream function. There download content and give access to local file could be implemented. Still requires source modifications in non-iostream loaders?!

ML: See resolveToLocal() below.

  • Hook into OS openfile call, ala some *nix vfs-systems?
  • Progress callback seeks to end of stream to get size. gzip/zip-streams do not usually support random-access.

CF: Abandon usage of random-access completely? Instead put file-size query function into ContentProvider.

ML: Preferrably. Use resolveToLocal() or resolveToMemory() if random access is needed.

  • A file-cache would be nice, to avoid multiple downloads.
  • How to combine these in a general fashion? (Copy behaviour from java, which supports something like http://host/file.jar!dir/mypic.txt)

In addition, we probably want to support loading from zip/dat/pk and automatically add decryption/decompression.

Proposal for a URL based scheme

(see end of this page for some highly preliminary code)

Need to define good names for following concepts: (first suggestions and what is used below in italics)

  1. A single protocol, path-to-istream converter such as file, http, ftp: ContentProvider.
  1. An object holding the istream and additional data (URL & streamsize) for a single object: DataSource.
  1. A singleton for choosing between these (url-lookup & url-to-datasource converter): ContentResolver.
  1. A PathHandler replacement for allowing several path's to be scanned: .. .
  1. A archive filetype, (zip, tar, jar), (a istream-and-path-to-istream-converter): ArchiveProvider?.
  1. a stream filetype (gzip, bzip2, enc/dec) ( a istream-to-istream converter): StreamProvider?.

There might be some confusion below on what we mean. Be warned. :)

Task list:

  • Replace PathHandler to ContentProvider. Make a registry of these so that different protocols can be supported.

CF: Need replacement for PathHandler, e.g. ContentLocator. Consists of protocol (http, ftp, file, ...) and path (absolute or relative). Need some place to store current path for all protocols, perhaps in the ContentResolver.

ML: Replace pathhandler by content://-protocol? . If we want to store 'current path', we must consider implications. (see issues below)

  • Move current PathHandler into a FileProvider that uses std::ifstream and file:/// urls.
  • The SceneFileHandler should check if input path is URL or not, to provide backwards compability.
  • URLs are converted to DataSources by a new ContentResolver, which is the registry for ContentProviders.
  • A DataSource should contain
  • An istream object.
  • An URL.
  • An optional streamsize.

CF: Agree with all of them! So accessing the content would be by DataSource::getStream or DataSource::resolveToLocal for older loaders.

CF: Why not put that into the ContentProvider interface? istream object or FILE* object, content-size, absolute and relative position, ...

ML: Plan was to have ContentProvider? output a single object which holds all that. So that it can be called several times and provide several open streams.

CF: Of course, absolute and relative position are not apropriate here''

  • All loaders are converted to using DataSources.

CF: So the user can load scenes with SceneFileHandler::read(std::string& url) as before? The resolution of the scene file URL into a DataSource is done by ContentResolver::get. What about subsequent content (images, materials, ..) referenced by a relative path inside the SceneFileType? Does this always refer to the same ContentProvider again, i.e. "appending" the relative path to the DataSource::getURL? We could use the layered URL for that also''

ML: Sort of. The plan was that, f.ex. the vrml-loader get's a URL and appends relative paths to that and loads images again. Embedded images could then use other protocols that the vrml-file, which is the whole point of using URLs everywhere. (If there as a relative url, it would be appended to the url of the vrml-file itself, by the VRML )

CF: This is the case occuring most often, so it should be as simple as possible inside the SceneFileType''

ML: Yeah. Boost::filesystem::path has a combining operator '/', so you could do ImageFileHandler.read(baseURL.branch() / relatePath); . That's the kind of URL compositing I expect will be enough for 95% of the loaders.

  • ProgressCallBack is modified to provide either stream-position or both.
  • Callback provides both URl, streampos and streamsize. streamsize is null to indicate unknown size.
  • If loader is 3rd-party, it should be able to request from a FileCache that a URL be downloaded locally and converted to a local path.
  • For archives, construct a ArchiveResolver or something (with a better name) which resolves internal entries to new DataSources?
  • For streams, something like a StreamResolver that adds decryption/decompresson etc? Configure these based on url-extension or stream data?

Comment:

  • Are URLs the best way to handle this? (seems to be)
  • Not sure about if streamsize is handled optimally
  • ContentProviders should give it if they can, so that ContentResolver can report read-progress if enabled.
  • SceneFileTypes need not really use it? .. Could be useful if you need random access (alloc mem & read all of stream into there, then parse).
  • Differentiating between local paths and file:// paths: Mimic web-browsers (IE), i.e. file:// is one thing and c:\foo or /usr/local/foo is another thing. That is unambigously resolvable.

Issues:

  • How to handle relative paths? Two solutions spring to mind:
  • Explicit resolving by the loaders. (i.e. they know the current url and have to piece things together themselves, mostly just tack the relative-url onto the end of the current url-directory). It would mean changing existing loaders a bit, but probably not much, and probably making things clearer and more straightforward.
  • Implicit management would perhaps work for one level of data, but it would probably be ambigous if you go several levels down (at least, it seems that to me if I think about it, someone else might be smarter

CF: We can start with explicit resolving, but I would like to see some automatism to make life/implementation in SceneFileType more easy..

ML: Sure. Implicit handling will be pretty easy for one-layer loadings (vrml->image) but for more complex stuff, you need to push/pop directories on a stack.

  • How to handle current path-handler functionality (i.e. several roots)?
  • Use a content:// or opensg:// protocol that would be a VFS into wehere other content-providers could be mounted:

content://base -> http://server/dir/

content://foo/bar -> file:///usr/local/mysystem/data/

content://foo/baz -> file:///usr/local/prepack.zip

CF: Just for understanding: This is a map of alternative names, e.g. content://base refers to http://server/dir/? Who should fill that map?

ML: The user. PathHandler? today allows to set several local directories that work as load "root" diretories, in a sense. This is an extension of that concept.

  • How to resolve archives?

CF: For the FileIO implementation this is simple and nice, but can you give an example how the user would refer to a scene /world/root.wrl inside an scene.zip? Does the user give a layered URL to the SceneFileHandler then?

Yup. That would be something like SceneFileHandler::the().read("file:///scene.zip","zip://world/root.wrl");. (If you want to take an array of URLs instead sure..) .. but since Zip isn't really a protocol, it doesn't make sense to write like that. Perhaps we should favor the combined URL scheme instead?

PD: These two solutions do not work. Consider the URL "http://server/archive.zip#file.obj" and having a relative URL "images/foo.jpg" inside file.obj. This will be transformed to the absolute URL "http://server/images/foo.jpg", instead of the correct URL "http://server/archive.zip#images/foo.jpg". There is a reason why Sun developed that ugly jar URL scheme ("jar:http://server/archive.jar!foo.class"), instead of using fragments...

CF: If you resolve the relative path this way, it won't work. But why not implement a special behaviour for encapsulating entities (ZIP,TAR, ...), see rule 5.1.2 http://www.gbiv.com/protocols/uri/rfc/rfc3986.html#rfc.section.5.1.2. Suns JAR URL scheme is really ugly.

  • How to resolve layers in streams?
  • extension lookup?
  • header lookup?
  • Should we support memory mapped I/O?
  • efficient for local files
  • no need to use iostreams (which might be slow?)
  • streams can unpack to memory, then read. (Useful architecture for multi-threaded unpack/loading anyway)
  • I.e. base all loaders on a raw-memory approach, which is either memory mapped files or an entire unpacked stream. (Solves some progress issues too.)
  • more efficient streaming from disk since we don't fetch blocks here and there?
  • problem: Can't use istream-operators. Might not be problem, since OpenSG-field use raw-memory read anyway. Non-optimized loaders can always use iostream by layering a stream onto memory (using boost::iostream lib)

Lists

ContentProviders:

  • path "protocol" - convert PathHandler?
  • file protocol - just call path-handler
  • http protocol
  • ftp protocol

ArchiveProviders:

  • zip archive:
  • Needs porting into OpenSG.
  • tar?

StreamProviders:

  • gzip - OpenSG supports zipstream already, just put into new interfaces
  • bzip2?

File caching

Caching loaded files is desirable, for a number of reasons:

  • Heavy GraphOps (striping/material share) & vrml-parse take time, especially in debug builds on windows.
  • Loading from network should have cache similar to web browsers.
  • Derived data (normal maps, tangent space coords) might be useful to cache as well.
  • Fine-grained caching is probably needed, since we can't require all applicatons to support general state save/restore. If yours do, however, you could just cache the entire app&scene after processing, thus not requiring support from the user.
  • Local Mirror function after loading a scene with SceneFileHandler. Stores all accessed files over various protocols in a local directory hierarchy. This is useful in addition to the file cache, caching the OpenSG parsing result.

Solutions:

  • Needs Boost and hence 2.0.
  • Does not work with current readprogresscb, as it gzips cached files as well (and boost::gzipstream does not support random access)

Implementation suggestion

Some quick suggestion-hacks on relevant interfaces:

CF: Nice work!

#!cpp

class URL {

    URL(string url); ///< allows automatic conversion from string/const char. Should perform checking to see if well-formed. Also convert from local-paths to !file://

    URL(URL base, const char* relative path);

    string getProtocol(); 

    string getServer();

    boost::filesystem::path getPath(); // or something that support similar branch()/leaf() ops.

};





typedef boost::shared_ptr<boost::filesystem::path> LocalFilePtr; /// could probably be a FILE* instead, although it sort of limits loaders that 



class DataSource {

   URL getURL();

   istream& getStream();

   unsigned int getSize(); ///< may be zero if not available

   LocalFilePtr resolveToLocal(); ///< on ptr release, file may be deleted by OpenSG if caching is disabled (custom deleter used)

};



typedef boost::shared_ptr<DataSource> DataSourcePtr; // some kind of ref count here is probably a good idea



/// all current filetypes

class SceneFileType {

   virtual NodePtr read(DataSourcePtr ds) = 0;

};



/// and images are similar

class ImageFileType {

   virtual ImagePtr read(DataSourcePtr ds) = 0;

};



/// base providers (file, http, ftp, ...)

class ContentProvider {

   virtual string getProtocol() = 0; 

   virtual DataSourcePtr get(URL url) = 0; 

};



/// 'archive' providers (zip, jar, tar, ...)

class LayeredContentProvider {

   virtual bool isExtensionMatch(string ext) = 0;



   virtual DataSourcePtr get(DataSourcePtr source, URL url) = 0; 

};



typedef boost::shared_ptr<std::istream> IStreamPtr; // useful to help passing there around



/// for layering streams (gzip, encryption, etc..)

class StreamProvider {

   virtual bool isExtensionMatch(string ext) = 0;



   virtual IStreamPtr composeWith(IStreamPtr input) = 0;

};



/// Main class for retrieving data

class ContentResolver {

   static ContentResolver & the(); // singleton access



   void setCachingPolicy(bool storeFetchedData, bool storeParsedData, ...); /// ought to go in here



   void addContentResolver(ContentProvider* cr); /// new func.

   void subContentResolver(ContentProvider* cr); /// new func. (sub is useful if we add dynamic (re)loading of plugins)



   DataSourcePtr get(URL url);

   DataSourcePtr get(std::vector<URL> urls); // layered access?

};



Last modified 7 years ago Last modified on 01/17/10 01:11:44