Welcome
Photos of Larryblakeley
http://www.royblakeley.name/larry_blakeley/larryblakeley_photos_jpeg.htm
(Contact Info: larry at larryblakeley dot com)
Important Note: You will need to click this icon to download the free
needed to view most of the images on this Web site - just a couple of clicks and you're "good to go."
I manage this Web site and the following Web sites: Leslie (Blakeley) Adkins - my oldest daughter
Lori Ann Blakeley (June 20, 1985 - May 4, 2005) - my middle daughter
Evan Blakeley- my youngest child
For many years, scientific data has been stored and transferred using a variety of data formats. In recent years XML has become an important and popular for exchanging digital information, including scientific data. At the most abstract level, XML can be used for any purpose that a binary format might be used, and vice versa. XML has many good features, but XML is not likely to replace binary data formats for scientific data. At NCSA we have studied emerging Web and XML technologies for many years. In particular, we are interested to learn how XML and binary best be used by scientists, and how XML and binary formats should work together in systems. This paper summarizes some of the lessons we have learned.
It is important to point out that all computer data is at least implicitly and informally “formatted”, although the format may be informal and undocumented. For trivial or evanescent data there is no little need to worry about how the data is represented or stored. However, for data that must persist, be transported, and/or be shared, it is important to carefully store the data in a known format, so that the intended meaning can be accurately reconstructed. This paper discusses “data formats” for scientific data. A data format is a method for representing data in a digital store (which may be transient or persistent). Essentially a data format has the following aspects:
- A model of data representation
- A mechanism for describing data, i.e., for mapping concepts of the data model to digital objects such as files or memory
- When the data model is very general, a data format may also have means to specify profiles, i.e., more specific sub-cases of the data model that model particular application or community concepts
This paper considers two mechanisms for implementing formatted data: XML and “binary” data formats. A “binary” format maps data concepts to the native entities of the computer system, to bytes, numbers, and perhaps pointers or other structures. A data format can also be defined in terms of composites or aggregations of ‘atomic’ binary elements. For example, many data formats define multidimensional array objects, specified as conventional arrangements of bytes. There are many examples of ‘binary’ formats, including HDF http://hdf.ncsa.uiuc.edu/ [18], netCDF http://my.unidata.ucar.edu/content/software/netcdf/index.html [36], FITS http://fits.gsfc.nasa.gov/ [33], TIFF http://www.libtiff.org/ [34], GeoTIFF http://remotesensing.org/geotiff/geotiff.html [10], and so on. A “text” format maps the data concepts to character strings, which must be translated to and from computer entities. For example, to perform arithmetic, the string “-7.0” must be converted to an appropriate representation. One advantage of text representations is that they can (to a degree) be read by a human. XML has emerged as the most popular text format.
Brief Comparison of “Binary” and XML Data Formats
This section (4. Brief Comparison of “Binary” and XML Data Formats) presents a summary of the characteristics of XML and “Binary” files for storing and transferring scientific data. XML and binary formats such as HDF can be used for the essential requirements of scientific data. These are: storing objects (e.g., tables, arrays, lists, images, etc., and the values of numbers and strings), storing descriptions of data (including facts critical to interpret the scientific meaning of the numbers), and specifying the layout of bits. In contrast, there are several important differences in the capabilities of XML and binary formats.
First, XML is, in theory, “human readable”, while binary data requires software. Of course, for any non-trivial use of data almost always requires software assist, even with XML.
Second, XML is a is a universal standard with widely available software. It is a good bet that anyone who needs to can get documentation and software to access XML. Because XML is a Web standard, it enjoys massive commercial use and support, and is built in ‘by default’ to most systems. As a result, there is a huge base of excellent commercial and free software that can be adopted for scientific use, at a huge cost savings. In the long run, the cost savings may be the overwhelming practical advantage of XML. There are many binary formats, and none is as universal as XML. For any given binary format, there are readers and documentation, but it may be necessary to obtain and install extra software, and/or write code.
A third important difference is in the flexibility of the storage format. XML is, by design, organized as a stream of Unicode characters, with its objects organized as a tree. In contrast, binary formats store numbers in ‘native’ formats which are typically compact and require no translation in order to perform computation. A binary format can implement many storage strategies, including a graph, mesh, or other multidimensional structure. Also, a binary file can be organized to optimize space and transfer speed. For instance, a binary data format may implement optimal numeric coding, data compression, or interleaving. Also, a binary file can access data in the middle of the file (“random access”). These features are extremely critical for practical access to large complex data sets that cannot be stored in memory in their entirety. The ability to access sub-sets of large data (e.g., a region of a global dataset) requires careful organization of the data http://hdf.ncsa.uiuc.edu/apps/dods/perfeval/report.html [21]. Compression and other strategies can have a very significant impact on cost and performance (e.g., http://esto.nasa.gov/conferences/estc-2002/Papers/A3P2(Yeh).pdf [48]).
Fourth, XML and specific binary formats integrate with other software. XML is the ‘native format’ for the Web, and many standards are built on top of XML, including XSL stylesheets http://www.w3.org/TR/xslt [38], Web Services http://www.w3.org/TR/2002/WD-ws-arch-20021114/ [41], Grid services http://www.globus.org/research/papers/gsspec.pdf [35], and the Semantic Web http://www.w3.org/2001/sw [39]. On the other hand, binary formats can (be) tightly coupled to software, for optimal integration to a specific environment
- "XML and Scientific File Formats," Robert E. McGrath http://www.ncsa.uiuc.edu/AboutUs/People/Contacts/people291.html, Scientific Data Technologies Group http://www.ncsa.uiuc.edu/AboutUs/People/Divisions/divisions9.html, National Center for Supercomputing Applications , University of Illinois, Urbana-Champaign http://www.ncsa.uiuc.edu/, August 2003.
File URL here (PDF) http://www.ncsa.uiuc.edu/NARA/XML_and_Binary.pdf
Post Date: March 22, 2005 at 9:00 AM CST; 1500 GMT