Technology and scientific methodologies have greatly advanced over the last decade as it relates to data. Scientists have been simultaneously fortunate and unfortunate: these advances are providing larger and more intricate data sets than ever, however, problems arise in evaluating the dataset as a whole and its infinite details.
Processing large datasets requires a significant amount of forethought; traditional supercomputing models do not provide an ideal match to big data, and so compromises or significantly over-complicated solutions are often necessary to achieve science goals. The Data-Scope endeavors to overcome issues related to big data on traditional HPC by doing the following:
Large datasets are traditionally stored on communal “head” or “storage” nodes, which are shared across hundreds or thousands of nodes. Putting aside IO subsystem obstacles, the data also has to travel across high-speed Infiniband or Ethernet networks to computational nodes that typically have less than a TB of slow, local storage. Thus the network and the slow disk become bottlenecks.
Providing users with their own nodes eliminates the problem of sharing large head node spaces with users that require different access patterns and data layouts. Providing exclusive access to a single project (or complementary projects) allows the data to be laid out in a fashion that is conducive for computation.
GPUs are a cost-effective and lightning-fast solution to big data problems. A Fermi-generation NVIDIA GPU is capable of half a Teraflop (double precision) and provides a much higher flop-per-dollar cost .
The Data-Scope removes the network from the equation during computation of the data. The computational nodes are equipped with twenty-four 1TB hard drives that are mapped one-to-one on the backplane, as well as four MLC SSDs. What this means is the design does not bottleneck on SAS expanders; sequential IO throughput scales with the aggregate bandwidth of the drives to the maximum ability of the host bus adapter.
Proposals will be reviewed on-demand as they are submitted by the Data-Scope Allocation Committee and the overall usage of the machine will be evaluated and reported quarterly. Please use the form on the right to contact Data-Scope administrators.
The Data-Scope is intended to provided a data-intensive analysis capability for Big Data problems. As such, the majority of users will run projects of finite duration, typically 3 to 6 months, and leverage Data-Scope’s unique properties, fast I/O with SSDs or high computing density with GPUs. Proposals that use Data-Scope as a compute facility alone will be redirected to other JHU resources, such as the Homewood High-Performance Computing Cluster (HHPC) or the GPU Laboratory.
Resident Services
It is expected that a minority fraction of the machine will be used to run long-standing services. The Institute for Data-Intensive Science and Engineering (IDIES) runs many such services, including SDSS.org, The Turbulence Project, and the Open Connectome Project. Proposals to this effect will be considered. However, this will always be a secondary usage of the machine.
Long-Term Storage
Permanent and backed-up storage may be available for projects that are long-term or generate data products that the Investigators cannot easily retrieve. Projects that wish to use long-term storage will be assessed a one-time charge back that will cover the acquisition and deployment of the storage, increasing the capacity of Data-Scope commensurately. Ask us about backup charges. These rates will be determined at the time of proposals, but we expect them not to exceed $100 per Terabyte.
The Data-Scope project was funded by a grant from the National Science Foundation. Intel has provided the CPUs in the amount of $294K. NVIDIA donated 60 of the TESLA C2070 cards.
Researchers interested in utilizing the Data-Scope instrument should submit a short (1-2 page) pdf document that addresses the following points:
How many and what types of Data-Scope resources do you require?
What are your storage requirements?
Data Handling
Provide a timeline for use of the machine