IDIES Data Center

High-value Datasets

At the center of IDIES research infrastructure, the data center boasts state-of-the-art hardware, optimized for AI/ML capabilities, to deliver a seamless experience for those working in science research with Big Data.

Coupled with the technical expertise of our support staff, research software engineers, and the decades of experience gained through our pioneering work with high-profile data projects such as the Sloan Digital Sky Survey (SDSS) and the Johns Hopkins Turbulence database (JHTDB)—see below for case studies on these projects—our resources are uniquely equipped for the deposition, curation, and dissemination of datasets of all sizes.

Storage—Ceph storage cluster and Weka frontend
The 90 nodes from 2012 predecessor Data Scope were divided into 3 separate clusters and built in phases. Each cluster has 10+ PB and provides CephFS POSIX file access in addition to S3 object storage. The S3 object storage is planned to be a slower storage tier for Weka which acts as a front-end cache and provides much faster WekaFS POSIX access.

Compute Cluster: CPUs and GPUs
At the center of each of our 25 quad nodes within the compute cluster are 2 x AMD EPYC 7702 64-Core processors, 1TiB DDR4 3200 MT/s RAM, 3.84 TiB NVMe scratch storage, 200 Gbps fiber networking.

Support for AI/ML workloads is provided by the 16-A100 compute cluster

High-speed connectivity
100Gb uplink to Internet2, providing fast transfer capabilities between JHU and other participating institutions

Kubernetes—Grendel and Kraken
The Kubernetes Compute Cluster now supports all the compute workflows resulting from SciServer Compute, as well as other computational tasks from all the science domains and data sets that IDIES supports.

Scratch Storage

Dedicated Data Volumes

Kafka

IT staff on-site

Research software engineers (RSEs) and project consultation available

The IDIES Data Center contains more than 30 PB of Ceph storage. A 16+4 erasure code is used for all data, along with 5x replication for metadata. With this erasure-coding scheme, there are 4 parity chunks for every 16 data chunks. Under this configuration, up to 4 of the 30 hosts can be lost before data loss becomes a concern. Ceph also continuously performs scrubs on metadata and deep scrubs on data to help detect and prevent bit rot.

DBMS cluster

The current database environment includes approximately 40 Microsoft SQL Server instances, along with fewer than 10 combined MySQL and PostgreSQL instances. Backup methods vary by platform, data criticality, retention needs, and disaster recovery requirements.

SQL Server Backup Strategy

Three primary backup approaches are used for SQL Server databases:

One-time archival backups
Scheduled backups to tape
Scheduled backups to disk

All SDSS catalog data stored in SQL Server format is backed up to tape. Two copies are maintained at separate physical locations, providing geographic redundancy and supporting disaster recovery.

The remaining SQL Server research databases are protected through a combination of tape and disk backups. Critical administrative databases are backed up to tape every day.

Approximately 30,000 “My DB” databases are backed up using a weekly full-backup schedule, supplemented by daily differential backups. This approach provides regular recovery points while reducing the storage and processing requirements associated with daily full backups.

Other SQL Server research datasets are backed up to disk at frequencies determined by operational needs, data value, rate of change, and recovery requirements.

MySQL and PostgreSQL Backup Strategy

MySQL and PostgreSQL represent a smaller portion of the database environment, with fewer than 10 instances in total. These databases are backed up to disk on schedules aligned with the criticality of each system and dataset.

Current Backup Coverage

The environment provides several layers of protection:

Geographic disaster recovery for SDSS catalog data
Daily tape backups for critical administrative databases
Weekly full and daily differential backups for approximately 30,000 My DB databases
Flexible disk-based backup schedules for research databases
Criticality-based disk backups for MySQL and PostgreSQL

The current backup model uses a risk-based combination of archival, tape, and disk-based protection. The most critical and high-value datasets receive frequent backups and, where required, off-site redundancy. Research and lower-criticality databases are backed up according to operational need, allowing backup frequency and storage use to be balanced across the environment.

High availability is maintained primarily via a load-balanced web server cluster for both Windows and Linux websites. At the database level, high availability is ensured with multiple redundant copies of each database, and workload segregation to optimize performance between copies. Production datasets typically get their own dedicated cluster of 2 or more servers, whereas legacy databases are consolidated on “fat” enterprise-grade storage with a failover VM configuration and mirroring of critical databases. At the filesystem level, high availability is maintained by ensuring that high-value data is on enterprise-grade storage with built-in high availability features.

IDIES hosts more than 50 different websites in astronomy and other science domains. All these sites are monitored at least on an hourly basis with an extensive monitoring system:

The monitoring stack is comprised of Grafana for visualization, Zabbix for data collection, and TimescaleDB for data storage, all of which are open-source software freely available to the public.

Zabbix Agents and Proxies are deployed to enable a distributed monitoring architecture, providing redundancy and load balancing for data collection.

Zabbix currently monitors 1400+ IDIES hosts and services, collecting 1700+ new values per second. Over 1.2TiB of time-series data (compressed) has been gathered in the past year alone.

All end-user data contained in the SciServer “persistent” storage folders (the 10 GB storage that users get for free with their SciServer registration) is backed up nightly with incremental backups to a tape robot system (IBM Spectrum Protect). For data providers, the backup strategy is negotiated as part of their MOU/contract with IDIES. If their datasets are SQL-compatible relational databases, then they are automatically covered by the Database Backup and Recovery strategy outlined above. If there are datasets that are file-based, then the backup strategy depends on whether this is a mirror site or a master site for the data. For example, the SDSS Science Archive Server (SAS) file-based data is stored on our Ceph filesystem as part of an official mirror site for the SAS. As such, it does not get backed up beyond the data protection that Ceph provides by default.

SciServer

Custom Web Services
examples include SDSS, JHTDB, and AstroPath

Case Study: The Sloan Digital Sky Survey

Case Study: SciServer and Johns Hopkins Turbulence Database (JHTDB)

The SciServer is an NSF funded project (NSF ACI-1261715 and CSSI-3211791) to support collaborative data-intensive research, and the data sets it supports are hosted in the IDIES Data Center. There are extensive resources for data storage, both for databases and for file systems, and for computational processing.

Across storage systems that support File Storage, databases, Logs, user storage and working compute space, there is approximately 6PB of storage space for SciServer operations and for the multiple science domains that it supports. This is in addition to several PB of storage to support the group’s traditional Astronomy data sets.

SciServer has 15 Compute servers each with between 48 and 64 compute cores, and between 256GB and 1TB of RAM, for a total system supporting just under 900 processing cores and just under 8TB of RAM. The SciServer supports a large number of data-intensive projects from many disciplines, with several petabytes of data in active use.

Other University Resources

Data-driven HPC—Advanced Research Computing at Hopkins (ARCH)

The mission of the Advanced Research Computing center at Hopkins (ARCH, https://www.arch.jhu.edu/about-arch/) – previously Maryland Advanced Research Computing Center (MARCC) – is to enable research, creative undertakings, and learning that involve and rely on the use and development of advanced computing.

ARCH is a shared computing facility at Johns Hopkins University that enables research, discovery, and learning, relying on the use and development of advanced computing. ARCH administers state of the art high performance computing resources, manages highly reliable data storage, and provides outstanding collaborative scientific support to empower computational research, scholarship, and innovation.

ARCH provides us with potential access to 23,000 cores and over 1.4 PFlops. The system uses FDR-14 Infiniband topology and includes Dell PowerEdge GPU nodes along with dual Intel Xeon servers. Similar to SciServer, all of our team has worked on MARCC and so it provides a common collaborative framework for us to work together.

NSF Grant Information

Renovations from an NSF ARRA grant (OCI-0963185) “Advanced CyberInfrastructure for High Performance, Data Intensive Computing” created a flexible, stable environment for a high density of computing equipment and petabyte storage to support data-intensive research.

An NSF STCI grant (OCI-1137045) “Collaborative Research: 100G Connectivity for Data-Intensive Computing at JHU” provides for an 100G connection through the MidAtlantic Crossroads (MAX) to Pittsburgh and then Starlight at Chicago.

The Institute for Data-Intensive Engineering and Science

IDIES Data Center

High-value Datasets

Case Study: The Sloan Digital Sky Survey

Case Study: SciServer and Johns Hopkins Turbulence Database (JHTDB)

Other University Resources

Data-driven HPC—Advanced Research Computing at Hopkins (ARCH)

NSF Grant Information

Contact IDIES:

The Institute for Data-Intensive Engineering and Science

IDIES Data Center

High-value Datasets

Hardware infrastructure+

on-demand compute and Storage+

Support TEAm+

Data Integrity and Security+

Ceph Built-In Data Protection +

Database Backup and recovery+

High-availability+

Logging and Monitoring+

SciServer User Data Backup+

Data dissemination+

Case Study: The Sloan Digital Sky Survey

Case Study: SciServer and Johns Hopkins Turbulence Database (JHTDB)

Other University Resources

Data-driven HPC—Advanced Research Computing at Hopkins (ARCH)

NSF Grant Information