HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. Understand the key features of the platform:
- Processing clusters using commodity off-the-shelf (COTS) hardware
- Utilizes typical rack-mounted blade servers with Intel or AMD processors, local memory and disk connected to a high-speed communications switch (usually Gigabit Ethernet connections) or hierarchy of communications switches depending on the total size of the cluster
- Clusters are usually homogeneous (all processors are configured identically), but not a requirement
- Thor, the Data Refinery, is the extraction, transformation and loading engine
- Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities
- Distributed File System
- Thor distributed file system (Thor DFS) is optimized for Big Data ETL
- Roxie distributed file system (Roxie DFS) is optimized for high concurrent query processing
- Linux Operating System
- Services for job execution
- Services for distributed file system access
- A Thor cluster is also configured with a master node and multiple slave nodes
- A Roxie cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing
- The file system on the Roxie cluster is a distributed indexed-based file system which uses a custom B+Tree structure for data storage
- Indexes and data supporting queries are pre-built on Thor clusters and deployed to Roxie with portions of the index and data stored on each node
- ECL Agent acting on behalf of a client program to manage the execution of a job on a Thor cluster
- Roxie file system is optimized for high concurrent query processing
- ESP Server (Enterprise Services Platform) providing authentication, logging, security, and other services for the job execution and Web services environment
- Dali server which functions as the system data store for job workunit information and provides naming services for the distributed file systems
- ECL IDE, the program development environment
- ECL code migration tool
- Distributed File Utility (DFU)
- Environment Configuration Utility
- Roxie Configuration Utility
- ECLWatch is a Web-based utility program for monitoring the HPCC environment and includes queue management, distributed file system management, job monitoring, and system performance monitoring tools
About the HPCC Distributed File System
The Thor DFS is record-oriented using a local Linux file system to store file parts. Files are initially loaded (sprayed) across nodes and each node has a single file part which can be empty for each distributed file. Files are divided on even record/document boundaries specified by the user. Master/slave architecture with name services and file mapping information are stored on a separate server. Only one local file per node is required to represent a distributed file. Read/write access is supported between clusters configured in the same environment. Utilizing special adapters allow files from external databases such as MySQL to be accessed, allowing transactional data to be integrated with DFS data and incorporated into batch jobs. The Roxie DFS utilizes distributed B+Tree index files containing key information and data stored in local files on each node.
The DFS for Thor and Roxie stores replicas of file parts on other configurable nodes to protect against disk and node failure. The Thor system offers either automatic or manual node swap and warm-start following a node failure, and jobs are restarted from last checkpoint or persist. Replicas are automatically used while copying data to the new node. The Roxie system continues running following a node failure with a reduced number of nodes.
Job Execution Environment
Thor utilizes a master/slave processing architecture. Processing steps defined in an ECL job can specify local (data processed separately on each node) or global (data is processed across all nodes) operation. Multiple processing steps for a procedure are executed automatically as part of a single job based on an optimized execution graph for a compiled ECL dataflow program. A single Thor cluster can be configured to run multiple jobs concurrently reducing latency if adequate CPU and memory resources are available on each node. Middleware components including an ECLAgent, ECLServer, and Dali Server provide the client interface and manage execution of the job which is packaged as a workunit. Roxie utilizes a multiple server/agent architecture to process ECL programs accessed by queries using server tasks acting as a manager for each query and multiple agent tasks as needed to retrieve and process data for the query.
ECL is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and Roxie platforms. ECL can include inline C++ code encapsulated in functions. External services can be written in any language and compiled into shared libraries of functions callable from ECL. A pipe interface allows execution of external programs written in any language to be incorporated into jobs.
The HPCC platform includes the capability to build multi-key, multivariate indexes on DFS files. These indexes can be used to improve performance and provide keyed access for batch jobs on a Thor system, or be used to support development of queries deployed to Roxie systems. Keyed access to data is supported directly in the ECL language.
Online Query and Data Warehouse Capabilities
The Roxie system configuration in the HPCC platform is specifically designed to provide data warehouse capabilities for structured queries and data analysis applications. Roxie is a high-performance platform capable of supporting thousands of users and providing sub-second response time depending on the application.