IBM Big Replicate 2.1.0 User Guide

1. Introduction

1.1. IBM Big Replicate User Guide

Welcome to the User Guide for IBM Big Replicate, version 2.1.0

1.2. What is IBM Big Replicate?

IBM Big Replicate is a software application that allows Hadoop deployments to replicate HDFS data between Hadoop clusters that are running different, even incompatible versions of Hadoop. It is even possible to replicate between different vendor distributions and versions of Hadoop.

1.2.1. Benefits

Virtual File System for Hadoop, compatible with all Hadoop applications.
Single, virtual Namespace that integrates storage from different types of Hadoop, including CDH, HDP, EMC Isilon, Amazon S3/EMRFS and MapR.
Storage can be globally distributed.
WAN replication using Wandisco’s patented active-active replication technology, delivering single-copy consistent hdfs data, replicated between far-flung data centers.

1.3. Using this guide

This guide describes how to install and administer IBM Big Replicate as part of a multi data center Hadoop deployment, using either on premises or cloud-based clusters. We break down the guide into the following three sections:

Deployment Guide: This section describes the evaluation and review process, on to the actual software installation. Use the deployment guide for getting set up. If you need to make changes on your platform, recheck the Deployment Checklist to ensure that you’re not going to impact Hadoop data replication.
User Guide: This section describes all the common actions and procedures that are required as part of managing IBM Big Replicate in a deployment. It covers how to work with the UI’s monitoring and management tools. Use the Admin Guide if you need to know how to do something.
Reference Guide: This section describes the UI, systematically covering all screens and providing an explanation for what everything does. Use the Reference Guide if you need to check what something does on the UI, or gain a better understanding of IBM Big Replicate’s underlying architecture.

1.4. Admonitions

In the guide we highlight types of information using the following call outs:

The alert symbol highlights important information.

The STOP symbol cautions you against doing something.

Tips are principles or practices that you’ll benefit from knowing or using.

The KB symbol shows where you can find more information, such as in our online Knowledge Center.

1.5. Get support

See our online Knowledge Center which contains updates and more information.

1.6. Give feedback

If you find an error or if you think some information needs improving, use the Contact us link at the bottom of this page.

2. Release Notes

2.1. Release 2.1.0

May 2017

Big Replicate 2.1.0 includes significant new product functionality that leverages the Big Replicate architecture to support a broader range of use cases, expand performance and scale, and ease the administration of Big Replicate environments.

2.2. Installation

Find detailed installation instructions in the user guide at Installation checklist.

2.2.1. Upgrades from Earlier Versions

As a major release, Big Replicate 2.1.0 introduces incompatibilities with the network protocols and storage formats used by prior versions. Please contact IBM support for information on the upgrade mechanism appropriate for your environment.

2.3. New Feature Highlights

This release includes the following major new features.

2.3.1. IBM Big Replicate for Network File Systems

Big Replicate 2.1.0 adds support for replicating data efficiently from Network File Systems (NFS) for NetApp devices to any mix of on-premises and cloud environments. This feature allows data replication at any scale from NFS to other Big Replicate zones.

2.3.2. User Interface

The Big Replicate user interface now presents a logical view of the Big Replicate operational components, Big Replicate zones and bandwidth limit policies in place of the physical map of locations. This makes it easier to observe the deployment of complex solutions and navigate directly to detailed views of individual item configuration.

2.3.3. Client Bypass

An improvement has been made to the mechanism used by the HDFS and HCFS client library to detect when a working Big Replicate server is unavailable. This allows clients to bypass the Big Replicate server when needed without waiting for a TCP connection loss or timeout.

2.3.4. Replication of Read-Only Locations

Big Replicate 2.1.0 can be configured to replicate from storage system locations that do not provide write access for the identity used by the Big Replicate server.

2.3.5. S3 Enhancements

Big Replicate configuration options now include custom S3 endpoints so that replication can occur to non-AWS S3 providers. Additionally, when Big Replicate is hosted in AWS EC2, replication can occur to an S3 endpoint that is in a region other than where the Big Replicate services reside.

2.3.6. Repair Features and Improvements

The Big Replicate repair feature allows the transfer of initial content between Big Replicate zones that have not previously replicated, and can be used as a mechanism to perform once-off replication that remains consistent with other replication activity. Repair has been enhanced significantly in Big Replicate 2.10, including the following:

Auto-Parallelization of Repair

Big Replicate repair functionality has been extended with major improvements in performance by automatically scaling a single repair task across multiple threads of execution. This removes the need to issue multiple repair requests for subset of a replicated location. It also provides the ability to tune the threads used for repair independently of those used for consensus-driven activity of replicated content.

Checkpoint Repair

When initiating a repair task for initial data transfer or similar, you now have the option of selecting a checkpoint repair. This avoids the need for Big Replicate to scan the file system of the originating zone under the repair path to determine content. Checkpoint repair refers to content from an HDFS fsimage file, avoiding the need to lock other operations during a repair scan.

Repair Cancellation

You can cancel a repair task that is underway.

Resource Consumption for Repair

Heap requirements for repair execution are now independent of the volume of data under repair.

Global View of Repair Status

Repair task status is available from any node, regardless of origin.

2.3.7. Consistency Check Features and Improvements

Consistency Check ACL Information

File system ACL information will be reported by consistency check and repaired by repair.

Consistency Check Cancellation

You can cancel consistency checks that are underway.

Resource Requirements for Consistency Check

Resource requirements for consistency check are now independent of the volume of metadata against which the check is performed.

2.3.8. User Interface Security

The IBM Big Replicate user interface can be accessed over HTTPS, and for that configuration to be performed independently of other SSL configuration.

2.3.9. Relocatable Installation

You can choose to install Big Replicate 2.1.0 in a location other than the default /opt/wandisco.

2.3.10. Network Support for Firewalled Big Replicate Zones

Big Replicate 2.1.0 can operate in an environment where one Big Replicate zone does not allow inbound network connectivity. This is typical for a secured on-premises deployment, where it may be difficult to modify or establish corporate firewall rules to allow inbound TCP connections to the Big Replicate services.

2.3.11. ACL Replication

ACL replication can be enabled to allow changes from local- and remote-originated zones to be replicated. ACL information will be represented in consistency check results as appropriate.

2.3.12. Enhanced Logging

Among a range of minor improvements to logged information, Big Replicate 2.1.0 adds the ability to log the identity of the proxy user for which requests are made.

2.3.13. Manual Fast Bypass

This feature introduces a mechanism to quickly prevent applications from using Big Replicate when interacting with the underlying file system, without the need to make configuration changes. The fusion.replicated.dir.exchange configuration property in core-site.xml specifies the location under which a directory named bypass can be created to trigger this. Subsequent client activity in that cluster will bypass coordination through Big Replicate.

2.3.14. API to Track Completion of Transfers for a Specified Location

The API to track the status of transfers under a replicated directory now allows that tracking to be limited to a sub-directory of a replicated location.

2.3.15. Installation without Root Identity

Big Replicate 2.1.0 can be installed as a non-root user with sufficient permissions ( sudo tar, sudo ambari-server, sudo cp).

2.3.16. Shadow Client JAR

The Big Replicate 2.1.0 client library for HDFS and HCFS compatibility ensures that classpath conflicts do not occur with any client application, allowing Big Replicate to be used by applications that use alternative versions of the Guava and Netty libraries.

2.3.17. Unsidelining

Periods of extended network outage between Big Replicate zones can be accommodated by limits that allow Big Replicate servers to identify a sidelined node, ensuring that operation of other nodes can continue in its absence. Prior to this release, bringing a sidelined node back into operation was a completely manual process. Big Replicate 2.1.0 adds a mechanism by which sidelined nodes can be recovered and participate in ongoing activity.

2.3.18. Operation as an HDFS Non-Superuser

To support operation in environments where minimal security privileges must be allocated, the Big Replicate server can now operate as a principal without HDFS superuser privileges.

2.3.19. Selective Replication of `open()` Requests

A configuration option ( fusion.client.coordinate.read) is provided to allow coordination of open() requests, which by default is false.

2.3.20. Preferred Writer Selection

This release provides an API by which a preferred writer node can be specified for a given replicated path. The writer node is the Big Replicate server instance responsible for executing modifications to the local zone’s metadata via the file system API.

2.3.21. Grace Period for License Expiry

License expiration allows continued operation for a short grace period (by default one month for production licenses), during which notifications are presented to the administrator about the license expiry. This is in addition to the existing warnings provided prior to license expiration.

Additionally, license expiry does not halt operation of the Big Replicate server, which remains available to service activities that occur in non-replicated locations.

2.3.22. New Platform Support

Big Replicate has added support for the following new platforms since Big Replicate 2.9:

CDH 5.9 and 5.10
HDP 2.5
HDInsights 3.2 - 3.5
IBM BigInsights 3.0
Amazon EMR 5.3 and 5.4
MapR 5.2.0

2.4. Available Packages

This release of IBM Big Replicate supports the following versions of Hadoop:

CDH 5.2.0 - CDH 5.10.0
HDP 2.1.0 - HDP 2.5.0
MapR 4.0.1 - MapR 5.2.0
Pivotal HD 3.0.0 - 3.4.0
IOP (BigInsights) 2.1.2 - 4.2

The trial download includes the installation packages for CDH and HDP distributions only.

2.5. System Requirements

Before installing, ensure that your systems, software and hardware meet the requirements found in the Deployment guide.

2.5.1. Certified Third-Party Components

IBM certifies the interoperability of Big Replicate with a wide variety of systems, including Hadoop distributions, object storage platforms, cloud environments, and applications.

For information on the data source and data target compatibility supported on Big Replicate, see Big Replicate data compatibility.

2.5.2. Client Applications Supported

IBM Big Replicate is architected for maximum compatibility and interoperability with applications that use standard Hadoop File System APIs. All applications that use the standard Hadoop Distributed File System API or any Hadoop-Compatible File System API should be interoperable with IBM Big Replicate, and will be treated as supported applications. Additionally, Big Replicate supports the replication of content with Amazon S3 and S3-compatible objects stores, locally-mounted file systems, and NetApp NFS devices, but does not require or provide application compatibility libraries for these storage services.

2.6. Known Issues

Big Replicate 2.1.0 includes a small set of known issues with workarounds. In each case, resolution for the known issues is underway.

Renaming the parent directory of a location with current file transfers may result in incomplete transfer - FUS-387.

In some circumstances, modification of the metadata for a parent directory within a replicated location can prevent the completion of content transfer that is underway for files underneath that directory. Big Replicate’s metadata consistency is unaffected, but file content may not be available in full. Consistency check and repair can be used to both detect and resolve any resulting missing content.

Metadata change following move of file from non-replicated to replicated location may be overwritten - FUS-3433

Under certain conditions, a metadata modification to a file that has recently been moved from a non-replicated to replicated location may be lost. Consistency check and repair can be used to both detect and resolve any resulting missing content.

Big Replicate does not support truncate command - FUS-3022

The public boolean truncate(Path f, long newLength) operation in org.apache.hadoop.fs.FileSystem (> 2.7.0) is not yet supported. Files will be truncated only in the cluster where the operation is initiated. Consistency check and repair can be used to both detect and resolve any resulting inconsistencies.

Big Replicate does not support concat() operation - FUS-3714

The public void concat(Path trg, Path[] psrcs) operation in org.apache.hadoop.fs.FileSystem is not yet supported, and will result in filesystem inconsistency. Consistency check and repair can be used to both detect and resolve any resulting inconsistencies.

Consistency check will not be marked as done when initiated from a non-writer node - FUS-2675

While a consistency check initiated via the API at a non-writer node will execute and complete, its status will not be marked as such. The workaround is to ensure that consistency check operations are only initiated at writer nodes.

2.7. Other Improvements

In addition to the highlighted features listed above, Big Replicate 2.1.0 includes a wide set of improvements in performance, functionality, scale, interoperability and general operation.

Parallel repair functionality avoids duplicate repair activity - FUS-3073
Correction to handling of specific path names to avoid issues with Hive replication - FUS-3543
Stack installer does not access non-initialized variables (fix for install on Oracle Enterprise Linux) - FUS-3551
Installation completes with WebHDFS disabled - FUS-3555
/fusion/fs no longer returns 500 response when adding removed replicated location - FUS-2148
Talkback does not attempt to ssh to KDC as root user - FUS-3192
Consistency check tasks can be cancelled - FUS-3053
service fusion-server restart displays success - FUS-3193
Installer supports configuration changes needed for SOLR - FUS-3200
Client library no longer conflicts with user jars - FUS-3372, FUS-3407
CDH parcel upgrade performed for alternatives - FUS-3418
IHC SSL configuration no longer in core-site.xml - FUS-2828
MapR 5.2.0 support - FUS-2870
Big Replicate UI now applies auth_to_local setting when browsing HDFS - FUI-3995
Repair page redesigned to avoid unselectable source of truth - FUI-3759
Big Replicate handshake token directory installer input is pre-populated when adding node to an existing zone - FUI-3920
UI correctly displayed size of replicated folder - FUI-3974/FUI-3995
Support for CDH 5.9 - FUI-4084
Support for Cloudera Manager 5.9 - FUI-4085
Support for CDH and Cloudera Manager 5.10 - FUI-4089
Consistency check marked as done when initiated from a non-writer node - FUI-3921/FUS-2675
Improved checks for Big Replicate client installation - FUI-3922
Install accommodates HIVE_AUX_JARS with single jar - FUS-3438
Allow operation with ambari-agent as non-root user - FUS-3211
Log proxy.user.name for requests - FUS-3154
Improve default exclusion paths for Hive tables - HIVE-310
Heap requirements for consistency check now independent of data volume - FUS-2402, FUS-3292
Avoid out of memory under failed socket connection scenario - DCO-683
Empty metadata content does not result in recursive delete - FUS-3190
Correct permission replication for Hive tables - FUS-3095, REPL-16
Allow cancellation of repair tasks that are underway - FUS-3052
Provide aggregate reporting of repair status across zones - FUS-2823, FUS-2948
Integrate with alternative native SSL libraries - FUS-2859
Talkback improves host resolution - FUS-3249
Service init functions allow AD groups with spaces in name - FUI-4278
RPM upgrades do not overwrite logging configuration - FUI-3894
Email alert interval polling defaults to 60s - FUI-3768
Metastore starts with DBTokenStore configured on CDH 5.5 - HIVE-384, HIVE-389
Support replication of external tables via default DSM - HIVE-225, HIVE-284
Correct Metastore configuration deployment with multiple nodes - HIVE-299
Bypass mechanism for replicated Metastore - HIVE-134
Metastore event listener replication - HIVE-222, HIVE-243, HIVE-234, REPL-2, REPL-7
WD Hive Metastore service status in Cloudera Manager - HIVE-257
Correct Hive installation on RHEL 7 - HIVE-261
Improve installation of Hive for HDP configuration - HIVE-296
Stack removal for Hive improved - HIVE-307
Standardized Java detection - FUS-2479, FUI-3165, HIVE-327
Hive support for CDH 5.9 - HIVE-356
Hive support for CDH 5.10 - HIVE-257
Correct permissions on /tmp/wd-hive-metrics.log et al. - HIVE-392
Sidelined DSMs no longer trigger re-elections - FUS-3083
fusion.ssl.enabled property renamed to fusion.client.ssl.enabled - FUS-3013
Additional properties for S3 configuration - FUS-3513
Client requests to sidelined DSM no longer retry - FUS-3003, FUS-2927, FUS-3051, FUS-3299
HttpFS classpath corrections - FUS-3201

3. Deployment Guide

3.1. Big Replicate server requirements

This section describes hardware requirements for deploying Hadoop using Big Replicate. These are guidelines that provide a starting point for setting up data replication between your Hadoop clusters.

Glossary

We’ll be using terms that relate to the Hadoop ecosystem, Big Replicate and DconE replication technology. If you encounter any unfamiliar terms checkout the Glossary.

Figure 1. server configuration for Big Replicate

Big Replicate UI: A separate server that provides administrators with a browser-based management console for each Big Replicate server. This can be installed on the same machine as Big Replicate’s server or on a different machine within your data center.
IHC Server: Inter Hadoop Communication servers handle the traffic that runs between zones or data centers that use different versions of Hadoop. IHC Servers are matched to the version of Hadoop running locally. It’s possible to deploy different numbers of IHC servers at each data center, additional IHC Servers can form part of a High Availability mechanism.

Big Replicate servers don’t need to be collocated with IHC servers

If you deploy using the installer, both the Big Replicate and IHC servers are installed into the same system by default. This configuration is made for convenience, but they can be installed on separate systems. This would be recommended if your servers don’t have the recommended amount of system memory.

Big Replicate Client: Client jar files to be installed on each Hadoop client, such as mappers and reducers that are connected to the cluster. The client is designed to have a minimal memory footprint and impact on CPU utilization.

Big Replicate must not be collocated with HDFS servers (DataNodes, etc)

HDFS’s default block placement policy dictates that if a client is collocated on a DataNode, then that collocated DataNode will receive 1 block of whatever file is being put into HDFS from that client. This means that if the Big Replicate Server (where all transfers go through) is collocated on a DataNode, then all incoming transfers will place 1 block onto that DataNode. In which case the DataNode is likely to consume lots of disk space in a transfer-heavy cluster, potentially forcing the Big Replicate Server to shut down in order to keep the Prevaylers from getting corrupted.

3.2. Licensing

Big Replicate includes a licensing model that can limit operation based on time, the number of nodes and the volume of data under replication. IBM generates a license file matched to your agreed usage model. You need to renew your license if you exceeds these limits or if your license period ends. See License renewals.

3.2.1. License Limits

When your license limits are exceeded, Big Replicate will operate in a limited manner, but allows you to apply a new license to bring the system back to full operation. Once a license is no longer valid:

Write operations to replicated locations are blocked,
Warnings and notifications related to the license expiry are delivered to the administrator,
Replication of data will no longer occur,
Consistency checks and repair operations are not allowed, and
Operations for adding replication rules and memberships will be denied.

Each different type of license has different limits.

Evaluation license

To simplify the process of pre-deployment testing, Big Replicate is supplied with an evaluation license (also known as a "trial license"). This type of license imposes limits:

Source	Time limit	No. Big Replicate servers	No. of Zones	Replicated Data	Plugins	Specified IPs
Website	14 days	1-2	1-2	5TB	No	No

Source

Time limit

No. Big Replicate servers

No. of Zones

Replicated Data

Plugins

Specified IPs

Website

14 days

1-2

5TB

Production license

Customers entering production need a production license file for each node. These license files are tied to the node’s IP address. In the event that a node needs to be moved to a new server with a different IP address customers should contact IBM’s support team and request that a new license be generated. Production licenses can be set to expire or they can be perpetual.

Source	Time limit	No. Big Replicate servers	No. of Zones	Replicated Data	Plugins	Specified IPs
IBM	variable (default: 1 year)	variable (default: 20)	variable (default: 10)	variable (default: 20TB)	Yes	Yes

Source

Time limit

No. Big Replicate servers

No. of Zones

Replicated Data

Plugins

Specified IPs

IBM

variable (default: 1 year)

variable (default: 20)

variable (default: 10)

variable (default: 20TB)

Yes

Unlimited license

For large deployments, Unlimited licenses are available, for which there are no usage limits.

3.2.2. License renewals

The Big Replicate UI provides a warning message whenever you log in.
A warning also appears under the Settings tab on the license Settings panel. Follow the link to the website.
Complete the form to set out your requirements for license renewal.

3.3. Prequisites Checklist

The following prerequisites checklist apply to both the Big Replicate server and for separate IHC servers. We recommend that you deploy on physical hardware rather than on a virtual platform, however, there are no reasons why you can’t deploy on a virtual environment.

3.3.1. Scaling a deployment

How much Big Replicate you need to deploy is not proportionate to the amount of data stored in your clusters, or the number of nodes in your clusters. You deploy Big Replicate/IHC server nodes in proportion to the data traffic between clusters; the more data traffic you need to handle, the more resources you need to put into the Big Replicate server software.

If you plan to locate both the Big Replicate and IHC servers on the same machine then check the Collocated Server requirements:

CPUs: Small Big Replicate server deployment : 8 cores
Large Big Replicate server deployment: : 16 cores
Architecture: 64-bit only.

System memory: There are no special memory requirements, except for the need to support a high throughput of data:
Type: Use ECC RAM
Size: Recommended 64 GB recommended (minimum of 16 GB)
Small Big Replicate server Deployment: 32 GB
Large Big Replicate server deployment: 128 GB
System memory requirements are matched to the expected cluster size and should take into account the number of files and block size. The more RAM you have, the bigger the supported file system, or the smaller the block size.

Collocation of Big Replicate/IHC servers

Both the Big Replicate server and the IHC server are, by default, installed on the same machine, in which case you would need to double the minimum memory requirements stated above. E.g. Size: Recommended 64 GB recommended (minimum of 32 GB)
Small Big Replicate server Deployment: 64 GB
Large Big Replicate server deployment: 128 GB or more

Storage space: Type: Hadoop operations are storage-heavy and disk-intensive so we strongly recommend that you use enterprise-class Solid State Drives (SSDs).
Size: Recommended: 1 TiB
Minimum: You need at least 250 GiB of disk space for a production environment.

Network Connectivity: Minimum 1Gb Ethernet between local nodes.
Small IBM Big Replicate server: 2Gbps
Large IBM Big Replicate server: 4x10 Gbps (cross-rack)
TCP Port Allocation: The following default TCP ports need to be reserved for Big Replicate installations:

Network diagram illustrating basic connections/port arrangement.

Hadoop Platform

Big Replicate Server: DConE replication port: 6444
DCone port handles all coordination traffic that manages replication. It needs to be open between all Big Replicate nodes. Nodes that are situated in zones that are external to the data center’s network will require unidirectional access through the firewall.

Application/REST API: 8082
REST port is used by the Big Replicate application for configuration and reporting, both internally and via REST API. The port needs to be open between all Big Replicate nodes and any systems or scripts that interface with Big Replicate through the REST API.

Big Replicate Client port: 8023
Port used by Big Replicate server to communicate with HCFS/HDFS clients. The port is generally only open to the local Big Replicate server, however you must make sure that it is open to edge nodes.

Big Replicate Server listening port: 8024
Port used by Big Replicate server to listen for connections from remote IHC servers. It is only used in unidirectional mode, but it’s always opened for listening. Remote IHCs connect to this port if the connection can’t be made in the other direction because of a firewall. The SSL configuration for this port is controlled by the same ihc.ssl.enabled property that is used for IHC connections performed from the other side. See Enable SSL for Big Replicate

IHC ports: 7000-range or 9000-range
7000 range, (exact port is determined at installation time based on what ports are available), used for data transfer between Big Replicate Server and IHC servers. Must be accessible from all Big Replicate nodes in the replicated system.
9000 range, exact port is determined at installation time based on available ports), used for an HTTP Server that exposes JMX metrics from the IHC server.

Big Replicate UI: Web UI interface: 8083 Used to access the Big Replicate Administration UI by end users (requires authentication), also used for inter-UI communication. This port should be accessible from all Big Replicate servers in the replicated system as well as visible to any part of the network where administrators require UI access.

3.3.2. Software requirements

Operating systems:

RHEL 6 x86_64
RHEL 7 x86_64 <

Web browsers: Mozilla Firefox 11 and higher
Google Chrome

Java: Java JRE 1.7 / 1.8 See Supported versions Hadoop requires Java JRE 1.7. as a minimum. It is built and tested on Oracle’s version of Java Runtime Environment. We have now added support for Open JDK 7, which is used in Amazon Cloud deployments. For other types of deployment we recommend running with Oracle’s Java as it has undergone more testing.
Architecture: 64-bit only
Heap size: Set Java Heap Size of to a minimum of 1Gigabytes, or the maximum available memory on your server.
Use a fixed heap size. Give -Xminf and -Xmaxf the same value. Make this as large as your server can support.
Avoid Java defaults. Ensure that garbage collection will run in an orderly manner. Configure NewSize and MaxNewSize Use 1/10 to 1/5 of Max Heap size for JVMs larger than 4GB. Stay deterministic!
When deploying to a cluster, make sure you have exactly the same version of the Java environment on all nodes.

Where’s Java?
Although Big Replicate only requires the Java Runtime Environment (JRE), Cloudera and Hortonworks may install the full Oracle JDK with the high strength encryption package included. This JCE package is a requirement for running Kerberized clusters.
For good measure, remove any JDK 6 that might be present in /usr/java. Make sure that /usr/java/default and /usr/java/latest point to an instance of java 7 version, your Hadoop manager should install this.

+ Ensure that you set the JAVA_HOME environment variable for the root user on all nodes. Remember that, on some systems, invoking sudo strips environmental variables, so you may need to add the JAVA_HOME to Sudo’s list of preserved variables.

Due to a bug in JRE 7, you should not run FINER level logging for javax.security.sasl if you are running on JDK 7. Doing so may result in an NPE. You can guard against the problem by locking down logging with the addition of the following line in Big Replicate’s logger.properties file (in /etc/fusion/server):

`javax.security.sasl.level=INFO`

The problem has been fixed for JDK 8. FUS-1946 Due to a bug in JDK 8 prior to 8u60, replication throughput with SSL enabled can be extremely slow (less than 4MB/sec). This is down to an inefficient GCM implementation.

Workaround
Upgrade to Java 8u60 or greater, or ensure Big Replicate is able to make use of OpenSSL libraries instead of JDK. Requirements for this can be found at http://netty.io/wiki/requirements-for-4.x.html[FUS-3041]

File descriptor/Maximum number of processes limit: Maximum User Processes and Open Files limits are low by default on some systems. It is possible to check their value with the ulimit or limit command:

ulimit -u && ulimit -n

-u The maximum number of processes available to a single user.
-n The maximum number of open file descriptors.

For optimal performance, we recommend both hard and soft limits values to be set to 64000 or more:

RHEL6 and later: A file /etc/security/limits.d/90-nproc.conf explicitly overrides the settings in security.conf, i.e.:

      # Default limit for number of user's processes to prevent
      # accidental fork bombs.
      # See rhbz #432903 for reasoning.
      * soft nproc 1024 <- Increase this limit or ulimit -u will
be reset to 1024

Ambari/Pivotal HD and Cloudera manager will set various ulimit entries, you must ensure hard and soft limits are set to 64000 or higher. Check with the ulimit or limit command. If the limit is exceeded the JVM will throw an error: java.lang.OutOfMemoryError: unable to create new native thread.

Additional requirements: iptables Use the following procedure to temporarily disable iptables, during installation:

RedHat 6

Turn off with

$ sudo chkconfig iptables off

Reboot the system.
On completing installation, re-enable with

$ sudo chkconfig iptables on

3.3.3. RedHat 7

Turn off with

$ sudo systemctl disable firewalld

Reboot the system.
On completing installation, re-enable with

$ sudo systemctl enable firewalld

Comment out requiretty in /etc/sudoers
The installer’s use of sudo won’t work with some linux distributions (CentOS where /etc/sudoer sets enables requiretty, where sudo can only be invoked from a logged in terminal session, not through cron or a bash script. When enabled the installer will fail with an error:

execution refused with "sorry, you must have a tty to run sudo"
message
Ensure that requiretty is commented out:
# Defaults      requiretty

SSL encryption: Basics
Big Replicate supports SSL for any or all of the three channels of communication: Big Replicate Server - Big Replicate Server, Big Replicate Server - Big Replicate Client, and Big Replicate Server - IHC Server.

keystore
A keystore (containing a private key / certificate chain) is used by an SSL server to encrypt the communication and create digital signatures.

truststore
A truststore is used by an SSL client for validating certificates sent by other servers. It simply contains certificates that are considered "trusted". For convenience you can use the same file as both the keystore and the truststore, you can also use the same file for multiple processes.

Enabling SSL: You can enable SSL during installation (Step 4 Server) or through the SSL Settings screen, selecting a suitable Big Replicate HTTP Policy Type. It is also possible to enable SSL through a manual edit of the application.properties file. We don’t recommend using the manual method, although it is available if needed: Enable HTTPS.

Due to a bug in JDK 8 prior to 8u60, replication throughput with SSL enabled can be extremely slow (less than 4MB/sec). This is down to an inefficient GCM implementation.

3.3.4. Supported versions

This table shows the versions of Hadoop and Java that we currently support:

Distribution:

Console:

JRE:

Apache Hadoop 2.5.0