Fix clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more than

  • Overview
  • Azure portal
  • Azure Data Mill
  • Azure CLI
  • Azure PowerShell
  • REST API (whorl)
  • Azure Resource Manager templates

Learn how to set up and configure Apache Hadoop, Apache Spark, Apache Kafka, Interactive Query, Apache HBase, or Apache Storm in HDInsight. Likewise, learn how to customize clusters and add together security past joining them to a domain.

A Hadoop cluster consists of several virtual machines (nodes) that are used for distributed processing of tasks. Azure HDInsight handles implementation details of installation and configuration of individual nodes, and then you but take to provide full general configuration data.

Of import

HDInsight cluster billing starts one time a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, then you should ever delete your cluster when it is no longer in use. Acquire how to delete a cluster.

If you're using multiple clusters together, yous'll want to create a virtual network, and if yous're using a Spark cluster you'll as well desire to employ the Hive Warehouse Connector. For more data, see Program a virtual network for Azure HDInsight and Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector.

Cluster setup methods

The post-obit table shows the different methods yous can use to set up an HDInsight cluster.

Clusters created with Web browser Command line Residue API SDK
Azure portal
Azure Information Mill
Azure CLI
Azure PowerShell
scroll
Azure Resource Manager templates

This article walks you through setup in the Azure portal, where you can create an HDInsight cluster.

Nuts

hdinsight create options custom quick

Project details

Azure Resource Manager helps you work with the resources in your awarding as a group, referred to as an Azure resource group. You can deploy, update, monitor, or delete all the resources for your awarding in a unmarried coordinated operation.

Cluster details

Cluster proper noun

HDInsight cluster names have the following restrictions:

  • Allowed characters: a-z, 0-ix, A-Z
  • Max length: 59
  • Reserved names: apps
  • The cluster naming scope is for all Azure, across all subscriptions. So the cluster name must be unique worldwide.
  • Offset six characters must be unique within a virtual network

Region

You don't need to specify the cluster location explicitly: The cluster is in the same location every bit the default storage. For a list of supported regions, select the Region drop-down list on HDInsight pricing.

Cluster type

Azure HDInsight currently provides the following cluster types, each with a fix of components to provide sure functionalities.

Important

HDInsight clusters are available in diverse types, each for a single workload or engineering. There is no supported method to create a cluster that combines multiple types, such as Tempest and HBase on one cluster. If your solution requires technologies that are spread across multiple HDInsight cluster types, an Azure virtual network tin can connect the required cluster types.

Cluster type Functionality
Hadoop Batch query and analysis of stored information
HBase Processing for large amounts of schemaless, NoSQL data
Interactive Query In-memory caching for interactive and faster Hive queries
Kafka A distributed streaming platform that tin be used to build existent-time streaming data pipelines and applications
Spark In-memory processing, interactive queries, micro-batch stream processing
Tempest Real-fourth dimension event processing

Version

Choose the version of HDInsight for this cluster. For more information, run across Supported HDInsight versions.

Cluster credentials

With HDInsight clusters, you can configure ii user accounts during cluster creation:

  • Cluster login username: The default username is admin. It uses the basic configuration on the Azure portal. Sometimes information technology's called "Cluster user," or "HTTP user."
  • Secure Shell (SSH) username: Used to connect to the cluster through SSH. For more information, meet Apply SSH with HDInsight.

The HTTP username has the following restrictions:

  • Immune special characters: _ and @
  • Characters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^space
  • Max length: 20

The SSH username has the following restrictions:

  • Allowed special characters:_ and @
  • Characters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^infinite
  • Max length: 64
  • Reserved names: hadoop, users, oozie, hive, mapred, ambari-qa, zookeeper, tez, hdfs, sqoop, yarn, hcat, ams, hbase, tempest, administrator, admin, user, user1, test, user2, test1, user3, admin1, ane, 123, a, actuser, adm, admin2, aspnet, backup, console, david, guest, john, owner, root, server, sql, support, support_388945a0, sys, test2, test3, user4, user5, spark

Storage

Cluster storage settings: HDFS-compatible endpoints

Although an on-premises installation of Hadoop uses the Hadoop Distributed File System (HDFS) for storage on the cluster, in the cloud you use storage endpoints connected to cluster. Using cloud storage means y'all can safely delete the HDInsight clusters used for computation while still retaining your data.

HDInsight clusters can employ the following storage options:

  • Azure Data Lake Storage Gen2
  • Azure Data Lake Storage Gen1
  • Azure Storage General Purpose v2
  • Azure Storage Full general Purpose v1
  • Azure Storage Block blob (merely supported as secondary storage)

For more than data on storage options with HDInsight, see Compare storage options for use with Azure HDInsight clusters.

Warning

Using an boosted storage account in a unlike location from the HDInsight cluster is not supported.

During configuration, for the default storage endpoint you specify a blob container of an Azure Storage business relationship or Information Lake Storage. The default storage contains application and arrangement logs. Optionally, you tin specify additional linked Azure Storage accounts and Data Lake Storage accounts that the cluster can access. The HDInsight cluster and the dependent storage accounts must exist in the same Azure location.

Important

Enabling secure storage transfer afterward creating a cluster tin can result in errors using your storage business relationship and is not recommended. Information technology is better to create a new cluster using a storage account with secure transfer already enabled.

Note

Azure HDInsight does not automatically transfer, move or re-create your data stored in Azure Storage from ane region to another.

Metastore settings

You tin create optional Hive or Apache Oozie metastores. Even so, not all cluster types support metastores, and Azure Synapse Analytics isn't uniform with metastores.

For more data, run across Employ external metadata stores in Azure HDInsight.

Of import

When you create a custom metastore, don't use dashes, hyphens, or spaces in the database proper noun. This can cause the cluster cosmos process to neglect.

SQL database for Hive

If you want to retain your Hive tables afterward you delete an HDInsight cluster, use a custom metastore. Y'all can and so attach the metastore to another HDInsight cluster.

An HDInsight metastore that is created for one HDInsight cluster version tin't be shared across unlike HDInsight cluster versions. For a list of HDInsight versions, meet Supported HDInsight versions.

Important

The default metastore provides an Azure SQL Database with a basic tier five DTU limit (not upgradeable)! Suitable for basic testing purposes. For large or production workloads, nosotros recommend migrating to an external metastore.

SQL database for Oozie

To increase performance when using Oozie, utilize a custom metastore. A metastore can also provide access to Oozie task data after y'all delete your cluster.

SQL database for Ambari

Ambari is used to monitor HDInsight clusters, brand configuration changes, and store cluster direction information besides as job history. The custom Ambari DB characteristic allows you to deploy a new cluster and setup Ambari in an external database that you manage. For more data, see Custom Ambari DB.

Important

You cannot reuse a custom Oozie metastore. To use a custom Oozie metastore, yous must provide an empty Azure SQL Database when creating the HDInsight cluster.

Security + networking

hdinsight create options choose enterprise security package

Enterprise security package

For Hadoop, Spark, HBase, Kafka, and Interactive Query cluster types, you tin cull to enable the Enterprise Security Package. This package provides option to have a more than secure cluster setup by using Apache Ranger and integrating with Azure Agile Directory. For more than data, see Overview of enterprise security in Azure HDInsight.

The Enterprise security packet allows you to integrate HDInsight with Active Directory and Apache Ranger. Multiple users tin be created using the Enterprise security parcel.

For more information on creating domain-joined HDInsight cluster, run across Create domain-joined HDInsight sandbox surround.

TLS

For more than information, run across Transport Layer Security

Virtual network

If your solution requires technologies that are spread beyond multiple HDInsight cluster types, an Azure virtual network can connect the required cluster types. This configuration allows the clusters, and whatever lawmaking you deploy to them, to straight communicate with each other.

For more data on using an Azure virtual network with HDInsight, encounter Programme a virtual network for HDInsight.

For an example of using 2 cluster types inside an Azure virtual network, see Use Apache Spark Structured Streaming with Apache Kafka. For more than data almost using HDInsight with a virtual network, including specific configuration requirements for the virtual network, see Plan a virtual network for HDInsight.

Disk encryption setting

For more information, see Customer-managed key disk encryption.

Kafka Remainder proxy

This setting is only available for cluster type Kafka. For more information, see Using a REST proxy.

Identity

For more than information, see Managed identities in Azure HDInsight.

Configuration + pricing

HDInsight choose your node size

You're billed for node usage for as long equally the cluster exists. Billing starts when a cluster is created and stops when the cluster is deleted. Clusters can't exist de-allocated or put on concur.

Node configuration

Each cluster type has its own number of nodes, terminology for nodes, and default VM size. In the post-obit table, the number of nodes for each node blazon is in parentheses.

Type Nodes Diagram
Hadoop Caput node (ii), Worker node (1+) HDInsight Hadoop cluster nodes
HBase Head server (2), region server (i+), main/ZooKeeper node (three) HDInsight HBase cluster type setup
Storm Nimbus node (2), supervisor server (1+), ZooKeeper node (3) HDInsight storm cluster type setup
Spark Caput node (2), Worker node (i+), ZooKeeper node (3) (free for A1 ZooKeeper VM size) HDInsight spark cluster type setup

For more information, meet Default node configuration and virtual motorcar sizes for clusters in "What are the Hadoop components and versions in HDInsight?"

The cost of HDInsight clusters is determined past the number of nodes and the virtual machines sizes for the nodes.

Different cluster types take different node types, numbers of nodes, and node sizes:

  • Hadoop cluster type default:
    • Two caput nodes
    • Iv Worker nodes
  • Storm cluster type default:
    • Two Nimbus nodes
    • Iii ZooKeeper nodes
    • Four supervisor nodes

If you're just trying out HDInsight, nosotros recommend yous use 1 Worker node. For more data about HDInsight pricing, encounter HDInsight pricing.

Note

The cluster size limit varies among Azure subscriptions. Contact Azure billing support to increment the limit.

When you lot employ the Azure portal to configure the cluster, the node size is available through the Configuration + pricing tab. In the portal, you can also see the cost associated with the dissimilar node sizes.

Virtual machine sizes

When you lot deploy clusters, choose compute resources based on the solution you plan to deploy. The following VMs are used for HDInsight clusters:

  • A and D1-4 series VMs: General-purpose Linux VM sizes
  • D11-fourteen series VM: Memory-optimized Linux VM sizes

To find out what value you should apply to specify a VM size while creating a cluster using the dissimilar SDKs or while using Azure PowerShell, see VM sizes to use for HDInsight clusters. From this linked article, use the value in the Size column of the tables.

Important

If you need more than than 32 Worker nodes in a cluster, you must select a head node size with at least 8 cores and 14 GB of RAM.

For more information, see Sizes for virtual machines. For information about pricing of the various sizes, run across HDInsight pricing.

Add application

An HDInsight application is an application that users can install on a Linux-based HDInsight cluster. You tin can employ applications provided by Microsoft, third parties, or that y'all develop yourself. For more data, see Install third-party Apache Hadoop applications on Azure HDInsight.

Most of the HDInsight applications are installed on an empty border node. An empty border node is a Linux virtual auto with the aforementioned client tools installed and configured as in the head node. You can apply the border node for accessing the cluster, testing your client applications, and hosting your client applications. For more information, run into Use empty border nodes in HDInsight.

Script actions

You can install additional components or customize cluster configuration by using scripts during creation. Such scripts are invoked via Script Activeness, which is a configuration option that tin can be used from the Azure portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .Net SDK. For more information, encounter Customize HDInsight cluster using Script Action.

Some native Java components, like Apache Mahout and Cascading, can be run on the cluster as Java Archive (JAR) files. These JAR files tin be distributed to Azure Storage and submitted to HDInsight clusters with Hadoop job submission mechanisms. For more than information, run into Submit Apache Hadoop jobs programmatically.

Sometimes, you want to configure the post-obit configuration files during the creation procedure:

  • clusterIdentity.xml
  • core-site.xml
  • gateway.xml
  • hbase-env.xml
  • hbase-site.xml
  • hdfs-site.xml
  • hive-env.xml
  • hive-site.xml
  • mapred-site
  • oozie-site.xml
  • oozie-env.xml
  • tempest-site.xml
  • tez-site.xml
  • webhcat-site.xml
  • yarn-site.xml

For more information, run across Customize HDInsight clusters using Bootstrap.

Next steps

  • Troubleshoot cluster creation failures with Azure HDInsight
  • What are HDInsight, the Apache Hadoop ecosystem, and Hadoop clusters?
  • Go started using Apache Hadoop in HDInsight
  • Work in Apache Hadoop on HDInsight from a Windows PC