Connection between RapidMiner with FusionInsightHD¶
Succeeded Case¶
Rapidminer Studio 8.2.001 ↔ FusionInsight HD V100R002C80SPC200 (HDFS/Hive/MapReduce/Spark)
- Download and install RapidMiner Studio, download site
- Start rapidminer, on the top of the main menu, choose
,install it and restart rapidminer - Configure the local host file,file path is
,add the cluster node ip and host name and save the file. -
Configure Kerberos file
Created a user with "Human-Machine" as its type( For detail, take product documentation as a reference ), grant the Hive, Spark,HDFS privileges to this user. For example, create a user named developuser, download the user.keytab and krb5.conf of the user and save them in your computer.
Prepare the FusionInsight client configuration filesystem and jar files
- In the Manager GUI,choose
Service->Download Client->Only Configuration File
- Unzip the file,find the following files,copy them into a directory,like
- Open
,delete the following property:<property> <name></name> <value>Yarn</value> </property>
- Login to one of the cluster nodes, go to the following path
,download the file directory/jars
,save it in your computer,likeC:/jars
- In the Manager GUI,choose
Configure the cluster¶
Bind the UDP port
- Download the UDP port bind tool
,website is - After building and installing, we get the executing file
,upload it to the KDC server nodes in the cluster,and run the following command,here IP refers to the node ip../uredir IP:88 IP:21732
- Download the UDP port bind tool
Configure Radoop Jars
- Download Radoop jars in this address,get the correct version。
Upload the jar files to each node of the cluster,eg,
In the HiveServer node of the cluster,uplaod the Radoop jar files to the following path and change their owner and execution authority ``` cd /opt/huawei/Bigdata/FusionInsight_HD_V100R002C80SPC200/install/FusionInsight-Hive-1.3.0/hive-1.3.0/lib chown omm:wheel radoop_hive-v4.jar chown omm:wheel rapidminer_libs-8.2.0.jar chmod 700 radoop_hive-v4.jar chmod 700 rapidminer_libs-8.2.0.jar
cd /opt/huawei/Bigdata/FusionInsight_HD_V100R002C80SPC200/install/FusionInsight-Hadoop-2.7.2/hadoop/share/hadoop/mapreduce/lib chown omm:ficommon radoop_hive-v4.jar chown omm:ficommon rapidminer_libs-8.2.0.jar chmod 750 radoop_hive-v4.jar chmod 750 rapidminer_libs-8.2.0.jar
`` * In the FusionInsight Manager GUI, choose
Service->Hive->Service Configuration`add the following configurationradoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name
- Notice that there should be a
as seperater -
Save the configuration,restart HiveServer
Create Radoop UDF functions
Run the following command in the client node, login to the Hive database
source /opt/hadoopclient、bigdata_env kinit developuser beeline
create a database in Hive, for example
, and create functions,run the following commands in beeline modecreate database rapidminer; use rapidminer; DROP FUNCTION IF EXISTS r3_add_file; DROP FUNCTION IF EXISTS r3_apply_model; DROP FUNCTION IF EXISTS r3_correlation_matrix; DROP FUNCTION IF EXISTS r3_esc; DROP FUNCTION IF EXISTS r3_gaussian_rand; DROP FUNCTION IF EXISTS r3_greatest; DROP FUNCTION IF EXISTS r3_is_eq; DROP FUNCTION IF EXISTS r3_least; DROP FUNCTION IF EXISTS r3_max_index; DROP FUNCTION IF EXISTS r3_nth; DROP FUNCTION IF EXISTS r3_pivot_collect_avg; DROP FUNCTION IF EXISTS r3_pivot_collect_count; DROP FUNCTION IF EXISTS r3_pivot_collect_max; DROP FUNCTION IF EXISTS r3_pivot_collect_min; DROP FUNCTION IF EXISTS r3_pivot_collect_sum; DROP FUNCTION IF EXISTS r3_pivot_createtable; DROP FUNCTION IF EXISTS r3_score_naive_bayes; DROP FUNCTION IF EXISTS r3_sum_collect; DROP FUNCTION IF EXISTS r3_which; DROP FUNCTION IF EXISTS r3_sleep; CREATE FUNCTION r3_add_file AS 'eu.radoop.datahandler.hive.udf.GenericUDFAddFile'; CREATE FUNCTION r3_apply_model AS 'eu.radoop.datahandler.hive.udf.GenericUDTFApplyModel'; CREATE FUNCTION r3_correlation_matrix AS 'eu.radoop.datahandler.hive.udf.GenericUDAFCorrelationMatrix'; CREATE FUNCTION r3_esc AS 'eu.radoop.datahandler.hive.udf.GenericUDFEscapeChars'; CREATE FUNCTION r3_gaussian_rand AS 'eu.radoop.datahandler.hive.udf.GenericUDFGaussianRandom'; CREATE FUNCTION r3_greatest AS 'eu.radoop.datahandler.hive.udf.GenericUDFGreatest'; CREATE FUNCTION r3_is_eq AS 'eu.radoop.datahandler.hive.udf.GenericUDFIsEqual'; CREATE FUNCTION r3_least AS 'eu.radoop.datahandler.hive.udf.GenericUDFLeast'; CREATE FUNCTION r3_max_index AS 'eu.radoop.datahandler.hive.udf.GenericUDFMaxIndex'; CREATE FUNCTION r3_nth AS 'eu.radoop.datahandler.hive.udf.GenericUDFNth'; CREATE FUNCTION r3_pivot_collect_avg AS 'eu.radoop.datahandler.hive.udf.GenericUDAFPivotAvg'; CREATE FUNCTION r3_pivot_collect_count AS 'eu.radoop.datahandler.hive.udf.GenericUDAFPivotCount'; CREATE FUNCTION r3_pivot_collect_max AS 'eu.radoop.datahandler.hive.udf.GenericUDAFPivotMax'; CREATE FUNCTION r3_pivot_collect_min AS 'eu.radoop.datahandler.hive.udf.GenericUDAFPivotMin'; CREATE FUNCTION r3_pivot_collect_sum AS 'eu.radoop.datahandler.hive.udf.GenericUDAFPivotSum'; CREATE FUNCTION r3_pivot_createtable AS 'eu.radoop.datahandler.hive.udf.GenericUDTFCreatePivotTable'; CREATE FUNCTION r3_score_naive_bayes AS 'eu.radoop.datahandler.hive.udf.GenericUDFScoreNaiveBayes'; CREATE FUNCTION r3_sum_collect AS 'eu.radoop.datahandler.hive.udf.GenericUDAFSumCollect'; CREATE FUNCTION r3_which AS 'eu.radoop.datahandler.hive.udf.GenericUDFWhich'; CREATE FUNCTION r3_sleep AS 'eu.radoop.datahandler.hive.udf.GenericUDFSleep';
RapidMiner Configuration¶
- In RapidMiner sStadio,choose
Connections->Manage Radoop Connections
in the top menu. -
New Connections->Import Hadoop Configuration Files
,choose the configuration files downloaded from the cluster,clickImport Configuration
After the import, click
, go to the Connection settings window,configure ad following:- Global:
- Hadoop Version:Other(Hadoop 2X line)
- Additional Libraries Directory: Spark jar files downloaded from the cluster
- Client Principal:
Kerberos user name - Keytab File: the keytab file downloaded from manager
- KDC Address: the KDC server IP(see the krb5.conf file)
- Kerberos Config File: the krb5 file downloaded from manager
- Hadoop:
At the filter in upper right corner, search
, uncheckmapreduce.input.fileinputformat.split.maxsize
- Spark Version:Spark2.1
- Spark Archive(or libs)Path: local:///opt/huawei/Bigdata/FusionInsight_Spark2x_V100R002C80SPC200/install/FusionInsight-Spark2x-2.1.0/spark/jars
- Spark Resource Allocation Policy:Static,Default Configuration
Advanced Spark Parameters:add the following two parameters for spark:
The value can be found in manager GUI, choose
Services->Spark2X Configuration->type all
in the search bar, choose the parameters inSpark2x->SparkResource2x
Copy the values into a text file, replace the relative path
in the value with absolute path in the cluster, like/opt/huawei/Bigdata/FusionInsight_Spark2x_V100R002C80SPC200/1_21_SparkResource2x/etc
, then copy the values into rapidminer spark configuration -
- Hive Version: Hive Server2
- Hive Server Address:Hive Server IP
- Hive Port: 21066
- Database Name: the database name created in Hive,here is rapidminer
- Customer database for UDFs: same as before
- click
OK->Proced Anyway->Save
Test the Connection¶
Click Configure, in Global tab, click Test,Test Results show as following:
In Hadoop tab,click Test,Test Results show as following:
In Spark tab,click Test,Test Results show as following:
In Hive tab, click Test, Test Results show as following:
Click Full test,Test Results show as following:
Radoop Demo¶
In RapidMiner Studio main menu,choose
Help->Tutorials->User Hadoop->Rapidminer Radoop
- Run the demo accordding to the Tutorials, get the follwing results