Jupyter_Notebook对接FusionInsight¶
适用场景¶
Jupyter Notebook 2.4.4.0 ↔ FusionInsight HD V100R002C70SPC200 (pySpark)
安装Jupyter notebook¶
Jupyter notebook的安装依赖于Python,且涉及到许多工具的依赖包,相互之间还存在版本依赖关系,比较麻烦,通常可以直接安装Anaconda包,里面包含了Python、Jupyter Notebook,以及众多的科学工具包,这里我们直接安装Anaconda
-
从Anaconda官网下载并安装Anaconda2-4.4
wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh bash Anaconda2-4.4.0-Linux-x86_64.sh
-
生成Jupyter notebook的配置文件
jupyter notebook --generate-config --allow-root
-
修改Jupyter notebook的配置IPc.NotebookApp.ip为本机IP地址
vi /root/.jupyter/jupyter_notebook_config.py
-
启动Jupyter notebook::
jupyter notebook --allow-root
-
出现如下提示表示Jupyter notebook启动成功
[I 15:53:46.918 NotebookApp] Serving notebooks from local directory: /opt [I 15:53:46.918 NotebookApp] 0 active kernels [I 15:53:46.918 NotebookApp] The Jupyter Notebook is running at: http://172.21.33.122:8888/?token=f0494a2274cba1a6098ef21c417af2f3c49df872c6b34938 [I 15:53:46.918 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 15:53:46.919 NotebookApp] No web browser found: could not locate runnable browser. [C 15:53:46.919 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://172.21.33.122:8888/?token=f0494a2274cba1a6098ef21c417af2f3c49df872c6b34938
-
使用
Ctrl+C
可以退出Jupyter notebook
安装FusionInsight Client¶
- 参考FusionInsight的产品文档完成Linux下的FusionInsight客户端的安装,安装到
/opt/hadoopclient
目录
完成Kerberos认证¶
- 使用sparkuser进行Kerberos认证(sparkuser为FusionInsight中创建的拥有Spark访问权限的人机用户)
cd /opt/hadoopclient/ source bigdata_env kinit sparkuser
导入ipython相关环境变量¶
- 执行以下命令导入环境变量,或者将下面两行添加到
/opt/hadoopclient/bigdata_env文件
,后续source bigdata_env时可以自动将环境变量导入export PYSPARK_DRIVER_PYTHON="ipython" export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root"
Jupyter notebook中使用pyspark进行分析¶
-
执行pyspark会自动启动Jupyter notebook
[root@test01 opt]# pyspark [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions. [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future [I 16:24:20.802 NotebookApp] The port 8888 is already in use, trying another port. [I 16:24:20.809 NotebookApp] Serving notebooks from local directory: /opt [I 16:24:20.809 NotebookApp] 0 active kernels [I 16:24:20.809 NotebookApp] The Jupyter Notebook is running at: http://172.21.33.121:8889/?token=a951f440e47d932b1782fd97383c3dc935d468799a3c36c6 [I 16:24:20.809 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 16:24:20.810 NotebookApp] No web browser found: could not locate runnable browser. [C 16:24:20.810 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://172.21.33.121:8889/?token=a951f440e47d932b1782fd97383c3dc935d468799a3c36c6
-
打开上述链接,可以进行数据分析
wget http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
Sys.setenv(SPARK_HOME="/opt/hadoopclient/Spark/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R","lib"), .libPaths()))
library(SparkR)
library(magrittr)
sc <- sparkR.init(master = "yarn-client", sparkPackages = "com.databricks:spark-csv_2.10-1.2.0")
sqlContext <- sparkRSQL.init(sc)
flightsDF <- read.df(sqlContext, "/user/sparkuser/flights.csv", source = "com.databricks.spark.csv", header = "true")
destDF <- select(flightsDF, "dest", "cancelled")
groupBy(flightsDF, flightsDF$date) %>%
summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF
head(dailyDelayDF)
wget http://files.grouplens.org/datasets/movielens/ml-100k/u.user
%pylab inline
user_data = sc.textFile("ml-100k/u.user")
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)
ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
Jupyter notebook中使用R语言进行分析¶
TBD