

Jupyter_Notebook对接FusionInsight¶

适用场景¶

Jupyter Notebook 2.4.4.0 ↔ FusionInsight HD V100R002C70SPC200 (pySpark)

安装Jupyter notebook¶

Jupyter notebook的安装依赖于Python，且涉及到许多工具的依赖包，相互之间还存在版本依赖关系，比较麻烦，通常可以直接安装Anaconda包，里面包含了Python、Jupyter Notebook，以及众多的科学工具包，这里我们直接安装Anaconda

从Anaconda官网下载并安装Anaconda2-4.4

wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
bash Anaconda2-4.4.0-Linux-x86_64.sh

生成Jupyter notebook的配置文件

jupyter notebook --generate-config --allow-root

修改Jupyter notebook的配置IPc.NotebookApp.ip为本机IP地址
```
vi /root/.jupyter/jupyter_notebook_config.py
```
启动Jupyter notebook::
```
jupyter notebook --allow-root
```

出现如下提示表示Jupyter notebook启动成功

[I 15:53:46.918 NotebookApp] Serving notebooks from local directory: /opt
[I 15:53:46.918 NotebookApp] 0 active kernels
[I 15:53:46.918 NotebookApp] The Jupyter Notebook is running at: http://172.21.33.122:8888/?token=f0494a2274cba1a6098ef21c417af2f3c49df872c6b34938
[I 15:53:46.918 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 15:53:46.919 NotebookApp] No web browser found: could not locate runnable browser.
[C 15:53:46.919 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://172.21.33.122:8888/?token=f0494a2274cba1a6098ef21c417af2f3c49df872c6b34938

使用 Ctrl+C 可以退出Jupyter notebook

安装FusionInsight Client¶

参考FusionInsight的产品文档完成Linux下的FusionInsight客户端的安装，安装到/opt/hadoopclient目录

完成Kerberos认证¶

使用sparkuser进行Kerberos认证(sparkuser为FusionInsight中创建的拥有Spark访问权限的人机用户)
```
cd /opt/hadoopclient/
source bigdata_env
kinit sparkuser
```

导入ipython相关环境变量¶

执行以下命令导入环境变量，或者将下面两行添加到/opt/hadoopclient/bigdata_env文件，后续source bigdata_env时可以自动将环境变量导入
```
export PYSPARK_DRIVER_PYTHON="ipython"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root"
```

Jupyter notebook中使用pyspark进行分析¶

执行pyspark会自动启动Jupyter notebook

[root@test01 opt]# pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[I 16:24:20.802 NotebookApp] The port 8888 is already in use, trying another port.
[I 16:24:20.809 NotebookApp] Serving notebooks from local directory: /opt
[I 16:24:20.809 NotebookApp] 0 active kernels
[I 16:24:20.809 NotebookApp] The Jupyter Notebook is running at: http://172.21.33.121:8889/?token=a951f440e47d932b1782fd97383c3dc935d468799a3c36c6
[I 16:24:20.809 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 16:24:20.810 NotebookApp] No web browser found: could not locate runnable browser.
[C 16:24:20.810 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://172.21.33.121:8889/?token=a951f440e47d932b1782fd97383c3dc935d468799a3c36c6

打开上述链接，可以进行数据分析

wget http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv

Sys.setenv(SPARK_HOME="/opt/hadoopclient/Spark/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R","lib"), .libPaths()))
library(SparkR)
library(magrittr)
sc <- sparkR.init(master = "yarn-client", sparkPackages = "com.databricks:spark-csv_2.10-1.2.0")
sqlContext <- sparkRSQL.init(sc)
flightsDF <- read.df(sqlContext, "/user/sparkuser/flights.csv", source = "com.databricks.spark.csv", header = "true")  
destDF <- select(flightsDF, "dest", "cancelled")
groupBy(flightsDF, flightsDF$date) %>%
summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF
head(dailyDelayDF)

wget http://files.grouplens.org/datasets/movielens/ml-100k/u.user

%pylab inline
user_data = sc.textFile("ml-100k/u.user")
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)

ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

Jupyter notebook中使用R语言进行分析¶

TBD