DART_QCEFF experiment

Telling story

what is quantile? what is quantile-conserving filter?

If we note CDF of a random variable \(X\) at value \(a\) as \(F^x(a)\), quantile as \(q\), then the quantile can be expressed as \(F^x(a)=q=Pr(X<=a)\), and the quantile function of q can be noted as \((F^x)^{-1}(q)=a\), which is noted as the inverse of CDF. 更直观一点,\(q=0.75\)表示在独立随机变量\(X\)上任意取一个数,在概率上有75\%的可能性小于这个quantile所代表的value \(a\), 也就是说把PDF从负无穷积分到a, 值是0.75, 也就是CDF=0.75, 也就是3/4分位数。此为一点基本概念。

Jeff_20221提出了QCEFF(quantile-conserving ensemble filter framework)的概念,but in a limitation of observed variable(也就是说,没有Forward operator and regression of increment), 在这种情况下,quantile conserving filter的posterior应该由下面的equ给出,或者满足下面的条件(Jeff_2022, equ(2))

\[ y_n^a=\left(F^a\right)^{-1}\left[F^p\left(y_n^p\right)\right], n=1, \ldots, N \]

这个equ想要保证的是 prior and posterior 的每一个ensemble member在各自的PDF/CDF中的分位数是一致的。实际上,DART中最常用的EAKF以及处理non-gaussian PDF的RHF(rank histogram filter)都是quantile conserving filter: EAKF可以简单证明,RHF则在构建PDF的时候就要求quantile uniformly distributed at(0,1)。请注意,上面所说的quantile conserving filter只是在observed variable这个条件限制下。Jeff的part2 about unobserved variable似乎还没publish, but Jeff talk about this in ISDA2023_Jan

QCEFF in unobserved variable

unobserved variable 需要将obs increment regress到state space上,这一步我们目前普遍用的是linear regression,这样的话即使在obs space上increment做的再好,在经过regression后依然会有问题(比如说dont respect bound and nonlinearity, see Jeff's slide)。 regression

Jeff试图通过probit transform来处理这个problem with linear regression of increments。直接拿Jeff ISDA2023_Jan slide来说明具体是如何处理的。

equation

我将用四个space(state, obs, state-probit, obs-probit)来描述每一步发生在什么地方。state/obs space是我们很熟悉的,两个probit space指的是我们将state/obs variable 进行probit transform之后的space。所谓的probit transform,可以理解为进行了一次quantile-conserving filter(准确的说是transform),只不过inverse CDF(quantile function)不是posterior的CDF, 而是一个N~(0,1)的normal CDF(about probit, see its wikipedia)

在DART code实现上,

  1. at state/obs space, turn prior state/obs space variable transform into probit space
  2. at obs space, get \(y_n^a\) from \(y_n^p\) and obs likelihood 这里需要注意,\(y_n^a\)是在obs space上计算的。也就是说,我们先常规地计算了一遍obs space上的increment,算出了\(y_n^a\),再进行下面的操作。code about this part can be refered in Appendix.
  3. at obs space, turn \(y_n^a\) into obs-probit space \(\tilde{y}_n^a\)
  4. at obs-probit space, calculate the increment from \(\tilde{y}_n^a, \tilde{y}_n^p\)
  5. at obs-probit space, regress increment into state-probit space.
  6. at state-probit space, do the increment.
  7. at state-probit space, turn values in state-probit space into state space

需要注意的是,此时我们做的实际上不满足quantile conserving的条件了, obviously, \(\Phi(\tilde{x}_n^a) \neq \Phi(\tilde{x}_n^p) = F_x^p(\tilde{x}_n^p)\) 不过相似的是posterior依然通过quantile进行计算。Jeff的slide里面也没有再提conserving,更强调的是probit-transformed

\[ x_n^a=\left(F_x^p\right)^{-1}\left[\Phi\left(\tilde{x}_n^a\right)\right] \]

关于probit transform,有必要再多说两句。如果state/obs space上的PDF估计的足够好,那么quantile分布应该是比较接近uniform dist U(0,1)的,那么经过probit transform之后distribution就会比较接近N(0,1)。对于quantile exactly ~ U(0,1), for example RHF, then in probit space it will exactly ~N(0,1), 此时linear regression就是最优线性无偏估计(BLUE),并且通过quantile function可以保证posterior不会出现在prior PDF=0的地方,这样开头提到的regression的问题就基本解决了。此外,Grooms_20222比较了gaussian anamorphose(高斯畸变) EnKF(GA-EnKF)and two-step bayesian update filter (like RHF in DART)作为两个处理nongaussian, nonlinear的filter在L96中的表现。原文abstract对于gaussian anamorphose的描述如下:

Info

GA-EnKF methods apply univariate transforms to the state and observation variables to make their distribution more Gaussian before applying an EnKF

that is somehow like what we do in probit transform。Grooms是U. Colorado应用数学系的AP,他在acknowledgement中还感谢了Jeff,说不定probit transform的idea跟他也有关系……

update: Ian和Jeff确实是紧密的合作者。Ian每周四都会在Jeff的Lab里面办公。

quantile regression

下面是我自己的一些想法了……probit transform非常好地解决了bound问题,并且一定程度上也处理了nonlinear regression的问题(linear regression at probit space,after transform back it will show the nonlinearity),但是实际上increment依然做的是linear regression。有没有可能替换掉regression中的linear,让regression本身就直接尊重分布的之间的nonlinearity而不是通过space transform来体现nonlinearity呢?

或许quantile regression可以做到?下面简单介绍一下quantile regression:

我们知道均值\(\mu\)可以表示成下式的解 $$ \min_\mu f=\sum_i^n\left(y_i-\mu\right)^2 $$ 同样的,中位数\(M\)也可以表示成下式的解(具体证明略) $$ \min_M f=\sum_i^n\left|y_i-M\right| $$

\(X,Y\)之间线性回归实际上就是求关于自变量\(X\)的系数为\(\beta\)的线性方程\(\mu\left(x_i, \beta\right)\),使得有

\[ \min _\beta \sum_i^n\left(y_i-\mu\left(x_i, \beta\right)\right)^2 \]

这个\(\mu\left(x_i, \beta\right)\)就是因变量\(Y\)的条件期望方程,也即是最小二乘法,也即是linear regression。

同样推之,我们令中位数\(M\)变为自变量的线性方程\(\xi(x_i, \beta)\),那么最优问题就转化成下式:

\[ \min _\beta \sum_i^n\left|y_i-\xi\left(x_i, \beta\right)\right| \]

中位数实际上就是一个特殊的quantile(分位数),我们可以把这个最优化问题它推广到任意一个quantile上。具体来说,对于一个quantile=q,在q左侧的样本点我们给予1-q的权重;在q右侧的样本点给予q的权重,问题就转化为求这样一个最优化问题。

\[ \min _{\beta_q} \sum_{i: y_i \geq \xi\left(x_i, \beta_q\right)} q\left(y_i-\xi\left(x_i, \beta_q\right)\right)+\sum_{i: y_i<\xi\left(x_i, \beta_q\right)}(1-q)\left(\xi\left(x_i, \beta_q\right)-y_i\right) \]

这样的结果就是我们对于处在分布中不同quantile处的数据,可以有不同的increment regression。尽管这个regression在特定的quantile依然展现为linear regression,但是在整个分布上可以说是一种nonlinear regression。一个简单的例图展示quantile regression的效果,虚线是10个分位数回归的回归方程:

quantile_regression

在DART中实现这样一个quantile regression脑补一下似乎不会特别困难,但是会比较显著地增加regression part的计算量,并且这个regression part对于每个obs-state pair都要做一遍,或许对于large model并不适合……总之,我目前还没有搜到quantile regression在DA algorithm上的应用,这或许是值得考虑的一个idea?上图以及关于quantile regression的推导来自石川的知乎专栏(难得在知乎搜到了有价值的信息~)

分位数回归损失函数代码实现解析 - Aidan_Lee - 博客园 (cnblogs.com)

Warning

最小二乘有高斯-马尔可夫肯定其合法性,那么这个quantile regression的loss function能不能找到其在统计上的合法性?

DART_QCEFF experiment design

3 filters: main_eakf, main_rhf, quantile_bnrhf

bnrhf: bounded normal rank histogram,如果我code理解正确的话就是一个两边可以有bound,两边没有bound的话是normal tail的rank histogram filter

2 kind of observation: allsky, cloudysky(可能后面会丢掉cloudsky obs)

d01_7.5km: 0030, 0100, 2 sets.

d02_1.5km: 0030-0100, 10min inverval, 4 sets.

d03_300m: 0030-0100, 5min inverval, 7 sets. Together 13 sets

Others: no inflation; localization scale is similar with linfan (cutoff=0.005 for 7.5km, 0.001 for 1.5km, 300m现在还没有进行测试,以及前面可能的还需要多试几个cutoff value调整一下,不过会保证三个filter之间cutoff一致)

The output/data directory should named as $DIR=f"{expt}_{domain}_{time}_{obs}".

Important

关于{obs}: 我在d01 0030 尝试了allsky and cloudy两种observation。尝试cloudy obs的原因是cloudy area的prior dist有更明显的non-gaussian (Larger KLD),而在cloudy obs影响的区域中,这种non-gaussian区域的占比显然更大。后续不一定会继续用。可能有些太多余了。

difference between main and quantile_method branch

see the difference at github

assimilation_code/modules/assimilation/assim_tools_mod.f90 assimilation_code/modules/assimilation/filter_mod.f90 assimilation_code/modules/assimilation/algorithm_info_mod.f90 assimilation_code/modules/assimilation/probit_transform_mod.f90

what i have changed?

just algorithm_info_mod.f90, and a minor change in rh_distribution.f90

如何重复quantile-conserving filter 的实验?

bashrc in lililei4: dart_rttov

quantile version of DART: /share/home/lililei4/haoxing/DART_quantile

main version of DART: /share/home/lililei4/haoxing/DART_main

Raw data在哪里?

在lililei4上,raw data全部在 ~/haoxing/quantile_data/ 中。llleistu200上同样有备份。

var_cut*: 关于domain格点的lat/lon信息,用于确定OSSE观测的位置。

$DIR: 主要的raw data。关于OSSE obs/FG/POSTERIOR都放在这里。里面的README记录了$DIR文件夹里内容的基本信息。

Tip

$DIR 的定义如下:f"{domain}_{time}" 一个$DIR里至少应该有:

FG -> firstguess dir

obs_{domain}_ch1025_totalline_withpert: text OBSERVATION

obs_seq.out_{domain}_{time}: DART OBSERVATION

prior_BT: external FO data (gengerated by RTTOVv11.2)

diagfile/: 把data从text转成diag再转成DART observation_converter需要的text......see more in README。

scripts/: 一些用于处理observation的脚本,以及一个基本的matlab画图脚本for d01(7.5km)

from linfan, raw data在哪里?

Ens:/ldata3/llleistu/linfan/Ensemble/

d01_7.5km/ldata3/llleistu/linfan/Ensemble/d01_7.5km/ens_20210608/rst0400_50mem_040000_040230/1-50 是1-50个member 不同时刻,如果只要几个时刻,在上层文件夹里,有04_00_30 04_01_00 04_01_30 。。。

d02_1.5km/2rst0400_50mem_1.5km/1-50

d03_300m/rst0400_50mem_300m/1-50

/share/home/lililei1/lfzhou/hyperspectral_da/step1_obs_ensBT/step2_les_obs/4obsdir_d01 -d04 有不同分辨率的obs hx,加了withpert就是添加了扰动的

/share/home/lililei1/lfzhou/scripts/bt_vars_corr/1nonlinear_criterion/d01/Ca_judgment: 这里有云检测的东西。

from kecheng, input/namelist/driver在哪里?

lililei1@202.119.37.252:/share/home/lililei1/kcfu/models/EnOI/fg_dir这个是 input

/share/home/lililei1/kcfu/models/EnOI/4wrf_advance 这个是namelist

/share/home/lililei1/kcfu/models/EnOI/EachTime_EnOI.sh 这是提交任务的script

Create DART obs_seq.out

see here: /share/home/lililei4/haoxing/DART_quantile/observations/obs_converters/text_giirs/

制作符合DART_RTTOV要求的观测并且使用external FO(也就是说在外面调用RTTOV,直接把RTTOV输出的BT值写进obs_seq.out,并且让DART_RTTOV不要再在filter里面调用FO来算prior,直接使用obs_seq.out里面我们给定的值作为prior)会有点麻烦。

Note

haoxing: create_3d_obs will set obs_def for this obs, but we want more, like(has_external_FO) so better do it ourselves.

haoxing: need to tell obs_def we use external FO. refer to GSI2DART convertor.

具体这个text_to_obs是如何运作的,请参见text_giirs/text_to_obs.f90

Warning

新定义了一个observation type: FY4_1GIIRS_TB

在observation/forward_operator/obs_def_rtttov_mod.f90 中可以看到

FY4_1_GIIRS_TB, QTY_BRIGHTNESS_TEMPERATURE

在迁移到别的DART version时记得需要修改obs_def

How to use external RTTOV as FO?

主要参见两个script,在/share/home/lililei4/haoxing/quantile_data/scripts

driver_matlab_prof.sh: 从wrfout生成profile data。需要wrfout中有给定的变量。

Warning

需要在DART input.nml &model_nml中列出所有driver_matlab_prof.sh需要的变量,因为posterior根据wrf_state_variables输出并存储变量(如果不使用obs_impact来约束的话,所有相关的variable都会被观测更新)。

一个范例:

&model_nml
default_state_variables = .false.,
wrf_state_variables     =      'U','QTY_U_WIND_COMPONENT','TYPE_U','UPDATE','999',
                                'V','QTY_V_WIND_COMPONENT','TYPE_V','UPDATE','999',
                                'U10','QTY_10M_U_WIND_COMPONENT','TYPE_U10','UPDATE','999',
                                'V10','QTY_10M_V_WIND_COMPONENT','TYPE_V10','UPDATE','999',
                                'W','QTY_VERTICAL_VELOCITY','TYPE_W','UPDATE','999',
                                'T','QTY_POTENTIAL_TEMPERATURE','TYPE_T','UPDATE','999',
                                'T2','QTY_2M_TEMPERATURE','TYPE_T2','UPDATE','999',
                                'TH2','QTY_POTENTIAL_TEMPERATURE','TYPE_TH2','UPDATE','999',
                                'TSK','QTY_SKIN_TEMPERATURE','TYPE_TSK','UPDATE','999',
                                'P','QTY_PRESSURE','TYPE_P','UPDATE','999',
                                'PB','QTY_PRESSURE','TYPE_PB','UPDATE','999',
                                'PH','QTY_GEOPOTENTIAL_HEIGHT','TYPE_GZ','UPDATE','999',
                                'MU','QTY_PRESSURE','TYPE_MU','UPDATE','999',
                                'PSFC','QTY_SURFACE_PRESSURE','TYPE_PS','UPDATE','999',
                                'QVAPOR','QTY_VAPOR_MIXING_RATIO','TYPE_QV','UPDATE','999',
                                'QCLOUD','QTY_CLOUDWATER_MIXING_RATIO','TYPE_QC','UPDATE','999',
                                'QRAIN','QTY_RAINWATER_MIXING_RATIO','TYPE_QR','UPDATE','999',
                                'QSNOW','QTY_SNOW_MIXING_RATIO','TYPE_QS','UPDATE','999',
                                'QICE','QTY_ICE_MIXING_RATIO','TYPE_QI','UPDATE','999',
                                'QGRAUP','QTY_GRAUPEL_MIXING_RATIO','TYPE_QG','UPDATE','999',
                                'Q2','QTY_2M_SPECIFIC_HUMIDITY','TYPE_Q2','UPDATE','999',
                                'CLDFRA','QTY_CLOUD_FRACTION','TYPE_CFRA','UPDATE','999',

driver_rttov.sh: 用于drive RTTOV,根据matlab生成的prof data来计算BT。

merged_FO_data是什么?

&text_to_obs_nml
    text_input_file = '../data/giirs_BT'
    FO_input_file = '../data/merged_FO_data.txt'
    obs_out_file    = 'obs_seq.out_allsky'
    is_prior = .true.
    debug           = .true.
  /

在text_to_obs的nml中我添加了一项FO_input_file,这是用来读取上述external FO data的。RTTOV输出了$ENS_SIZE的FO data file,为了读取方便需要把他们merge到一个file中,再指定FO_input_file,从而把data读取到obs_seq.out的external_FO 定义中。

! haoxing : open input ensemble FO file
iFOunit = open_file(FO_input_file, 'formatted', 'read')
if (debug) print *, 'opened input file ' // trim(FO_input_file)

read(iFOunit, "(A)", iostat=rcio) FO_input_line
read(FO_input_line, *, iostat=rcio) FO_copy

if (is_prior) then 
    call set_obs_def_external_FO(obs_def, has_external, write_external,key,ens_size,real(FO_copy,r8)) ! CSS sets obs_def%has_exteral_FO 
endif

关于如何merge成一个file,参见/scripts/merge_FO_in_one_file.py

About observation

首先要明确一下,synthetic obs用的是point-like method,not average,无论是linfan还是haoxing

linfan method:

  1. 找到d03/d02/d01 grid points在d04上对应的grid index
  2. call rttov based on d04 profile,generate synthetic obs(without perturbation)
  3. for each domain, add perturbation.

haoxing method

  1. call rttov for d04 grid points.
  2. add perturbation for d04.
  3. 找到d03/d02/d01 grid points在d04上对应的grid index, 直接在obs_d04_withpert上slice出d03/d02/d01的obs

linfan的方法在不同的domain都独立添加了perturbation,这会导致synthetic obs即使在相同的lat/lon,domain不同实际的数值也不同。haoxing method可以保证lat/lon相同,对应的obs在所有domain都一致,这样就可以进行不同domain之间OMB的对比了。

Info

slice validation在llleistu@114.212.48.200:/share/home/lililei4/haoxing/quantile_data/OBS_reduce_fromd04/valid_archive/ ,这个validation可以说明在没有加perturbation的情况下,slice profile->call rttov(linfan method)与直接slice d04 observation结果是完全一致的。

Forecast

040100 rst是全的。

Appendix

about code block for probit transform

   ! Compute observation space increments for each group
   do group = 1, num_groups
      grp_bot = grp_beg(group); grp_top = grp_end(group)
      call obs_increment(obs_prior(grp_bot:grp_top), grp_size, obs(1), &
         obs_err_var, base_obs_kind, obs_inc(grp_bot:grp_top), inflate, my_inflate,   &
         my_inflate_sd, net_a(group))
      obs_post(grp_bot:grp_top) = obs_prior(grp_bot:grp_top) + obs_inc(grp_bot:grp_top)

      ! Convert both the prior and posterior to probit space (efficiency for prior???)
      ! Running probit space with groups needs to be studied more carefully
      ! EFFICIENCY NOTE: FOR RHF, THE OBS_INCREMENT HAS TO DO A SORT
      ! THE POSTERIOR WOULD HAVE THE SAME RANK STATISTICS, SO THIS SORT WOULD BE THE SAME
      ! THE SECOND CONVERT_TO_PROBIT CAN BE MUCH MORE EFFICIENT USING A SORT
      ! SHOULD FIGURE OUT A WAY TO PASS THE SORT ORDER
      ! NOTE 2: THIS CONVERSION IS USING THE INFO FROM THE CURRENT (UPDATED) PRIOR ENSEMBLE. THIS
      ! IS GENERALLY GOING TO BE A DIFFERENT PROBIT TRANSFORMED ENSEMBLE THAN THE ONE THAT WAS JUST
      ! CONVERTED FROM PROBIT SPACE BY THE PROCESS THAT OWNS THIS OBSERVATION. 

      ! Need to specify what kind of prior to use for obs being assimilated
      call probit_dist_info(base_obs_kind, .false., .false., dist_for_obs, &
         bounded_below, bounded_above, lower_bound, upper_bound)

      ! Convert the prior and posterior for this observation to probit space
      call transform_to_probit(grp_size, obs_prior(grp_bot:grp_top), dist_for_obs, &
         temp_dist_params, probit_obs_prior(grp_bot:grp_top), .false., &
         bounded_below, bounded_above, lower_bound, upper_bound)
      call transform_to_probit(grp_size, obs_post(grp_bot:grp_top), dist_for_obs, &
         temp_dist_params, probit_obs_post(grp_bot:grp_top), .true., &
         bounded_below, bounded_above, lower_bound, upper_bound)
      ! Free up the storage used for this transform
      call deallocate_distribution_params(temp_dist_params)

      ! Copy back into original storage
      obs_prior(grp_bot:grp_top) = probit_obs_prior(grp_bot:grp_top)
      obs_post(grp_bot:grp_top) = probit_obs_post(grp_bot:grp_top)
      ! Recompute obs_inc in probit space
      obs_inc(grp_bot:grp_top) = obs_post(grp_bot:grp_top) - obs_prior(grp_bot:grp_top)

      ! Also compute prior mean and variance of obs for efficiency here
      obs_prior_mean(group) = sum(obs_prior(grp_bot:grp_top)) / grp_size
      obs_prior_var(group) = sum((obs_prior(grp_bot:grp_top) - obs_prior_mean(group))**2) / &
         (grp_size - 1)
      if (obs_prior_var(group) < 0.0_r8) obs_prior_var(group) = 0.0_r8
   end do

   ! Compute updated values for single state space inflation
   if(local_single_ss_inflate) then
      ! Update for each group separately
      do group = 1, num_groups
         call update_single_state_space_inflation(inflate, my_inflate, my_inflate_sd, &
            ens_handle%copies(ENS_SD_COPY, 1), orig_obs_prior_mean(group), &
            orig_obs_prior_var(group), obs(1), obs_err_var, grp_size, inflate_only)
      end do
   endif


  1. J. L. Anderson. A quantile-conserving ensemble filter framework. part i: updating an observed variable. Monthly Weather Review, 150(5):1061–1074, 2022. URL: <Go to ISI>://WOS:000807973600006, doi:10.1175/mwr-d-21-0229.1

  2. Ian Grooms. A comparison of nonlinear extensions to the ensemble kalman filter. Computational Geosciences, 26(3):633–650, 2022. URL: https://dx.doi.org/10.1007/s10596-022-10141-x, doi:10.1007/s10596-022-10141-x