[Deep RL Course] Q-Learning 실습(1) - Frozen Lake

728x90

이 글은 Deep RL Course를 학습하고 정리한 글입니다.

FrozenLake 환경 생성 및 이해하기 ⛄

FrozenLake는 에이전트가 시작 지점(S)에서 목표 지점(G)으로 이동하며 얼음 타일(F) 위를 걸어가고 구멍(H)에 빠지지 않도록 탐색하는 강화학습 환경입니다.

FrozenLake 환경 기본 설정

1. 환경 크기

FrozenLake는 다음 두 가지 크기의 맵을 제공합니다.

map_name = "4x4":4 x 4 격자
map_name = "8x8": 8 x 8 격자

2. 환경 모드

FrozenLake는 결정론적(deterministic) 환경과 확률론적(stochastic) 환경을 지원합니다.

is_slippery = False: 미끄럽지 않은 환경으로, 에이전트가 항상 의도한 방향으로 움직입니다.
is_slippery = True: 미끄러운 환경으로, 에이전트가 의도한 방향으로 항상 움직이지 않을 수 있습니다.

3. 시각화 옵션

render_mode = "rgb_array": 환경의 상태를 RGB 이미지로 반환하며, (x, y, 3) 형식의 numpy 배열로 표현됩니다.

FrozenLake 환경 생성

4 x 4 크기의 미끄럽지 않은 환경을 생성하고, 시각화 모드를 RGB로 설정합니다.

env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")

💡Tip 사용자 정의 맵을 생성할 수도 있습니다.

desc=["SFFF", "FHFH", "FFFH", "HFFG"]
gym.make('FrozenLake-v1', desc=desc, is_slippery=True)

FrozenLake 환경 살펴보기

1. 관찰 공간 (Observation Space)

FrozenLake의 관찰공간은 에이전트의 현재 위치를 나타내며 정수로 표현됩니다.

$$\text{current state} = \text{current row} \times \text{n cols} + \text{current col}$$

예를 들어, 4 x 4 맵에서 목표 지점(G)의 상태는 $3 \times 4 + 3 = 15$입니다.

print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample())

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample observation 1

2. 행동 공간 (Action Space)

FrozenLake에서 에이전트는 0(왼쪽), 1(아래쪽), 2(오른쪽), 3(위쪽)의 4가지 행동을 수행할 수 있습니다.

print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action

 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 3

3. 보상 함수 (Reward Function)

목표(G)에 도달: +1
구멍(H)에 빠짐: 0
얼음(F) 위에 있음: 0

Q-테이블 생성 및 초기화

상태 공간과 행동 공간 크기 확인

테이블의 크기를 결정하려면 상태 공간과 행동 공간의 크기를 알아야 하며, 이는 환경에 따라 달라집니다. Gym 라이브러리를 사용하면 이 정보를 동적으로 가져올 수 있어 다양한 환경에서도 유연하게 Q-Learning 알고리즘을 적용할 수 있습니다.

Gym에서는 상태와 행동 공간의 크기를 다음 메서드를 통해 확인할 수 있습니다.

env.observation_space.n: 환경에서 가능한 상태의 개수
env.action_space.n: 에이전트가 취할 수 있는 가능한 행동의 개수

state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

There are  16  possible states
There are  4  possible actions

Q-테이블 초기화

Q-테이블은 상태-행동 쌍에 대한 Q-값을 저장하는 2차원 배열로 초기에는 모든 값을 0으로 설정합니다. 테이블의 크기는 (state_space, action_space)이며, 이를 np.zeros() 함수를 사용해 초기화합니다.

def initialize_q_table(state_space, action_space):
  Qtable = np.zeros(shape=(state_space, action_space))
  return Qtable

Qtable_frozenlake = initialize_q_table(state_space, action_space)

Policy 정의

Q-Learning은 Off-Policy 알고리즘으로, 에이전트가 행동을 선택할 때와 가치 함수를 업데이트할 때 서로 다른 정책을 사용합니다. 이는 에이전트가 환경을 탐험하면서도 학습한 정보를 기반으로 최적의 행동을 선택할 수 있도록 합니다.

Q-Learning에서는 다음 두 가지 정책을 사용합니다.

$\epsilon$-Greedy Policy: 학습 중에 탐험과 이용 사이의 균형을 유지하기 위한 정책입니다.
Greedy Policy: 학습이 완료된 후 또는 학습 도중 Q-값을 업데이트할 때 사용하는 정책으로, 항상 가장 높은 Q-값을 가진 행동을 선택하는 정책입니다.

Greedy Policy

Greedy Policy는 주어진 상태에서 Q-테이블을 기반으로 가장 높은 Q-값을 가진 행동을 반환합니다. 이용만 수행하며 학습이 완료된 후 최종 정책으로 사용됩니다.

def greedy_policy(Qtable, state):
  action = np.argmax(Qtable[state][:])
  
  return action

$\epsilon$-Greedy Policy

$\epsilon$-Greedy Policy는 학습 중 탐험과 이용을 적절히 섞어 에이전트가 학습할 수 있도록 합니다. $\epsilon$의 확률로 무작위로 행동을 선택해 새로운 상태를 탐색하고, $1 - \epsilon$의 확률로 Q-테이블에서 가장 높은 Q-값을 가진 행동을 선택합니다.

def epsilon_greedy_policy(Qtable, state, epsilon):
  random_num = random.uniform(0,1)

  if random_num > epsilon:
    action = greedy_policy(Qtable, state)
  else:
    action = env.action_space.sample()

  return action

하이퍼파라미터 정의

강화학습에서 하이퍼파라미터는 에이전트의 학습 성능에 중요한 영향을 미칩니다.

특히 탐험과 관련된 하이퍼파라미터는 에이전트가 충분히 상태 공간을 탐색하고 적절한 값을 학습할 수 있도록 설정해야 합니다. 탐험이 너무 적으면($\epsilon$ 감소가 너무 빠르면) 에이전트가 상태 공간을 충분히 탐색하지 못하고 최적의 정책을 학습하지 못할 위험이 있습니다.

학습 관련 하이퍼파라미터

n_training_episodes: 학습에 사용되는 총 에피소드 수로, 더 많은 에피소드를 사용할수록 에이전트가 충분히 학습할 수 있습니다.
learning_rate: 학습률 $\alpha$은 Q-값을 업데이트할 때 새로 받은 정보의 중요도를 조절합니다. 값이 높을수록 최신 정보에 더 가중치를 두고 값이 낮을수록 이전 정보에 더 가중치를 둡니다.

평가 관련 하이퍼파라미터

n_eval_episodes: 학습 후 에이전트를 평가하기 위한 테스트 에피소드 수

환경 관련 하이퍼파라미터

env_id: 사용 중인 환경의 ID
max_steps: 에이전트가 각 에피소드에서 수행할 수 있는 최대 스텝 수
gamma: 감가율 $\gamma$는 미래 보상의 중요도를 결정합니다. 1에 가까울수록 미래 보상을 더 많이 고려하고 값이 작으면 현재 보상에 더 집중합니다.
eval_seed: 평가 환경에서 동일한 조건으로 재현 가능한 결과를 얻기 위한 시드 값

탐험 관련 하이퍼파라미터

max_epsilon: 학습 초기 탐험 확률로, 학습 초반에는 높은 값을 설정하여 에이전트가 새로운 상태를 충분히 탐색할 수 있도록 돕습니다.
min_epsilon: 학습 후반부 최소 탐험 확률로, 학습이 진행됨에 따라 탐험보다 이용에 집중하도록 설정합니다.
decay_rate: $\epsilon$ 값의 감소 속도를 조절합니다. 값이 너무 크면 탐험이 빨리 줄어들어 학습 초기 상태에서 충분히 탐색하지 못할 수도 있습니다.

n_training_episodes = 10000
learning_rate = 0.7

n_eval_episodes = 100

env_id = "FrozenLake-v1"    
max_steps = 99              
gamma = 0.95                 
eval_seed = []               

max_epsilon = 1.0         
min_epsilon = 0.05         
decay_rate = 0.0005

Q-Learning 학습 루프 생성

학습 루프를 구현하여 Q-값을 업데이트합니다.

For episode in the total of training episodes:

Reduce epsilon (since we need less and less exploration)
Reset the environment

  For step in max timesteps:
    Choose the action At using epsilon greedy policy
    Take the action (a) and observe the outcome state(s') and reward (r)
    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
    If done, finish the episode
    Our next state is the new state

아래는 위 과정을 구현한 Q-Learning 학습 함수입니다.

def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False

    for step in range(max_steps):
      action = epsilon_greedy_policy(Qtable, state, epsilon)

      new_state, reward, terminated, truncated, info = env.step(action)

      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])

      if terminated or truncated:
        break

      state = new_state
  return Qtable

Q-Learning 에이전트 학습

이제 우리가 정의한 학습 함수 train을 사용하여 Q-Learning 에이전트를 학습시켜보겠습니다. 학습이 완료되면 결과로 생성된 Q-테이블을 확인할 수 있습니다. 이 Q-테이블은 상태-행동 쌍의 가치를 저장하며 에이전트가 환경에서 최적의 행동을 선택하도록 돕습니다.

학습 실행

다음 코드를 실행하여 에이전트를 학습시킵니다.

Qtable_frozenlake = train(
    n_training_episodes=n_training_episodes,
    min_epsilon=min_epsilon,               
    max_epsilon=max_epsilon,
    decay_rate=decay_rate,                    
    env=env,                               
    max_steps=max_steps,                  
    Qtable=Qtable_frozenlake                 
)

학습된 Q-테이블 확인

학습이 완료된 후 학습된 Q-테이블을 출력하여 확인합니다.

print("Trained Q-Table:")
print(Qtable_frozenlake)

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

에이전트 평가 방법

Q-Learning 에이전트가 학습을 성공적으로 완료했는지 확인하기 위해 평가 메서드를 정의합니다. 이 메서드는 주어진 환경에서 에이전트를 여러 번 테스트한 후 평균 보상과 보상의 표준 평차를 반환합니다.

평가 방법 구현

def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
    episode_rewards = []

    for episode in tqdm(range(n_eval_episodes)):
        if seed:
            state, info = env.reset(seed=seed[episode])
        else:
            state, info = env.reset()

        step = 0
        truncated = False
        terminated = False
        total_rewards_ep = 0 

        for step in range(max_steps):
            action = greedy_policy(Q, state)

            new_state, reward, terminated, truncated, info = env.step(action)
            total_rewards_ep += reward 

            if terminated or truncated:
                break

            state = new_state

        episode_rewards.append(total_rewards_ep)

    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward

에이전트 평가 실행

mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)

print(f"Mean_reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Mean_reward=1.00 +/- 0.00

Hugging Face Hub에 모델 업로드

업로드를 위한 준비

1. Hugging Face 웹사이트에서 계정을 생성하거나 로그인합니다.

2. 설정 페이지에서 write 권한을 가진 인증 토큰을 생성합니다.

3. 아래의 명령어를 실행하여 토큰을 저장합니다.

from huggingface_hub import notebook_login

notebook_login()

업로드를 위한 준비 작업

1. 비디오 생성 함수

학습된 에이전트의 플레이를 녹화합니다.

def record_video(env, Qtable, out_directory, fps=1):
    images = []
    terminated = False
    truncated = False
    state, info = env.reset(seed=random.randint(0, 500))
    img = env.render()
    images.append(img)
    
    while not terminated and not truncated:
        action = np.argmax(Qtable[state][:])
        state, reward, terminated, truncated, info = env.step(action)
        img = env.render()
        images.append(img)

    import imageio
    imageio.mimsave(out_directory, [np.array(img) for img in images], fps=fps)

2. 모델 업로드 함수

아래 함수는 모델 평가, 비디오 생성, 모델 카드 작성 및 Hugging Face Hub에 업로드를 포함한 전체 프로세스를 처리합니다.

def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):

    from huggingface_hub import HfApi, snapshot_download
    from huggingface_hub.repocard import metadata_eval_result, metadata_save
    from pathlib import Path
    import json
    import pickle
    import datetime

    api = HfApi()

    repo_url = api.create_repo(repo_id=repo_id, exist_ok=True)
    repo_local_path = Path(snapshot_download(repo_id=repo_id))

    with open(repo_local_path / "q-learning.pkl", "wb") as f:
        pickle.dump(model, f)

    mean_reward, std_reward = evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])

    evaluate_data = {
        "env_id": model["env_id"],
        "mean_reward": mean_reward,
        "n_eval_episodes": model["n_eval_episodes"],
        "eval_datetime": datetime.datetime.now().isoformat(),
    }
    with open(repo_local_path / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    metadata = {"tags": [model["env_id"], "q-learning", "reinforcement-learning"]}
    model_card = f"""
    # **Q-Learning** Agent playing **{model['env_id']}**
    This is a trained model of a **Q-Learning** agent playing **{model['env_id']}**.
    """
    with open(repo_local_path / "README.md", "w", encoding="utf-8") as f:
        f.write(model_card)
    metadata_save(repo_local_path / "README.md", metadata)

    record_video(env, model["qtable"], repo_local_path / "replay.mp4", video_fps)

    api.upload_folder(repo_id=repo_id, folder_path=repo_local_path, path_in_repo=".")

    print(f"Your model is pushed to the Hub: {repo_url}")

모델 딕셔너리 생성

학습된 Q-테이블과 하이퍼파라미터를 포함한 모델 정보를 딕셔너리로 만듭니다.

model = {
    "env_id": env_id,
    "max_steps": max_steps,
    "n_training_episodes": n_training_episodes,
    "n_eval_episodes": n_eval_episodes,
    "eval_seed": eval_seed,
    "learning_rate": learning_rate,
    "gamma": gamma,
    "max_epsilon": max_epsilon,
    "min_epsilon": min_epsilon,
    "decay_rate": decay_rate,
    "qtable": Qtable_frozenlake,
}

모델 업로드 실행

username = "RangDev"  # Hugging Face 사용자명
repo_name = "q-FrozenLake-v1-4x4-noSlippery"

push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)

Your model is pushed to the Hub. You can view your model here:  https://huggingface.co/RangDev/q-FrozenLake-v1-4x4-noSlippery

728x90

'강화학습' 카테고리의 다른 글

[Deep RL Course] Q-Learning에서 Deep Q-Learning으로 (0)	2025.01.09
[Deep RL Course] Q-Learning 실습(2) - Taxi (0)	2025.01.08
[Deep RL Course] Q-Learning 예제 (0)	2025.01.07
[Deep RL Course] Q-Learning (0)	2025.01.07
[Deep RL Course] 몬테카를로 vs 시간차 학습 (0)	2025.01.06

FrozenLake 환경 생성 및 이해하기 ⛄

Q-테이블 생성 및 초기화

Policy 정의

하이퍼파라미터 정의

Q-Learning 학습 루프 생성

Q-Learning 에이전트 학습

에이전트 평가 방법

Hugging Face Hub에 모델 업로드

'강화학습' 카테고리의 다른 글

티스토리툴바