Forest - Olivier Casile

Before dwelling into the details, let's first discuss some basic concepts and terminology: What do we call an Artificial Intelligence (AI) ? And, what algorithms and techniques can we use to make an AI ?

Actually, in many good programs, several of these algorithms and techniques are combined. In the following paragraphs I have listed the main techniques I have used in FOREST, with some advices about what works best, or what doesn't work well.

Now the details

Evaluation function

Function that evaluates, or estimates, the value of a given position in the game. In the Othello game, traditional evaluation functions are based on prior knowledge of the game strategy, and are a combination of the number of discs of the players, on their mobility (number of moves they can make), on the frontier of their position (numer of outbound discs), and on tables of specific patterns such as edges, corners, and diagonals patterns. The evaluation usually provides an estimate of the final score, i.e. the final disc difference between the two players (-64..+64).

In the early days of Othello programming, evaluation functions were empiric, i.e. manually tuned by their programmer after after playing many many games against their program to find out the best coefficients for each of the above parameters. Then, starting with the famous Logistello program written by Michael Buro, the coefficients of the evaluation functions were automatically computed by Logistic Regression applied to a large number of positions with known or well estimated results. That method proved to be superior and Logistello, as well as subsequent programs using that technique such as EDAX or Zebra, defeated all other programs and even the Othello world champion in 1997.

Until its version 5.0, FOREST was using a basic evaluation function combining occupation, mobility, and frontier parameters plus tables of edges, diagonals, and corner patterns, manually tuned. That evaluation function performed reasonably well against player, but has never been able to match the best programs, such as Zebra.

Minimax

The most basic tree search technique, recursive, trying to minimize the opponent's moves while maximizing the players moves, according to the evaluation of their respective position after every move. The Minimax algorithm guarantees to find the optimal solution, but can be very long, depending on the expansion factor of the tree (average number of moves at each turn).

Alpha-beta

Improvement of the Minimax search, where branches of the tree are cut if outside of the [alpha, beta] window, because it is remembered from previous branches examination that the branches outside of that interval would not influence the partial result already obtained for a given node. Alpha beta is a big improvement of the Minimax search and allows to dramatically improve the search depth at equal search time.

Negascout

Improvement of the Alpha-Beta search, where, at a given node, once the best branch has been explored, subsequent branches are searched with a null window to check whether they are better on not than the best branch already found. The search with a null window being much faster than an alpha-beta search, Negascout can save significant time if the branches are properly ordered, i.e. if the best branch is explored first. This assumes that a preliminary sort of the branches can be performed efficiently with a good accurracy, otherwise Negascout can even lose time compared to the plain Alpha-Beta.

Prob-cut

Somewhat an improvement of alpha-beta algorithms where "weak" branches are automatically cut, based on their value at an intermediate depth and pre-computed probabilities that branches with that value will not improve when searching deeper. Prob-cut and its multi prop-cut variation, when carefully tuned, can lead to a significant pruning of the search tree, allowing deeper searches. Programs with deep searches (20 moves look-ahead and more), such as Logistello, NTest, Zebra, Edax, all use prob-cut techniques. However, prob-cut is risky, because there is always a probability that you may cut a branch that eventually would give a good result at a high depth.

Personally, I don't like prob-cut and FOREST doesn't use it. I beleive that the accuracy of neural networks and their capability to properly order and evaluate moves at any depth can lead to similar and even better results than prob-cut, even with much lower search depths.

Neural networks

Neural networks are combinations of "neurons" that take inputs, apply linear combinations of these inputs, then fire a signal according to a non-linear activation function to the next neuron. The digital neurons in a neural net are a simplistic modelisation of human neurons. The number of neurons, and of their parameters, roughly conditions the capacity of the neural to learn how to properly process the input signal.

Neural networks can be used for estimating, predicting, classifying events, for recognizing patterns (image recognition), for taking decisions, and for many other purposes. In the case of board games such as Othello, we see that there are several elements that make neural networks relevant: patterns (combinations of discs on the board) are important, the score can always be predicted in a finite range [0, 64] for the player, and there are tactics and strategies you can learn in order to improve your chances of weening. Basically, given a given position of the discs on the board, there is one and only one possible final score if both players play the best possible moves at every turn. Ohello is a "perfect information games", like Chess or Go, but unlike Poker.

The training phase requires a lot of "tagged" samples, i.e. of inputs for which the outputs are known. The more you have neurons in your neural network, the more you need training samples. Regarding Othello, my experience is that a neural network able to decently estimate the score of a position easily requires 100's of 1000's of parameters... That means you need a very large number of games, and of board positions at every turn of the game, for which the final score is known. There exist libraries of Othello games, such as the "WTHOR database", that contain 100.000+ games played by human players and computers, that constitute a good basis. However, such libraries are not sufficient, mainly because the final score resulting from every position at every turn is not always known. In the WTHOR database, only positions with 24 empty squares or less are resolved, but positions with more empty squares are more or less certains, depending on the strength of the players of each game. And, anyway, 100.000 games are an order of magnitude inferior to the number of "good" games you really need. Practically, that means you need to produce a lot of additional games, for instance by using known good programs, such as EDAX, NTEST or ZEBRA, and letting them play games between each others.

In the prediction phase, the neural network "observes" a position of the discs on the board, and predicts the final score, more or less accurately. The more the prediction is accurate, the better your program plays. My experience is that it is extremely difficult to obtain a good accurracy, i.e. to predict the exact final score resulting from a position, probably because the game of Othello is very "chaotic", in the sense that, at every turn, many discs can be flipped and the board pattern significantly changed. This particularity of Othello, which increases as a game gets closer to its end, makes it almost impossible to estimate the final score of a position with an accurracy of 1 disc by just examining the position, unless that position is final or near to final. But, if we notice that the neural network basically acts as an "evaluation function", the we can use it in a slightly different way... Tradictional Othello (or Chess, or Go) programs use alpha-beta tree searches to anticipate moves in the game, combined to an evaluation function computed at each leaf of the tree, to estimate the value of a given position or move. Why not combining that traditional alpha-beta approach with a neural network used as evaluation function ? This is what FOREST does, and it works well. Actually, I observed that the alpha-beta is "smoothing" the estimations computed by the neural network, and gives position estimations that are more an more accurate as the depth of the alpha-beta grows. I have no mathematical proof, but my guess is that the alpha-beta, with its large number of leaves, is doing some "bagging" or "ensembling" of the results, techniques which are known to improve the accurracy of neural networks. The only caveat of that method is that the alpha-beta, even improved by negascout, can't go very deep, because the neural network requires a lot of computations, much more than traditional evaluation functions. Traditional programs such as ZEBRA or EDAX will easily explore the game 24 moves ahead, whereas with a neural network, reaching 12 moves ahead is already a challenge. But... the neural network evaluates a position much more accurately than traditional evaluation functions, which compensates the lower look-ahead depth.

Sorts of neural networks

Neural networks can have different structures and can be composed of neurons of different kinds: dense layers made of fully connected neurons, convolutional networks, recurrent networks, residual networks, etc.

The first approach works very well on large boards such as the 19x19 board of Go, where convolutional neurons can scan the board and detect patterns without being too much disturbed by the edges of the board. AlphaGo has proved CNN are very efficient on detecting the stones patterns of the Go game. But for Othello (and likely for Chess), my experience is that it doesn't work well, because on a 8x8 board a 3x3 convolutional neurons can only move 6 times in each dimensions before hiting an edge, and because the importance of edges and corners at Othello is such that the convolutional neuron cannot learn anything generic enough about its 3x3 pattern. Then, a second layer of 3x3 convolutional neurons can only move 4 times, which is even worse. The net is that I gave up with that approach.

The second approach is theoretically perfect, but has a major drawback: if each square is considered as a three-state input with a possible value of (-1, 0, +1), then the first hidden -dense- layer of N neurons will need to perform a N x 64 matrix multiplication at run time, and the next dense layer of M neurons will do N x M multiplications, which is a lot. The resulting neural network used as an evaluation function will be slow, which will seriously penalize the depth of the Alpha Beta, hence reduce drastically the overall accuracy of evaluations. Another drawback of that approach is the very high number of training position it requires. I also rapidly gave up with that approach.

Another aspect of neural networks is their depth (number of hidden layers) and width (number of neurones in each layer). Initially, we are often tempted to put many layers and many neurons in the network, driven by the belief that the problem to solve is complex. That approach generally leads to over-fitting. The right approach is rather to start with the simplest network, see the results, and augment the number the number of neurons an layers to improve accuracy, to the point the network starts to over-fit. Remember that a 1-layer neural network, known as a perceptron, is already a universal function approximator, i.e. it can approximate about any continuous function, hence it is already quite powerful.

Eventually, each layer of the neural networks has an activation function. After trying the Sigmoïd, Tanh, Relu, Leaky Relu, and PRelu, the conclusion is that -unsurprisingly- the Relu family works best, and the Leaky Relu slightly better than the standard Relu. I also tried the PRelu (Parametric Relu), but it didn't bring any additional accurracy and requires much more training data.

The last aspect of neural networks is their output layer. The two main options are:

After many tries, I eventually chosed the... first approach, the reason being that the Softmax function is extremely expensive to compute at run time, with many multiplications and exponentials. The second approach is certainly more accurate, but it is just too slow. Sometimes we have to make compromises.

Features engineering

Training data

Neural networks need lots of training data. The more the network has parameters, the more it needs data to tune those parameters using the gradient descent method. The neural network I use in FOREST doesn't have that many neurons (less than 70), but due to its embedding layer and the multiple phases of the game, each with its own set of parameters, it has more than 1 million parameters overall. Computing this 1+ million parameters with a good accurracy actually requires hundreds of millions of labelled positions, i.e. of positions for which the final score is known, either exactly or with a good approximation.

There are several options, but anyway you need a very large number of games, played by human players and computers. A first set of games can be found in the WTHOR database, containing more than 100.000 games played during tournaments, between human and computer players of various strength. But these games are not sufficient, and you need to complement that database by confronting your program against a reference program such as EDAX, which can easily be used in scripts. Those automatically generated games can either have a random beginning, or, better, be generated by a Monte Carlo Tree Search (MCTS).

Once you have a large number of games, you need to compute the value of each position of those games. Resolving positions with 26 empty squares or less is reasonably quick with a program like EDAX and a fast multi-core computer, but resolving positions with more empty squares is slow or even impossible. The only option left is to estimate the value of those positions as accurately as possible. This can be done in two ways: either you use your reference program to evaluate those position with its own alpha-beta search and evaluation function, or you make statistics on your database of games and estimate the value of each position with a minimax ordering of the game tree. Also, don't forget that the Othello game has 8 symetries, which provides an easy mean to augment the set of training positions by x8.

As an indication of what is needed to train a good neural network to play Othello, FOREST uses a game database of more than one million games. And, the more game I add to the database, the better accuracy I obtain. Training a neural network is a never ending process.

Note that, for many technical reasons, and to help the neural network to generalize, both the input data and the output scores must be normalized, i.e. scaled to the [-1, +1] range. Ideally, the intermediate outputs of the hidden layers should also be normalized (batch normalization), but this raises difficulties at prediction time and is not absolutely necessary when the number of hidden layers is small, which is the case for Othello.

Under- and Over- fitting

One of the main challenges of defining and training a neural network is to stay on the thin line between under-fitting and over-fitting.

Although over-fitting is realatively easy to detect (the network accuracy reaches excellent scores on the training data, but remains poor on a set of validation data, it is a difficult problem to tackle. Adding more training data can be difficult when those data are expensive to obtain, and it is the case for the game of Othello. Techniques known to be efficient, such as drop-out and batch normalisation, are easy to apply at training time, but pose a serious problem when using the neural network as an evaluation function at run time, because the neural network is used on individual positions rather than on batches of positions, hence it is very difficult to compensate the effects of drop-out and batch regularization on individual evaluations. My experience on Othello is that, practically, the only techniques that work to prevent over-fitting are to properly tune the network size and depth, and L2 regularization.

Regularization and Generalization

One of the key differentiators of neural networks and deep learning is their capability to generalize, i.e. to apply what they have learnt on a training data set to apply that knowledge to inputs they have never seen and to deliver meaningful predictions.

However, that generalization capability only works when the neural network doesn't over-fit, i.e. its coefficients are not customized to match exactly the training set. The best way I have found to achieve this is L2 regularization, which penalize the largest coefficients (or weights) of the neural network, thus keeping those weights in a reasonable range and avoiding the extreme values that would be needed to match the inputs exactly.

Tuning the regularization parameter, known as "lambda", is tricky. There is no rule other than it should be small, let's say in the range of 10e-5. Too small it will have no effect; too large it will hamper the prediction accuracy. The right way to find the correct value(s) is to try different values until an optimum of the prediction on validation data (not training data) is attained. What I have observed with othello is that lambda is progressiveny decreasing as the games progresses. Which is consistent with the intuition that the beginning of the Othello game is more strategic (i.e. good players apply "general" strategic rules to reach a solid position), whereas the end of the game more tactical (i.e. players try traps and start counting stable discs to maximize their final score).

Monte Carlo Tree Search and UCB

Monte Carlo methods are a category of mathematical methods based on randomness; their name come from the randomness of games at the casinos in Monte Carlo. One specific Monte Carlo method that applies to games such as Othello (as well as to Chess and Go) is the Monte Carlo Tree Search (MCTS), which can be used as an alternative to the traditional Alpha-Beta tree search.

The idea of the MCTS to search a tree structure for the score of the optimal branch is to explore a random subset of the branches, and to assume that the optimal score is the average of the scores of the branches that were explored. The more random branches you take, the closer your estimate is to the theoretical result (which could be obtained by an Alpha-Beta). The MCTS can be extremely efficient because, practically, you might only need to explore very few branches, i.e. to play or simulate very few games, to get a decent estimate of the result. How many games to you need to simulate ? That depends... I have no mathematical proof, but my intuition is that it depends on the "quiescence" or "stability" of the game. Games such as Go and Chess are relatively quiet, i.e. strength or value of a player's position evolves relatively quietly throughout the game, without brutal evolutions, except perhaps for the final mat phase in the Chess game. Hence, MCTS works extremely well for the game of Go, as proved by the AlphaGo program. But Othello is quite different from Go and Chess: the begining of an Othello game is relatively quiet, but the game gets more and more chaotic as it appraches its end, and in the 20 last moves the score can completely change at every move. That makes of Othello a challenge for the MCTS, because playing completely random games will give an average score which is the average of a random variable with a high variance, hence a meaningless predicted score.

Nevertheless, those improvements of the MCTS algorithm have a negative consequence: for performance reasons, they make the MCTS almost unusable at runtime, i.e. during live games, because the execution of a decent selection algorithm at every move plus the resolution of all end-games at 18 empty squares or more dramatically reduce the number of games that can be tested in a short amount of time (unless you have a huge parallele system available for free, which is out of reach of individuals). The Alpha-Beta and its optimized variants still remain the best option for live games... So, can the MCTS still be useful ? Yes it can, in two areas:

Bit-boards

Most of the early Othello programs were representing, including FOREST until its version 5.x, were internally representing and manipulating the game board as a vector of bytes, let's say a 10 x 10 = 100 bytes in order to contain the 8 x 8 squares plus two additional rows and columns to represent the edges and facilitate the detection of these edges by the loops traversing the board. Each square of the board then contained a three state value, lets say 0 for an empty square, 1 for a white disc and 2 for a black disc, or (-1, 0, +1). That representation was straightforward and allowed to implement the basic algorithms of the game (finding an empty square, finding a licit move, and playing a move plus flipping the adversarial discs, copying a game) in a few lines of code. However, that representation had three major drawbacks:

Of course, CPU's were improving generation after generation, with better branch prediction reducing the cost of the loops, larger internal caches reducing the need to access the RAM, optimized block copy operations reducing the cost of copying game boards, and deeper pipelines with out of order and parallel execution reducing the cost of flipping and un-flipping discs. Also, programs such as FOREST 4.x were extensively using loop unrolling techniques to improve the key algorithms. But the real breakthrough came with the new generations of 64-bits CPUs allowing to store an entire game board in a pair of register, one register storing the presence of white discs (0 or 1) in each square, the other register doing so with black discs.

The new techniques using pairs of 64 bits registers to hold and manipulate board representations are referred to as "bitboard manipulations". The use of bit masking, shift operations, and of the classical and/or/xor bitwise operations enable the development of very fast routines to play Othello (and Chess). You can even list all the licit moves on a given board position in a few instructions without any access to the RAM, the algorithm is known as Dumb7Fill.

I started using bitboards with the versions 5.x of FOREST, which gave the program a speed-up of about 40%. Actually, that is less than what I was expecting, likely because the program was already very optimized, and because the latest x86 CPU's are so good at caching/fetching data and predicting branches that they have significantly reduced the drawbacks of the vector representation of board games. But, anyway, a 40% improvement is worth the effort of adopting bitboards.

Multi-threading

Modern CPU's have more and more cores, and are even able to execute more than one logical thread per physical core. Actually, the increase of CPU horsepower is now mainly due to the increase of the number of cores (horizontal scaling) rather than on the increase of the core frequency or IPC (instructions per cycle). However, the power of all these cores remains difficult to exploit, because a CPU is not able, by itself, to distribute the workload on all its cores. The logic of the applications needs to be modified to exploit all the cores, and the operating system needs a scheduler able to distribute threads of workload over all the available cores.

How can multi-cores architectures be used in an Othello program ? The computing workload generated by FOREST comprises four main algorithms: the Alpha-Beta tree search in mid-game, the Neural Network that serves as evaluation function, the Win-Loss-Draw search in end-game, and the exact score determination in end-game. Each of these algorithms are opportunities for parallelism:

Once it is decided what can be parallelised using mult_threading, the next question is "how ?". The C++ language and Standard Template Library (STL) offers nice high-level multi-threading and synchronization primitives since the C++11 standard, the Future and Promise objects, which are very convenient for launching asynchronous jobs and waiting for their result. The issue with those high level primitive is their overhead. They are incredibly slow and can only be applied to macro tasks. Using them to parallelize small tasks, such as several computations of the neural network, is actually slower than doing the computations sequentially. There are two reasons for this:

The solution to the two above problems is known as "Thread Pooling", which consists in creating a fixed number of threads at program initialization, then assigning workload to those threads when needed. This eliminates the cost of dynamically creating threads and allows to control exactly the number of threads that will run concurrently.

Unfortunately, so far, there is no standard thread pooling mechanism in C++ (although some C++ threading primitives use a Windows thread pooling mechanism, but in an un-controlled manner). The only solution is therefore to develop a custom thread pooling mechanism, using the low level Condition Variables and Mutex primitives of the C++ STL.

The Mutex mechanism has also, by itself, a large impact on the performance of multi-threading. In a thread pool implementation, Mutex's allow a given thread to wait for a task to perform, or to pick an available task and perform it. For relatively small tasks such as the computation of a Neural Network, the times it takes to test the Mutex, block the thread, and awaken the thread are therefore critical. The issue is that different compilers have different implementations of the std::mutex object: GCC/G++ rely on the Pthread POSIX library, whereas Visual Studio C++ relies on the Windows Mutex mechanism. And, the Pthread implementation is at least 10 times faster than the standard Windows Mutex. This is because Pthreads relies on Windows "Critical Sections", which are actually an inter-thread exclusion mechanism implemented at the "user level" of the operating system, whereas the Windows Mutex is an inter-process exclusion mechanism implemented at the kernel level, i.e. that requires going downto the kernel of the operating system. The Windows Mutex is therefore more powerfull, because it is capable of inter-process mutual exclusion, but it is also much slower. Considering that for Othello computations we only need multi-threading, not multi-processing, using Windows Critical Sections is therefore much more efficient than using the regular Windows Mutex mechansm. And, if we want to use Visual Studio C++ rather than GCC, and avoid to drag the Pthread library with the program, the best option is to develop a custom Mutex object that relies on the Windows Critical Section (or on the Eventing and WaitForSingleObject primitives). Practically this is only a few lines of code, not a big deal.

Note that it is possible to implement further optimizations using lock-free mechanisms, such as "memory barriers", but I won't enter into the details, even though I actually use such a mechanism to retrieve the results of the tasks performed by my thread pool without blocking the thread that launched the tasks.

Vectorization

Modern CPUs such as Intel Core, AMD Rizen, IBM Power, and some ARM processors have SIMD (Single Instruction Multiple Data) capabilities, i.e. they can process multiple data with a single instruction. Such instructions are usually available for the + - x / operations, the binary operations (and or xor not), shift and rotate operations, and for integer as well as floating point data. Depending on the processor, on its generation and the instruction set, SIMD instructions can process 128, 256, 512 bits at once. Intel processors, for instance, support AVX2 (Advanced Vector Extensions) instruction since the Haswell generation (Sandy Bridge series in 2013), which can process 256 bits at once, i.e. four 64 bits integers or eight 32 bits floating point values. Such instructions can considerably accelerate bitboard manipulations, as well as the matrix operations of the neural network.

However, the difficulty is to properly use those instructions in conjunction to the programming language you are using. There are actually two options to do this:

Note that the latest Visual C++ 2019 compiler has an excellent auto-vectorizer, and is even able to apply to SIMD instructions all the kinds of optimizations that are applicable to regular instructions (arithmetic simplifications, sub-expresssion elimination, detection of fused multiply-adds, constant folding, pipeline optimization, etc.), which results in very fast code, see the details here.

Bibliography:

There has been many technical papers and even research papers on the programming of Othello/Reversi and on the use of various machine learning, deep learning techniques, and re-inforcement learning in that game. Here are the most notable ones with my comments about their content and usefulness.

Tooling

Programming

FOREST is architected in two layers: the user interface which is developped in C++ and uses the MFC GUI classes, and the game logic which is mostly in C for performance reasons, with a couple of C++ modules when specific features of the C++ standard library are necessary, such as multi-threading. It is worth noting that some parts of the C code, such as bitboard manipulations and the neural network logic, are automatically generated by Python scripts to unroll loops and adapt to the neural network hyper parameters. The C++/C compiler I use is the Visual Studio 2017 C++ compiler. I have tried GCC and CLang, but the VS C++ compiler generates faster code in my case (about 10% faster). In particular, the auto-vectorizer of VS C++ does an excellent job which is absolutely critical for the performance of the neural network. This doesn't mean the VS C++ compiler would is always faster than others, it really depends on your code and you have to try all af them to find out the one that delivers the best performance for your code.

But the runtime code of FOREST, what you execute when playing with it, is only the visible part of the iceberg. There is actually more code to manage the training data, train the neural network, and to post-process the outputs of the training before FOREST can be compiled and executed. All that code is written in Python, which, in my mind, is definitively the most flexible programming language for data science. The features I use the most in Python are of course the lists and dictionaries, that are so usefull to manipulate game positions and build game trees. The only drawbacks of Python are its relative slowness (it is a semi-interpretor) and its high consumption of RAM (all variables are dynamically linked, hence pointers and dynamic memory allocation everywhere), which become problems when processing tens of millions of data records. The best ways to overcome these drawbacks are a powerfull computer with lots of RAM (I am at 32 Gb RAM and will soon double that amount), and the use specific extensions written in Cython for specific routines which must absolutely run fast. Cython is basically a Python compiler which simplifies the development of Python extensions written in pure C for absolute performance. Python alone is very nice, but most of its power actually comes from its vast catalog of extensions, which allow to integrate most of the known algorithms for data processing. Here are the extensions I use the most:

Numpy, the ubiquitous high performance matrix and N-dimensions arrays manipulation library;
SciPy, the Scikit Learn machine learning library that contain all sorts or algorithms that are usefull in feature engineering and data preparation, such as the K-Best ranking of your features;
Sqlite3, providing an easy way to manipulate the SQL database where I store my huge base of games and positions;
csv, to manipulate "comma separated value" files, quite useful to store intermediate results;
othpy, the python extension I have written in Cython and C to accelerate the manipulation of Othello boards;
and of course Keras/Tensorflow, the deep learning framework I will explain in the next section.

Deep Learning

To be continuated.

A world-championship-level Othello program, Paul S. Rosenbloom, 1981	One of the very first papers explaining with details all the techniques and steps that led to the development of the IAGO program, which one of the first really good programs prior to the arrival of Logistello. In particular, there is a good discussion about the "features", i.e. the key factors, that play a role in the strategy of the game of Othello.
An evaluation function for Othello based on statistics, Michael Buro, 2002	The must-read paper from Michael Buro, the author of the famous Logistello program, explaining for the first time how to determine mathematically an efficient evaluation function. Although Michael used a logistic regression technique, the principles of his supervised learning methodology still appies today with neural networks.
The evolution of Strong Othello programs, Michael Buro, 2003	A paper from the author of the famous Logistello program relating the progress of stong Othello programs from Iago and its hand crafted evaluation function to Logistello and its mathematically tuned evaluation function.
Experiments with Multi-ProbCut and a new High-Quality evaluation Function for Othello, Michael Buro, 2000	An excellent paper from Michael Buro on the multi-prob-cut technique allowing to prune Alpha-Beta search tree hence to dramatically augment the search depth.
Toward Opening Book Learning, Michael Buro, 2000	Another important paper from Michael Buro on the automatic learning of opening books. This is the most sophisticated and efficient technique I have seen so far.
Knowledge-based feature discovery for evaluation functions, Tom E. Fawcet, 1995	A very interesting approach to discover the key features of the Othello game, i.e. the characteristics that play a key role in the value (expected score) of a board position.
Linear versus Non-Linear Learning in the Context of Othello, Ali Alanjawi, 2000	A short paper with some indications that non linear evaluation functions are superior than linear ones. This is also the only paper I found using Support Verctor Machines (SVM) for Othello.
Observing the Evolution of Neural Networks, Learning to Play the Game of Othello, Siang Y. Chong, 2005	A paper summarizing the main thoughts behind the use of neural networks to play the game of Othello. This paper doesn't list the sophisticated techniques that are now possible with modern AI frameworks, but is is the first paper to mention the use of a combination of pattern detection neurones (convolutional neurons) and of game strategy features. In that sense it is quite interesting.
On Bayesian Upper Conﬁdence Bounds for Bandit Problems, Emilie Kaufmann, Olivier Cappe, Aurelien Garivier , 2012	The essential paper to understand and use the UCB algortithms in Monte Carlo Tree Searches.

Developing an Artificial Intelligence for Othello/Reversi

Introduction