INFO2049-1

Duration

30h Th

Number of credits

	Master MSc. in Computer Science, professional focus in computer systems security	5 crédits
	Master MSc. in Data Science, professional focus	5 crédits
	Master MSc. in Data Science and Engineering, professional focus	5 crédits
	Master MSc. in Computer Science and Engineering, professional focus in management	5 crédits
	Master Msc. in computer science and engineering, professional focus in intelligent systems	5 crédits
	Master Msc. in computer science and engineering, professional focus in intelligent systems (double diplômation avec HEC)	5 crédits
	Master MSc. in Computer Science, professional focus in management	5 crédits
	Master MSc. in Computer Science and Engineering, professional focus in computer systems and networks	5 crédits
	Master MSc. in Computer Science, professional focus in intelligent systems	5 crédits
	Master MSc. in Computer Science, professional focus in intelligent systems (double diplômation avec HEC)	5 crédits
	Master in business engineering, professional focus in Supply Chain Management and Business Analytics	5 crédits
	Master in business engineering, professional focus in Supply Chain Management and Business Analytics (Digital Business)	5 crédits
	Master in linguistics, professional focus in analysis of textual data	5 crédits
	Master in linguistics, professional focus in word processing and analysis of textual data (joint degree programme) (Double diplomation)	5 crédits

Lecturer

Ashwin Ittoo

Language(s) of instruction

English language

Organisation and examination

Teaching in the first semester, review in January

Schedule

Schedule online

Units courses prerequisite and corequisite

Prerequisite or corequisite units are presented within each program

Learning unit contents

Objective

LLMs have generated lots of hype recently. However, the field of natural language processing encompasses a much broader range of topics.

This course covers basic and advanced algorithms and techniques in natural language processing (NLP) and text mining/analytics. We will study a number of fundamental methods, which have led to the advent of LLMs.

In addition, there will be practical sessions and implementation projects. These will concern mainly deep learning architectures, in particular, RNN, Transformers and related models, such as BERT, GPT.

PyTorch will be the programming language of choice.

Course participants will also have to read selected scientific articles, and present their main conclusions.

Course Structure

1. Vector Space Model and Information Retrieval

Vector representation of text
Term document matrices, document term matrices
Similarity measures (cosine, Euclidean, Jaccard)

2. Feature Selection

Tf-idf (Term Frequency-Inverse Document Frequency)
Chi-squared measure
Mutual information

3. Naïve-Bayes for Text Classification

Bayesian theory revision
Multinomial vs. Bernoulli Naïve-Bayes
Parameter estimation

4. Evaluating Models

Bootstrapping, cross-validation
Metrics: precision, recall, F-score
Metrics (Machine Translation): BLEU

5. Language Models Fundamentals

Markov models
n-gram (tri-gram) models
Parameter estimation
Perplexity metric
Discounting methods and Katz Back-off

6. Neural Network Language Models (Neural language models)

Distributational semantics and distributed word representation
Comparison with 1-hot encodings
Illustration with Word2Vec: skipgram, CBOW

7. Recurrent Neural Networks

LSTM architecture
Issues with vanishing , exploding gradients

8. Machine Translation

Overview of Statistical Machine Translation
Seq2Seq Model
Evaluation metrics and limitations

9. Transformers & BERT

Motivation for contextual embeddings
Self-attention
Transformer architecture
BERT

10. LLMs

Overall architecture
Architecture comparison (e.g. Llama, Falcon)
What LLMs cannot do

11. Ethics

Bias
Privacy
Hallucination

Learning outcomes of the learning unit

Understand the underlying principles and algebraic formulations of machine learning models
Ability to apply these models to the task of information extraction from text and text classification
Synthesize various principles and algorithms introduced in the course and to develop a full-fledge text analytics application (as part of the course project)
Implement text analytics solutions to support an organization's business intelligence activities
Formulate a strategy based on the acquired text analytics skills to optimize the value of an organization
Ability to perform research on and understand advanced topics in the field and to be informed on recent developments to adapt easily to changing requirements
Appreciate how the algorithms studied could solve real-life managerial issues
Communicate appropriately about text analytics projects/applications to various stakeholders

This course contributes to the learning outcomes I.1, I.2, I.3, II.1, II.2, II.3, III.1, III.2, III.3, III.4, IV.1, IV.2, IV.3, VI.1, VI.2, VII.1, VII.2, VII.3, VII.4, VII.5 of the MSc in data science and engineering.

This course contributes to the learning outcomes I.1, I.2, II.1, II.2, II.3, III.1, III.2, III.3, III.4, IV.1, IV.2, VI.1, VI.2, VII.1, VII.2, VII.3, VII.4, VII.5 of the MSc in computer science and engineering.

Prerequisite knowledge and skills

Students are expected to have reasonable maths/stats & programming skills. Appropriate guiance and support will be offered to students

Lecture notes
Online references
Consultation (if time permits)

Planned learning activities and teaching methods

The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
Theory lectures = 18-22 hours

Self-study for exam = approx. 70 hours
Practical lectures = 9-12 hours
Working on practical exercises and projects = approx. 80 hours
Total = 150 hours (5 credits)

Mode of delivery (face to face, distance learning, hybrid learning)

Lectures
Practical (during lectures and as homework)

Course materials and recommended or required readings

Topics covered in the course will come from different textbooks
Deep Learning Textbook (2016) by Goodfellow, Bengio, Courville (Softcopy here.)
Speech and Language Processing (2017) by Jurafsky and Martin (Softcopy here)
Neural Network Methods for Natural Language Processing (2017) by Goldberg (Softcopy here)

Exam(s) in session

Any session

- In-person

written exam ( open-ended questions )

- Remote

oral exam

Written work / report

Continuous assessment

Additional information:

Final written exam: 50%
Final practical project: 35%
Paper presentation: 15%
(May be adjusted during the course)

Work placement(s)

Organisational remarks and main changes to the course

Contacts

Ashwin Ittoo, ashwin.ittoo@uliege.be

Association of one or more MOOCs

Items online

Lecture Notes
Lecture Notes

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.
jcms.prefs	www.uliege.be	Session	Permet de conserver des préférences de l’utilisateur (onglet ouvert, par exemple).

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website

Web and Text Analytics