2024-2025 / INFO2049-1

Web and Text Analytics

Duration

30h Th

Number of credits

 Master MSc. in Computer Science, professional focus in computer systems security5 crédits 
 Master MSc. in Data Science, professional focus5 crédits 
 Master MSc. in Data Science and Engineering, professional focus5 crédits 
 Master MSc. in Computer Science and Engineering, professional focus in management5 crédits 
 Master Msc. in computer science and engineering, professional focus in intelligent systems5 crédits 
 Master Msc. in computer science and engineering, professional focus in intelligent systems (double diplômation avec HEC)5 crédits 
 Master MSc. in Computer Science, professional focus in management5 crédits 
 Master MSc. in Computer Science and Engineering, professional focus in computer systems and networks5 crédits 
 Master MSc. in Computer Science, professional focus in intelligent systems5 crédits 
 Master MSc. in Computer Science, professional focus in intelligent systems (double diplômation avec HEC)5 crédits 
 Master in business engineering, professional focus in Supply Chain Management and Business Analytics5 crédits 
 Master in business engineering, professional focus in Supply Chain Management and Business Analytics (Digital Business)5 crédits 
 Master en linguistique, à finalité spécialisée en analyse des données textuelles5 crédits 
 Master in linguistics, professional focus in word processing and analysis of textual data (joint degree programme) (Double diplomation)5 crédits 

Lecturer

Ashwin Ittoo

Language(s) of instruction

English language

Organisation and examination

Teaching in the first semester, review in January

Schedule

Schedule online

Units courses prerequisite and corequisite

Prerequisite or corequisite units are presented within each program

Learning unit contents

Objective

LLMs have generated lots of hype recently. However, the field of natural language processing encompasses a much broader range of topics.

This course covers basic and advanced algorithms and techniques in natural language processing (NLP) and text mining/analytics. We will study a number of fundamental methods, which have led to the advent of LLMs.

In addition, there will be practical sessions and implementation projects. These will concern mainly deep learning architectures, in particular, RNN, Transformers and related models, such as BERT, GPT.

PyTorch will be the programming language of choice.

Course participants will also have to read selected scientific articles, and present their main conclusions.

 

Course Structure

1. Vector Space Model and Information Retrieval 

  • Vector representation of text
  • Term document matrices, document term matrices
  • Similarity measures (cosine, Euclidean, Jaccard)
 

2. Feature Selection 

  • Tf-idf (Term Frequency-Inverse Document Frequency) 
  • Chi-squared measure
  • Mutual information 
 

3. Naïve-Bayes for  Text Classification 

  • Bayesian theory revision
  • Multinomial vs. Bernoulli Naïve-Bayes
  • Parameter estimation
 

4. Evaluating Models

  • Bootstrapping, cross-validation
  • Metrics: precision, recall, F-score
  • Metrics (Machine Translation): BLEU
 

5. Language Models Fundamentals

  • Markov models
  • n-gram (tri-gram) models
  • Parameter estimation
  • Perplexity metric
  • Discounting methods and Katz Back-off 
 

6. Neural Network Language Models (Neural language models)

  • Distributational semantics and distributed word representation
  • Comparison with 1-hot encodings
  • Illustration with Word2Vec: skipgram, CBOW
 

7. Recurrent Neural Networks

  • LSTM architecture
  • Issues with vanishing , exploding gradients
 

8. Machine Translation

  • Overview of Statistical Machine Translation
  • Seq2Seq Model
  • Evaluation metrics and limitations
 

9. Transformers & BERT

  • Motivation for contextual embeddings
  • Self-attention
  • Transformer architecture
  • BERT
 

10. LLMs

  • Overall architecture
  • Architecture comparison (e.g. Llama, Falcon)
  • What LLMs cannot do
 

11. Ethics

  • Bias
  • Privacy
  • Hallucination

Learning outcomes of the learning unit

  • Understand the underlying principles and algebraic formulations of machine learning models
  • Ability to apply these models to the task of information extraction from text and text classification
  • Synthesize various principles and algorithms introduced in the course and to develop a full-fledge text analytics application (as part of the course project)
  • Implement text analytics solutions to support an organization's business intelligence activities
  • Formulate a strategy based on the acquired text analytics skills to optimize the value of an organization
  • Ability to perform research on and understand advanced topics in the field and to be informed on recent developments to adapt easily to changing requirements 
  • Appreciate how the algorithms studied could solve real-life managerial issues
  • Communicate appropriately about text analytics projects/applications to various stakeholders
This course contributes to the learning outcomes I.1, I.2, I.3, II.1, II.2, II.3, III.1, III.2, III.3, III.4, IV.1, IV.2, IV.3, VI.1, VI.2, VII.1, VII.2, VII.3, VII.4, VII.5 of the MSc in data science and engineering.


This course contributes to the learning outcomes I.1, I.2, II.1, II.2, II.3, III.1, III.2, III.3, III.4, IV.1, IV.2, VI.1, VI.2, VII.1, VII.2, VII.3, VII.4, VII.5 of the MSc in computer science and engineering.

Prerequisite knowledge and skills

Students are expected to have reasonable maths/stats & programming skills. Appropriate guiance and support will be offered to students

  • Lecture notes
  • Online references
  • Consultation (if time permits)

Planned learning activities and teaching methods

The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
Theory lectures = 18-22 hours

  • Self-study for exam = approx. 70 hours
  • Practical lectures = 9-12 hours
  • Working on practical exercises and projects = approx. 80 hours
  • Total = 150 hours (5 credits)

Mode of delivery (face to face, distance learning, hybrid learning)

  • Lectures 
  • Practical  (during lectures and as homework)

Topics covered in the course will come from different textbooks
Deep Learning Textbook (2016) by Goodfellow, Bengio, Courville (Softcopy here.)
Speech and Language Processing (2017) by Jurafsky and Martin (Softcopy here)
Neural Network Methods for Natural Language Processing (2017) by Goldberg (Softcopy here)

Exam(s) in session

Any session

- In-person

written exam ( open-ended questions )

- Remote

oral exam

Written work / report

Continuous assessment


Additional information:

Final written exam: 50%
Final practical project: 35%
Paper presentation:  15%
(May be adjusted during the course)
 

Work placement(s)

Organisational remarks and main changes to the course

Contacts

Ashwin Ittoo, ashwin.ittoo@uliege.be

Association of one or more MOOCs

Items online

Lecture Notes
Lecture Notes